1. Introduction
Facial recognition is the most effective method for identifying individuals in daily life, as it uses the unique biological features of the human face, which are considered the primary and most natural characteristics of the human body. This non-invasive and contactless method of identification allows for easy operation and intuitive results, making it an important research area in the field of biometric identification. Recent advancements in computer processing power, the proliferation of data, and innovation in deep learning algorithms have significantly improved the accuracy of facial recognition technology, making it applicable in a range of scenarios. This has led to the widespread adoption of facial recognition technology for applications such as unlocking smartphones; security screening at train stations, airports, and customs; payment at shopping centers, hotels, and self-service terminals; and access to offices and laboratories.
Although facial recognition is widely applied due to its technological maturity and effectiveness, it is far from perfect. On the one hand, the continuous expansion of application boundaries increases the requirements for the accuracy and scale of facial recognition. From attendance scenarios in small buildings and offices, which require recognition of only a few hundred people, to access scenarios in small districts, schools, and other places, which require recognition of several thousand or tens of thousands of people, and even to urban security scenarios involving millions of people, different scenarios have very different requirements for facial recognition systems. Moreover, different types of scenarios have significant differences in their accuracy requirements. For example, applications in the financial sector, such as facial recognition payments, require much higher accuracy than ordinary access scenarios. Facial recognition still needs to improve the accuracy and reliability of recognition in various scenarios to meet industrial-level demands. On the other hand, although facial recognition based on deep neural networks has significant performance advantages over other methods, the huge computational cost is also the main factor restricting its more extensive application. Therefore, reducing the computational and storage costs of deep neural networks while maintaining accuracy is an important and challenging issue in the field of facial recognition.
From an academic research perspective, facial recognition and deep neural networks have always been considered classic issues in the fields of artificial intelligence and computer vision. Their research progress can be validated and applied to other visual tasks such as image classification, detection, and segmentation, as well as pedestrian re-identification, thereby promoting communication and development across related disciplines. From an applied perspective, improving the accuracy of facial recognition can further expand its range of applications, allowing it to be deployed in more fields. Reduced computing and storage expenses lead to lower hardware costs and faster response times. Therefore, lightweight deep neural network facial recognition has enormous significance in terms of both academic research value and practical applications.
In order to address the limitations of large-scale deep neural networks, including their large parameter size and computational complexity, several methods have been used to compress or accelerate deep neural networks through techniques such as pruning [
1,
2], knowledge distillation [
3], low-rank approximation [
4], and quantization [
5,
6]. In recent years, the development of lightweight deep neural networks has emerged as one of the most promising solutions to achieve a better balance between speed and accuracy. Among the widely used methods in visual recognition tasks are SqueezeNet [
7], MobileNets [
8,
9,
10], ShuffleNets [
11,
12], EfficientNet [
13], and GhostNet [
14], which have demonstrated impressive results. However, only a few works have proposed precise lightweight architectures that have been specifically designed for face recognition [
15,
16,
17,
18,
19], indicating the need for further exploration in this area.
From a fundamental perspective, face recognition belongs to the category of image-matching tasks. However, it is not a simple pixel-level matching process, as numerous factors can impact face recognition performance, such as pose, age, lighting, occlusion, or variations in quality. These factors can hinder the ability of Convolutional Neural Networks (CNNs) to extract local features, as changes in a pose can cause part of the face to disappear, blurry facial images can lead to indistinct facial regions, and variations in lighting can result in the loss of detailed facial information. Furthermore, different facial regions contribute differently to the final recognition outcome. Therefore, when utilizing a CNN to extract facial features, regions that have a higher contribution to face recognition should be given more weight. Similarly, feature channels with higher discrimination ability should also be given greater weight. Besides the extraction of efficient local features, the fusion of local facial features and global abstract facial features also plays an important role in face recognition performance. The convolution operation possesses inherent local perceptual characteristics but may disregard the overall correlation between different regions in facial images, leading to limitations in comprehending global semantic information. For instance, two individuals can possess very similar eyes, making it difficult to distinguish between them. However, discerning whether they belong to the same person can be accomplished by evaluating factors such as interocular distance or the overall arrangement of their facial features. Nonetheless, CNNs struggle to learn global facial information. The self-attention mechanism architecture of the Transformer effectively compensates for CNN’s limitations in global modeling by enabling long-distance modeling at shallow levels of the model. The literature has shown the feasibility of incorporating Transformer models into face recognition and promising experimental outcomes have been reported.
Based on the aforementioned analysis, this paper presents a novel, efficient, and lightweight model known as CFormerFaceNet for face recognition, which integrates a CNN and Transformer. By fully utilizing the CNN’s ability to extract local facial features and Transformer’s capability to model global facial features, these two techniques are seamlessly combined to become CFormerFaceNet. In addition, lightweight modifications are implemented to significantly reduce the parameters and computations required, without significantly compromising performance. Therefore, CFormerFaceNet can greatly reduce the requirements for edge-device processors in face recognition tasks, thereby enhancing real-time performance.
2. Related Works
The significant progress in deep face recognition can be attributed to the rapid development of deep convolutional neural networks. DeepFace [
20] employs the AlexNet [
21] architecture for the end-to-end learning of face representations. DeepID [
22], on the other hand, utilizes a different feature extraction method compared to DeepFace. Based on the facial features’ locations, DeepID segments the face image into 60 different blocks of varying size and color and trains 4 convolutional layers to integrate them into an overcomplete high-dimensional feature. DeepID has achieved an accuracy of 97.45% on the LFW benchmark dataset. To further improve the performance of DeepID, DeepID2 [
23], DeepID2+ [
24], and DeepID3 [
25] were subsequently proposed. FaceNet [
26] trains using triplet loss to directly learn the mapping of face images to Euclidean space, achieving state-of-the-art results in large-scale face recognition tasks. The center loss [
27] enhances the discriminative power of deep learning features by penalizing the distance between the deep feature and its corresponding class center, thereby significantly improving performance. Based on margins, the loss function has been the primary research direction in deep face recognition in recent years, as it is simple and effective. L-softmax [
28] introduces a larger angle gap in deep feature learning and subsequent margin-based loss functions follow this design. SphereFace [
29] introduces a multiplication margin to achieve tighter decision boundaries and introduces weight normalization. To learn face embedding features with larger angular distances, feature normalization is introduced in [
30]. CosFace [
31] and ArcFace [
32] achieved impressive results by using additive margins instead of multiplicative margins. Other margin-based loss functions have shown excellent performance, including Circle loss [
33], Adacos [
34], AdaptiveFace [
35], CurricularFace [
36], and DiscFace [
37]. These models have demonstrated the effectiveness of deep convolutional neural networks in face recognition tasks. However, they have not considered the model’s size and computational efficiency.
The development of lightweight and efficient deep neural network architectures plays a crucial role in enabling high-performance computing on edge or portable devices with performance constraints. The effective implementation of efficient and lightweight deep face recognition architectures continues to pose a challenge for practical applications in real-world settings. Ref. [
16] introduced an alternative to the ReLU function called Max-Feature-Map (MFM) to suppress low-activation neurons in each convolutional layer. In addition, small convolution filters, Network-in-Network (NIN) layers, and residual blocks were employed to reduce parameter space and improve performance. Three architectures for lightweight CNNs were evaluated, demonstrating superior speed and storage efficiency compared with state-of-the-art large face models. These findings suggest the potential for significant performance improvements in face recognition tasks through the use of lightweight CNN structures.
MobileFaceNet [
15] was built on the MobileNetV2 architecture and incorporated a global depth-wise convolution layer in place of global average pooling to output a discriminative feature vector, achieving significantly improved efficiency compared to previous state-of-the-art mobile CNNs for face verification. ShuffleFaceNet [
18] was built on the ShuffleNetV2 [
9] architecture and includes four ShuffleFaceNet models with different complexity levels. These models use less than 4.5 million parameters and have a maximum computational complexity of 1.05 G FLOPs. Among them, the ShuffleFaceNet 1.5× model exhibits the best balance between speed and accuracy. VarGFaceNet [
38] employs efficient convolution with variable groups and employs equivalent angular distillation loss functions to further increase the model’s discriminatory power and explanatory ability. MobiFace [
19] uses a residual bottleneck layer with an extended layer to construct the model, thereby optimizing the information contained within the output vector, delivering outstanding results, and achieving a notable accuracy of 99.73% on the LFW dataset. ELANet [
17] was proposed based on MobilefaceNet and it was able to acquire diverse multi-scale, multi-level characteristics and also discern local features. However, these methods are all based on CNNs. UnifiedFace [
39] proposed an efficient and lightweight CNN based on GhostNet [
14] for embedded or mobile devices known as GhostFaceNet, which significantly reduced the computational requirements of the model.
This paper presents a novel lightweight face recognition model that combines a CNN and Transformer by merging the two architectures organically. The contributions of this work are as follows:
- 1.
A novel and efficient lightweight model known as CFormerFaceNet, which combines a CNN and Transformer for face recognition, is proposed in this study. This model significantly reduces the number of parameters and computational complexity while maintaining high performance.
- 2.
A Group Depth-Wise Transpose Attention (GDTA) block is designed to effectively capture both local and global representations to mitigate the issue of limited receptive fields in CNNs, without increasing the parameters and Multiply-Add (MAdd) operations.
- 3.
Cross-covariance attention is used to incorporate the attention operation across the feature channel dimension instead of the spatial dimension, which reduces the quadratic complexity of the original self-attention operation in terms of the number of tokens to a linear complexity. As a result, global information is effectively encoded implicitly.
4. Experiments
In this section, we evaluate the efficacy of our lightweight CFormerFaceNet architecture in terms of its accuracy, parameters, FLOPs, and speed. We compare it with other Face Transformer models and various CNN models on standard face recognition datasets.
4.1. Training Data and Test Data
The cleaned MS1M [
44] dataset is used as the training dataset and contains 5.1 M face images of 93 K identities from different races, ages, and genders around the world. The LFW [
45], CALFW [
46], CPLFW [
47], SLLFW [
48], CFP [
49], and AgeDB-30 [
50] datasets are employed for testing.
LFW (Labeled Faces in the Wild) is a dataset comprising 13,233 face images obtained from the internet, with 5749 identities that have been labeled. The images exhibit diversity in terms of ethnicity, gender, age, expression, pose, and lighting conditions and have become the standard benchmark for assessing the efficacy of face recognition algorithms.
The CALFW (Cross-Age LFW), CPLFW (Cross-Pose LFW), and SLLFW (Similar-Looking LFW) datasets are extensions of the widely used LFW dataset and were designed to emphasize the challenges of recognizing faces with age differences, faces with pose variations, and similar-looking faces, respectively. These datasets were constructed to provide more realistic and challenging benchmarks for evaluating the effectiveness of face recognition algorithms in real-world scenarios.
The CFP (Celebrities in Frontal-Profile) dataset was created to evaluate the accuracy of face recognition algorithms in recognizing faces with frontal-profile pose variations. It includes 500 identities, each with 10 frontal face images and 4 profile face images. Similar to the LFW dataset, the dataset is divided into 10 subsets, each containing 350 pairs of same-class samples and 350 pairs of different-class samples. The CFP dataset serves as a benchmark for evaluating the performance of face recognition algorithms in recognizing faces with frontal-profile pose variations. The CFP dataset has two settings for evaluation: frontal-to-frontal verification (CFP_FF) and frontal-to-profile verification (CFP_FP).
The AgeDB-30 dataset comprises 12,240 face images of 440 individuals from various professions. Each image is labeled with the subject’s identity, age, and gender. The dataset is divided into 4 testing tasks, each consisting of 10 subsets. Each subset includes 300 pairs of same-class samples and 300 pairs of different-class samples. The age difference between sample pairs varies across the four testing tasks and is set to 5 years, 10 years, 20 years, and 30 years, respectively. The AgeDB-30 dataset serves as a benchmark for evaluating the performance of cross-age face recognition algorithms.
4.2. Training Implementation Details
We set the batch size to 128 and trained models using an Nvidia GeForce GTX 2080Ti GPU (Nvidia Corporation, Santa Clara, CA, USA). The initial learning rate was set to 0.001, and Cosine annealing was used to gradually reduce the learning rate. The Adam optimizer was used, with a momentum of 0.9. The model was trained for 40 epochs. All experiments were implemented using the Pytorch framework.
4.3. Comparison with Face Transformer Models
The authors of [
51] explored the efficacy of Transformer models for facial recognition, given the inherent possibility for the original Transformer architecture to potentially overlook interpatch information. Their Face Transformer models made adjustments to the patch generation process by introducing overlapping sliding patches to create tokens. The experiments demonstrated the feasibility of Transformer models in face recognition and yielded promising results. Our proposed CFormerFaceNet model further improves the efficiency of Transformer models in face recognition.
Table 2 shows that CFormerFaceNet achieved comparable performance to the Face Transformer models using substantially fewer parameters and MACs when evaluated against several face recognition benchmarks, including the LFW, SLLFW, CALFW, CPLFW, and AgeDB-30 datasets.
4.4. Comparison with Different CNN Models for Face Recognition
We compared the performance of CFormerFaceNet with popular CNN models used for face recognition in recent years. ArcFace [
32] was used as the loss function. Except for GhostFaceNet, the results of other lightweight face recognition models on different datasets were obtained from [
54]. It can be seen in
Table 3 that compared with these CNN models, the proposed CFormerFaceNet model achieved the best performance on the LFW and CALFW test datasets. In addition, its performance on the CPLFW, CFP_FF, CFP_FP, and AgeDB-30 datasets was also competitive. It is noteworthy that our computational cost was much lower than that of other lightweight face recognition models. The GhostFaceNet model had the most similar computational complexity to CFormerFaceNet among the lightweight face recognition models, but its computational cost was still 3.75 times that of CFormerFaceNet, and its performance on the LFW, CALFW, and CFP_FP test datasets was inferior to that of CFormerFaceNet. The significantly lower computational demand of CFormerFaceNet can significantly reduce the processing requirements of edge-device processors for face recognition tasks.
4.5. Speed Comparison of Different Lightweight Face Recognition Models on a Computer and an Embedded Device
We conducted the speed test on a computer (i9-9900K CPU). CFormerFaceNet was compared with the various lightweight models reported in [
39]. The performance of the models on the LFW, as well as their speeds, is illustrated in
Table 4. CFormerFaceNet achieved the best performance on the LFW dataset and it was faster than EfficientNet, MobileNetV2, and MobileFaceNet despite being run on a much older i9 processor.
Additionally, we carried out experiments comparing the running speeds of CFormerFaceNet and MobileFaceNet on a resource-limited embedded device based on an ARMv8 processor rev1(v8l) ×4 for an additional demonstration of the superiority of the proposed model in such environments (
Table 5).
4.6. Real-Time Test Experiments under Different Input Face Image Resolutions
To evaluate the real-time performance of CFormerFaceNet in practical applications, we conducted frame-rate tests using different input face image resolutions on laptops and a Jetson Nano development board. The results presented in
Table 6 are the average values obtained from 100 iterations of the test. Based on the data in
Table 6, it can be observed that good real-time performance can be achieved using a regular laptop without a GPU. Specifically, with an input resolution of 800 × 800, a frame rate of 0.92 fps is maintained, demonstrating that face recognition can be completed in 1.09 seconds, which meets the typical real-time requirements for face recognition. When tested on the Jetson Nano embedded development board, CFormerFaceNet also showed satisfactory performance. It reached 1.52 fps, which means it took 0.66 s to complete face recognition at an input resolution of 448 × 448, and 0.5 fps, which means it took 2 s to complete face recognition at a resolution of 800 × 800. These results meet the requirements of general face recognition tasks. For example, in actual scenarios with limited system resources, such as attendance card punching and community access control monitoring equipment, it is difficult to employ complex models, and the proposed CFormerFaceNet model can easily complete face recognition tasks.