CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition

He, Lin; He, Lile; Peng, Lijun

doi:10.3390/app13116506

Open AccessArticle

CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition

by

Lin He

^1,2,

Lile He

¹ and

Lijun Peng

^1,3,*

¹

School of Mechanical and Electrical Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

School of Science, Xi’an University of Architecture and Technology, Xi’an 710055, China

³

Engineering Comprehensive Training Center, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6506; https://doi.org/10.3390/app13116506

Submission received: 26 April 2023 / Revised: 18 May 2023 / Accepted: 24 May 2023 / Published: 26 May 2023

(This article belongs to the Special Issue AI-Based Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Most face recognition methods rely on deep convolutional neural networks (CNNs) that construct multiple layers of processing units in a cascaded form and employ convolution operations to fuse local features. However, these methods are not conducive to modeling the global semantic information of the face and lack attention to important facial feature regions and their spatial relationships. In this work, a Group Depth-Wise Transpose Attention (GDTA) block is designed to effectively capture both local and global representations, mitigate the issue of limited receptive fields in CNNs, and establish long-range dependencies among different feature regions. Based on GDTA and CNNs, a novel, efficient, and lightweight face recognition model called CFormerFaceNet, which combines a CNN and Transformer, is proposed. The model significantly reduces the parameters and computational cost without compromising performance, greatly improving the computational efficiency of deep neural networks in face recognition tasks. The model achieves competitive accuracy on multiple challenging benchmark face datasets, including LFW, CPLFW, CALFW, SLLFW, CFP_FF, CFP_FP, and AgeDB-30, while maintaining the minimum computational cost compared to all other advanced face recognition models. The experimental results using computers and embedded devices also demonstrate that it can meet real-time requirements in practical applications.

Keywords:

face recognition; CNN; transformer; lightweight network; transpose attention

1. Introduction

Facial recognition is the most effective method for identifying individuals in daily life, as it uses the unique biological features of the human face, which are considered the primary and most natural characteristics of the human body. This non-invasive and contactless method of identification allows for easy operation and intuitive results, making it an important research area in the field of biometric identification. Recent advancements in computer processing power, the proliferation of data, and innovation in deep learning algorithms have significantly improved the accuracy of facial recognition technology, making it applicable in a range of scenarios. This has led to the widespread adoption of facial recognition technology for applications such as unlocking smartphones; security screening at train stations, airports, and customs; payment at shopping centers, hotels, and self-service terminals; and access to offices and laboratories.

Although facial recognition is widely applied due to its technological maturity and effectiveness, it is far from perfect. On the one hand, the continuous expansion of application boundaries increases the requirements for the accuracy and scale of facial recognition. From attendance scenarios in small buildings and offices, which require recognition of only a few hundred people, to access scenarios in small districts, schools, and other places, which require recognition of several thousand or tens of thousands of people, and even to urban security scenarios involving millions of people, different scenarios have very different requirements for facial recognition systems. Moreover, different types of scenarios have significant differences in their accuracy requirements. For example, applications in the financial sector, such as facial recognition payments, require much higher accuracy than ordinary access scenarios. Facial recognition still needs to improve the accuracy and reliability of recognition in various scenarios to meet industrial-level demands. On the other hand, although facial recognition based on deep neural networks has significant performance advantages over other methods, the huge computational cost is also the main factor restricting its more extensive application. Therefore, reducing the computational and storage costs of deep neural networks while maintaining accuracy is an important and challenging issue in the field of facial recognition.

From an academic research perspective, facial recognition and deep neural networks have always been considered classic issues in the fields of artificial intelligence and computer vision. Their research progress can be validated and applied to other visual tasks such as image classification, detection, and segmentation, as well as pedestrian re-identification, thereby promoting communication and development across related disciplines. From an applied perspective, improving the accuracy of facial recognition can further expand its range of applications, allowing it to be deployed in more fields. Reduced computing and storage expenses lead to lower hardware costs and faster response times. Therefore, lightweight deep neural network facial recognition has enormous significance in terms of both academic research value and practical applications.

In order to address the limitations of large-scale deep neural networks, including their large parameter size and computational complexity, several methods have been used to compress or accelerate deep neural networks through techniques such as pruning [1,2], knowledge distillation [3], low-rank approximation [4], and quantization [5,6]. In recent years, the development of lightweight deep neural networks has emerged as one of the most promising solutions to achieve a better balance between speed and accuracy. Among the widely used methods in visual recognition tasks are SqueezeNet [7], MobileNets [8,9,10], ShuffleNets [11,12], EfficientNet [13], and GhostNet [14], which have demonstrated impressive results. However, only a few works have proposed precise lightweight architectures that have been specifically designed for face recognition [15,16,17,18,19], indicating the need for further exploration in this area.

From a fundamental perspective, face recognition belongs to the category of image-matching tasks. However, it is not a simple pixel-level matching process, as numerous factors can impact face recognition performance, such as pose, age, lighting, occlusion, or variations in quality. These factors can hinder the ability of Convolutional Neural Networks (CNNs) to extract local features, as changes in a pose can cause part of the face to disappear, blurry facial images can lead to indistinct facial regions, and variations in lighting can result in the loss of detailed facial information. Furthermore, different facial regions contribute differently to the final recognition outcome. Therefore, when utilizing a CNN to extract facial features, regions that have a higher contribution to face recognition should be given more weight. Similarly, feature channels with higher discrimination ability should also be given greater weight. Besides the extraction of efficient local features, the fusion of local facial features and global abstract facial features also plays an important role in face recognition performance. The convolution operation possesses inherent local perceptual characteristics but may disregard the overall correlation between different regions in facial images, leading to limitations in comprehending global semantic information. For instance, two individuals can possess very similar eyes, making it difficult to distinguish between them. However, discerning whether they belong to the same person can be accomplished by evaluating factors such as interocular distance or the overall arrangement of their facial features. Nonetheless, CNNs struggle to learn global facial information. The self-attention mechanism architecture of the Transformer effectively compensates for CNN’s limitations in global modeling by enabling long-distance modeling at shallow levels of the model. The literature has shown the feasibility of incorporating Transformer models into face recognition and promising experimental outcomes have been reported.

Based on the aforementioned analysis, this paper presents a novel, efficient, and lightweight model known as CFormerFaceNet for face recognition, which integrates a CNN and Transformer. By fully utilizing the CNN’s ability to extract local facial features and Transformer’s capability to model global facial features, these two techniques are seamlessly combined to become CFormerFaceNet. In addition, lightweight modifications are implemented to significantly reduce the parameters and computations required, without significantly compromising performance. Therefore, CFormerFaceNet can greatly reduce the requirements for edge-device processors in face recognition tasks, thereby enhancing real-time performance.

2. Related Works

The significant progress in deep face recognition can be attributed to the rapid development of deep convolutional neural networks. DeepFace [20] employs the AlexNet [21] architecture for the end-to-end learning of face representations. DeepID [22], on the other hand, utilizes a different feature extraction method compared to DeepFace. Based on the facial features’ locations, DeepID segments the face image into 60 different blocks of varying size and color and trains 4 convolutional layers to integrate them into an overcomplete high-dimensional feature. DeepID has achieved an accuracy of 97.45% on the LFW benchmark dataset. To further improve the performance of DeepID, DeepID2 [23], DeepID2+ [24], and DeepID3 [25] were subsequently proposed. FaceNet [26] trains using triplet loss to directly learn the mapping of face images to Euclidean space, achieving state-of-the-art results in large-scale face recognition tasks. The center loss [27] enhances the discriminative power of deep learning features by penalizing the distance between the deep feature and its corresponding class center, thereby significantly improving performance. Based on margins, the loss function has been the primary research direction in deep face recognition in recent years, as it is simple and effective. L-softmax [28] introduces a larger angle gap in deep feature learning and subsequent margin-based loss functions follow this design. SphereFace [29] introduces a multiplication margin to achieve tighter decision boundaries and introduces weight normalization. To learn face embedding features with larger angular distances, feature normalization is introduced in [30]. CosFace [31] and ArcFace [32] achieved impressive results by using additive margins instead of multiplicative margins. Other margin-based loss functions have shown excellent performance, including Circle loss [33], Adacos [34], AdaptiveFace [35], CurricularFace [36], and DiscFace [37]. These models have demonstrated the effectiveness of deep convolutional neural networks in face recognition tasks. However, they have not considered the model’s size and computational efficiency.

The development of lightweight and efficient deep neural network architectures plays a crucial role in enabling high-performance computing on edge or portable devices with performance constraints. The effective implementation of efficient and lightweight deep face recognition architectures continues to pose a challenge for practical applications in real-world settings. Ref. [16] introduced an alternative to the ReLU function called Max-Feature-Map (MFM) to suppress low-activation neurons in each convolutional layer. In addition, small convolution filters, Network-in-Network (NIN) layers, and residual blocks were employed to reduce parameter space and improve performance. Three architectures for lightweight CNNs were evaluated, demonstrating superior speed and storage efficiency compared with state-of-the-art large face models. These findings suggest the potential for significant performance improvements in face recognition tasks through the use of lightweight CNN structures.

MobileFaceNet [15] was built on the MobileNetV2 architecture and incorporated a global depth-wise convolution layer in place of global average pooling to output a discriminative feature vector, achieving significantly improved efficiency compared to previous state-of-the-art mobile CNNs for face verification. ShuffleFaceNet [18] was built on the ShuffleNetV2 [9] architecture and includes four ShuffleFaceNet models with different complexity levels. These models use less than 4.5 million parameters and have a maximum computational complexity of 1.05 G FLOPs. Among them, the ShuffleFaceNet 1.5× model exhibits the best balance between speed and accuracy. VarGFaceNet [38] employs efficient convolution with variable groups and employs equivalent angular distillation loss functions to further increase the model’s discriminatory power and explanatory ability. MobiFace [19] uses a residual bottleneck layer with an extended layer to construct the model, thereby optimizing the information contained within the output vector, delivering outstanding results, and achieving a notable accuracy of 99.73% on the LFW dataset. ELANet [17] was proposed based on MobilefaceNet and it was able to acquire diverse multi-scale, multi-level characteristics and also discern local features. However, these methods are all based on CNNs. UnifiedFace [39] proposed an efficient and lightweight CNN based on GhostNet [14] for embedded or mobile devices known as GhostFaceNet, which significantly reduced the computational requirements of the model.

This paper presents a novel lightweight face recognition model that combines a CNN and Transformer by merging the two architectures organically. The contributions of this work are as follows:

1.: A novel and efficient lightweight model known as CFormerFaceNet, which combines a CNN and Transformer for face recognition, is proposed in this study. This model significantly reduces the number of parameters and computational complexity while maintaining high performance.
2.: A Group Depth-Wise Transpose Attention (GDTA) block is designed to effectively capture both local and global representations to mitigate the issue of limited receptive fields in CNNs, without increasing the parameters and Multiply-Add (MAdd) operations.
3.: Cross-covariance attention is used to incorporate the attention operation across the feature channel dimension instead of the spatial dimension, which reduces the quadratic complexity of the original self-attention operation in terms of the number of tokens to a linear complexity. As a result, global information is effectively encoded implicitly.

3. Proposed Approach

The main goal of this research is to design a lightweight hybrid network that effectively combines the strengths of both CNN and Transformer for face recognition. Figure 1 illustrates the overall architecture of the proposed CFormerFaceNet model. There are two main blocks: (1) an N × N conv. block, and (2) a group depth-wise transpose attention (GDTA) block. The architecture is based on the ConvNeXt [40] design principle, which divides hierarchical functions into four different scales in four stages. Each stage begins with a downsampling layer implemented using a 2 × 2 stride convolution that reduces the spatial sizes by half and increases the channels. Then, the output is passed to the N × N conv. blocks to extract features. Positional Encoding (PE) is added before the GDTA block in stages 2, 3, and 4 to encode the spatial location information.

3.1. N × N Conv. Block

As shown in Figure 2, the conv. block consists of two parts: (1) a depth-wise convolution layer with N × N kernels; the kernel size is 3, 5, 7, and 9 for stages 1, 2, 3, and 4, respectively; and (2) to enhance the local representation with nonlinear feature mapping, two point-wise convolution layers are applied together with the standard Layer Normalization (LN) [41] and Gaussian Error Linear Unit (GELU) [42] activation. The network hierarchy enhances information flow by incorporating a skip connection. The kernel sizes of the block differ from those of ConvNeXt, as they are adaptable and change according to the stage. Our findings indicate that the conv. block with dynamic kernel sizes can outperform those with fixed kernel sizes. The conv. block can be represented as follows:

x_{i + 1} = x_{i} + P w_{G E L U} (P w (L N (D w (x_{i}))))

(1)

where

x_{i}

denotes the input feature maps of shape H × W × C,

P w_{G E L U}

is a point-wise convolution layer followed by a GELU,

D w

is k × k depth-wise convolution,

L N

is a normalization layer, and

x_{i + 1}

denotes the output feature maps of the conv. block.

3.2. GDTA Block

As shown in Figure 3, the proposed group depth-wise transpose attention block comprises two main components. The first component endeavors to acquire a flexible feature representation across multiple scales by incorporating diverse spatial levels within the input image, whereas the second component implicitly incorporates the overall image representations. Inspired by Res2Net [43], the first part of the GDTA block employs a multi-scale processing methodology by integrating hierarchical representations within a singular block. This enhances the flexibility and adaptiveness of the spatial receptive field in the resultant feature representation. In contrast to Res2Net, at the beginning of the GDTA block, the initial block does not employ a 1 × 1 point-wise convolution layer in order to constrain the number of parameters and FLOPs. Furthermore, an adaptive quantity of subsets per stage is employed to facilitate the effective and flexible encoding of facial features. In the GDTA block, the input tensor with dimensions H × W × C is divided into s groups, denoted by x_i, and has the same spatial size as the dashed box in Figure 3. With each group, a group of filters is applied to extract features from the corresponding input feature maps. The output features of the previous group are then sent to the next group of filters, along with another group of input feature maps. This process repeats several times until all input feature maps are processed. Finally, the feature maps from all groups are concatenated to fuse the information. Except for x₁, each x_i has a corresponding 3 × 3 depth-wise convolution, denoted by K_i (). The output of K_i () is denoted by y_i. The feature subset x_i is added to the output of K_i−1() and then fed into K_i (). To reduce parameters while increasing s, the 3 × 3 depth-wise convolution is omitted for x₁. Thus, y_i can be written as:

y_{i} = {\begin{cases} x_{i} & i = 1 \\ K_{i} (x_{i}) & i = 2 \\ K_{i} (x_{i} + y_{i - 1}) & 2 < i \leq s \end{cases}

(2)

Note that each 3 × 3 depth-wise convolution operation K_i () can receive feature map outputs from all previous groups.

The computational overhead of the self-attention layer in Transformer presents challenges for vision tasks on edge devices due to increased FLOPs and latency, rendering it infeasible. To mitigate this problem and efficiently incorporate the global context, transposed query and key attention feature maps are used in the GDTA encoder. Utilizing the dot-product operation of the multi-headed self-attention (MSA) across channel dimensions rather than spatial dimensions results in a linear complexity for this operation, which enables us to perform inter-channel cross-covariance computation, thereby producing attention feature maps that contain global facial information. Given a feature map with dimensions of H × W × C, three linear layers are used to calculate the query (Q), key (K), and value (V) projections: Q = W^QY, K = W^KY, and V = W^VY. These projections have dimensions of HW × C, where W^Q, W^K, and W^V denote the projection weights for Q, K, and V, respectively. The application of the L2 norm on Q and K is implemented prior to computing the cross-covariance attention in order to enhance the stability of the training. The dot-product operation is applied across the channel dimensions between Q^T and K, which is (C × HW)·(HW × C), generating a C × C attention score matrix with softmax scaling. The final attention maps are obtained by multiplying the scores by V and summing them up. The transposed attention operation can be formulated in the following manner:

\begin{array}{l} \hat{X} = TransAttention (Q, K, V) + X \\ s . t ., TransAttention (Q, K, V) = V \cdot softmax (Q^{T} \cdot K) \end{array}

(3)

where

X

is the input features and

\hat{X}

is the output features. Then, two 1 × 1 point-wise convolution layers with LN and GELU activations are used to generate nonlinear features. More details about the CFormerFaceNet architecture are shown in Table 1.

4. Experiments

In this section, we evaluate the efficacy of our lightweight CFormerFaceNet architecture in terms of its accuracy, parameters, FLOPs, and speed. We compare it with other Face Transformer models and various CNN models on standard face recognition datasets.

4.1. Training Data and Test Data

The cleaned MS1M [44] dataset is used as the training dataset and contains 5.1 M face images of 93 K identities from different races, ages, and genders around the world. The LFW [45], CALFW [46], CPLFW [47], SLLFW [48], CFP [49], and AgeDB-30 [50] datasets are employed for testing.

LFW (Labeled Faces in the Wild) is a dataset comprising 13,233 face images obtained from the internet, with 5749 identities that have been labeled. The images exhibit diversity in terms of ethnicity, gender, age, expression, pose, and lighting conditions and have become the standard benchmark for assessing the efficacy of face recognition algorithms.

The CALFW (Cross-Age LFW), CPLFW (Cross-Pose LFW), and SLLFW (Similar-Looking LFW) datasets are extensions of the widely used LFW dataset and were designed to emphasize the challenges of recognizing faces with age differences, faces with pose variations, and similar-looking faces, respectively. These datasets were constructed to provide more realistic and challenging benchmarks for evaluating the effectiveness of face recognition algorithms in real-world scenarios.

The CFP (Celebrities in Frontal-Profile) dataset was created to evaluate the accuracy of face recognition algorithms in recognizing faces with frontal-profile pose variations. It includes 500 identities, each with 10 frontal face images and 4 profile face images. Similar to the LFW dataset, the dataset is divided into 10 subsets, each containing 350 pairs of same-class samples and 350 pairs of different-class samples. The CFP dataset serves as a benchmark for evaluating the performance of face recognition algorithms in recognizing faces with frontal-profile pose variations. The CFP dataset has two settings for evaluation: frontal-to-frontal verification (CFP_FF) and frontal-to-profile verification (CFP_FP).

The AgeDB-30 dataset comprises 12,240 face images of 440 individuals from various professions. Each image is labeled with the subject’s identity, age, and gender. The dataset is divided into 4 testing tasks, each consisting of 10 subsets. Each subset includes 300 pairs of same-class samples and 300 pairs of different-class samples. The age difference between sample pairs varies across the four testing tasks and is set to 5 years, 10 years, 20 years, and 30 years, respectively. The AgeDB-30 dataset serves as a benchmark for evaluating the performance of cross-age face recognition algorithms.

4.2. Training Implementation Details

We set the batch size to 128 and trained models using an Nvidia GeForce GTX 2080Ti GPU (Nvidia Corporation, Santa Clara, CA, USA). The initial learning rate was set to 0.001, and Cosine annealing was used to gradually reduce the learning rate. The Adam optimizer was used, with a momentum of 0.9. The model was trained for 40 epochs. All experiments were implemented using the Pytorch framework.

4.3. Comparison with Face Transformer Models

The authors of [51] explored the efficacy of Transformer models for facial recognition, given the inherent possibility for the original Transformer architecture to potentially overlook interpatch information. Their Face Transformer models made adjustments to the patch generation process by introducing overlapping sliding patches to create tokens. The experiments demonstrated the feasibility of Transformer models in face recognition and yielded promising results. Our proposed CFormerFaceNet model further improves the efficiency of Transformer models in face recognition. Table 2 shows that CFormerFaceNet achieved comparable performance to the Face Transformer models using substantially fewer parameters and MACs when evaluated against several face recognition benchmarks, including the LFW, SLLFW, CALFW, CPLFW, and AgeDB-30 datasets.

4.4. Comparison with Different CNN Models for Face Recognition

We compared the performance of CFormerFaceNet with popular CNN models used for face recognition in recent years. ArcFace [32] was used as the loss function. Except for GhostFaceNet, the results of other lightweight face recognition models on different datasets were obtained from [54]. It can be seen in Table 3 that compared with these CNN models, the proposed CFormerFaceNet model achieved the best performance on the LFW and CALFW test datasets. In addition, its performance on the CPLFW, CFP_FF, CFP_FP, and AgeDB-30 datasets was also competitive. It is noteworthy that our computational cost was much lower than that of other lightweight face recognition models. The GhostFaceNet model had the most similar computational complexity to CFormerFaceNet among the lightweight face recognition models, but its computational cost was still 3.75 times that of CFormerFaceNet, and its performance on the LFW, CALFW, and CFP_FP test datasets was inferior to that of CFormerFaceNet. The significantly lower computational demand of CFormerFaceNet can significantly reduce the processing requirements of edge-device processors for face recognition tasks.

4.5. Speed Comparison of Different Lightweight Face Recognition Models on a Computer and an Embedded Device

We conducted the speed test on a computer (i9-9900K CPU). CFormerFaceNet was compared with the various lightweight models reported in [39]. The performance of the models on the LFW, as well as their speeds, is illustrated in Table 4. CFormerFaceNet achieved the best performance on the LFW dataset and it was faster than EfficientNet, MobileNetV2, and MobileFaceNet despite being run on a much older i9 processor.

Additionally, we carried out experiments comparing the running speeds of CFormerFaceNet and MobileFaceNet on a resource-limited embedded device based on an ARMv8 processor rev1(v8l) ×4 for an additional demonstration of the superiority of the proposed model in such environments (Table 5).

4.6. Real-Time Test Experiments under Different Input Face Image Resolutions

To evaluate the real-time performance of CFormerFaceNet in practical applications, we conducted frame-rate tests using different input face image resolutions on laptops and a Jetson Nano development board. The results presented in Table 6 are the average values obtained from 100 iterations of the test. Based on the data in Table 6, it can be observed that good real-time performance can be achieved using a regular laptop without a GPU. Specifically, with an input resolution of 800 × 800, a frame rate of 0.92 fps is maintained, demonstrating that face recognition can be completed in 1.09 seconds, which meets the typical real-time requirements for face recognition. When tested on the Jetson Nano embedded development board, CFormerFaceNet also showed satisfactory performance. It reached 1.52 fps, which means it took 0.66 s to complete face recognition at an input resolution of 448 × 448, and 0.5 fps, which means it took 2 s to complete face recognition at a resolution of 800 × 800. These results meet the requirements of general face recognition tasks. For example, in actual scenarios with limited system resources, such as attendance card punching and community access control monitoring equipment, it is difficult to employ complex models, and the proposed CFormerFaceNet model can easily complete face recognition tasks.

5. Conclusions

In this paper, we proposed the CFormerFaceNet model, an efficient deep neural network for face recognition on embedded or mobile devices, which combines the advantages of both CNN and Transformer architectures. The model can significantly reduce the amount of computation and improve computational efficiency while maintaining competitive performance. Experiments were conducted on multiple face recognition test datasets to verify the effectiveness of the proposed method. The results showed that the proposed CFormerFaceNet model exhibited competitive accuracy while significantly reducing computational cost compared with other state-of-the-art lightweight face recognition deep neural networks. Additionally, experiments were carried out using computers and embedded devices, and the results showed that our proposed CFormerFaceNet model can complete face recognition tasks in real time while maintaining accuracy. This has practical value for real-time and mobile face recognition applications. In future work, we will design more efficient lightweight modules from the perspective of hardware system resources and leverage the high parallelism of FPGAs or AI chips to further speed up the inference of our model in edge devices.

Author Contributions

Methodology, L.H. (Lin He); software, L.H. (Lin He); validation, L.P.; formal analysis, L.H. (Lin He); investigation, L.H. (Lin He); resources, L.P.; writing—original draft preparation, L.H. (Lin He); writing—review and editing, L.P.; supervision, L.H. (Lile He); project administration, L.H. (Lin He). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the key research and development program of Shaanxi Province, China (Grant No. 2022GY-134); the Science and Technology Foundation of Xi’an University of Architecture and Technology, China (Grant No. ZR19059); and the Special Scientific Research Project of the Education Department of the Shaanxi Provincial Government of China (Grant No. 21JK0732).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhang, X.; Zou, J.; Ming, X.; He, K.; Sun, J. Efficient and Accurate Approximations of Nonlinear Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1984–1992. [Google Scholar]
Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized Convolutional Neural Networks for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar]
Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep Learning with Limited Numerical Precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5 MB Model Size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Chen, S.; Liu, Y.; Gao, X.; Han, Z. MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices. In Biometric Recognition; Zhou, J., Wang, Y., Sun, Z., Jia, Z., Feng, J., Shan, S., Ubul, K., Guo, Z., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 10996, pp. 428–438. ISBN 978-3-319-97908-3. [Google Scholar]
Wu, X.; He, R.; Sun, Z.; Tan, T. A Light CNN for Deep Face Representation with Noisy Labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]
Zhang, P.; Zhao, F.; Liu, P.; Li, M. Efficient Lightweight Attention Network for Face Recognition. IEEE Access 2022, 10, 31740–31750. [Google Scholar] [CrossRef]
Martindez-Diaz, Y.; Luevano, L.S.; Mendez-Vazquez, H.; Nicolas-Diaz, M.; Chang, L.; Gonzalez-Mendoza, M. ShuffleFaceNet: A Lightweight Face Architecture for Efficient and Highly-Accurate Face Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2721–2728. [Google Scholar]
Duong, C.N.; Quach, K.G.; Jalata, I.; Le, N.; Luu, K. Mobiface: A Lightweight Deep Learning Face Recognition on Mobile Devices. In Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, 23–26 September 2019; pp. 1–6. [Google Scholar]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Tang, X. Deep Learning Face Representation from Predicting 10,000 Classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [Google Scholar]
Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep Learning Face Representation by Joint Identification-Verification. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Sun, Y.; Wang, X.; Tang, X. Deeply Learned Face Representations Are Sparse, Selective, and Robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar]
Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face Recognition with Very Deep Neural Networks. arXiv 2015, arXiv:1502.00873. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-Margin Softmax Loss for Convolutional Neural Networks. arXiv 2016, arXiv:1612.02295. [Google Scholar]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 212–220. [Google Scholar]
Wang, F.; Xiang, X.; Cheng, J.; Yuille, A.L. Normface: L2 Hypersphere Embedding for Face Verification. In Proceedings of the 25th ACM international Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1041–1049. [Google Scholar]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Cosface: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
Zhang, X.; Zhao, R.; Qiao, Y.; Wang, X.; Li, H. Adacos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10823–10832. [Google Scholar]
Liu, H.; Zhu, X.; Lei, Z.; Li, S.Z. AdaptiveFace: Adaptive Margin and Sampling for Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11939–11948. [Google Scholar]
Huang, Y.; Wang, Y.; Tai, Y.; Liu, X.; Shen, P.; Li, S.; Li, J.; Huang, F. Curricularface: Adaptive Curriculum Learning Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5901–5910. [Google Scholar]
SpringerLink. DiscFace: Minimum Discrepancy Learning for Deep Face Recognition. Available online: https://link.springer.com/chapter/10.1007/978-3-030-69541-5_22 (accessed on 23 April 2023).
Yan, M.; Zhao, M.; Xu, Z.; Zhang, Q.; Wang, G.; Su, Z. Vargfacenet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhao, F.; Zhang, P.; Zhang, R.; Li, M. UnifiedFace: A Uniform Margin Loss Function for Face Recognition. Appl. Sci. 2023, 13, 2350. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
SpringerLink. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Available online: https://link.springer.com/chapter/10.1007/978-3-319-46487-9_6 (accessed on 23 April 2023).
Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database Forstudying Face Recognition in Unconstrained Environments. In Proceedings of the Workshop on Faces in‘Real-Life’Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008. [Google Scholar]
Zheng, T.; Deng, W.; Hu, J. Cross-Age Lfw: A Database for Studying Cross-Age Face Recognition in Unconstrained Environments. arXiv 2017, arXiv:1708.08197. [Google Scholar]
Zheng, T.; Deng, W. Cross-Pose Lfw: A Database for Studying Cross-Pose Face Recognition in Unconstrained Environments. Beijing Univ. Posts Telecommun. Tech. Rep. 2018, 5, 1–6. [Google Scholar]
Deng, W.; Hu, J.; Zhang, N.; Chen, B.; Guo, J. Fine-Grained Face Verification: FGLFW Database, Baselines, and Human-DCMN Partnership. Pattern Recognit. 2017, 66, 63–73. [Google Scholar] [CrossRef]
Sengupta, S.; Chen, J.-C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to Profile Face Verification in the Wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. AgeDB: The First Manually Collected, In-the-Wild Age Database. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1997–2005. [Google Scholar]
Zhong, Y.; Deng, W. Face Transformer for Recognition. arXiv 2021, arXiv:2103.14803. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
Martínez-Díaz, Y.; Nicolás-Díaz, M.; Méndez-Vázquez, H.; Luevano, L.S.; Chang, L.; Gonzalez-Mendoza, M.; Sucar, L.E. Benchmarking Lightweight Face Architectures on Specific Face Recognition Scenarios. Artif. Intell. Rev. 2021, 54, 6201–6244. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of CFormerFaceNet. In stage 1, the input face image is downsampled to 1/2 resolution by a 2 × 2 stride convolution, followed by 3 × 3 convolution (Conv.) blocks. In stages 2–4, the downsampling layer is followed by N × N conv. blocks and the Group Depth-Wise Transpose Attention (GDTA) blocks. The output layers comprise a point-wise convolutional layer, a 7 × 7 depth-wise convolutional layer, and a fully connected layer, which collectively yield a 512-dimensional facial feature vector.

Figure 2. The architecture of the N × N conv. block. The N × N conv. block employs N × N depth-wise convolutions to perform spatial mixing, which are subsequently followed by two point-wise convolutions (linear) for channel mixing.

Figure 3. The architecture of the GDTA block. The GDTA block partitions the input tensor into s channel groups and employs 3 × 3 depth-wise convolutions to achieve multi-scale spatial feature mixing. The skip connections between branches enhance the overall receptive field of the network. Specifically, branches K3 and K4 are progressively activated in stages 3 and 4, respectively, resulting in an increased receptive field in the deeper layers of the network. In the GDTA block, a transpose attention mechanism is utilized, followed by a lightweight MLP, which applies attention to the feature channels. This results in computational complexity that scales linearly with respect to the input image.

Table 1. Description of the model’s layers with respect to the output size, kernel size, and output channels, repeated n times.

Layer	Output Size	#Layers (n)	Kernel	Output Channels
Image	112 × 112	1	-	-
Downsampling	56 × 56	1	2 × 2	32
Conv. Block	56 × 56	3	3 × 3	32
Downsampling	28 × 28	1	2 × 2	64
Conv. Block	28 × 28	5	5 × 5	64
STDA Block	28 × 28	1	-	64
Downsampling	14 × 14	1	2 × 2	96
Conv. Block	14 × 14	5	7 × 7	96
STDA Block	14 × 14	1	-	96
Downsampling	7 × 7	1	2 × 2	128
Conv. Block	7 × 7	2	9 × 9	128
STDA Block	7 × 7	1	-	128
Conv2d	7 × 7	1	1 × 1	512
Global Depth-wise conv.	1 × 1	1	7 × 7	512
Linear	1 × 1	1	-	512

Table 2. Performance comparison of CFormerFaceNet and the Face Transformer models on the LFW, SLLFW, CALFW, CPLFW, and AgeDB-30 datasets.

Model	LFW	SLLFW	CALFW	CPLFW	AgeDB-30	PARAMs	MACs
ViT-P8S8 [52]	99.83%	99.53%	95.92%	92.55%	97.82%	63.2 M	12.4 G
T2T-ViT [53]	99.82%	99.63%	95.85%	93.00%	98.07%	63.5 M	12.7 G
ViT-P10S8	99.77%	99.63%	95.95%	92.93%	97.83%	63.3 M	12.4 G
ViT-P12S8	99.80%	99.55%	96.18%	93.08%	98.05%	63.3 M	12.4 G
CformerFaceNet (ours)	99.75%	99.32%	95.73%	90.20%	97.12%	1.7 M	0.079 G

Table 3. Performance comparison of CFormerFaceNet and the various CNN models for face recognition on the LFW, CPLFW, CALFW, CFP_FF, CFP_FP, and AgeDB-30 datasets.

Model	LFW	CPLFW	CALFW	CFP_FF	CFP_FP	AgeDB-30	PARAMs	FLOPs
DensNet	99.22%	86.84%	93.03%	99.18%	94.44%	-	66.37 M	8.52 G
ResNet-50	99.64%	90.57%	95.28%	99.50%	96.32%	95.2%	40.29 M	2.19 G
VarGFaceNet	99.70%	88.55%	95.15%	99.50%	96.90%	97.5%	-	-
ProxylessFaceNAS	99.20%	84.17%	92.55%	98.80%	94.70%	94.4%	-	-
EfficientNet	99.53%	90.92%	95.78%	99.5%	96.32%	-	6.58 M	1.14 G
MobileNetV2	99.55%	89.43%	95.34%	99.48%	93.17%	-	2.26 M	0.43 G
MobileFaceNetV1	99.40%	87.17%	94.47%	99.50%	95.80%	96.4%	-	-
MobileFaceNet	99.53%	90.34%	95.42%	99.55%	95.26%	97.6%	1.0 M	0.45 G
ShuffleFaceNet	99.70%	88.50%	95.05%	99.60%	96.30%	97.3%	2.60 M	0.58 G
GhostFaceNet [39]	99.50%	-	94.63%	-	94.54	-	0.82 M	0.15 G
CFormerFaceNet (ours)	99.73%	90.20%	95.80%	99.71%	95.06%	97.12%	1.7 M	0.04 G

Table 4. Speed comparison of CFormerFaceNet and various lightweight face recognition models on a computer.

Model	Accuracy on LFW	Speed
EfficientNet	99.53%	36.50 ms
MobileNetV2	99.55%	25.67 ms
MobileFaceNet	99.53%	23.67 ms
GhostFaceNet	99.63%	15.30 ms
CformerFaceNet (ours)	99.73%	19.00 ms

Table 5. Speed comparison of CFormerFaceNet and MobileFaceNet on the embedded device.

Model	PARAMs	FLOPs	Speed
MobileFaceNet	1.0 M	0.45 G	4.86 s
CformerFaceNet (ours)	1.7 M	0.04 G	3.98 s

Table 6. Running speed of CFormerFaceNet using different input face image resolutions on laptops and a Jetson Nano development board.

Platform	112 × 112	224 × 224	300 × 300	448 × 448	800 × 800
Laptop with RTX3050Ti GPU	105.02 fps	90.48 fps	61.89 fps	37.59 fps	12.47 fps
Laptop with R5-5600H CPU	45.86 fps	12.18 fps	6.56 fps	2.87 fps	0.92 fps
Jetson Nano	7.48 fps	4.64 fps	3.14 fps	1.52 fps	0.50 fps

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, L.; He, L.; Peng, L. CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Appl. Sci. 2023, 13, 6506. https://doi.org/10.3390/app13116506

AMA Style

He L, He L, Peng L. CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Applied Sciences. 2023; 13(11):6506. https://doi.org/10.3390/app13116506

Chicago/Turabian Style

He, Lin, Lile He, and Lijun Peng. 2023. "CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition" Applied Sciences 13, no. 11: 6506. https://doi.org/10.3390/app13116506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition

Abstract

1. Introduction

2. Related Works

3. Proposed Approach

3.1. N × N Conv. Block

3.2. GDTA Block

4. Experiments

4.1. Training Data and Test Data

4.2. Training Implementation Details

4.3. Comparison with Face Transformer Models

4.4. Comparison with Different CNN Models for Face Recognition

4.5. Speed Comparison of Different Lightweight Face Recognition Models on a Computer and an Embedded Device

4.6. Real-Time Test Experiments under Different Input Face Image Resolutions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI