CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition
Abstract
:1. Introduction
2. Related Works
- 1.
- A novel and efficient lightweight model known as CFormerFaceNet, which combines a CNN and Transformer for face recognition, is proposed in this study. This model significantly reduces the number of parameters and computational complexity while maintaining high performance.
- 2.
- A Group Depth-Wise Transpose Attention (GDTA) block is designed to effectively capture both local and global representations to mitigate the issue of limited receptive fields in CNNs, without increasing the parameters and Multiply-Add (MAdd) operations.
- 3.
- Cross-covariance attention is used to incorporate the attention operation across the feature channel dimension instead of the spatial dimension, which reduces the quadratic complexity of the original self-attention operation in terms of the number of tokens to a linear complexity. As a result, global information is effectively encoded implicitly.
3. Proposed Approach
3.1. N × N Conv. Block
3.2. GDTA Block
4. Experiments
4.1. Training Data and Test Data
4.2. Training Implementation Details
4.3. Comparison with Face Transformer Models
4.4. Comparison with Different CNN Models for Face Recognition
4.5. Speed Comparison of Different Lightweight Face Recognition Models on a Computer and an Embedded Device
4.6. Real-Time Test Experiments under Different Input Face Image Resolutions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zhang, X.; Zou, J.; Ming, X.; He, K.; Sun, J. Efficient and Accurate Approximations of Nonlinear Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1984–1992. [Google Scholar]
- Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized Convolutional Neural Networks for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar]
- Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep Learning with Limited Numerical Precision. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1737–1746. [Google Scholar]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5 MB Model Size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
- Chen, S.; Liu, Y.; Gao, X.; Han, Z. MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices. In Biometric Recognition; Zhou, J., Wang, Y., Sun, Z., Jia, Z., Feng, J., Shan, S., Ubul, K., Guo, Z., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 10996, pp. 428–438. ISBN 978-3-319-97908-3. [Google Scholar]
- Wu, X.; He, R.; Sun, Z.; Tan, T. A Light CNN for Deep Face Representation with Noisy Labels. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2884–2896. [Google Scholar] [CrossRef]
- Zhang, P.; Zhao, F.; Liu, P.; Li, M. Efficient Lightweight Attention Network for Face Recognition. IEEE Access 2022, 10, 31740–31750. [Google Scholar] [CrossRef]
- Martindez-Diaz, Y.; Luevano, L.S.; Mendez-Vazquez, H.; Nicolas-Diaz, M.; Chang, L.; Gonzalez-Mendoza, M. ShuffleFaceNet: A Lightweight Face Architecture for Efficient and Highly-Accurate Face Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2721–2728. [Google Scholar]
- Duong, C.N.; Quach, K.G.; Jalata, I.; Le, N.; Luu, K. Mobiface: A Lightweight Deep Learning Face Recognition on Mobile Devices. In Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, 23–26 September 2019; pp. 1–6. [Google Scholar]
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Sun, Y.; Wang, X.; Tang, X. Deep Learning Face Representation from Predicting 10,000 Classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [Google Scholar]
- Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep Learning Face Representation by Joint Identification-Verification. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
- Sun, Y.; Wang, X.; Tang, X. Deeply Learned Face Representations Are Sparse, Selective, and Robust. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar]
- Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face Recognition with Very Deep Neural Networks. arXiv 2015, arXiv:1502.00873. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-Margin Softmax Loss for Convolutional Neural Networks. arXiv 2016, arXiv:1612.02295. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. Sphereface: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 212–220. [Google Scholar]
- Wang, F.; Xiang, X.; Cheng, J.; Yuille, A.L. Normface: L2 Hypersphere Embedding for Face Verification. In Proceedings of the 25th ACM international Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1041–1049. [Google Scholar]
- Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. Cosface: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. Arcface: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar]
- Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]
- Zhang, X.; Zhao, R.; Qiao, Y.; Wang, X.; Li, H. Adacos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10823–10832. [Google Scholar]
- Liu, H.; Zhu, X.; Lei, Z.; Li, S.Z. AdaptiveFace: Adaptive Margin and Sampling for Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11939–11948. [Google Scholar]
- Huang, Y.; Wang, Y.; Tai, Y.; Liu, X.; Shen, P.; Li, S.; Li, J.; Huang, F. Curricularface: Adaptive Curriculum Learning Loss for Deep Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5901–5910. [Google Scholar]
- SpringerLink. DiscFace: Minimum Discrepancy Learning for Deep Face Recognition. Available online: https://link.springer.com/chapter/10.1007/978-3-030-69541-5_22 (accessed on 23 April 2023).
- Yan, M.; Zhao, M.; Xu, Z.; Zhang, Q.; Wang, G.; Su, Z. Vargfacenet: An Efficient Variable Group Convolutional Neural Network for Lightweight Face Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Zhao, F.; Zhang, P.; Zhang, R.; Li, M. UnifiedFace: A Uniform Margin Loss Function for Face Recognition. Appl. Sci. 2023, 13, 2350. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A Convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
- SpringerLink. MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition. Available online: https://link.springer.com/chapter/10.1007/978-3-319-46487-9_6 (accessed on 23 April 2023).
- Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database Forstudying Face Recognition in Unconstrained Environments. In Proceedings of the Workshop on Faces in‘Real-Life’Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008. [Google Scholar]
- Zheng, T.; Deng, W.; Hu, J. Cross-Age Lfw: A Database for Studying Cross-Age Face Recognition in Unconstrained Environments. arXiv 2017, arXiv:1708.08197. [Google Scholar]
- Zheng, T.; Deng, W. Cross-Pose Lfw: A Database for Studying Cross-Pose Face Recognition in Unconstrained Environments. Beijing Univ. Posts Telecommun. Tech. Rep. 2018, 5, 1–6. [Google Scholar]
- Deng, W.; Hu, J.; Zhang, N.; Chen, B.; Guo, J. Fine-Grained Face Verification: FGLFW Database, Baselines, and Human-DCMN Partnership. Pattern Recognit. 2017, 66, 63–73. [Google Scholar] [CrossRef]
- Sengupta, S.; Chen, J.-C.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to Profile Face Verification in the Wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
- Moschoglou, S.; Papaioannou, A.; Sagonas, C.; Deng, J.; Kotsia, I.; Zafeiriou, S. AgeDB: The First Manually Collected, In-the-Wild Age Database. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1997–2005. [Google Scholar]
- Zhong, Y.; Deng, W. Face Transformer for Recognition. arXiv 2021, arXiv:2103.14803. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 558–567. [Google Scholar]
- Martínez-Díaz, Y.; Nicolás-Díaz, M.; Méndez-Vázquez, H.; Luevano, L.S.; Chang, L.; Gonzalez-Mendoza, M.; Sucar, L.E. Benchmarking Lightweight Face Architectures on Specific Face Recognition Scenarios. Artif. Intell. Rev. 2021, 54, 6201–6244. [Google Scholar] [CrossRef]
Layer | Output Size | #Layers (n) | Kernel | Output Channels |
---|---|---|---|---|
Image | 112 × 112 | 1 | - | - |
Downsampling | 56 × 56 | 1 | 2 × 2 | 32 |
Conv. Block | 56 × 56 | 3 | 3 × 3 | 32 |
Downsampling | 28 × 28 | 1 | 2 × 2 | 64 |
Conv. Block | 28 × 28 | 5 | 5 × 5 | 64 |
STDA Block | 28 × 28 | 1 | - | 64 |
Downsampling | 14 × 14 | 1 | 2 × 2 | 96 |
Conv. Block | 14 × 14 | 5 | 7 × 7 | 96 |
STDA Block | 14 × 14 | 1 | - | 96 |
Downsampling | 7 × 7 | 1 | 2 × 2 | 128 |
Conv. Block | 7 × 7 | 2 | 9 × 9 | 128 |
STDA Block | 7 × 7 | 1 | - | 128 |
Conv2d | 7 × 7 | 1 | 1 × 1 | 512 |
Global Depth-wise conv. | 1 × 1 | 1 | 7 × 7 | 512 |
Linear | 1 × 1 | 1 | - | 512 |
Model | LFW | SLLFW | CALFW | CPLFW | AgeDB-30 | PARAMs | MACs |
---|---|---|---|---|---|---|---|
ViT-P8S8 [52] | 99.83% | 99.53% | 95.92% | 92.55% | 97.82% | 63.2 M | 12.4 G |
T2T-ViT [53] | 99.82% | 99.63% | 95.85% | 93.00% | 98.07% | 63.5 M | 12.7 G |
ViT-P10S8 | 99.77% | 99.63% | 95.95% | 92.93% | 97.83% | 63.3 M | 12.4 G |
ViT-P12S8 | 99.80% | 99.55% | 96.18% | 93.08% | 98.05% | 63.3 M | 12.4 G |
CformerFaceNet (ours) | 99.75% | 99.32% | 95.73% | 90.20% | 97.12% | 1.7 M | 0.079 G |
Model | LFW | CPLFW | CALFW | CFP_FF | CFP_FP | AgeDB-30 | PARAMs | FLOPs |
---|---|---|---|---|---|---|---|---|
DensNet | 99.22% | 86.84% | 93.03% | 99.18% | 94.44% | - | 66.37 M | 8.52 G |
ResNet-50 | 99.64% | 90.57% | 95.28% | 99.50% | 96.32% | 95.2% | 40.29 M | 2.19 G |
VarGFaceNet | 99.70% | 88.55% | 95.15% | 99.50% | 96.90% | 97.5% | - | - |
ProxylessFaceNAS | 99.20% | 84.17% | 92.55% | 98.80% | 94.70% | 94.4% | - | - |
EfficientNet | 99.53% | 90.92% | 95.78% | 99.5% | 96.32% | - | 6.58 M | 1.14 G |
MobileNetV2 | 99.55% | 89.43% | 95.34% | 99.48% | 93.17% | - | 2.26 M | 0.43 G |
MobileFaceNetV1 | 99.40% | 87.17% | 94.47% | 99.50% | 95.80% | 96.4% | - | - |
MobileFaceNet | 99.53% | 90.34% | 95.42% | 99.55% | 95.26% | 97.6% | 1.0 M | 0.45 G |
ShuffleFaceNet | 99.70% | 88.50% | 95.05% | 99.60% | 96.30% | 97.3% | 2.60 M | 0.58 G |
GhostFaceNet [39] | 99.50% | - | 94.63% | - | 94.54 | - | 0.82 M | 0.15 G |
CFormerFaceNet (ours) | 99.73% | 90.20% | 95.80% | 99.71% | 95.06% | 97.12% | 1.7 M | 0.04 G |
Model | Accuracy on LFW | Speed |
---|---|---|
EfficientNet | 99.53% | 36.50 ms |
MobileNetV2 | 99.55% | 25.67 ms |
MobileFaceNet | 99.53% | 23.67 ms |
GhostFaceNet | 99.63% | 15.30 ms |
CformerFaceNet (ours) | 99.73% | 19.00 ms |
Model | PARAMs | FLOPs | Speed |
---|---|---|---|
MobileFaceNet | 1.0 M | 0.45 G | 4.86 s |
CformerFaceNet (ours) | 1.7 M | 0.04 G | 3.98 s |
Platform | 112 × 112 | 224 × 224 | 300 × 300 | 448 × 448 | 800 × 800 |
---|---|---|---|---|---|
Laptop with RTX3050Ti GPU | 105.02 fps | 90.48 fps | 61.89 fps | 37.59 fps | 12.47 fps |
Laptop with R5-5600H CPU | 45.86 fps | 12.18 fps | 6.56 fps | 2.87 fps | 0.92 fps |
Jetson Nano | 7.48 fps | 4.64 fps | 3.14 fps | 1.52 fps | 0.50 fps |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, L.; He, L.; Peng, L. CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Appl. Sci. 2023, 13, 6506. https://doi.org/10.3390/app13116506
He L, He L, Peng L. CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Applied Sciences. 2023; 13(11):6506. https://doi.org/10.3390/app13116506
Chicago/Turabian StyleHe, Lin, Lile He, and Lijun Peng. 2023. "CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition" Applied Sciences 13, no. 11: 6506. https://doi.org/10.3390/app13116506
APA StyleHe, L., He, L., & Peng, L. (2023). CFormerFaceNet: Efficient Lightweight Network Merging a CNN and Transformer for Face Recognition. Applied Sciences, 13(11), 6506. https://doi.org/10.3390/app13116506