Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets
Abstract
:1. Introduction
2. Related Works
2.1. Facial Emotion Recognition
2.2. Transformer Models in Computer Vision
2.3. Attaining Data Balance through Augmentation
3. Analysis of Vision Transformer Models
3.1. Vision Transformer Models
3.1.1. Base Vision Transformer (ViT)
- Every picture patch is flattened into an vector of length , where .
- Mapping the flattened patches to D dimensions using a trainable linear projection, E, produces a series of embedded picture patches.
- The classification output, y, is represented by the learnable class embedding, , which prefixes the list of embedded image patches.
- Finally, positioning information, which is also learned during training, is added to the input by augmenting the patch embeddings with one-dimensional positional embeddings .
- Multi-head self-attention (MHSA) layer: By using multiple “heads”, the ViT model can simultaneously focus on different segments of an image. Each head calculates attention independently, allowing for a variety of image representations to be generated. These representations are then combined to create a final image representation. This approach allows the model to capture more nuanced interactions between input components. However, this also makes the model more complex and computationally expensive due to the need to aggregate the outputs from all the heads. Within the images in Figure 3, we offer insights into the visualization of the attention mechanism in a vision transformer. We explore different numbers of heads in the multi-head self-attention layer. The upper-left corner displays the original image and its attention visualization using the mean value of the heads. Meanwhile, the lower part of the image showcases visualizations with varying numbers of heads in the multi-head self-attention layer. Based on Figure 3, it is evident that as we increase the number of heads in the MHSA layer, the model’s ability to identify interrelated objects in our dataset improves. However, it is worth noting that using a large number of heads can also negatively impact the accuracy of the model. Therefore, in most cases, it is crucial to determine the optimal number of heads for each model. In Figure 3, the ideal choice would be to set the number of heads to 4.
- Layer normalization (LN): LN is used to normalize the training data before each block, preventing the introduction of any new dependencies. This improves overall performance and training effectiveness.
- Feed-forward network (FFN): The MHSA layers produce outputs that are processed by the FFN. It has a nonlinear activation function and two linear transformation layers.
- Multi-layer perceptron: This layer uses the GELU activation function in a two-layer structure.
3.1.2. Attention-Based Approaches
3.1.3. Patch-Based Approaches
3.1.4. Multi-Transformer-Based Approaches
3.2. Hybrid Vision Transformer Architectures
4. Experimental Results
4.1. Datasets
4.1.1. FER2013 Dataset
4.1.2. RAF-DB Dataset
4.1.3. Our Balanced FER2013 Dataset
- Contrast variation: It is possible that the dataset contains photos that are either too bright or too dark. CNN models, which learn visual features automatically, tend to perform better with high-contrast images. Low-contrast images, on the other hand, may affect CNN performance due to the lower amount of information they transmit. This issue can be resolved by improving the quality of the faces in the pictures.
- Imbalance: When one class has many more photos than another, there is a class imbalance. This may skew the model in favour of the dominant class. The model will favour the cheerful class, for instance, if there are 100 photographs of joyful people and 20 images of afraid people. To address this issue, data augmentation techniques like horizontal flipping, cropping, and padding can be applied to increase the amount of data available for the minority classes [35].
- Intra-class variation: A range of facial expressions, including animated faces and drawings, are included in the dataset. The features of real and animated faces differ, which can make it challenging for the model to extract landmark elements. In order to enhance model performance, only photographs of actual human faces should be included in the dataset.
- Occlusion: When part of the image is hidden, this is known as occlusion. This may occur when someone is wearing sunglasses or a mask, or when a hand covers a portion of the face, like the right eye or nose. Occluded photos should be eliminated from the dataset, since the eyes and nostrils are crucial characteristics for detecting and extracting emotions.
4.2. Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mehendale, N. Facial Emotion Recognition Using Convolutional Neural Networks (FERC). SN Appl. Sci. 2020, 2, 446. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Xu, K.; Deng, P.; Huang, H. Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5618715. [Google Scholar] [CrossRef]
- Kaselimi, M.; Voulodimos, A.; Daskalopoulos, I.; Doulamis, N.; Doulamis, A. A Vision Transformer Model for Convolution-Free Multilabel Classification of Satellite Imagery in Deforestation Monitoring. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3299–3307. [Google Scholar] [CrossRef]
- Tanzi, L.; Audisio, A.; Cirrincione, G.; Aprato, A.; Vezzetti, E. Vision Transformer for femur fracture classification. Injury 2022, 53, 2625–2634. [Google Scholar] [CrossRef]
- Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
- Horváth, J.; Baireddy, S.; Hao, H.; Montserrat, D.M.; Delp, E.J. Manipulation detection in satellite images using vision transformer. arXiv 2021, arXiv:2105.06373. [Google Scholar]
- Wang, Y.; Ye, T.; Cao, L.; Huang, W.; Sun, F.; He, F.; Tao, D. Bridged Transformer for Vision and Point Cloud 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
- Luthra, A.; Sulakhe, H.; Mittal, T.; Iyer, A.; Yadav, S. Eformer: Edge enhancement based transformer for medical image denoising. arXiv 2021, arXiv:2109.08044. [Google Scholar]
- Fan, C.M.; Liu, T.J.; Liu, K.H. SUNet: Swin Transformer UNet for Image Denoising. arXiv 2022, arXiv:2202.14009. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Yao, C.; Jin, S.; Liu, M.; Ban, X. Dense residual Transformer for image denoising. Electronics 2022, 11, 418. [Google Scholar] [CrossRef]
- Mishra, P.; Verk, R.; Fornasier, D.; Piciarelli, C.; Foresti, G.L. VT-ADL: A vision transformer network for image anomaly detection and localization. In Proceedings of the 2021 IEEE 30th International Symposium on Industrial Electronics (ISIE), Kyoto, Japan, 20–23 June 2021. [Google Scholar]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Yuan, H.; Cai, Z.; Zhou, H.; Wang, Y.; Chen, X. TransAnomaly: Video Anomaly Detection Using Video Vision Transformer. IEEE Access 2021, 9, 123977–123986. [Google Scholar] [CrossRef]
- Jamil, S.; Piran, J.; Kwon, O.J. A Comprehensive Survey of Transformers for Computer Vision. Drones 2023, 7, 287. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jegou, H. Going deeper with image transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
- Lee, S.H.; Lee, S.; Song, B.C. Vision Transformer for Small-Size Datasets. arXiv 2021, arXiv:2112.13492. [Google Scholar] [CrossRef]
- Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. Deepvit: Towards deeper vision transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
- Wang, W.; Yao, L.; Chen, L.; Cai, D.; He, X.; Liu, W. Crossformer: A versatile vision transformer based on cross-scale attention. arXiv 2021, arXiv:2108.00154. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-Token VIT: Training Vision Transformers from Scratch on ImageNet. arXiv 2021, arXiv:2101.11986. [Google Scholar]
- Touvron, H.; Cord, M.; El-Nouby, A.; Verbeek, J.; Jégou, H. Three things everyone should know about Vision Transformers. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar] [CrossRef]
- Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; J´egou, H.; Douze, M. Levit: A vision transformer in convnet’s clothing for faster inference. arXiv 2021, arXiv:2104.01136. [Google Scholar]
- Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R. Early Convolutions Help Transformers See Better. Proc. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 30392–30400. [Google Scholar]
- Mehta, S.; Rastegari, M. mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
- Chen, C.-F.; Panda, R.; Fan, Q. RegionViT: Regional-to-Local Attention for Vision Transformers. arXiv 2022, arXiv:2106.02689. [Google Scholar]
- Heo, B.; Yun, S.; Han, D.; Chun, S.; Choe, J.; Oh, S.J. Rethinking spatial dimensions of vision transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11936–11945. [Google Scholar]
- Porcu, S.; Floris, A.; Atzori, L. Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems. Electronics 2020, 9, 1892. [Google Scholar] [CrossRef]
Model | RAF-DB | FER2013 | Balanced FER2013 |
---|---|---|---|
Base ViT [3] | 68.34 | 49.64 | 60.25 |
CrossViT [4] | 69.74 | 50.27 | 62.43 |
CaiT [24] | 70.45 | 45.68 | 60.15 |
ViT for Small-Size Datasets [25] | 72.37 | 55.35 | 67.88 |
Deep ViT [26] | 63.25 | 43.45 | 50.37 |
Cross Former [27] | 72.47 | 59.95 | 75.12 |
Tokens-to-Token ViT [28] | 76.40 | 61.28 | 74.20 |
Parallel ViT [29] | 67.16 | 50.94 | 64.40 |
LeViT [30] | 65.71 | 47.22 | 60.85 |
Early ConViT [31] | 68.24 | 51.02 | 66.70 |
Mobile ViT [32] | 74.28 | 62.73 | 77.33 |
Region ViT [33] | 69.62 | 56.03 | 73.79 |
PiT [34] | 72.84 | 58.67 | 76.09 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bobojanov, S.; Kim, B.M.; Arabboev, M.; Begmatov, S. Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets. Appl. Sci. 2023, 13, 12271. https://doi.org/10.3390/app132212271
Bobojanov S, Kim BM, Arabboev M, Begmatov S. Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets. Applied Sciences. 2023; 13(22):12271. https://doi.org/10.3390/app132212271
Chicago/Turabian StyleBobojanov, Sukhrob, Byeong Man Kim, Mukhriddin Arabboev, and Shohruh Begmatov. 2023. "Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets" Applied Sciences 13, no. 22: 12271. https://doi.org/10.3390/app132212271
APA StyleBobojanov, S., Kim, B. M., Arabboev, M., & Begmatov, S. (2023). Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets. Applied Sciences, 13(22), 12271. https://doi.org/10.3390/app132212271