End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild
Abstract
:1. Introduction
2. Related Work
2.1. Multimodal Emotion Recognition
2.2. ABAW-FER Challenge
3. Materials and Methods
3.1. Video-Based Deep Networks
3.1.1. Transfer Learning with VGGFace2-Based CNN
- VGGFace2-EE. The VGGFace2-model fine-tuned on the AffWild2 dataset;
- AffWild2-EE. AffectNet-EE, which was further fine-tuned on the AffWild2 dataset (dynamic frames).
3.1.2. Temporal Modeling
- EE + SVM system. First of all, embeddings for every frame were extracted by the Embedding Extractor (EE), which is one of the fine-tuned CNNs presented earlier. Next, we grouped the embeddings according to the chosen window in a sequence and applied statistical functionals, namely the mean and Standard Deviation (STD), on each sequence. This process summarizes the LLD matrix having dimensions of (where N denotes the number of frames in a sequence and M is the size of the embeddings vector obtained from the EE for one image frame) into a suprasegmental feature vector with dimensions of (since this vector contains the M mean and M STD statistics). Finally, an SVM classifier was trained on the obtained vectors to predict one emotion category for the whole window (sequence). The target emotion to train the SVM model was calculated using the mode (i.e., voting) of the emotion annotations in the sequence. This approach works best when the window size is close to the average duration of emotions (2–4 s based on previous research);
- LSTM-based systems. These systems are based on recurrent neural networks, namely on LSTM networks. As an EE subsystem, AffectNet-EE was used. For the LSTM network’s training, we also grouped embeddings using a time window.Here, we experimented with two alternative training schemes:
- –
- EE + LSTM system. The extraction of the embeddings from the EE subsystem was separated from the LSTM network. Thus, the EE subsystem did not take part in the fine-tuning (training) process, and we exploited it solely for the embedding extraction;
- –
- E2E system. The EE subsystem was combined with the LSTM network during the fine-tuning (training) process. Thus, the system was trained as a whole, making it an End-to-End (E2E) deep neural network system. Moreover, training the EE subsystem as a part of the E2E system allowed us to generalize the EE subsystem more since it had “seen” more data including the AffWild2 dataset.
3.2. Audio-Based Deep Networks
3.2.1. Audio Separation
3.2.2. Synchronization of Labels
3.2.3. 1D CNN + LSTM-Based Deep Network
3.3. Fusion Techniques
4. Experimental Setting
4.1. Datasets Used in the Study
4.2. Preprocessing of the AffWild2 Dataset
4.3. Models Setup
4.3.1. Audio Emotion Recognition Models
4.3.2. Visual Emotion Recognition Models
4.3.3. Temporal Modeling Techniques
- SVM. Means and STDs were calculated over each fixed-sized window, resulting in a vector with a length of 1024. Next, a meta-parameter search for the SVM using calculated suprasegmental features was carried out. We conducted extensive experiments with various kernels including linear, polynomial, RBF ( optimized in [0.001, 0.1]), and regularization parameter C (in [1, 25]). The best result was obtained with following settings: kernel - polynomial, = 0.1, C = 3;
- L-SVM. We calculated the means, STDs, and leading coefficients for polynomials of the first and the second orders over each fixed-size window, resulting in a suprasegmental feature vector of dimensionality 2048. Next, we consistently applied cascaded normalization, in the form of z-, power-, and l2-normalization, respectively, and trained the Linear Support Vector Machine (L-SVM) on the obtained normalized vectors, as suggested in [88].
4.4. Performance Measure
5. Results
5.1. Visual Emotion Recognition SubSystem
5.1.1. EE + SVM Subsystem
5.1.2. EE + LSTM Subsystem
5.2. Audio Emotion Recognition Sub-System
5.3. Multimodal Emotion Recognition System
6. Analysis and Discussion
Inference Time Analysis
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Gupta, P.; Rajput, N. Two-stream emotion recognition for call center monitoring. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27–31 August 2007; Citeseer: Princeton, NJ, USA, 2007. [Google Scholar]
- Bojanić, M.; Delić, V.; Karpov, A. Call redistribution for a call center based on speech emotion recognition. Appl. Sci. 2020, 10, 4653. [Google Scholar] [CrossRef]
- Zatarain-Cabada, R.; Barrón-Estrada, M.L.; Alor-Hernández, G.; Reyes-García, C.A. Emotion recognition in intelligent tutoring systems for android-based mobile devices. In Proceedings of the Mexican International Conference on Artificial Intelligence, Tuxtla Gutierrez, Mexico, 16–22 November 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 494–504. [Google Scholar]
- Yang, D.; Alsadoon, A.; Prasad, P.C.; Singh, A.K.; Elchouemi, A. An emotion recognition model based on facial recognition in virtual learning environment. Procedia Comput. Sci. 2018, 125, 2–10. [Google Scholar] [CrossRef]
- van der Haar, D. Student Emotion Recognition Using Computer Vision as an Assistive Technology for Education. In Information Science and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 183–192. [Google Scholar]
- Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimed. 2012, 19, 34–41. [Google Scholar] [CrossRef] [Green Version]
- Dhall, A.; Goecke, R.; Ghosh, S.; Joshi, J.; Hoey, J.; Gedeon, T. From individual to group-level emotion recognition: Emotiw 5.0. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 524–528. [Google Scholar]
- Kollias, D.; Zafeiriou, S. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. arXiv 2018, arXiv:1811.07770. [Google Scholar]
- Kollias, D.; Nicolaou, M.A.; Kotsia, I.; Zhao, G.; Zafeiriou, S. Recognition of affect in the wild using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1972–1979. [Google Scholar]
- Kollias, D.; Zafeiriou, S. Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace. arXiv 2019, arXiv:1910.04855. [Google Scholar]
- Zafeiriou, S.; Kollias, D.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Kotsia, I. Aff-wild: Valence and arousal ‘in-the-wild’challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1980–1987. [Google Scholar]
- Kollias, D.; Zafeiriou, S. A multi-task learning & generation framework: Valence-arousal, action units & primary expressions. arXiv 2018, arXiv:1811.07771. [Google Scholar]
- Avots, E.; Sapiński, T.; Bachmann, M.; Kamińska, D. Audiovisual emotion recognition in wild. Mach. Vis. Appl. 2019, 30, 975–985. [Google Scholar] [CrossRef] [Green Version]
- Eyben, F.; Weninger, F.; Gross, F.; Schuller, B. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; pp. 835–838. [Google Scholar]
- Eyben, F. Real-Time Speech and Music Classification by Large Audio Feature Space Extraction; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Schuller, B.; Steidl, S.; Batliner, A. The INTERSPEECH 2009 emotion challenge. In Proceedings of the 10th Annual Conference of the International Speech Communication Association, Brighton, UK, 6–10 September 2009; pp. 312–315. [Google Scholar] [CrossRef]
- Schuller, B.; Steidl, S.; Batliner, A.; Hantke, S.; Hönig, F.; Orozco-Arroyave, J.R.; Nöth, E.; Zhang, Y.; Weninger, F. The INTERSPEECH 2015 computational paralinguistics challenge: Nativeness, Parkinson’s & eating condition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 478–482. [Google Scholar] [CrossRef]
- Schuller, B.W.; Batliner, A.; Bergler, C.; Mascolo, C.; Han, J.; Lefter, I.; Kaya, H.; Amiriparian, S.; Baird, A.; Stappen, L.; et al. The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2021; pp. 431–435. [Google Scholar] [CrossRef]
- Pancoast, S.; Akbacak, M. Bag-of-Audio-Words Approach for Multimedia Event Classification; Technical Report; SRI International Menlo Park United States: Menlo Park, CA, USA, 2012. [Google Scholar]
- Schmitt, M.; Ringeval, F.; Schuller, B. At the Border of Acoustics and Linguistics: Bag-of-Audio-Words for the Recognition of Emotions in Speech. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; pp. 495–499. [Google Scholar] [CrossRef] [Green Version]
- Kaya, H.; Karpov, A.A.; Salah, A.A. Fisher vectors with cascaded normalization for paralinguistic analysis. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; pp. 909–913. [Google Scholar] [CrossRef]
- Kaya, H.; Karpov, A.A. Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; pp. 2046–2050. [Google Scholar] [CrossRef] [Green Version]
- Gosztolya, G. Using Fisher Vector and Bag-of-Audio-Words Representations to Identify Styrian Dialects, Sleepiness, Baby & Orca Sounds. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019; pp. 2413–2417. [Google Scholar] [CrossRef] [Green Version]
- Soğancıoğlu, G.; Verkholyak, O.; Kaya, H.; Fedotov, D.; Cadée, T.; Salah, A.A.; Karpov, A. Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition. Proc. Interspeech 2020, 2097–2101. [Google Scholar] [CrossRef]
- Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
- Cummins, N.; Amiriparian, S.; Hagerer, G.; Batliner, A.; Steidl, S.; Schuller, B.W. An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 478–484. [Google Scholar]
- Keesing, A.; Koh, Y.S.; Witbrock, M. Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech. In Proceedings of the 22nd Annual Conference of the International Speech Communication Association, Brno, Czech Republic, 30 August–3 September 2021; pp. 3415–3419. [Google Scholar] [CrossRef]
- Szep, J.; Hariri, S. Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 2087–2091. [Google Scholar] [CrossRef]
- Lian, Z.; Tao, J.; Liu, B.; Huang, J.; Yang, Z.; Li, R. Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 394–398. [Google Scholar] [CrossRef]
- Markitantov, M.; Dresvyanskiy, D.; Mamontov, D.; Kaya, H.; Minker, W.; Karpov, A. Ensembling End-to-End Deep Models for Computational Paralinguistics Tasks: ComParE 2020 Mask and Breathing Sub-Challenges. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25–29 October 2020; pp. 2072–2076. [Google Scholar] [CrossRef]
- Dvoynikova, A.; Markitantov, M.; Ryumina, E.; Ryumin, D.; Karpov, A. Analytical Review of Audiovisual Systems for Determining Personal Protective Equipment on a Person’s Face. Inform. Autom. 2021, 20, 1116–1152. [Google Scholar] [CrossRef]
- Ahonen, T.; Hadid, A.; Pietikäinen, M. Face recognition with local binary patterns. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2004; pp. 469–481. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 21–23 September 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
- Slimani, K.; Kas, M.; El Merabet, Y.; Messoussi, R.; Ruichek, Y. Facial Emotion Recognition: A Comparative Analysis Using 22 LBP Variants. In Proceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence, Rabat, Morocco, 27–28 March 2018; MedPRAI ’18. Association for Computing Machinery: New York, NY, USA, 2018; pp. 88–94. [Google Scholar] [CrossRef]
- Julina, J.K.J.; Sharmila, T.S. Facial Emotion Recognition in Videos using HOG and LBP. In Proceedings of the 2019 4th International on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), Bangalore, India, 17–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 56–60. [Google Scholar]
- Lakshmi, D.; Ponnusamy, R. Facial emotion recognition using modified HOG and LBP features with deep stacked autoencoders. Microprocess. Microsyst. 2021, 82, 103834. [Google Scholar] [CrossRef]
- Almaev, T.R.; Valstar, M.F. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 356–361. [Google Scholar]
- Gürpınar, F.; Kaya, H.; Salah, A.A. Combining Deep Facial and Ambient Features for First Impression Estimation. In ECCV Workshop Proceedings; Springer: Cham, Switzerland, 2016; pp. 372–385. [Google Scholar]
- Kaya, H.; Gürpınar, F.; Salah, A.A. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 2017, 65, 66–75. [Google Scholar] [CrossRef]
- Hu, C.; Jiang, D.; Zou, H.; Zuo, X.; Shu, Y. Multi-task micro-expression recognition combining deep and handcrafted features. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 946–951. [Google Scholar]
- Escalante, H.J.; Kaya, H.; Salah, A.A.; Escalera, S.; Güçlütürk, Y.; Güçlü, U.; Baró, X.; Guyon, I.; Jacques, J.C.S.; Madadi, M.; et al. Modeling, Recognizing, and Explaining Apparent Personality from Videos. IEEE Trans. Affect. Comput. 2020, 1. [Google Scholar] [CrossRef]
- Fan, Y.; Lu, X.; Li, D.; Liu, Y. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 445–450. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Kaya, H.; Fedotov, D.; Dresvyanskiy, D.; Doyran, M.; Mamontov, D.; Markitantov, M.; Akdag Salah, A.A.; Kavcar, E.; Karpov, A.; Salah, A.A. Predicting Depression and Emotions in the Cross-Roads of Cultures, Para-Linguistics, and Non-Linguistics. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, AVEC ’19, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 27–35. [Google Scholar] [CrossRef] [Green Version]
- Yu, D.; Sun, S. A systematic exploration of deep neural networks for EDA-based emotion recognition. Information 2020, 11, 212. [Google Scholar] [CrossRef] [Green Version]
- Mou, W.; Shen, P.H.; Chu, C.Y.; Chiu, Y.C.; Yang, T.H.; Su, M.H. Speech Emotion Recognition Based on CNN+ LSTM Model. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), Taoyuan, Taiwan, 15–16 October 2021; pp. 43–47. [Google Scholar]
- Rizos, G.; Baird, A.; Elliott, M.; Schuller, B. Stargan for Emotional Speech Conversion: Validated by Data Augmentation of End-To-End Emotion Recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3502–3506. [Google Scholar]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J.; Schuller, B.W. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comput. 2020. [Google Scholar] [CrossRef] [Green Version]
- Pandit, V.; Schmitt, M.; Cummins, N.; Schuller, B. I see it in your eyes: Training the shallowest-possible CNN to recognise emotions and pain from muted web-assisted in-the-wild video-chats in real-time. Inf. Process. Manag. 2020, 57, 102347. [Google Scholar] [CrossRef]
- Kapidis, G.; Poppe, R.; Veltkamp, R.C. Multi-Dataset, Multitask Learning of Egocentric Vision Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 1. [Google Scholar] [CrossRef]
- Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 2019, 127, 907–929. [Google Scholar] [CrossRef] [Green Version]
- Verkholyak, O.; Fedotov, D.; Kaya, H.; Zhang, Y.; Karpov, A. Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6700–6704. [Google Scholar] [CrossRef]
- Kollias, D.; Schulc, A.; Hajiyev, E.; Zafeiriou, S. Analysing Affective Behavior in the First ABAW 2020 Competition. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG), Buenos Aires, Argentina, 16–20 November 2020; IEEE Computer Society: Washington, DC, USA, 2020; pp. 794–800. [Google Scholar]
- Dresvyanskiy, D.; Ryumina, E.; Kaya, H.; Markitantov, M.; Karpov, A.; Minker, W. An Audio-Video Deep and Transfer Learning Framework for Multimodal Emotion Recognition in the wild. arXiv 2020, arXiv:2010.03692. [Google Scholar]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Kwon, O.W.; Chan, K.; Hao, J.; Lee, T.W. Emotion recognition by speech signals. In Proceedings of the Eighth European Conference on Speech Communication and Technology, Geneva, Switzerland, 1–4 September 2003. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Al Osman, H.; Falk, T.H. Multimodal affect recognition: Current approaches and challenges. In Emotion and Attention Recognition Based on Biological Signals and Images; IntechOpen: London, UK, 2017; pp. 59–86. [Google Scholar]
- Toisoul, A.; Kossaifi, J.; Bulat, A.; Tzimiropoulos, G.; Pantic, M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 2021, 3, 42–50. [Google Scholar] [CrossRef]
- Xie, B.; Sidulova, M.; Park, C.H. Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion. Sensors 2021, 21, 4913. [Google Scholar] [CrossRef] [PubMed]
- Ranganathan, H.; Chakraborty, S.; Panchanathan, S. Multimodal emotion recognition using deep learning architectures. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–9 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–9. [Google Scholar]
- Liu, D.; Wang, Z.; Wang, L.; Chen, L. Multi-Modal Fusion Emotion Recognition Method of Speech Expression Based on Deep Learning. Front. Neurorobotics 2021, 15, 697634. [Google Scholar] [CrossRef] [PubMed]
- Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar] [CrossRef] [Green Version]
- Tripathi, S.; Tripathi, S.; Beigi, H. Multi-modal emotion recognition on iemocap dataset using deep learning. arXiv 2018, arXiv:1804.05788. [Google Scholar]
- Poria, S.; Majumder, N.; Hazarika, D.; Cambria, E.; Gelbukh, A.; Hussain, A. Multimodal sentiment analysis: Addressing key issues and setting up the baselines. IEEE Intell. Syst. 2018, 33, 17–25. [Google Scholar] [CrossRef] [Green Version]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
- Tzirakis, P.; Chen, J.; Zafeiriou, S.; Schuller, B. End-to-end multimodal affect recognition in real-world environments. Inf. Fusion 2021, 68, 46–53. [Google Scholar] [CrossRef]
- Kuhnke, F.; Rumberg, L.; Ostermann, J. Two-Stream Aural-Visual Affect Analysis in the Wild. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 366–371. [Google Scholar]
- Gera, D.; Balasubramanian, S. Affect Expression Behaviour Analysis in the Wild using Spatio-Channel Attention and Complementary Context Information. arXiv 2020, arXiv:2009.14440. [Google Scholar]
- Liu, H.; Zeng, J.; Shan, S.; Chen, X. Emotion Recognition for In-the-wild Videos. arXiv 2020, arXiv:2002.05447. [Google Scholar]
- Deng, D.; Chen, Z.; Shi, B.E. Multitask Emotion Recognition with Incomplete Labels. arXiv 2020, arXiv:2002.03557. [Google Scholar]
- Do, N.T.; Nguyen-Quynh, T.T.; Kim, S.H. Affective Expression Analysis in-the-wild using Multi-Task Temporal Statistical Deep Learning Model. arXiv 2020, arXiv:2002.09120. [Google Scholar]
- Youoku, S.; Toyoda, Y.; Yamamoto, T.; Saito, J.; Kawamura, R.; Mi, X.; Murase, K. A Multi-term and Multi-task Analyzing Framework for Affective Analysis in-the-wild. arXiv 2020, arXiv:2009.13885. [Google Scholar]
- Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef] [Green Version]
- Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.H.; et al. Challenges in representation learning: A report on three machine learning contests. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124. [Google Scholar]
- Laugs, C.; Koops, H.V.; Odijk, D.; Kaya, H.; Volk, A. The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition. In Proceedings of the Companion Publication of the 2020 International Conference on Multimodal Interaction, Utrecht, The Netherlands, 25–29 October 2020; ICMI ’20 Companion. Association for Computing Machinery: New York, NY, USA, 2020; pp. 67–71. [Google Scholar] [CrossRef]
- Hennequin, R.; Khlif, A.; Voituret, F.; Moussallam, M. Spleeter: A fast and efficient music source separation tool with pretrained models. J. Open Source Softw. 2020, 5, 2154. [Google Scholar] [CrossRef]
- Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124. [Google Scholar] [CrossRef] [Green Version]
- Ryumina, E.; Karpov, A. Comparative analysis of methods for imbalance elimination of emotion classes in video data of facial expressions. J. Sci. Tech. J. Inf. Technol. Mech. Opt. 2020, 20, 683–691. [Google Scholar] [CrossRef]
- Mathias, M.; Benenson, R.; Pedersoli, M.; Van Gool, L. Face detection without bells and whistles. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 720–735. [Google Scholar]
- Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5203–5212. [Google Scholar]
- Ryumina, E.; Ryumin, D.; Ivanko, D.; Karpov, A. A Novel Method for Protective Face Mask Detection Using Convolutional Neural Networks and Image Histograms. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2021, XLIV-2/W1-2021, 177–182. [Google Scholar] [CrossRef]
- Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
- Kaya, H.; Karpov, A.A.; Salah, A.A. Robust acoustic emotion recognition based on cascaded normalization and extreme learning machines. In International Symposium on Neural Networks; Springer: Berlin/Heidelberg, Germany, 2016; pp. 115–123. [Google Scholar]
- Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
- Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 776–780. [Google Scholar]
- Zhang, Y.; Huang, R.; Zeng, J.; Shan, S. M3F: Multi-Modal Continuous Valence-Arousal Estimation in the Wild. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 617–621. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Huan, R.H.; Shu, J.; Bao, S.L.; Liang, R.H.; Chen, P.; Chi, K.K. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
- Hori, C.; Hori, T.; Lee, T.Y.; Zhang, Z.; Harsham, B.; Hershey, J.R.; Marks, T.K.; Sumi, K. Attention-based multimodal fusion for video description. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4193–4202. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–16 June 2019; pp. 10502–10511. [Google Scholar]
Emotion | Training | Validation |
---|---|---|
Neutral | 589,215 (63.39%) | 183,636 (56.76%) |
Anger | 24,080 (2.59%) | 8002 (2.47%) |
Disgust | 12,704 (1.37%) | 5825 (1.80%) |
Fear | 11,155 (1.20%) | 9754 (3.01%) |
Happiness | 152,010 (16.35%) | 53,702 (16.60%) |
Sadness | 101,295 (10.90%) | 39,486 (12.21%) |
Surprised | 39,035 (4.20%) | 23,113 (7.14%) |
SysID | Window Length | UAR (%) | CPM (%) |
---|---|---|---|
1 | 2 | 41.9 | 54.5 |
2 | 3 | 42.8 | 55.3 |
3 | 4 | 43.3 | 55.6 |
4 | 8 | 37.5 | 47.8 |
SysID | Face Detector | Embedding Extractor | Class Weighting | UAR | CPM |
---|---|---|---|---|---|
1 | HeadHunter | VGGFace2-EE | Logarithmic | 43.3 | 55.6 |
2 | HeadHunter | VGGFace2-EE | Balanced | 43.2 | 55.6 |
3 | RetinaFace | VGGFace2-EE | Logarithmic | 39.0 | 52.6 |
4 | HeadHunter | AffectNet-EE | Logarithmic | 42.3 | 55.3 |
5 | RetinaFace | AffectNet-EE | Logarithmic | 39.1 | 51.7 |
6 | RetinaFace | AffectNet-EE | Balanced | 38.2 | 50.9 |
7 | HeadHunter | Affwild2-EE | Logarithmic | 42.1 | 54.7 |
8 | RetinaFace | Affwild2-EE | Logarithmic | 42.4 | 55.3 |
SysID | Window Length | Class Weighting | UAR (%) | CPM (%) |
---|---|---|---|---|
1 | 2 | Logarithmic | 43.3 | 55.1 |
2 | 4 | Logarithmic | 43.2 | 52.8 |
3 | 2 | Balanced | 52.8 | 49.0 |
4 | 4 | Balanced | 50.8 | 49.0 |
Window Length | AffectNet-EE + LSTM | AffectNet-E2E | ||
---|---|---|---|---|
UAR (%) | CPM (%) | UAR (%) | CPM (%) | |
2 | 52.8 | 49.0 | 45.9 | 53.6 |
4 | 50.8 | 49.0 | 42.2 | 51.1 |
SysID | Modality | System | Val. | Test |
---|---|---|---|---|
1 | A | 1D CNN + LSTM | 35.09 | - |
2 | V | VGGFace2-EE | 50.23 | 40.60 |
3 | V | VGGFace2-EE + SVM | 55.66 | 42.00 |
4 | A & V | Decision Fusion of SysID 1 & 3 | 55.90 | 42.10 |
5 | V | AffWild2-EE + LSTM | 54.73 | 44.21 |
6 | V | SysID 5 & AffectNet-EE + LSTM | 57.61 | 46.34 |
7 | A & V | SysID 6 & 1D CNN + LSTM | 54.69 | 47.58 |
8 | A & V | SysID 7 & AffWild2-EE + L-SVM | 58.95 | 48.07 |
Rank | Work | CPM (%) |
---|---|---|
1 | Kuhnke et al. [70] | 50.9 |
2 | Gera and Balasubramanian [71] | 43.4 |
3 | Dresvyanskiy et al. [55] | 42.1 |
4 | Zhang et al. [91] | 40.8 |
5 | Deng et al. [73] | 40.5 |
This work | 48.1 |
# Development Clips | # Frames | # ECPs | [, ) | [, t) | [t, ) | [, ) |
---|---|---|---|---|---|---|
70 | 323,518 | 810 | 0.6191 | 0.4910 | 0.4835 | 0.5807 |
Operation | Preprocessing | Feature Extraction | Prediction Generation |
---|---|---|---|
Sub-processes | Video frame decimation | AffectNet-EE (6.63 s, 0.06 SFI) | AffectNet-EE + LSTM (2.70 s, 0.02 SFI) |
Face detection & Cropping | AffWild2-EE (1.75 s, 0.01 SFI) | AffWild2-EE + LSTM (2.40 s, 0.02 SFI) | |
Cosine similarity | l-SVM (6.13 s) | ||
1D CNN + LSTM (5.53 s, 0.05 SFI) | |||
Inference time | 54 s (0.45 SFI) | 7.78 s (0.07 SFI) | 16.76 s (0.14 SFI) |
Total inference time: 78.54 s (0.66 SFI) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dresvyanskiy, D.; Ryumina, E.; Kaya, H.; Markitantov, M.; Karpov, A.; Minker, W. End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technol. Interact. 2022, 6, 11. https://doi.org/10.3390/mti6020011
Dresvyanskiy D, Ryumina E, Kaya H, Markitantov M, Karpov A, Minker W. End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technologies and Interaction. 2022; 6(2):11. https://doi.org/10.3390/mti6020011
Chicago/Turabian StyleDresvyanskiy, Denis, Elena Ryumina, Heysem Kaya, Maxim Markitantov, Alexey Karpov, and Wolfgang Minker. 2022. "End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild" Multimodal Technologies and Interaction 6, no. 2: 11. https://doi.org/10.3390/mti6020011
APA StyleDresvyanskiy, D., Ryumina, E., Kaya, H., Markitantov, M., Karpov, A., & Minker, W. (2022). End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild. Multimodal Technologies and Interaction, 6(2), 11. https://doi.org/10.3390/mti6020011