Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion
Abstract
:1. Introduction
- We propose a novel hybrid attention mechanism that combines both unimodal and parallel cross-modal modules. This approach enables our model to efficiently learn intramodal correlations (within audio and visual modalities) and capture global information necessary for accurate emotion recognition.
- The single-modal attention (SMA) framework is introduced to explore and capture internal relationships within each modality (audio or visual) independently. By focusing on modality-specific correlations, this block extracts highly relevant features essential for identifying emotional cues within audio and visual data.
- The parallel cross-modal attention (PCMA) mechanism ensures consistency and alignment between the emotional signals conveyed by audio and visual modalities. By integrating cross-modal attention in parallel, the block facilitates the effective fusion of emotional context across both modalities, reducing misalignment and enhancing robustness.
- Unlike previous methods, this work introduces a Cross-Modality Relation Attention (CMRA) mechanism that enhances emotion recognition by aligning emotional cues, integrating complementary information from audio and visual modalities, improving robustness to noise or misalignment, and capturing the temporal dynamics of emotional expressions.
- The results of comprehensive experimental testing on the Affwild2, AFEW-VA, and IEMOCAP datasets confirm that the developed audio–visual fusion model surpasses existing techniques for dimensional emotion recognition performance.
2. Related Works
2.1. Audio–Visual Emotion Recognition
2.2. Attention Models for A-V Integration
3. The Proposed Method
3.1. Visual Network
3.2. Audio Network
3.3. Cross-Attentional Fusion
3.4. Feature-Level Fusion
3.5. Overall Architecture
Modality Representation Module
- SEMantic ATtention (SEMAT) SubsystemTo ensure the visual modality prioritizes semantic information closely linked to the audio modality, different weights are assigned to various semantic features. Initially, the visual features and then the audio features are converted into the feature space using a linear mapping. The transformed features are then fused using a Hadamard element-wise product. A semantic attention map is generated by linearly transforming the fused features from both modalities, and then a sigmoid activation function is applied. This semantic attention map is subsequently applied to the visual features, enabling a focus on semantic details aligned with the audio modality. A comprehensive breakdown of these steps is provided below:Here, represents a linear transformation. The weight for the linear transformation is denoted by . The semantic attentive visual features are identified by .
- SPAtial ATtention (SPAAT) SubsystemThe spatial attention mechanism allows the model to concentrate on visually critical areas that are closely tied to emotions. By doing so, it reduces the impact of areas that are irrelevant or hold less importance. To begin, both visual and audio features are linearly transformed into a common feature space. These transformed features are then combined through the Hadamard element-wise product. A spatial attention map is derived by linearly transforming the fused features from both modalities; this is followed by the application of a softmax activation function. The attention map is then used to modify the visual features, enabling the model to emphasize areas with crucial and distinctive details. A detailed explanation of this process follows:Here, represents a linear transformation, while denotes the associated weight for the transformation. corresponds to the spatially attended visual features. Finally, we add and to obtain the co-semantic and spatial attentive visual featureUltimately, two distinct Bidirectional LSTM (Bi-LSTM) networks are employed to capture the temporal dependencies in both the forward and backward directions for each modality. This procedure can be formally expressed as follows:
3.6. Hybrid Attention of Single and Parallel Cross-Modal Framework
- Single-Modal Attention (SMA) Subsystem The SMA block is engineered to capture relationships between regions and frames within both the audio and visual modalities. It is implemented using a self-attention mechanism [45]. Focusing on the visual modality as an example, we begin by linearly transforming the visual features into query features , key features , and value features , all of which share the same dimensions of . Next, we apply the scaled inner-product operation followed by the softmax function to assess the relevance of various elements within the visual modality. Subsequently, the updated visual features are obtained. The overall process can be summarized as follows:Then, the residual line connects multi-head attention with . The final intramodal representations and are obtained after a layer normalization operation. This process can be written as follows:
- Parallel Cross-Modal Attention (PCMA) block.Figure 3 presents a detailed visualization of the PCMA block. Motivated by the parallel attention module in [65], the PCMA block is developed to capture the correlation between audio and visual features, ensuring uniformity in the emotion details procured by both modalities. By employing visual features and audio features as inputs, we can compute two affinity matrices, namely and , using the following equations:The visual attention map and audio attention maps are generated via the following equations:
3.7. Audio–Visual Fusion Module
4. Experimental Evaluation
4.1. Datasets
- AffWild2 Dataset [66]: Aff-Wild2 is one of the largest datasets in affective computing; it comprises 564 videos sourced from YouTube, all recorded in real-world, unconstrained settings. This extensive dataset contains approximately 2.8 million frames featuring 554 subjects (326 male and 228 female). Annotations were conducted for three main behaviour tasks: valence–arousal (all 564 videos), expressions of the seven basic emotions (546 videos), and 12 action units (541 videos). The annotations were performed on a frame-by-frame basis by a team of experts. Aff-Wild2 showcases a diverse range of subjects, ages, ethnicities, and environments, making it a valuable resource for affective computing research. The dataset includes a total of 2,816,832 frames featuring 455 unique subjects, with a demographic breakdown of 277 males and 178 females. Continuous annotations for valence and arousal are provided on a scale ranging from to 1. The dataset is divided into three subsets—training, validation, and testing—using a subject-independent partitioning approach while ensuring that no subject appears in more than one subset. This partitioning results in 341 videos for training, 71 videos for validation, and 152 videos for testing.The final labels were obtained by averaging the four annotations. The mean inter-annotator correlation is 0.63 for valence and 0.60 for arousal. It is important to note that all subjects in each video were annotated. Aff-Wild2 is the largest audio–visual in-the-wild database annotated for valence and arousal. The dataset is divided into three subsets: training, validation, and test. The partitioning is subject-independent, ensuring that a person appears in only one of these subsets. Additionally, the training, validation, and test subsets include five, three, and eight videos, respectively, that feature two subjects.
- AFEW-VA dataset [67]: The AFEW-VA dataset (Acted Facial Expressions in the Wild—Valence and Arousal) is a specialized collection designed for emotion recognition and analysis, and it is particularly focused on facial expressions. It is derived from authentic movie scenes, and it offers a broad range of natural expressions captured in real-world contexts. Comprising 600 video clips, each frame is annotated with 68 facial landmarks to provide a detailed representation of the face’s features. The dataset labels emotions according to two key dimensions: valence and arousal. Valence describes the emotional tone, ranging from negative to positive feelings, while arousal measures the intensity of the emotional response. Arousal values are classified on a scale from low to high: low values correspond to states of calmness or fatigue, while high values indicate heightened emotional states such as excitement or surprise. For each video clip, we compute the mean arousal value across all its frames, which is then used as the label for that particular video. These arousal labels are further categorized into three groups: less than 0, between 0 and 3, and greater than 3.To assess the performance of HMATN in facial feature extraction, we randomly selected 400 clips for training and 200 for testing. The AFEW-VA dataset is particularly challenging due to the variability in conditions such as illumination, background, pose, and facial scale. The clips, ranging from 10 to 120 frames, are short but complex, often depicting diverse facial expressions in dynamic circumstances. These video frames are meticulously annotated with both valence and arousal values, each ranging from −10 to 10, with 21 possible intensity levels. Given the dataset’s naturalistic (wild) setting, AFEW-VA presents significant challenges for emotion recognition, especially due to the unpredictable environmental factors. The evaluation protocol for AFEW-VA uses Correlation Coefficients (CCs) based on five-fold cross-validation to gauge the performance of continuous emotion recognition models. While the Aff-Wild dataset contains longer videos, AFEW-VA offers a smaller set, with fewer than 30,000 frames in total. As such, the use of k-fold cross-validation is critical to ensuring robust model evaluation.
- IEMOCAP dataset [68]: The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset is a widely used multimodal corpus containing 12 h of audio–visual recordings, including audio, text transcriptions, videos, phonetic features, and facial expressions, making it a valuable resource for multimodal emotion recognition [68]. The dataset consists of interactions recorded from 10 professional actors (5 male and 5 female) engaged in scripted and spontaneous dyadic conversations. Each interaction, approximately 5 min long, is segmented into smaller utterances to facilitate emotion classification. A team of three annotators labeled each utterance, with the final annotation based on a majority agreement where at least two annotators had to concur. The dataset covers nine emotional categories: happiness, anger, excitement, sadness, neutrality, disgust, surprise, frustration, and fear. In line with previous studies, the “happiness” class includes both “happy” and “excited” utterances due to their conceptual similarity. Additionally, the dataset provides dimensional annotations, where each utterance is rated on a 1-to-5 scale for activation and valence. The IEMOCAP corpus comprises approximately 5231 utterances, each labeled with its corresponding emotional category. The number of samples per category varies, ensuring a diverse distribution of emotional expressions. The availability of multimodal features, including audio and transcriptions, makes IEMOCAP a comprehensive benchmark for developing and evaluating emotion recognition models. It consists of a total of 7380 samples, with 5162 samples allocated for training, 737 for validation, and 1481 for testing.
4.2. Implementation Details
4.3. Ablation Study
4.4. Performance and Comparison
4.4.1. Experiments on the AFEW-VA Dataset
4.4.2. Experiments on Affwild2 Dataset
4.4.3. Experiments on IEMOCAP Dataset
4.4.4. Qualitative Evaluation
4.4.5. Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ER | Emotion Recognition |
EEG | Electroencephalogram |
DL | Deep Learning |
CNN | Convolutional Neural Network |
RNN | Recurrent Neural Network |
LSTM | Long Short-Term Memory |
AVER | Audio–Visual Emotion Recognition |
3D-CNN | 3D Convolutional Neural Network |
2D-CNN | 2D Convolutional Neural Network |
1D-CNN | 1D Convolutional Neural Network |
DBN | Deep Belief Network |
SVM | Support Vector Machine |
CapsGCN | Capsule Graph Convolutional Network |
GCN | Graph Convolutional Network |
Bi-LSTM | Bidirectional Long Short-Term Memory |
FC | Fully Connected Layer |
LLD | Low-Level Descriptor |
DCNN | Deep Convolutional Neural Network |
Bi-RNN | Bidirectional Recurrent Neural Network |
CTNet | Conversational Transformer Network |
Bi-GRU | Bi-Directional Gated Recurrent Unit |
ATS-Fusion | Audio–Text–Speaker Fusion |
BERT | Bidirectional encoder representations from transformers |
cLSTM | Contextual Long Short-Term Memory |
MMA | Multimodal Attention |
HASPCM | Hybrid Attention of Single and Parallel Cross-Modal |
SMA | Single-Modal Attention |
PCMA | Parallel Cross-Modal Attention |
CMRA | Cross-Modality Relation Attention |
SVR | Support Vector Regressor |
FER | Facial Emotion Recognition |
PCCE | Polarity-Consistent Cross Entropy |
CMFN | Cross-Modality attention Fusion Network |
FEN | Feature Extraction Networks |
ConvLSTM | Convolutional LSTM network |
MFCC | Mel-Frequency Cepstral Coefficients |
SEMAT | SEMantic ATtention |
SPAAT | SPAtial ATtention |
CSSA | Contextual Semantic and Spatial Attention |
SGD | Stochastic Gradient Descent |
FAN | Face Alignment Network |
SIFT | Scale-Invariant Feature Transform |
LBP | Local Binary Pattern |
BoW | Bag of Words |
CRF | Conditional Random Field |
CCC | Concordance Correlation Coefficient |
IMAN | Interactive Multimodal Attention Network |
References
- Moorthy, S.; KS, S.S.; Arthanari, S.; Jeong, J.H.; Joo, Y.H. Hybrid multi-attention transformer for robust video object detection. Eng. Appl. Artif. Intell. 2025, 139, 109606. [Google Scholar]
- Moorthy, S.; Joo, Y.H. Learning dynamic spatial-temporal regularized correlation filter tracking with response deviation suppression via multi-feature fusion. Neural Netw. 2023, 167, 360–379. [Google Scholar]
- Ali, Y.; Khan, H.U.; Khan, F.; Moon, Y.K. Building integrated assessment model for IoT technology deployment in the Industry 4.0. J. Cloud Comput. 2024, 13, 155. [Google Scholar]
- Moorthy, S.; Joo, Y.H. Formation control and tracking of mobile robots using distributed estimators and a biologically inspired approach. J. Electr. Eng. Technol. 2023, 18, 2231–2244. [Google Scholar]
- Chhimpa, G.R.; Kumar, A.; Garhwal, S.; Khan, F.; Moon, Y.K. Revolutionizing Gaze-based Human-Computer Interaction using Iris Tracking: A Webcam-Based Low-Cost Approach with Calibration, Regression and Real-Time Re-calibration. IEEE Access 2024, 12, 168256–168269. [Google Scholar]
- Iqbal, H.; Khan, A.; Nepal, N.; Khan, F.; Moon, Y.K. Deep Learning Approaches for Chest Radiograph Interpretation: A Systematic Review. Electronics 2024, 13, 4688. [Google Scholar] [CrossRef]
- Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar]
- Matsumoto, D. More evidence for the universality of a contempt expression. Motiv. Emot. 1992, 16, 363–368. [Google Scholar]
- Schlosberg, H. Three dimensions of emotion. Psychol. Rev. 1954, 61, 81. [Google Scholar]
- Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 2003, 91, 1306–1326. [Google Scholar]
- D’mello, S.K.; Kory, J. A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. (CSUR) 2015, 47, 1–36. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [PubMed]
- Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar]
- Kaya, H.; Gürpınar, F.; Salah, A.A. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 2017, 65, 66–75. [Google Scholar]
- Glodek, M.; Tschechne, S.; Layher, G.; Schels, M.; Brosch, T.; Scherer, S.; Kächele, M.; Schmidt, M.; Neumann, H.; Palm, G.; et al. Multiple classifier systems for the classification of audio-visual emotional states. In Proceedings of the Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; Part II. pp. 359–368. [Google Scholar]
- Wu, Z.; Cai, L.; Meng, H. Multi-level fusion of audio and visual features for speaker identification. In Proceedings of the Advances in Biometrics: International Conference, ICB 2006, Hong Kong, China, 5–7 January 2006; pp. 493–499. [Google Scholar]
- Arthanari, S.; Moorthy, S.; Jeong, J.H.; Joo, Y.H. Adaptive spatially regularized target attribute-aware background suppressed deep correlation filter for object tracking. Signal Process. Image Commun. 2025, 136, 117305. [Google Scholar]
- Kuppusami Sakthivel, S.S.; Moorthy, S.; Arthanari, S.; Jeong, J.H.; Joo, Y.H. Learning a context-aware environmental residual correlation filter via deep convolution features for visual object tracking. Mathematics 2024, 12, 2279. [Google Scholar] [CrossRef]
- Schoneveld, L.; Othmani, A.; Abdelkawy, H. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognit. Lett. 2021, 146, 1–7. [Google Scholar]
- Wang, L.; Wang, S.; Qi, J.; Suzuki, K. A multi-task mean teacher for semi-supervised facial affective behavior analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3603–3608. [Google Scholar]
- Tzirakis, P.; Chen, J.; Zafeiriou, S.; Schuller, B. End-to-end multimodal affect recognition in real-world environments. Inf. Fusion 2021, 68, 46–53. [Google Scholar]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W.; Tian, Q. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3030–3043. [Google Scholar]
- Liu, J.; Chen, S.; Wang, L.; Liu, Z.; Fu, Y.; Guo, L.; Dang, J. Multimodal emotion recognition with capsule graph convolutional based representation fusion. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6339–6343. [Google Scholar]
- Huang, J.; Tao, J.; Liu, B.; Lian, Z.; Niu, M. Multimodal transformer fusion for continuous emotion recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3507–3511. [Google Scholar]
- Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar]
- Sharafi, M.; Yazdchi, M.; Rasti, R.; Nasimi, F. A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed. Signal Process. Control 2022, 78, 103970. [Google Scholar]
- Hazarika, D.; Gorantla, S.; Poria, S.; Zimmermann, R. Self-attentive feature-level fusion for multimodal emotion detection. In Proceedings of the 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018; pp. 196–201. [Google Scholar]
- Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Attention driven fusion for multi-modal emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3227–3231. [Google Scholar]
- Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar]
- Zhang, T.; Li, S.; Chen, B.; Yuan, H.; Chen, C.P. Aia-net: Adaptive interactive attention network for text–audio emotion recognition. IEEE Trans. Cybern. 2022, 53, 7659–7671. [Google Scholar]
- Fu, Y.; Okada, S.; Wang, L.; Guo, L.; Song, Y.; Liu, J.; Dang, J. Context-and knowledge-aware graph convolutional network for multimodal emotion recognition. IEEE Multimed. 2022, 29, 91–100. [Google Scholar]
- Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar]
- Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-modal attention for speech emotion recognition. arXiv 2020, arXiv:2009.04107. [Google Scholar]
- Ren, M.; Huang, X.; Shi, X.; Nie, W. Interactive multimodal attention network for emotion recognition in conversation. IEEE Signal Process. Lett. 2021, 28, 1046–1050. [Google Scholar]
- Zheng, J.; Zhang, S.; Wang, Z.; Wang, X.; Zeng, Z. Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition. IEEE Trans. Multimed. 2022, 25, 2213–2225. [Google Scholar]
- Zhao, J.; Li, R.; Jin, Q.; Wang, X.; Li, H. Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4703–4707. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ortega, J.D.; Cardinal, P.; Koerich, A.L. Emotion recognition using fusion of audio and video features. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3847–3852. [Google Scholar]
- Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv 2019, arXiv:1908.11540. [Google Scholar]
- Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volome 33, pp. 6818–6825. [Google Scholar]
- Kuhnke, F.; Rumberg, L.; Ostermann, J. Two-stream aural-visual affect analysis in the wild. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 600–605. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Atmaja, B.T.; Akagi, M. Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4482–4486. [Google Scholar]
- Ni, R.; Yang, B.; Zhou, X.; Song, S.; Liu, X. Diverse local facial behaviors learning from enhanced expression flow for microexpression recognition. Knowl.-Based Syst. 2023, 275, 110729. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 303–311. [Google Scholar]
- Kumar, A.; Vepa, J. Gated mechanism for attention based multi modal sentiment analysis. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4477–4481. [Google Scholar]
- Ni, R.; Yang, B.; Zhou, X.; Cangelosi, A.; Liu, X. Facial expression recognition through cross-modality attention fusion. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 175–185. [Google Scholar]
- Ghaleb, E.; Niehues, J.; Asteriadis, S. Multimodal attention-mechanism for temporal emotion recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 251–255. [Google Scholar]
- Lee, J.; Kim, S.; Kim, S.; Sohn, K. Audio-visual attention networks for emotion recognition. In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Seoul, Republic of Korea, 26 October 2018; pp. 27–32. [Google Scholar]
- Zhang, S.; Ding, Y.; Wei, Z.; Guan, C. Continuous emotion recognition with audio-visual leader-follower attentive fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3567–3574. [Google Scholar]
- Kim, D.H.; Baddar, W.J.; Jang, J.; Ro, Y.M. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput. 2017, 10, 223–236. [Google Scholar]
- Wöllmer, M.; Kaiser, M.; Eyben, F.; Schuller, B.; Rigoll, G. LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 2013, 31, 153–163. [Google Scholar] [CrossRef]
- Nicolaou, M.A.; Gunes, H.; Pantic, M. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2011, 2, 92–105. [Google Scholar] [CrossRef]
- Rajasekhar, G.P.; Granger, E.; Cardinal, P. Deep domain adaptation with ordinal regression for pain assessment using weakly-labeled videos. Image Vis. Comput. 2021, 110, 104167. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Huang, L.; Wang, L.; Li, H. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 8002–8011. [Google Scholar]
- Long, C.; Basharat, A.; Hoogs, A.; Singh, P.; Farid, H. A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Forged Videos. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 1–10. [Google Scholar]
- Sethu, V.; Epps, J.; Ambikairajah, E. Speech based emotion recognition. In Speech and Audio Processing for Coding, Enhancement and Recognition; Springer: New York, NY, USA, 2014; pp. 197–228. [Google Scholar]
- Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
- Albanie, S.; Nagrani, A.; Vedaldi, A.; Zisserman, A. Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 292–301. [Google Scholar]
- Slimi, A.; Hamroun, M.; Zrigui, M.; Nicolas, H. Emotion recognition from speech using spectrograms and shallow neural networks. In Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia, Chiang Mai, Thailand, 30 November–2 December 2020; pp. 35–39. [Google Scholar]
- Praveen, R.G.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.; Bacon, S.; Cardinal, P.; et al. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. arXiv 2022, arXiv:2203.14779. [Google Scholar]
- Praveen, R.G.; Granger, E.; Cardinal, P. Deep weakly supervised domain adaptation for pain localization in videos. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 18–22 May 2020; pp. 473–480. [Google Scholar]
- Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 2016, 29, 289–297. [Google Scholar]
- Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 2019, 127, 907–929. [Google Scholar] [CrossRef]
- Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Kollias, D.; Zafeiriou, S. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3652–3660. [Google Scholar]
- Deng, D.; Chen, Z.; Shi, B.E. Multitask emotion recognition with incomplete labels. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 592–599. [Google Scholar]
- Toisoul, A.; Kossaifi, J.; Bulat, A.; Tzimiropoulos, G.; Pantic, M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 2021, 3, 42–50. [Google Scholar] [CrossRef]
- Kossaifi, J.; Toisoul, A.; Bulat, A.; Panagakis, Y.; Hospedales, T.M.; Pantic, M. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6060–6069. [Google Scholar]
- Kim, D.; Song, B.C. Contrastive adversarial learning for person independent facial emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 5948–5956. [Google Scholar]
- Mitenkova, A.; Kossaifi, J.; Panagakis, Y.; Pantic, M. Valence and arousal estimation in-the-wild with tensor methods. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar]
- Tellamekala, M.K.; Valstar, M. Temporally coherent visual representations for dimensional affect recognition. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 1–7. [Google Scholar]
- Handrich, S.; Dinges, L.; Al-Hamadi, A.; Werner, P.; Saxen, F.; Al Aghbari, Z. Simultaneous prediction of valence/arousal and emotion categories and its application in an HRC scenario. J. Ambient Intell. Humaniz. Comput. 2021, 12, 57–73. [Google Scholar]
- Aspandi, D.; Sukno, F.; Schuller, B.; Binefa, X. An enhanced adversarial network with combined latent features for spatio-temporal facial affect estimation in the wild. arXiv 2021, arXiv:2102.09150. [Google Scholar]
- Pei, E.; Hu, Z.; He, L.; Ning, H.; Berenguer, A.D. An ensemble learning-enhanced multitask learning method for continuous affect recognition from facial images. Expert Syst. Appl. 2024, 236, 121290. [Google Scholar]
- Kollias, D. Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022; pp. 2328–2336. [Google Scholar]
- Zhang, W.; Qiu, F.; Wang, S.; Zeng, H.; Zhang, Z.; An, R.; Ma, B.; Ding, Y. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2428–2437. [Google Scholar]
- Karas, V.; Tellamekala, M.K.; Mallol-Ragolta, A.; Valstar, M.; Schuller, B.W. Continuous-time audiovisual fusion with recurrence vs. attention for in-the-wild affect recognition. arXiv 2022, arXiv:2203.13285. [Google Scholar]
- Savchenko, A.V. Frame-level prediction of facial expressions, valence, arousal and action units for mobile devices. arXiv 2022, arXiv:2203.13436. [Google Scholar]
- Nguyen, H.H.; Huynh, V.T.; Kim, S.H. An ensemble approach for facial expression analysis in video. arXiv 2022, arXiv:2203.12891. [Google Scholar]
- Zhang, S.; An, R.; Ding, Y.; Guan, C. Continuous emotion recognition using visual-audio-linguistic information: A technical report for abaw3. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022; pp. 2376–2381. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; Zhou, G. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 5415–5421. [Google Scholar]
- Mao, Y.; Sun, Q.; Liu, G.; Wang, X.; Gao, W.; Li, X.; Shen, J. Dialoguetrm: Exploring the intra-and inter-modal emotional behaviors in the conversation. arXiv 2020, arXiv:2010.07637. [Google Scholar]
- Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation. Neurocomputing 2023, 550, 126427. [Google Scholar]
Category | Method | Contributions, Innovations, and Limitations |
---|---|---|
Audio–Visual |
Zhang et al. [22] Liu et al. [23] Huang et al. [24] Middya et al. [25] Sharafi et al. [26] |
Hybrid deep learning using a CNN and 3D-CNN for feature extraction, DBN for fusion, and SVM for classification. No end-to-end training, large number of parameters leads to computational cost, and developed for discrete emotions. CapsGCN-based emotion recognition with 2D-CNNs, capsule networks, GCN for relational learning. Ignore the complementary information between different modalities. Transformer-based emotion recognition with eGeMAPS for speech, geometric facial features, and multi-head attention. CNN-based feature extractors with concatenation fusion, FC, and Softmax for classification. Spectral audio features did not perform well when there was a class distribution mismatch among datasets. Spatiotemporal CNN with Bi-LSTM for feature extraction, followed by FC and Softmax. The model cannot learn the features of the images correctly when using a pre-trained model. |
Audio–Text |
Hazarika et al. [27] Priyasad et al. [28] Lian et al. [29] Zhang et al. [30] Fu et al. [31] |
Feature-level fusion using self-attention, LLDs for speech, CNNs for text, and FC with Softmax. The performance is decreased with the increase in noise for both modalities. Deep learning approach with SincNet, DCNN for audio, Bi-RNN for text, and cross-attention for fusion. SincNet designs only the first layer, focusing on low-frequency components that fail to capture formats. Conversational transformer model with Bi-GRU speaker embeddings, ATS-Fusion for audio–text integration. Global utterance sequence modeling minimizes the complex emotional interactions between multimodal utterances. AIA-Net with interactive attention, RoBERTa for text, and Wav-RoBERTa for speech. Context- and knowledge-aware GCNs using CNN-BiLSTM for audio, BERT for text, and graph-based fusion. |
Audio–Visual–Text |
Poria et al. [32] Pan et al. [33] Ren et al. [34] Zheng et al. [35] Zhao et al. [36] |
Temporal CNN for visual–text feature extraction, combining image pairs for sequence sensitivity. The pooling operations in the CNN resulted in the loss of overall semantic dependency. cLSTM-MMA multimodal attention mechanism, selective fusion across three modalities. IMAN uses cross-modal attention fusion and conversational modeling for speaker dependency. IMAN defines three different gated recurrent units (GRUs) to capture the context information, Multi-channel Weight-sharing Autoencoder to handle affective heterogeneity in MER. Ignoring the interaction of modalities. MEmoBERT is a multimodal pre-training model that employs self-supervised learning with a prompt-based approach. This method combines the strengths of BERT with multimodal inputs to achieve state-of-the-art performance. |
Parameter | Value | Parameter | Value |
---|---|---|---|
Model for video | Inception I3D and R3D | Weight-decay | 0.0005 |
Model for audio | Resnet 18 | Dropout | 0.8 |
Learning rate for video | 0.0001 | Optimizer | SGD |
Learning rate for audio | 0.001 | Activation | ReLU |
Learning rate decay start | 15 | Epochs | 50 |
Learning rate decay every | 5 | Batch size | 64 |
Learning rate decay rate | 0.9 | Momentum | 0.9 |
Model | Parameter Scale (Million) |
---|---|
I3D (RGB) | 12 M |
R3D-18 (ResNet-3D, 18 layers) | 33 M |
2D CNN + LSTM (ResNet-18 + 2-layer LSTM) | 13.3 M |
ResNet-18 (2D CNN) | 11.7 M |
Method | Valence | Arousal |
---|---|---|
w/o CSSA | 0.421 | 0.343 |
w/o SEMAT | 0.451 | 0.367 |
w/o SPAAT | 0.447 | 0.359 |
Full Model [Ours] | 0.457 | 0.375 |
Method | Valence | Arousal |
---|---|---|
w/o HASPCM | 0.432 | 0.348 |
w/o SMA | 0.441 | 0.352 |
w/o PCMA | 0.449 | 0.361 |
Full Model [Ours] | 0.457 | 0.375 |
IEMOCAP | ACC | WA-F1 |
---|---|---|
HMATN | 75.39 | 78.56 |
Modality | ||
Visual | 67.52 | 68.95 |
Audio | 61.77 | 62.34 |
Feature concatenation | 70.12 | 70.65 |
Cross-attention | 72.75 | 72.52 |
Method | [71] | [72] | [73] | [74] | [75] | [76] | [77] | [78] | Our Method |
---|---|---|---|---|---|---|---|---|---|
Valence | 0.69 | 0.55 | 0.59 | 0.270 | 0.475 | 0.39 | 0.377 | 0.502 | 0.654 |
Arousal | 0.66 | 0.53 | 0.54 | 0.333 | 0.306 | 0.29 | 0.467 | 0.581 | 0.617 |
Average | 0.675 | 0.540 | 0.56 | 0.302 | 0.391 | 0.34 | 0.497 | 0.541 | 0.635 |
Method | [79] | [80] | [63] | [81] | [82] | [83] | [84] | [41] | Our Method |
---|---|---|---|---|---|---|---|---|---|
Valence | 0.180 | 0.300 | 0.374 | 0.418 | 0.417 | 0.450 | 0.520 | 0.448 | 0.457 |
Arousal | 0.170 | 0.244 | 0.363 | 0.406 | 0.453 | 0.448 | 0.601 | 0.417 | 0.375 |
Average | 0.175 | 0.272 | 0.369 | 0.412 | 0.435 | 0.449 | 0.560 | 0.432 | 0.416 |
Validation Set | Valence | Arousal | Mean |
---|---|---|---|
Fold 0 | 0.455 | 0.652 | 0.553 |
Fold 1 | 0.596 | 0.683 | 0.640 |
Fold 2 | 0.475 | 0.639 | 0.557 |
Fold 3 | 0.544 | 0.658 | 0.601 |
Fold 4 | 0.438 | 0.638 | 0.538 |
Fold 5 | 0.469 | 0.623 | 0.546 |
Baseline | Hapy | Sad | Neut | Angry | Excit | Frust |
---|---|---|---|---|---|---|
Acc/F1 | Acc/F1 | Acc/F1 | Acc/F1 | Acc/F1 | Acc/F1 | |
BC-LSTM-Att | 30.26/33.59 | 58.24/61.41 | 53.12/52.30 | 56.03/57.45 | 52.14/56.58 | 66.52/59.17 |
DialogueRNN | 29.13/34.51 | 74.11/77.55 | 59.16/60.14 | 63.72/66.13 | 81.57/74.15 | 65.12/62.55 |
ConGCN | 42.43/45.11 | 84.10/86.45 | 62.90/64.14 | 68.54/65.44 | 66.15/64.18 | 67.40/60.45 |
DialogueTRM | 48.87/53.16 | 79.11/80.14 | 65.02/65.56 | 73.18/68.11 | 80.15/79.93 | 53.17/53.17 |
AMER-Net | 55.14/56.21 | 82.17/78.94 | 58.93/64.14 | 70.17/71.01 | 80.15/75.01 | 69.16/69.11 |
GraphMFT | 53.94/56.11 | 82.17/78.84 | 61.03/63.74 | 70.70/71.05 | 77.87/75.08 | 69.56/69.01 |
HMATN | 58.73/67.87 | 84.96/87.14 | 61.82/62.45 | 74.14/73.51 | 71.82/75.57 | 70.84/70.33 |
Method | Accuracy | WA (F1) |
---|---|---|
TextCNN [85] | 48.47 | 48.05 |
BC-LSTM-Att [86] | 56.32 | 56.11 |
DialogueRNN [40] | 63.50 | 62.70 |
ConGCN [87] | 64.19 | 64.10 |
DialogueTRM [88] | 68.72 | 69.13 |
DialogueGCN [39] | 65.91 | 65.62 |
GraphMFT [89] | 69.76 | 70.05 |
HMATN [Ours] | 75.39 | 78.56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moorthy, S.; Moon, Y.-K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. https://doi.org/10.3390/math13071100
Moorthy S, Moon Y-K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics. 2025; 13(7):1100. https://doi.org/10.3390/math13071100
Chicago/Turabian StyleMoorthy, Sathishkumar, and Yeon-Kug Moon. 2025. "Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion" Mathematics 13, no. 7: 1100. https://doi.org/10.3390/math13071100
APA StyleMoorthy, S., & Moon, Y.-K. (2025). Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics, 13(7), 1100. https://doi.org/10.3390/math13071100