Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion

Moorthy, Sathishkumar; Moon, Yeon-Kug

doi:10.3390/math13071100

Open AccessFeature PaperArticle

Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion

by

Sathishkumar Moorthy

and

Yeon-Kug Moon

^*

Department of Artificial Intelligence Data Science, Sejong University, Seoul 05006, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1100; https://doi.org/10.3390/math13071100

Submission received: 21 February 2025 / Revised: 17 March 2025 / Accepted: 24 March 2025 / Published: 27 March 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Multimodal emotion recognition involves leveraging complementary relationships across modalities to enhance the assessment of human emotions. Networks that integrate diverse information sources outperform single-modal approaches while offering greater robustness against noisy or missing data. Current emotion recognition approaches often rely on cross-modal attention mechanisms, particularly audio and visual modalities; however, these methods do not assume the complementary nature of the data. Despite making this assumption, it is not uncommon to see non-complementary relationships arise in real-world data, reducing the effectiveness of feature integration that assumes consistent complementarity. While audio–visual co-learning provides a broader understanding of contextual information for practical implementation, discrepancies between audio and visual data, such as semantic inconsistencies, pose challenges and lay the groundwork for inaccurate predictions. In this way, they have limitations in modeling the intramodal and cross-modal interactions. In order to address these problems, we propose a multimodal learning framework for emotion recognition, called the Hybrid Multi-ATtention Network (HMATN). Specifically, we introduce a collaborative cross-attentional paradigm for audio–visual amalgamation, intending to effectively capture salient features over modalities while preserving both intermodal and intramodal relationships. The model calculates cross-attention weights by analyzing the relationship between combined feature illustrations and distinct modes. Meanwhile, the network employs the Hybrid Attention of Single and Parallel Cross-Modal (HASPCM) mechanism, comprising a single-modal attention component and a parallel cross-modal attention component, to harness complementary multimodal data and hidden features to improve representation. Additionally, these modules exploit complementary and concealed multimodal information to enhance the richness of feature representation. Finally, the efficiency of the proposed method is demonstrated through experiments on complex videos from the AffWild2 and AFEW-VA datasets. The findings of these tests show that the developed attentional audio–visual fusion model offers a cost-efficient solution that surpasses state-of-the-art techniques, even when the input data are noisy or missing modalities.

Keywords:

emotion recognition; hybrid attention; cross-modality relation; multimodal fusion; parallel cross-modal attention

MSC:

68T45

1. Introduction

Emotions arise from personal cognitive experiences that reflect multifaceted psychological and physical conditions. Emotions significantly shape human choices, perceptions, social interactions, and intellectual processes, and they are a pivotal aspect of everyday life. Understanding emotions is vital to comprehending human actions and has become an increasingly important area of study in fields such as affective computing, artificial intelligence, and robotics [1,2,3,4]. Emotion recognition is crucial for various applications, including human–computer interaction, emotion regulation, and the diagnosis of emotion-related disorders. The ability to accurately detect and interpret emotions promises to enhance communication, improve user experience, and advance humans’ behaviour understanding. Advances in data collection technologies have facilitated the acquisition of large multimodal datasets that have dramatically improved emotion recognition accuracy [5,6]. In this study, we selected three widely recognized corpora: IEMOCAP, Affwild2, and AFEW-VA. Our selection was based on key criteria, including modality diversity (audio, video, and text), dataset size and speaker variability, annotation quality, and suitability for benchmarking. These improvements have fueled interest in multidirectional learning, which utilizes diverse data sources for better outcomes. EEG signals and eye movement data are particularly effective input data for emotion recognition, as they reflect internal physiological states and external subconscious behaviours with strong interpretability. Recent multimodal fusion methods that combine these signals have shown promising results, advancing the field significantly.

Emotion recognition (ER) is a complex and multifaceted challenge due to the wide diversity of emotional expressions between individuals, cultures, and contexts. Human emotions are deeply influenced by both biological and cultural factors, leading to a wide range of expressions that vary significantly between populations. However, the foundational work of Paul Ekman, who conducted studies on human emotions, recognizes that six basic emotions are universally recognized across cultures [7]. In subsequent studies, contempt was added as the seventh basic emotion, expanding the model further [8]. Moreover, the enhanced performance of the categorical emotion model, which classifies emotions into distinct categories, has been widely studied in affective computing due to its simplicity and broad applicability. Meanwhile, as emotion recognition systems transition from controlled laboratory settings to real-world environments, the focus has shifted towards more dynamic and subtle emotional states that emerge naturally in everyday life. In some cases of recognition, this shift reflects the increasing demand for emotion recognition systems that can process more complex, continuous emotional experiences, extending beyond discrete categories to encompass a broader spectrum of emotions.

To address this, dimensional models of emotion have become increasingly relevant, where emotions are not classified into distinct categories but are represented along continuous dimensions. The most widely used dimensional model is the valence–arousal space, which maps emotions onto two key axes: valence (pleasantness) and arousal (intensity) [9]. Valence captures the spectrum of emotional experience from negative (e.g., sadness) to positive (e.g., happiness), while arousal measures the level of intensity, ranging from low-arousal states (e.g., calmness and sleepiness) to high-arousal states (e.g., excitement and agitation). Therefore, recognizing emotions in this continuous dimensional space is crucial for many practical applications. In the healthcare system, emotion recognition has been getting more attention recently. Moreover, some physiological data, including heart rate, Electromyogram (EMG), Electrocardiogram (ECG), and Skin Conductance Level (SCL), represent the autonomous nervous system, which originates from the human physiological system for different emotion categories. Affective computing is transforming marketing by enabling a deeper understanding of consumer behavior through emotion recognition technologies that analyze facial expressions, eye movements, and voice tones. These data allow for the optimization of marketing campaigns to better engage target audiences. For example, Realeyes employs such technology to enhance video advertisements by evaluating viewers’ emotional responses. In the automotive industry, affective computing plays a crucial role in improving driver safety and comfort. Emotion recognition systems continuously monitor drivers’ states, detecting signs of drowsiness or stress and issuing alerts to prevent accidents. Additionally, in-car assistants, such as Affectiva’s Automotive AI, dynamically adjust the vehicle environment to enhance the driver’s mood and overall experience, promoting both safety and comfort. Given the increasing need for accurate continuous emotion recognition in dynamic, real-world scenarios, this paper specifically focuses on dimensional emotion recognition within the valence–arousal framework, aiming to advance the capabilities of emotion recognition technologies for complex, real-world applications.

The fascination with multimodal fusion arises from three primary advantages. First, combining multiple modalities that analyze the same event enhances prediction accuracy and reliability, a principle widely embraced by the audio–visual speech recognition field [10]. Second, utilizing various modalities allows the integration of complementary information, uncovering insights that a single modality might overlook. Third, multimodal systems demonstrate adaptability by maintaining functionality even if one modality is unavailable—for instance, interpreting emotions through visual cues when verbal input is absent [11]. Multimodal fusion has been applied across diverse areas, such as audio–visual speech recognition [10], multimodal emotion analysis, medical imaging, and multimedia event identification.

Generally, the methods for multimodal fusion can be divided into two main groups: model-independent and model-specific approaches [12]. Model-independent techniques are simple to apply using standard unimodal machine learning frameworks but typically rely on general methods not optimized for multimodal data processing. Conversely, model-specific approaches utilize specialized algorithms explicitly crafted to analyze and merge information from multiple modalities. Depending on the fusion model used, these approaches are often further categorized into kernel-based techniques, graphical models, and neural networks [12]. Model-independent fusion offers greater adaptability, as it can be implemented with virtually any unimodal regressor or classifier without the need for dedicated fusion models. This category encompasses the majority of existing fusion methods, where integration is frequently achieved through feature concatenation [13] or by aggregating predictions from individual modalities [14]. While its simplicity is a major advantage, it also underscores the limitations in fully harnessing the richness of multimodal data.

Model-agnostic approaches are further divided into early fusion (feature-based), late fusion (decision-based), and hybrid fusion. Technically, early fusion integrates features immediately after their extraction, typically by concatenating their representations. This method is among the pioneering approaches to multimodal research aimed at achieving multimodal representation learning; it identifies correlations and interactions between the low-level features of individual modalities. A notable benefit of early fusion lies in its straightforwardness, as it involves training just one model, simplifying the training process compared to late or hybrid fusion approaches. Late fusion, in contrast, combines outputs from individual modalities after each has made a decision, often using classification or regression methods. Integration is performed using mechanisms like averaging, voting methods, weights derived from the channel noise or signal deviation, or even the learnt models [15]. Late fusion provides flexibility by allowing each modality to be modeled using distinct predictors, which can be optimized independently for their respective modalities. This approach is particularly robust in scenarios where one or more modalities are missing and enables training when there are no correlated data. In contrast to these advantages, late fusion disregards moderate associations among the modalities, potentially missing important cross-modal relationships. Hybrid fusion, on the other hand, aims to merge the strengths of both early and late fusion within a unified framework. It integrates outputs from early fusion with those of individual unimodal predictors, leveraging the advantages of both methods. Hybrid fusion has been successfully applied in domains such as multimodal speaker identification and multimedia event detection [16].

Deep learning (DL) models achieve cutting-edge performance across a wide array of vision-based recognition tasks, including image classification, object detection, visual object tracking, and action recognition [17,18]. Building on this success, researchers have proposed numerous methods for video-based dimensional ER, employing Convolutional Neural Networks (CNNs) for extracting deep features and Recurrent Neural Networks (RNNs) for modeling temporal dependencies [13,19]. In the context of vocal emotion recognition, DL models have gained significant traction by utilizing spectrograms with 2D CNNs [19,20], or by processing raw waveforms through 1D CNNs [13]. Currently, the majority of dimensional ER approaches [13,21] integrate audio–visual data by joining deep features, which are subsequently fed into a Long Short-Term Memory (LSTM) network for predicting valence and arousal. Although LSTM-based fusion models excel in capturing intramodal relationships, they often fail to represent intermodal interactions between the modalities effectively. To overcome these challenges, we propose the extraction of richer and more distinctive features that capitalize on the complementary nature of audio and visual modalities, thereby enhancing the efficiency of dimensional ER systems.

Generally, the goal of audio–visual emotion recognition (AVER) is to effectively integrate both visual and auditory data to predict emotional states accurately. However, real-world videos often exhibit significant challenges due to misalignment between the audio and visual components. For instance, visual frames may lack corresponding audio tracks, or audio signals may not have matching visual content, making it difficult to interpret information from both modalities. Further complexities arise when multiple audio sources are associated with a single visual target or one audio source corresponds to several visual targets, leading to ambiguity in aligning the modalities. Additionally, visual data are often affected by external factors such as occlusions or poor lighting conditions, while audio signals are highly sensitive to temporal issues like background noise or synchronization errors. These challenges emphasize the importance of addressing alignment problems to improve the accuracy and reliability of emotion recognition systems. Table 1 summarizes the reviewed fusion techniques based on their suitable input modalities, strengths, limitations, and applied techniques.

To address the challenges of modality misalignment and achieve robust encoding of high-quality representations while retaining modality-specific details and extracting shared semantic content, this study introduces the Hybrid Attention of Single and Parallel Cross-Modal (HASPCM) framework. This innovative module ensures the normalization of embeddings within the same class, promoting proximity, while also driving separation among embeddings from different classes. The HASPCM module operates by first extracting single-modal information via the single-modal attention (SMA) block; then, it refines these representations with the parallel cross-modal attention (PCMA) method. By combining SMA and PCMA, our model captures both local modality-specific features and ensures global event consistency across audio and visual modalities, improving emotion recognition accuracy. The proposed cross-modality attention mechanism enhances emotion recognition by aligning emotional cues, integrating complementary audio-visual information, increasing robustness to noise or misalignment, and capturing the temporal evolution of emotional expression. The distinctive features of the proposed method include the following:

We propose a novel hybrid attention mechanism that combines both unimodal and parallel cross-modal modules. This approach enables our model to efficiently learn intramodal correlations (within audio and visual modalities) and capture global information necessary for accurate emotion recognition.
The single-modal attention (SMA) framework is introduced to explore and capture internal relationships within each modality (audio or visual) independently. By focusing on modality-specific correlations, this block extracts highly relevant features essential for identifying emotional cues within audio and visual data.
The parallel cross-modal attention (PCMA) mechanism ensures consistency and alignment between the emotional signals conveyed by audio and visual modalities. By integrating cross-modal attention in parallel, the block facilitates the effective fusion of emotional context across both modalities, reducing misalignment and enhancing robustness.
Unlike previous methods, this work introduces a Cross-Modality Relation Attention (CMRA) mechanism that enhances emotion recognition by aligning emotional cues, integrating complementary information from audio and visual modalities, improving robustness to noise or misalignment, and capturing the temporal dynamics of emotional expressions.
The results of comprehensive experimental testing on the Affwild2, AFEW-VA, and IEMOCAP datasets confirm that the developed audio–visual fusion model surpasses existing techniques for dimensional emotion recognition performance.

The remainder of this paper is structured as follows: Section 2 reviews the existing literature on dimensional emotion recognition and attention mechanisms for audio–visual fusion. Section 3 details the architecture and mechanisms of the proposed HASPCM fusion model. Section 4 discusses the experimental results and the analysis of the proposed method on three datasets. Finally, in Section 5, we summarize the advantages and limitations of the proposed method and discuss directions for further research.

2. Related Works

2.1. Audio–Visual Emotion Recognition

The advancement of deep neural networks has driven significant progress in core pattern recognition domains, including video object recognition, audio recognition, and speaker identification. These breakthroughs have extended to integrated applications, including AVER and the emerging domain of paralinguistics. One of the pioneering efforts in audio–visual integration for dimensional ER via deep learning models was introduced by Tzirakis et al. [13]. In this approach, visual features are extracted using ResNet50 [37], while audio features are captured using a 1D CNN. These features are processed by LSTM networks to generate final predictions. A multimodal framework for continuous emotion recognition was proposed in [38], integrating visual and acoustic features to predict arousal and valence levels. This method employs a pre-trained CNN for visual feature extraction, which is combined with handcrafted audio features. During fine-tuning, convolutional layers are initially frozen to retain pre-learned weights, while fully connected layers are optimized. Schöneveld et al. [19] leveraged advancements in deep learning, such as knowledge distillation and high-performing architectures, to employ model-level fusion strategies. This included using a teacher–student framework for visual modalities and 2D-CNN spectrograms for audio, followed by employing LSTMs to capture temporal dependencies. Technically, Ghosal et al. [39] introduced DialogueGCN, leveraging the capabilities of Graph Convolutional Networks (GCNs) to construct a dynamic graph model that simulates interactions between speakers. In this approach, speakers are represented as nodes, while dependencies between them form the edges of the graph. However, GCNs are susceptible to over-smoothing, which can hinder the extraction of deeper semantic information. In speaker-based emotion recognition, although DialogueGCN and similar models capture dependencies between different speakers, they fail to explicitly distinguish the speaker of an utterance during the final emotion recognition process. To address this limitation, Majumder et al. [40] proposed DialogueRNN, which integrates speaker information, utterance context, and emotional cues from multimodal features. The model employs three Gated Recurrent Units (GRUs)—Party GRU, Global GRU, and Emotion GRU—to effectively capture the speaker state, contextual dependencies, and affective state, respectively.

In addition, Kuhnke et al. [41] proposed a two-stream audio–visual network, where visual features are derived using R(2plus1)D [42] pre-trained on action recognition datasets, while audio features are obtained through ResNet-18 [37]. These features are concatenated to enhance performance and predict valence and arousal. Similarly, Atmaja et al. [43] employed multitask learning, optimizing a loss function for three emotion attributes derived from audio–visual features. They also proposed a fusion strategy that combines early and late fusion through a multistage Support Vector Regressor (SVR) to improve recognition outcomes. Meanwhile, Ni et al. [44] proposed a complementary facial image fusion strategy to solve the interference due to pose and lighting changes in FER. A motion detail enhancement strategy is implemented to retain microexpression-relevant dynamics, and a novel data augmentation method generates diverse u/v/s images to address limited training data. After preprocessing, the LD-FMERN network is proposed to extract microexpression-related features, incorporating spatial-channel modulators to focus on salient regions. A locally diverse feature mining strategy further refines features by leveraging microexpression cues from small, diverse facial regions. Despite these advancements, existing approaches still often fail to adequately capture intermodal relationships and task-specific features, limiting their effectiveness. To overcome these limitations, this study emphasizes attention mechanisms as a means to derive complementary features from both modalities, enhancing the accuracy and robustness of our emotion recognition system.

2.2. Attention Models for A-V Integration

Attention-based multimodal integration was devised to seamlessly combine features from diverse modes, advancing the extraction of complementary information and the exploration of intricate relationships within multimodal data. Recent advancements have introduced several attention mechanism variants, which have found extensive utilization in NLP and computer vision tasks. These models leverage the self-attention mechanism, renowned for its capability to capture long-range semantic dependencies while enabling efficient parallelized training. More recently, the Transformer techniques leveraging a unique self-attention mechanism proposed by Vaswani et al. [45] have attracted growing research interest owing to their strong capabilities of modeling long-term dependencies. An innovative network architecture, VAANet, was introduced by Zhao et al. [46], combining spatial, channel-wise, and temporal attention mechanisms within a visual 3D CNN and temporal attention within an audio 2D CNN for video emotion recognition. The model further incorporates a novel PCCE loss function, allowing VAANet to produce polarity-preserved attention maps. On the other hand, Kumar et al. [47] proposed another noteworthy approach by employing bilinear self-attention in association with a gating system to capture contextual information over extended time periods. This method selectively integrates unimodal and cross-modal features, emphasising their respective contributions to enhancing overall performance. Most recently, Ni et al. [48] proposed a novel cross-modality attention fusion network (CMFN) to effectively fuse features extracted from multiple Feature Extraction Networks (FENs). To enhance spatial correlations across different facial modalities, a spatial attention mechanism is introduced, allowing the model to capture complementary information more effectively. This integration fully leverages modality complementarity, enabling a more robust and precise Facial Expression Recognition (FER) system.

In a similar vein, Ghaleb et al. [49] introduced an attention mechanism that assigns weights to video sequence time windows, thereby enabling the efficient modeling of temporal interactions between audio and visual modalities. Their approach employs transformer-based encoders [45] to compute attention weights through self-attention; these are subsequently used for emotion classification. To further advance multimodal learning, Lee et al. [50] developed a spatiotemporal attention-based deep neural network for dimensional emotion recognition. Their model utilizes temporal attention to identify emotional cues in speech; this is complemented by spatiotemporal attention applied via ConvLSTM modules to extract features from facial videos. Additionally, 3D CNNs have been used to recognize emotions in facial videos without requiring pixel-level annotations. Further technical developments were presented by Zhang et al. [51]; they explored methods to enhance fusion performance by looking beyond individual modalities. They developed a leader–follower fusion strategy for emotion recognition, where audio (A) and visual (V) features are encoded and combined to derive attention weights. These weights are then applied to the visual features, which are concatenated with the original visual features to generate the final predictions. Despite the advancements brought forth by these attention-based multimodal fusion techniques, most approaches still predominantly focus on intramodal relationships while overlooking the critical interactions between audio and visual modalities. This limitation restricts their ability to effectively capture complementary information, thereby constraining the potential of these approaches to further enhance emotion recognition performance while using audio–visual datasets.

3. The Proposed Method

3.1. Visual Network

Facial expression plays a crucial role in non-verbal communication, as it effectively conveys a person’s internal states, emotions, and intentions. Deep learning networks have significantly advanced the understanding of low-dimensional discriminative features extracted from high-dimensional, complex facial patterns for automatic Facial Expression Recognition (FER). Specifically, facial expressions can be seen as dynamic variations in key parts of the face (e.g., the eyes, nose, and mouth), which combine to change both local features and the entire face. The primary challenge for FER lies in capturing these dynamic variations in facial structure across consecutive frames. Kim et al. [52] proposed a frame-based approach to FER, where spatial image features from representative expression-state frames are learned using a convolutional neural network. Subsequently, the temporal dynamics of these spatial features are captured through LSTM for facial expression analysis. Various methods for dimensional FER have been investigated using 2D-CNNs and LSTMs [53,54]. However, 3D-CNNs have proven to be more effective in the detection of spatiotemporal motion in video data [55]; these networks have also been applied to dimensional facial ER. Building on the success of 3D-CNNs, an Inflated 3D-CNN (I3D) model [56] has been used to extract spatiotemporal features from facial video clips.

Carreira and Zisserman [56] developed the I3D model for action recognition; they demonstrated its ability to efficiently capture spatiotemporal dynamics within the visual modality while maintaining a lower parameter count compared to conventional 3D CNNs. By expanding the filters and pooling kernels of the 2D ConvNet into a 3D structure, the I3D architecture offers a distinctive advantage. This configuration allows the use of pre-trained 2D CNNs, trained on extensive image datasets, improving spatial feature learning from video data. Although initially tailored for action recognition, the I3D model has been effectively employed in other affective computing domains, including video-based localization [57,58]. The proposed method derives spatiotemporal features from the facial modality using the I3D model.

3.2. Audio Network

Speech is the fastest and most natural mode of communication among humans. It not only conveys formal features of linguistic expressions—such as phonology, morphology, syntax, and semantics—but also reflects the speaker’s emotional state. Speech carries affective information through both linguistic (explicit) and paralinguistic cues, and this information can be effectively extracted using advanced speech processing techniques. Vocal ER has predominantly relied on traditional handcrafted features, such as Mel-Frequency Cepstral Coefficients (MFCCs) [59]. However, recent advancements from DL models have significantly enhanced the field. While emotion recognition can be approached with spectrograms using 2D CNNs [19,20] or with raw audio signals via 1D-CNNs [13], spectrograms are particularly valuable as they capture crucial paralingual features that reflect an individual’s emotional state [60]. Furthermore, some research has explored the use of 2D-CNNs for emotion recognition from spectrograms [61,62]. In the proposed method, we integrate spectrograms and 2D-CNN models to extract emotional features. For the Affwild2 dataset, we used the ResNet18 [37] model. Given the varying sizes of these datasets, distinct 2D-CNN models were used for each to prevent overfitting.

3.3. Cross-Attentional Fusion

Multimodal learning has recently demonstrated significant success by leveraging the mutually beneficial relationships among different modalities through cross-modal attention mechanisms [48,63]. Cross-modal attention, also known as cross-attention or co-attention, enables one modality to focus on another based on their intermodal interactions [24]. Building on this concept, cross-attention (CA) has been effectively employed to capture complementary relationships, achieving state-of-the-art performance in emotion recognition (ER) [50,64]. The audio (A) and visual (V) models are trained separately, with deep features extracted for both modalities. The performance of valence and arousal varies significantly between the A and V modalities. The V modality, rich in appearance-based information, effectively conveys valence-related cues by capturing facial expressions throughout a sequence. In contrast, audio signals encode crucial information about expression intensity, which is reflected in the energy of A signals. In a given video sequence, the V modality may provide more relevant information in certain clips, while the A modality may be more informative in others. Since multiple modalities offer diverse and complementary information for valence and arousal, integrating the A and V modalities enhances predictive performance beyond what a single modality can achieve. To ensure the effective fusion of these modalities for valence and arousal prediction, we employ a cross-attention-based fusion mechanism that efficiently encodes intermodal interactions while preserving intramodal characteristics.

3.4. Feature-Level Fusion

To improve the extraction of modality-specific features, this study explored the fusion features generated by multiple backbone models for both audio (A) and visual (V) modalities. Incorporating multiple backbone networks within each modality enables the acquisition of diverse, complementary information, enhancing the richness of the overall feature set. Specifically, visual features were obtained using a combination of I3D, R3D, and 2D CNNs, which were further integrated with LSTM networks. This approach is fruitful because the 3D CNN models used, as well as the I3D and R3D models, excel in capturing spatiotemporal relationships, particularly those associated with short-term temporal dynamics. Meanwhile, the 2D CNN, paired with LSTM, focuses on extracting spatial features and modeling long-term temporal dependencies effectively. For the audio modality, feature extraction was performed using a 2D CNN trained on spectrograms, complemented by traditional handcrafted MFCCs. MFCCs are widely used in speech processing across various tasks and add an essential layer of auditory feature representation.

3.5. Overall Architecture

The overall framework of our approach is illustrated in Figure 1. Initially, audio–visual pairs are processed through the modality representation module to extract high-dimensional embeddings for both audio and visual inputs. These embeddings are then passed to the multimodal fusion module, where distinct mechanisms refine the information. The SMA block leverages self-attention to capture intramodal dependencies, enhancing the understanding of the individual modalities. Meanwhile, the PCMA block employs a parallel attention mechanism to model cross-modal interactions, effectively complementing the extracted unimodal features. In the final stage, both multimodal and single-modal representations are used in the emotion recognition module to generate the final outputs.

Modality Representation Module

Given an input video sub-sequence S, L non-overlapping video clips are uniformly sampled. The deep feature vectors for these clips are then extracted using pre-trained models for both audio and visual modalities. We denote the deep feature vectors for audio and visual modalities as

F_{a}

and

F_{v}

, respectively; they correspond to the fixed-size input video sub-sequence S, as described below:

\begin{matrix} F_{a} & = f_{a}^{1}, f_{a}^{2}, \dots, f_{a}^{L} \in R^{d_{a} \times L}, \end{matrix}

(1)

\begin{matrix} F_{v} & = f_{v}^{1}, f_{v}^{2}, \dots, f_{v}^{L} \in R^{d_{v} \times L}, \end{matrix}

(2)

where,

d_{a}

and

d_{v}

indicate the dimensions of the audio and visual feature vectors, respectively. The audio and visual feature vectors for the video clips are represented by

f_{a}^{l}

and

f_{v}^{l}

, where

l = 1, 2, \dots, L

clips. As illustrated in Figure 2, the CSSA module consists of two components: the SEMantic ATtention (SEMAT) block and the SPAtial ATtention (SPAAT) subsystem.

SEMantic ATtention (SEMAT) Subsystem
To ensure the visual modality prioritizes semantic information closely linked to the audio modality, different weights are assigned to various semantic features. Initially, the visual features $F^{v}$ and then the audio features $F^{a}$ are converted into the feature space using a linear mapping. The transformed features are then fused using a Hadamard element-wise product. A semantic attention map $S E M_{a t t}$ is generated by linearly transforming the fused features from both modalities, and then a sigmoid activation function is applied. This semantic attention map $S E M_{a t t}$ is subsequently applied to the visual features, enabling a focus on semantic details aligned with the audio modality. A comprehensive breakdown of these steps is provided below:

$\begin{matrix} F_{1}^{a v} & = Δ_{1}^{s e} (W_{1}^{s e} F^{a}) ⊙ Δ_{2}^{s e} (W_{2}^{s e} F^{v}), \end{matrix}$

(3)

$\begin{matrix} S E M_{a t t} & = s i g m o i d (Δ_{3}^{s e} {(W_{3}^{s e} F^{v})}_{1}^{a v}), \end{matrix}$

(4)

$\begin{matrix} F_{s e}^{v} & = (S E M_{a t t} + 1) ⊙ F^{v} . \end{matrix}$

(5)

Here, $Δ_{i}^{s e}$ represents a linear transformation. The weight for the linear transformation is denoted by $W_{i}^{s e}$ . The semantic attentive visual features are identified by $F_{s e}^{v}$ .
SPAtial ATtention (SPAAT) Subsystem
The spatial attention mechanism allows the model to concentrate on visually critical areas that are closely tied to emotions. By doing so, it reduces the impact of areas that are irrelevant or hold less importance. To begin, both visual and audio features are linearly transformed into a common feature space. These transformed features are then combined through the Hadamard element-wise product. A spatial attention map $S P A_{a t t}$ is derived by linearly transforming the fused features from both modalities; this is followed by the application of a softmax activation function. The attention map is then used to modify the visual features, enabling the model to emphasize areas with crucial and distinctive details. A detailed explanation of this process follows:

$\begin{matrix} F_{2}^{a v} & = Δ_{1}^{s p} (W_{1}^{s p} F^{a}) ⊙ Δ_{2}^{s p} (W_{2}^{s e} F^{v}), \end{matrix}$

(6)

$\begin{matrix} S P A_{a t t} & = s i g m o i d (Δ_{3}^{s p} {(W_{3}^{s p} F^{v})}_{1}^{a v}), \end{matrix}$

(7)

$\begin{matrix} F_{s p}^{v} & = S P A_{a t t} ⊙ F^{v} . \end{matrix}$

(8)

Here, $Δ_{i}^{s p}$ represents a linear transformation, while $W_{i}^{s p}$ denotes the associated weight for the transformation. $F_{s e}^{v}$ corresponds to the spatially attended visual features. Finally, we add $F_{s e}^{v}$ and $F_{s p}^{v}$ to obtain the co-semantic and spatial attentive visual feature $F_{s s a f}^{v}$

$\begin{matrix} F_{s s a f}^{v} = F_{s e m}^{v} + F_{s p}^{v} . \end{matrix}$

(9)

Ultimately, two distinct Bidirectional LSTM (Bi-LSTM) networks are employed to capture the temporal dependencies in both the forward and backward directions for each modality. This procedure can be formally expressed as follows:

$\begin{matrix} F_{l}^{v} & = B i L S T M (F_{s s a f}^{v}), \end{matrix}$

(10)

$\begin{matrix} F_{l}^{a} & = B i L S T M (F^{a}), \end{matrix}$

(11)

where $F_{l}^{v} \in R^{T \times d_{l}^{v}}$ and $F_{l}^{a} \in R^{T \times d_{l}^{a}}$ $d_{l}^{v}$ while $d_{l}^{a}$ denote the updated dimensions of the visual and audio features, respectively.

3.6. Hybrid Attention of Single and Parallel Cross-Modal Framework

Single-Modal Attention (SMA) Subsystem The SMA block is engineered to capture relationships between regions and frames within both the audio and visual modalities. It is implemented using a self-attention mechanism [45]. Focusing on the visual modality as an example, we begin by linearly transforming the visual features $F_{l}^{v}$ into query features $Q^{v}$ , key features $K^{v}$ , and value features $V^{v}$ , all of which share the same dimensions of $R^{T \times d_{l}^{v}}$ . Next, we apply the scaled inner-product operation followed by the softmax function to assess the relevance of various elements within the visual modality. Subsequently, the updated visual features are obtained. The overall process can be summarized as follows:

$\begin{matrix} Q^{v}, K^{v}, V^{v} & = Δ (F_{l}^{v} W_{q}^{v}, F_{l}^{v} W_{k}^{v}, F_{l}^{v} W_{v}^{v}), \end{matrix}$

(12)

$\begin{matrix} Z_{a t t} & = s o f t m a x \frac{Q^{v} {(K^{v})}^{T}}{\sqrt{d}} V^{v}, \end{matrix}$

(13)

where $W_{q}^{v}, W_{k}^{v},$ and $W_{v}^{v}$ represent the learnable transformation weights. Here, $R^{d_{l}^{v} \times d_{l}^{v}}$ denotes the weight dimensions and $Δ (\cdot)$ is the linear transformation. In our approach, multi-head attention employs m parallel attention heads to capture diverse semantic information from the input data in a way that is equivalent to how multi-core convolution works in deep learning architectures, where multiple filters operate in parallel to extract different features from the input. Each attention head independently learns to focus on different aspects of the input, enabling the model to gather a variety of features, such as fine-grained local patterns or broader contextual relationships. By processing the input through multiple attention heads simultaneously, the model can integrate complementary information and enhance its ability to represent complex relationships within the data effectively. This design improves the overall performance by allowing the attention mechanism to attend to different positions and perspectives within the input sequence.

$\begin{matrix} H^{v} & = M u l t i H e a d (F_{l}^{v}, F_{l}^{v}, F_{l}^{v}), \end{matrix}$

(14)

$\begin{matrix} H^{a} & = M u l t i H e a d (F_{l}^{a}, F_{l}^{a}, F_{l}^{a}) . \end{matrix}$

(15)

Then, the residual line connects multi-head attention with $F_{l}^{v}$ . The final intramodal representations $U^{v},$ and $U^{v}$ are obtained after a layer normalization operation. This process can be written as follows:

$\begin{matrix} U^{v} & = ψ (H^{v} + F_{l}^{v}), \end{matrix}$

(16)

$\begin{matrix} U^{a} & = ψ (H^{a} + F_{l}^{a}), \end{matrix}$

(17)

where $ψ$ represents the LayerNorm operation.
Parallel Cross-Modal Attention (PCMA) block.
Figure 3 presents a detailed visualization of the PCMA block. Motivated by the parallel attention module in [65], the PCMA block is developed to capture the correlation between audio and visual features, ensuring uniformity in the emotion details procured by both modalities. By employing visual features $U^{v} \in R^{T \times d}$ and audio features $U^{a} \in R^{T \times d}$ as inputs, we can compute two affinity matrices, namely $M_{a f f}^{v} \in R^{T \times T}$ and $M_{a f f}^{a} \in R^{T \times T}$ , using the following equations:

$\begin{matrix} M_{a f f}^{v} & = \frac{(W_{1}^{v} U^{v}) {(W_{1}^{a} U^{a})}^{T}}{\sqrt{d}}, \end{matrix}$

(18)

$\begin{matrix} M_{a f f}^{a} & = \frac{(W_{1}^{a} U^{a}) {(W_{1}^{a} U^{a})}^{T}}{\sqrt{d}}, \end{matrix}$

(19)

where $W_{1}^{v} \in R^{T \times T}$ and $W_{1}^{a} \in R^{T \times T}$ are two learnable weight matrices in the linear transformation, and T denotes the transpose operation.
The visual attention map $A^{v}$ and audio attention maps $A^{a}$ are generated via the following equations:

$\begin{matrix} A^{v} & = s o f t m a x (t a n h (W_{2}^{v} U^{v} + (W_{2}^{a} U^{a})) M_{a f f}^{v}), \end{matrix}$

(20)

$\begin{matrix} A^{a} & = s o f t m a x (t a n h (W_{2}^{a} U^{a} + (W_{2}^{v} U^{v})) M_{a f f}^{a}), \end{matrix}$

(21)

where $W_{2}^{v}$ and $W_{2}^{a}$ are weight matrices. In the end, the fused visual and audio features are obtained by

$\begin{matrix} C^{v} & = A^{v} U^{v}, \end{matrix}$

(22)

$\begin{matrix} C^{a} & = A^{a} U^{a} . \end{matrix}$

(23)

3.7. Audio–Visual Fusion Module

A combined representation is obtained by multiplying the visual features and audio features by elements to obtain

C^{v}

and

C^{a}

, respectively. The output obtained by temporal concatenation of

(C^{v})

and

(C^{a})

is further combined with

F_{c}^{v a}

and fed to the CMRA to yield an attentive output denoted by Z. The joint representation

F_{c}^{v a}

is then combined with the attentive output Z. This combined representation subsequently undergoes normalization.

\begin{matrix} F_{c}^{v a} & = C^{v} ⊙ C^{a}, \end{matrix}

(24)

\begin{matrix} Z & = C M R A (F_{c}^{v a}, c a t (C^{v}, C^{a})), \end{matrix}

(25)

\begin{matrix} Z^{v a} & = L a y e r N o r m (Z + F_{c}^{v a}) \end{matrix}

(26)

where

C M R A (F_{c}^{v a}, c a t (C^{v}, C^{a})) = s o f t m a x \frac{(Q_{1} (K_{1, 2}^{T}))}{{\sqrt{d}}_{m}} V_{1, 2}

. Here,

Q_{1} = F_{c}^{v a} W^{Q}, K_{1, 2} = α_{a v} W^{k},

V_{1, 2} = α_{a v} W^{V}, α_{a v} = c a t (C^{v}, C^{a})

.

The Concordance Correlation Coefficient

(ρ_{c})

has been widely used in the literature to measure the level of agreement between the predictions

(x)

and ground truth

(y)

annotations for dimensional ER [13]. Let

(μ_{x})

and

(μ_{y})

represent the mean of predictions and ground truth, respectively. Similarly, if

(σ_{x}^{2})

and

(σ_{y}^{2})

denote the variance of predictions and ground truth, respectively, then

(ρ_{c})

between the predictions and ground truth is as follows:

ρ_{c} = \frac{2 σ_{x y}^{2}}{σ_{x}^{2} + σ_{y}^{2} + {(μ_{x} - μ_{y})}^{2}}

(27)

where

(σ_{x y}^{2})

denotes the predictions–ground truth covariance. Although MSE has been widely used as a loss function for regression models, we use

(L = 1 - ρ_{c})

, since it is a standard and conventional loss function in the literature of dimensional ER [13]. The parameters of our A-V fusion model

(W_{1}^{s e}, W_{2}^{s e}, W_{3}^{s e}, W_{1}^{s p}, W_{2}^{s p}, W_{3}^{s p}, W_{q}^{v}, W_{k}^{v}, W_{v}^{v})

are optimized according to this loss.

Test mode: As shown in Figure 1, we assume that a continuous video sequence is input to our model during inference. Feature representations

F_{a}

and

F_{v}

are extracted by A and V backbones for successive input clips and spectrograms and fed to the fusion model for the prediction of valence and arousal.

4. Experimental Evaluation

4.1. Datasets

AffWild2 Dataset [66]: Aff-Wild2 is one of the largest datasets in affective computing; it comprises 564 videos sourced from YouTube, all recorded in real-world, unconstrained settings. This extensive dataset contains approximately 2.8 million frames featuring 554 subjects (326 male and 228 female). Annotations were conducted for three main behaviour tasks: valence–arousal (all 564 videos), expressions of the seven basic emotions (546 videos), and 12 action units (541 videos). The annotations were performed on a frame-by-frame basis by a team of experts. Aff-Wild2 showcases a diverse range of subjects, ages, ethnicities, and environments, making it a valuable resource for affective computing research. The dataset includes a total of 2,816,832 frames featuring 455 unique subjects, with a demographic breakdown of 277 males and 178 females. Continuous annotations for valence and arousal are provided on a scale ranging from $- 1$ to 1. The dataset is divided into three subsets—training, validation, and testing—using a subject-independent partitioning approach while ensuring that no subject appears in more than one subset. This partitioning results in 341 videos for training, 71 videos for validation, and 152 videos for testing.
The final labels were obtained by averaging the four annotations. The mean inter-annotator correlation is 0.63 for valence and 0.60 for arousal. It is important to note that all subjects in each video were annotated. Aff-Wild2 is the largest audio–visual in-the-wild database annotated for valence and arousal. The dataset is divided into three subsets: training, validation, and test. The partitioning is subject-independent, ensuring that a person appears in only one of these subsets. Additionally, the training, validation, and test subsets include five, three, and eight videos, respectively, that feature two subjects.
AFEW-VA dataset [67]: The AFEW-VA dataset (Acted Facial Expressions in the Wild—Valence and Arousal) is a specialized collection designed for emotion recognition and analysis, and it is particularly focused on facial expressions. It is derived from authentic movie scenes, and it offers a broad range of natural expressions captured in real-world contexts. Comprising 600 video clips, each frame is annotated with 68 facial landmarks to provide a detailed representation of the face’s features. The dataset labels emotions according to two key dimensions: valence and arousal. Valence describes the emotional tone, ranging from negative to positive feelings, while arousal measures the intensity of the emotional response. Arousal values are classified on a scale from low to high: low values correspond to states of calmness or fatigue, while high values indicate heightened emotional states such as excitement or surprise. For each video clip, we compute the mean arousal value across all its frames, which is then used as the label for that particular video. These arousal labels are further categorized into three groups: less than 0, between 0 and 3, and greater than 3.
To assess the performance of HMATN in facial feature extraction, we randomly selected 400 clips for training and 200 for testing. The AFEW-VA dataset is particularly challenging due to the variability in conditions such as illumination, background, pose, and facial scale. The clips, ranging from 10 to 120 frames, are short but complex, often depicting diverse facial expressions in dynamic circumstances. These video frames are meticulously annotated with both valence and arousal values, each ranging from −10 to 10, with 21 possible intensity levels. Given the dataset’s naturalistic (wild) setting, AFEW-VA presents significant challenges for emotion recognition, especially due to the unpredictable environmental factors. The evaluation protocol for AFEW-VA uses Correlation Coefficients (CCs) based on five-fold cross-validation to gauge the performance of continuous emotion recognition models. While the Aff-Wild dataset contains longer videos, AFEW-VA offers a smaller set, with fewer than 30,000 frames in total. As such, the use of k-fold cross-validation is critical to ensuring robust model evaluation.
IEMOCAP dataset [68]: The IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset is a widely used multimodal corpus containing 12 h of audio–visual recordings, including audio, text transcriptions, videos, phonetic features, and facial expressions, making it a valuable resource for multimodal emotion recognition [68]. The dataset consists of interactions recorded from 10 professional actors (5 male and 5 female) engaged in scripted and spontaneous dyadic conversations. Each interaction, approximately 5 min long, is segmented into smaller utterances to facilitate emotion classification. A team of three annotators labeled each utterance, with the final annotation based on a majority agreement where at least two annotators had to concur. The dataset covers nine emotional categories: happiness, anger, excitement, sadness, neutrality, disgust, surprise, frustration, and fear. In line with previous studies, the “happiness” class includes both “happy” and “excited” utterances due to their conceptual similarity. Additionally, the dataset provides dimensional annotations, where each utterance is rated on a 1-to-5 scale for activation and valence. The IEMOCAP corpus comprises approximately 5231 utterances, each labeled with its corresponding emotional category. The number of samples per category varies, ensuring a diverse distribution of emotional expressions. The availability of multimodal features, including audio and transcriptions, makes IEMOCAP a comprehensive benchmark for developing and evaluating emotion recognition models. It consists of a total of 7380 samples, with 5162 samples allocated for training, 737 for validation, and 1481 for testing.

4.2. Implementation Details

We performed all experiments on an NVIDIA RTX A6000 graphics card with a GPU memory of 49140 MiB, Intel Core i9-12900KS processor, and Ubuntu 20.04 operating system. Pytorch was used as a deep learning framework, and its version was 1.12.0. All experiments were implemented with Python 3.9.12. The whole pre-training process took roughly 40 h. All the parameters used in this experiment are listed in Table 2. Moreover, the model parameter scale of this work is given in Table 3. For the Affwild2 dataset, we used the cropped and aligned images provided by the challenge organizers for the visual (V) modality [69]. In cases where frames were missing, we substituted in black frames (i.e., frames where all pixel values are zero). The faces were resized to

224 \times 224

pixels before being input to the I3D network. To configure the video input, we established specific parameters for frame processing. The subsequence length was defined as 8 frames, while the overall sequence length was set to 64 frames. These values were derived by applying a down-sampling factor of 4 to the original 256-frame sequences, effectively reducing the data while maintaining essential temporal information. As a result, each sequence consisted of eight subsequences; this corresponds to 196,265 training samples, 41,740 validation samples, and 92,941 test samples.

In line with our approach for the AffWild2 dataset, we used the I3D model, initially pre-trained on Kinetics-400, and subsequently fine-tuned it as a 3D-CNN using facial expression videos from AffWild2. Leaving conventional methods, we replaced the standard pooling layer that usually follows the final convolutional layer with a scaled dot product of audio and visual features, as outlined in previous research. To mitigate overfitting, dropout was implemented in the linear layers with a probability of

0.8

. For optimization, we employed stochastic gradient descent (SGD) with an initial learning rate of

1 \times 10^{- 3}

and a momentum of

0.8

. Weight decay was set at

5 \times 10^{- 4}

. We chose a batch size of eight and applied data augmentation to the training set through random cropping, enhancing the model’s scale invariance. The training process spanned 50 epochs, with early stopping implemented to select optimal weights.

For audio processing, we extracted the vocal signal from the corresponding video and resampled it at

44.1

kHz. This signal was then segmented to align with the 256-frame subsequences from the visual network. For the computation of the spectrogram, we applied a Discrete Fourier Transform (DFT) of length 1024 to each segment, using a window length of 20 ms and a hop length of 10 ms. The resulting spectrogram, with dimensions of

64 \times 107

, corresponds to each visual subsequence.

Next, a normalization step was carried out on the spectrogram, converting it to a log-power spectrum in decibels (dB). Following this, mean and variance normalization was applied to standardize the spectrogram. The processed spectrograms were then input into the ResNet18 model [37] to extract features for the A modality. Given the large volume of samples in the AffWild2 dataset, we opted to train the ResNet18 model from scratch. Modifications were made to the first convolutional layer of the ResNet18 model to accommodate the single-channel input of the spectrogram. Training was performed with an initial learning rate of

0.001

, while the Adam optimizer was used for weight optimization. A batch size of 64 was chosen, with early stopping employed to select the best model for accurate predictions.

Data Imbalance:

Similar to [70], we utilized external datasets to mitigate the data imbalance in the AffWild2 dataset. Specifically, most frames in AffWild2 have a concentration of valence values in the range of [0–0.4], leading to an imbalance in the valence arousal distribution. To address this, we use the Affwild2 dataset and the AFEW-VA dataset for valence–arousal estimation, helping to improve the diversity and balance of the training data. For valence–arousal estimation, we incorporate the AFEW-VA dataset, which consists of 30,051 frames annotated with valence–arousal scores in the range [−10, 10], rescaled to [−1, 1] for consistency. To address data imbalance in the AffWild2 dataset, where most frames have valence values concentrated in the range [0–0.4], we down-sample the AffWild2 dataset by a factor of 5 and merge it with AFEW-VA. To further mitigate imbalance, we discretize the valence–arousal scores into 20 equal-width bins, treating each bin as a separate category. We then apply oversampling and undersampling strategies, similar to those used in expression recognition, to enhance the balance of the dataset and improve the generalization of the model.

4.3. Ablation Study

To evaluate the efficacy of the individual components of our framework and elucidate their respective roles in improving its overall performance, we conducted a comprehensive series of ablation studies. These experiments were designed to validate three crucial aspects of the proposed architecture. The findings, presented in Table 4 and Table 5, demonstrate that the CSSA module, the HASPCM module, and the single-modal visual task each contribute significantly to improving the accuracy of audio–visual emotion recognition (AVER). These results substantiate the effectiveness of each module in optimizing the model’s performance and underscore their importance within the overall framework.

In our ablation study, we first removed the CSSA module from the complete framework, followed by the elimination of its constituent SEMantic ATtention (SEMAT) and SPAtial ATtention (SPAAT) blocks. We use the notation “w/o CSSA” to indicate a model lacking the entire CSSA module, while “w/o SEMAT” and “w/o SPAAT” represent models without the SEMAT and SPAAT blocks, respectively. As illustrated in Table 4, the comprehensive framework that incorporates the CSSA module achieves valence and arousal values of 0.457 and 0.375 compared to the model without CSSA and achieves 0.421 and 0.343. These figures demonstrate enhancements of 0.036 and 0.032, respectively, when compared to models without the CSSA module. Significantly, the model devoid of the CSSA module shows a marked decrease in accuracy relative to the optimal performance of the complete framework. It can be seen that the model without SEMAT’s performance decreased to 0.451 and 0.367. This observation underscores the critical role of the CSSA module in bolstering the overall effectiveness of our emotion recognition system.

To evaluate the significance of the Hybrid Attention-based Spatial-Pyramid Cross-Modal (HASPCM) module, we conducted a series of ablation experiments. Initially, we removed the entire HASPCM module from our framework. We then proceeded to eliminate its constituent parts: the single-modal attention (SMA) and pyramid cross-modal attention (PCMA) blocks. In our notation, “w/o HASPCM” denotes the model lacking the complete HASPCM module, while “w/o SMA” and “w/o PCMA” indicate the absence of the SMA and PCMA blocks, respectively. Table 5 presents the results of these experiments. The complete framework, including the HASPCM module, achieves accuracy rates of 0.457 for valence and 0.375 for arousal. These results show improvements of 2.5% and 2.7%, respectively, compared to models without the HASPCM module. Further analysis reveals that removal of the SMA block leads to performance decreases of 0.84% and 0.86% relative to the optimal results. Similarly, the absence of the PCMA block results in performance drops of 1.6% and 2.3%. These findings highlight the crucial roles of both the SMA and PCMA blocks within the framework. The observed performance decreases provide compelling evidence for the integral nature of the HASPCM module and its components in enhancing the effectiveness of the Audio–Visual Emotion Learning (AVEL) framework.

The ablation analysis on the IEMOCAP dataset is given in Table 6. The HMATN model achieves the highest performance on the IEMOCAP dataset, with an accuracy of 75.39% and an F1 score of 78.56%. Among the unimodal approaches, the visual modality significantly outperforms the others, with 67.52%/68.95 in accuracy and F1 score, highlighting its crucial role in emotion recognition. In contrast, the audio modality achieves 61.77% in accuracy and 62.34% in F1 score, which is slightly lower than the visual modality. Specifically, the audio and visual fusion obtains comparable performance, with 72.75% in accuracy and 72.52% in F1 scores, which further emphasizes the dominance of visual features. The best performance is achieved when both modalities are combined, demonstrating that emotions are influenced by vocal and visual expressions. These findings validate the necessity of multimodal integration for AVER, as leveraging multiple modalities enhances overall recognition accuracy.

4.4. Performance and Comparison

4.4.1. Experiments on the AFEW-VA Dataset

The experimental findings presented in Table 7 compare our proposed method with the leading audio–visual fusion techniques, demonstrating its performance across various metrics on the AFEW-VA dataset. Moreover, Figure 4 presents some sample images from the AFEW-VA dataset.

EmoFAN [71] enhances emotion recognition by leveraging features from a face alignment network (FAN). It improves robustness through joint prediction of categorical and continuous emotions, employs an attention mechanism for facial region focus, integrates knowledge distillation for label smoothing, and optimizes affect recognition using a tailored loss function, achieving a valence score of 0.69 and an arousal score of 0.66. Kossaifi et al. [72] introduced a hybrid system that combines deep learning with classical geometric and texture features for affect recognition. Their approach incorporates features extracted from the Scale-Invariant Feature Transform (SIFT), Local Binary Pattern (LBP), and facial landmarks, which are used alongside classifiers such as Bag of Words (BoW) and Conditional Random Field (CRF). Transfer learning is applied to multiple convolutional network-based models to assess their effectiveness, resulting in a valence score of 0.55 and an arousal score of 0.53. Kim [73] proposed a contrastive adversarial learning method that utilizes weak emotion learning based on strong emotion samples, achieving a valence score of 0.59 and an arousal score of 0.54. Tellamekala et al. [75] introduced a constrained representation learning approach that enhances temporal consistency by applying a first-order temporal coherency regularization constraint to the supervised loss, using a ResNet50 CNN pre-trained on VGG-Face as the encoder. Their method obtained a valence score of 0.475 and an arousal score of 0.306. Aspandi et al. [77] developed a model that efficiently captures spatial and temporal features through enhanced temporal modeling of latent representations. It integrates three key networks—Generator, Discriminator, and Combiner—trained in an adversarial setting to estimate valence (V) and arousal (A) affect domains. Latent features are leveraged for temporal modeling using LSTM RNNs, progressively trained with curriculum learning and adaptive attention, leading to a valence score of 0.377 and an arousal score of 0.467. Additionally, compared to the factorized framework of Kossaifi et al. [72], our approach improves performance by approximately 17.6% in valence and 16.4% in arousal, demonstrating its ability to more effectively capture affective states. Moreover, Figure 5 illustrates the confusion matrices of the proposed HMATN on the AFEW-VA dataset. Overall, these results underscore the effectiveness of our method in advancing emotion recognition systems. The demonstrated improvements suggest that our model leverages superior feature representations and learning strategies, leading to more accurate and reliable predictions.

4.4.2. Experiments on Affwild2 Dataset

Figure 6 presents some sample images from the Affwild2 dataset. Moreover, Table 8 compares various methods in terms of their performance across valence, arousal, and average scores. Kollias et al. [79] report the lowest average score of 0.175, with valence and arousal scores of 0.180 and 0.170, respectively. Zhang et al. [80] proposed a transformer-based multimodal framework for action unit detection and expression recognition, integrating static vision and dynamic multimodal features. A fusion module with cross-attention enhances key feature focus for improved detection, with the static vision feature extractor initialized by a pre-trained expression embedding model distilled from a DLN, achieving a valence score of 0.300 and an arousal score of 0.244. Praveen et al. [63] introduced a joint cross-attention fusion model that leverages intermodal relationships for accurate valence and arousal prediction. By computing cross-attention weights from joint and individual feature correlations, the approach outperforms the standard cross-attention module, achieving a valence score of 0.374 and an arousal score of 0.363. Karas et al. [81] report an average score of 0.412, with valence and arousal scores of 0.418 and 0.406, respectively. Savchenko et al. [82] addressed real-time video-based facial emotion analytics, including expression recognition, valence–arousal prediction, and action unit detection. A frame-level approach utilizing an EfficientNet model pre-trained on AffectNet enables deployment on mobile devices, resulting in a valence score of 0.417 and an arousal score of 0.453. Nguyen et al. [83] focused on valence–arousal estimation and action unit detection using multimodal and temporal learning. A RegNet-extracted feature integrates GRU and Transformer, followed by GRU with Local Attention, for prediction, leading to a valence score of 0.450 and an arousal score of 0.448. Karas et al. [81] report an average score of 0.412, with valence and arousal scores of 0.418 and 0.406, respectively. Kuhnke et al. [41] proposed a two-stream aural–visual model for affective behavior recognition from videos, where audio and image streams are processed separately using CNNs, with temporal convolutions for sequence analysis. Face alignment features and emotion representation correlations enhance performance during training, achieving a valence score of 0.448 and an arousal score of 0.417. Finally, our method delivers an average score of 0.416, with valence and arousal scores of 0.457 and 0.375, respectively, positioning it as competitive among the compared methods.

We conducted a six-fold cross-validation to determine the optimal average fusion model for the Affwild2 dataset. As shown in the Table 9, the proposed model achieves state-of-the-art performance with multimodal input while leveraging six-fold cross-validation. We opted for six-fold cross-validation instead of the Leave-One-Subject-Out (LOSO) validation method due to the high computational cost of processing video data. Our empirical results indicate that the proposed fusion model achieves a Concordance Correlation Coefficient (CCC) of 0.596 and 0.683 for valence and arousal, respectively, on the validation set. By integrating multiple modalities, our proposed model surpasses unimodal performance and establishes new benchmarks on the Affwild2 dataset.

4.4.3. Experiments on IEMOCAP Dataset

In this section, we evaluate the effectiveness of the proposed Hierarchical Multimodal Attention-based Transformer Network (HMATN) on the IEMOCAP dataset. Figure 7 presents some sample images from the IEMOCAP dataset. To ensure a comprehensive performance assessment, we integrate HMATN into seven state-of-the-art deep learning-based approaches: TextCNN [85], BC-LSTM-Att [86], DialogueRNN [40], ConGCN [87], DialogueTRM [88], DialogueGCN [39], and GraphMFT [89]. These models serve as backbone architectures, enabling a comparative analysis of our fusion method. Specifically, TextCNN employs Convolutional Neural Networks (CNNs) for text feature extraction, while 3D-CNN and openSMILE are utilized to extract video and audio features, respectively. The extracted multimodal representations are then processed through a 1D-CNN, which performs emotion classification. By incorporating HMATN into these baseline models, we aim to enhance their ability to capture complex dependencies across multiple modalities, thereby improving overall emotion recognition performance.

Several models have been proposed for emotion recognition in conversations, each employing different techniques to capture contextual and speaker-related information. BC-LSTM utilizes a Bidirectional LSTM to encode contextual semantic information but does not incorporate speaker-related details. DialogueRNN employs three GRUs to capture distinct aspects of speaker-related information, focusing on contextual information, speaker identity, and emotional content. DialogueGCN is the first method to apply Graph Convolutional Networks (GCNs) for emotion classification, modeling contextual relationships between utterances but relying solely on textual modality, limiting its ability to leverage complementary multimodal information. In contrast, GraphMFT employs an advanced graph-based multimodal fusion technique, representing multimodal data as a graph where nodes correspond to data objects, and edges capture both intramodal contextual relationships and intermodal dependencies. By utilizing multiple enhanced graph attention networks, GraphMFT effectively integrates complementary information across modalities, improving emotion recognition in conversations.

To assess the effectiveness of the proposed HMATN, we use accuracy and weighted average F1 score as evaluation metrics. These metrics are particularly suitable for testing on imbalanced datasets, ensuring a fair comparison between different emotion classes. The experimental results for all models are summarized in Table 10 and Table 11, demonstrating that HMATN outperforms all baseline models. On the IEMOCAP dataset, our proposed HMATN method achieves an accuracy of 75.39% and a weighted F1 score of 78.56%. As shown in Table 10, HMATN surpasses the GraphMFT method and achieves a 10.51% improvement in F1 score on the IEMOCAP dataset. Another competitive baseline, DialogueTRM, also exhibits strong performance, yet HMATN demonstrates a 6.67% increase in accuracy compared to this model. A detailed analysis of the F1 score for each emotion class reveals that HMATN achieves significant improvements for most emotions, except for “happy” and “frustrated.” Notably, HMATN performs exceptionally well in recognizing “happy” (67.87%), “sad” (87.14%), and “frustrated” (70.33%), outperforming existing models in these categories. Overall, the results indicate that HMATN achieves superior performance in multimodal emotion recognition, effectively enhancing the fusion of textual, visual, and auditory modalities. These findings highlight the model’s robustness in handling complex emotional expressions and its potential for advancing the field of Automatic Video Emotion Recognition (AVER).

The analysis of the confusion matrix indicates that the proposed model achieves high accuracy in recognizing the emotions “sad” and “anger” as shown in Figure 8. Specifically, 84.96% of the samples classified as “sad” are correctly identified, demonstrating the model’s strong sensitivity to this emotion. Similarly, 74.14% of “anger” samples are accurately classified, further highlighting the model’s robustness in detecting strong negative emotions. Despite its strong performance, certain misclassification trends are evident. Notably, 25.16% samples labeled as “anger” are misclassified as “frustrated”, which can be attributed to the similarity in expression between these two emotions. Additionally, “neutral” expressions are frequently misclassified as “frustrated” (21.1%) and “happy” (5.88%), suggesting that the model struggles to differentiate between emotions with lower intensity or subtle variations in expression. Furthermore, the confusion matrix reveals a relatively low error rate between “excited” and “sad” classifications, though some degree of misclassification persists. This finding underscores the challenge of distinguishing high-intensity emotions, particularly in cases where exaggerated facial expressions or strong vocal cues are present. In summary, while the model demonstrates strong performance in recognizing distinct emotions such as “sad” and “anger”, it faces challenges in differentiating closely related emotional states. Future research should focus on enhancing the model’s ability to discriminate between similar emotions, particularly those with overlapping features, to improve overall classification accuracy.

4.4.4. Qualitative Evaluation

To enhance understanding of the proposed model, we have provided visualizations of the attention scores obtained from the individual audio and visual components, as seen in Figure 9. The model primarily leverages temporal attention through each modality. This allows us to examine clip-level attention scores in order to provide a clear understanding of the video’s key clips, showing how the fusion attention model focuses on the temporal progression of both the A and V modalities. For example, as depicted in Figure 9, the HMATN model emphasizes the V modality when the person smiles, since there are significant changes in the facial muscles around the nose and mouth over this time period. Similarly, the model gives higher attention to the A modality when there is a notable change in the person’s vocal expressions.

4.4.5. Limitations

Despite the strong performance of the proposed HMATN, the model occasionally encounters challenges in distinguishing similar or closely related emotions. Specifically, it may misclassify “frustrated” as “angry” or “happy” as “excited” due to the overlapping characteristics of these emotional states. Such misclassifications arise from subtle variations in facial expressions, vocal tone, and linguistic context, which can lead to ambiguity in emotion recognition. Figure 8 presents an analysis of the limitations of the proposed HMATN method using confusion matrices. One of the primary challenges in AVER is the similar emotion problem, where models struggle to differentiate between emotions with overlapping characteristics. Similar to existing approaches, the proposed HMATN also encounters difficulties in distinguishing closely related emotions. As illustrated in Figure 8, the HMATN model misclassifies the ground-truth emotion “angry” as “frustrated” with a probability of 25.16%. Additionally, the ground-truth emotion “excited” is incorrectly detected as “happy” in 14.25% of cases, further highlighting the challenge of distinguishing high-arousal emotional states. Another key limitation in AVER is the non-neutral emotion problem, where models exhibit a tendency to classify various emotions as “neutral”. This issue arises due to the reliance of most AVER models on text-based features, which can lead to incorrect classifications of non-neutral utterances. For instance, short conversational responses such as “Okay” and “Yeah” may be directly classified as “neutral” despite their emotional context. As depicted in Figure 8, the ground-truth emotion “happy” is misclassified as “neutral” with a probability of 12.96%, demonstrating the model’s inclination to overgeneralize neutral states.

Additionally, the HASPCM module, while beneficial in capturing correlations between emotional labels, slightly restricts the model’s ability to handle exceptional cases that deviate from these learned correlations. This limitation can affect the model’s adaptability in scenarios where emotions do not conform to typical patterns. Furthermore, significant interference in certain modalities, such as background noise in audio signals, may negatively impact recognition accuracy. External noise and other disruptive factors can obscure emotional cues, leading to suboptimal performance in real-world applications.

5. Conclusions

In this paper, we presented a hybrid multi-attention network for audio–visual emotion recognition. Specifically, the proposed HMATN demonstrates significant advancements in multimodal representation learning by effectively integrating audio–visual data through a collaborative cross-attentional paradigm and the HASPCM mechanism. Testing the model on complex videos from the AffWild2 and AFEW-VA datasets confirms its ability to enhance feature richness, capture salient cross-modal interactions, and maintain robustness even in the presence of noisy or missing modalities. Furthermore, the superior performance of HMATN compared to state-of-the-art techniques highlights its efficiency and effectiveness as a cost-efficient solution for real-world affective computing and human emotion recognition applications. Looking ahead, future work will involve applying the architecture to other tasks, such as localization and scene classification. Moreover, an extension of this work for enhanced emotion recognition will incorporate different modalities, including eye tracking and skin temperature. This could further expand the applicability of our method to a broader range of real-world scenarios.

Our approach has broader applications beyond emotion recognition. In human–computer interaction (HCI), it enhances affective computing by enabling AI to interpret and respond to emotions in virtual assistants, social robots, and adaptive learning systems. Mental health monitoring can benefit from this technology by detecting emotional states to assess stress, anxiety, and depression, supporting personalized interventions in telemedicine. Intelligent video surveillance can leverage emotion recognition to identify abnormal expressions, aiding in crime prevention and public safety. Similarly, automotive driver monitoring can be used to assess emotions like fatigue and frustration, improving advanced driver-assistance systems (ADASs) for safer driving. Furthermore, entertainment and content recommendation systems can personalize user experiences in interactive gaming, virtual reality (VR), and streaming platforms based on emotional responses. These applications highlight the broader impact and relevance of our research.

Author Contributions

Conceptualization, S.M.; funding acquisition, Y.-K.M.; coordination, Y.-K.M.; investigation, S.M. and Y.-K.M.; methodology, S.M.; project administration, Y.-K.M.; supervision, Y.-K.M.; visualization, S.M.; writing—original draft, S.M.; writing—review and editing, S.M. and Y.-K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Technology Innovation Program (RS-2024-00487049, Development and demonstration of complex emotion recognition on-device AI technology for in-vehicle driver emotional services) funded By the Ministry of Trade, Industry and Energy (MOTIE, Korea).

Data Availability Statement

No new data were created or analyzed in this study. The Affwilf2 dataset is available at https://link.springer.com/article/10.1007/s11263-019-01158-4 accessed on 27 December 2024. The AFEW-VA dataset is available at https://www.sciencedirect.com/science/article/pii/S0262885617300379, accessed on 27 December 2024. The IEMOCAP dataset is available at https://sail.usc.edu/iemocap/iemocap_release.htm accessed on 25 January 2025. The code will be available on GitHub https://github.com/MSathishkumar1990/Hybrid-Multi-Attention-Network-for-Audio-Visual-Emotion-Recognition-using-Feature-Fusion accessed on 25 March 2025, commit b2c9e039db76540aff5e9b0b0097af18ff37b20c.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ER	Emotion Recognition
EEG	Electroencephalogram
DL	Deep Learning
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
AVER	Audio–Visual Emotion Recognition
3D-CNN	3D Convolutional Neural Network
2D-CNN	2D Convolutional Neural Network
1D-CNN	1D Convolutional Neural Network
DBN	Deep Belief Network
SVM	Support Vector Machine
CapsGCN	Capsule Graph Convolutional Network
GCN	Graph Convolutional Network
Bi-LSTM	Bidirectional Long Short-Term Memory
FC	Fully Connected Layer
LLD	Low-Level Descriptor
DCNN	Deep Convolutional Neural Network
Bi-RNN	Bidirectional Recurrent Neural Network
CTNet	Conversational Transformer Network
Bi-GRU	Bi-Directional Gated Recurrent Unit
ATS-Fusion	Audio–Text–Speaker Fusion
BERT	Bidirectional encoder representations from transformers
cLSTM	Contextual Long Short-Term Memory
MMA	Multimodal Attention
HASPCM	Hybrid Attention of Single and Parallel Cross-Modal
SMA	Single-Modal Attention
PCMA	Parallel Cross-Modal Attention
CMRA	Cross-Modality Relation Attention
SVR	Support Vector Regressor
FER	Facial Emotion Recognition
PCCE	Polarity-Consistent Cross Entropy
CMFN	Cross-Modality attention Fusion Network
FEN	Feature Extraction Networks
ConvLSTM	Convolutional LSTM network
MFCC	Mel-Frequency Cepstral Coefficients
SEMAT	SEMantic ATtention
SPAAT	SPAtial ATtention
CSSA	Contextual Semantic and Spatial Attention
SGD	Stochastic Gradient Descent
FAN	Face Alignment Network
SIFT	Scale-Invariant Feature Transform
LBP	Local Binary Pattern
BoW	Bag of Words
CRF	Conditional Random Field
CCC	Concordance Correlation Coefficient
IMAN	Interactive Multimodal Attention Network

References

Moorthy, S.; KS, S.S.; Arthanari, S.; Jeong, J.H.; Joo, Y.H. Hybrid multi-attention transformer for robust video object detection. Eng. Appl. Artif. Intell. 2025, 139, 109606. [Google Scholar]
Moorthy, S.; Joo, Y.H. Learning dynamic spatial-temporal regularized correlation filter tracking with response deviation suppression via multi-feature fusion. Neural Netw. 2023, 167, 360–379. [Google Scholar]
Ali, Y.; Khan, H.U.; Khan, F.; Moon, Y.K. Building integrated assessment model for IoT technology deployment in the Industry 4.0. J. Cloud Comput. 2024, 13, 155. [Google Scholar]
Moorthy, S.; Joo, Y.H. Formation control and tracking of mobile robots using distributed estimators and a biologically inspired approach. J. Electr. Eng. Technol. 2023, 18, 2231–2244. [Google Scholar]
Chhimpa, G.R.; Kumar, A.; Garhwal, S.; Khan, F.; Moon, Y.K. Revolutionizing Gaze-based Human-Computer Interaction using Iris Tracking: A Webcam-Based Low-Cost Approach with Calibration, Regression and Real-Time Re-calibration. IEEE Access 2024, 12, 168256–168269. [Google Scholar]
Iqbal, H.; Khan, A.; Nepal, N.; Khan, F.; Moon, Y.K. Deep Learning Approaches for Chest Radiograph Interpretation: A Systematic Review. Electronics 2024, 13, 4688. [Google Scholar] [CrossRef]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar]
Matsumoto, D. More evidence for the universality of a contempt expression. Motiv. Emot. 1992, 16, 363–368. [Google Scholar]
Schlosberg, H. Three dimensions of emotion. Psychol. Rev. 1954, 61, 81. [Google Scholar]
Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 2003, 91, 1306–1326. [Google Scholar]
D’mello, S.K.; Kory, J. A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. (CSUR) 2015, 47, 1–36. [Google Scholar]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [PubMed]
Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 2017, 11, 1301–1309. [Google Scholar]
Kaya, H.; Gürpınar, F.; Salah, A.A. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 2017, 65, 66–75. [Google Scholar]
Glodek, M.; Tschechne, S.; Layher, G.; Schels, M.; Brosch, T.; Scherer, S.; Kächele, M.; Schmidt, M.; Neumann, H.; Palm, G.; et al. Multiple classifier systems for the classification of audio-visual emotional states. In Proceedings of the Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; Part II. pp. 359–368. [Google Scholar]
Wu, Z.; Cai, L.; Meng, H. Multi-level fusion of audio and visual features for speaker identification. In Proceedings of the Advances in Biometrics: International Conference, ICB 2006, Hong Kong, China, 5–7 January 2006; pp. 493–499. [Google Scholar]
Arthanari, S.; Moorthy, S.; Jeong, J.H.; Joo, Y.H. Adaptive spatially regularized target attribute-aware background suppressed deep correlation filter for object tracking. Signal Process. Image Commun. 2025, 136, 117305. [Google Scholar]
Kuppusami Sakthivel, S.S.; Moorthy, S.; Arthanari, S.; Jeong, J.H.; Joo, Y.H. Learning a context-aware environmental residual correlation filter via deep convolution features for visual object tracking. Mathematics 2024, 12, 2279. [Google Scholar] [CrossRef]
Schoneveld, L.; Othmani, A.; Abdelkawy, H. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognit. Lett. 2021, 146, 1–7. [Google Scholar]
Wang, L.; Wang, S.; Qi, J.; Suzuki, K. A multi-task mean teacher for semi-supervised facial affective behavior analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3603–3608. [Google Scholar]
Tzirakis, P.; Chen, J.; Zafeiriou, S.; Schuller, B. End-to-end multimodal affect recognition in real-world environments. Inf. Fusion 2021, 68, 46–53. [Google Scholar]
Zhang, S.; Zhang, S.; Huang, T.; Gao, W.; Tian, Q. Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 3030–3043. [Google Scholar]
Liu, J.; Chen, S.; Wang, L.; Liu, Z.; Fu, Y.; Guo, L.; Dang, J. Multimodal emotion recognition with capsule graph convolutional based representation fusion. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6339–6343. [Google Scholar]
Huang, J.; Tao, J.; Liu, B.; Lian, Z.; Niu, M. Multimodal transformer fusion for continuous emotion recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3507–3511. [Google Scholar]
Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar]
Sharafi, M.; Yazdchi, M.; Rasti, R.; Nasimi, F. A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed. Signal Process. Control 2022, 78, 103970. [Google Scholar]
Hazarika, D.; Gorantla, S.; Poria, S.; Zimmermann, R. Self-attentive feature-level fusion for multimodal emotion detection. In Proceedings of the 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), Miami, FL, USA, 10–12 April 2018; pp. 196–201. [Google Scholar]
Priyasad, D.; Fernando, T.; Denman, S.; Sridharan, S.; Fookes, C. Attention driven fusion for multi-modal emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3227–3231. [Google Scholar]
Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar]
Zhang, T.; Li, S.; Chen, B.; Yuan, H.; Chen, C.P. Aia-net: Adaptive interactive attention network for text–audio emotion recognition. IEEE Trans. Cybern. 2022, 53, 7659–7671. [Google Scholar]
Fu, Y.; Okada, S.; Wang, L.; Guo, L.; Song, Y.; Liu, J.; Dang, J. Context-and knowledge-aware graph convolutional network for multimodal emotion recognition. IEEE Multimed. 2022, 29, 91–100. [Google Scholar]
Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar]
Pan, Z.; Luo, Z.; Yang, J.; Li, H. Multi-modal attention for speech emotion recognition. arXiv 2020, arXiv:2009.04107. [Google Scholar]
Ren, M.; Huang, X.; Shi, X.; Nie, W. Interactive multimodal attention network for emotion recognition in conversation. IEEE Signal Process. Lett. 2021, 28, 1046–1050. [Google Scholar]
Zheng, J.; Zhang, S.; Wang, Z.; Wang, X.; Zeng, Z. Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition. IEEE Trans. Multimed. 2022, 25, 2213–2225. [Google Scholar]
Zhao, J.; Li, R.; Jin, Q.; Wang, X.; Li, H. Memobert: Pre-training model with prompt-based learning for multimodal emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4703–4707. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ortega, J.D.; Cardinal, P.; Koerich, A.L. Emotion recognition using fusion of audio and video features. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3847–3852. [Google Scholar]
Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv 2019, arXiv:1908.11540. [Google Scholar]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volome 33, pp. 6818–6825. [Google Scholar]
Kuhnke, F.; Rumberg, L.; Ostermann, J. Two-stream aural-visual affect analysis in the wild. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 600–605. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Atmaja, B.T.; Akagi, M. Multitask learning and multistage fusion for dimensional audiovisual emotion recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4482–4486. [Google Scholar]
Ni, R.; Yang, B.; Zhou, X.; Song, S.; Liu, X. Diverse local facial behaviors learning from enhanced expression flow for microexpression recognition. Knowl.-Based Syst. 2023, 275, 110729. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Zhao, S.; Ma, Y.; Gu, Y.; Yang, J.; Xing, T.; Xu, P.; Hu, R.; Chai, H.; Keutzer, K. An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 303–311. [Google Scholar]
Kumar, A.; Vepa, J. Gated mechanism for attention based multi modal sentiment analysis. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 4477–4481. [Google Scholar]
Ni, R.; Yang, B.; Zhou, X.; Cangelosi, A.; Liu, X. Facial expression recognition through cross-modality attention fusion. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 175–185. [Google Scholar]
Ghaleb, E.; Niehues, J.; Asteriadis, S. Multimodal attention-mechanism for temporal emotion recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 251–255. [Google Scholar]
Lee, J.; Kim, S.; Kim, S.; Sohn, K. Audio-visual attention networks for emotion recognition. In Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia, Seoul, Republic of Korea, 26 October 2018; pp. 27–32. [Google Scholar]
Zhang, S.; Ding, Y.; Wei, Z.; Guan, C. Continuous emotion recognition with audio-visual leader-follower attentive fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3567–3574. [Google Scholar]
Kim, D.H.; Baddar, W.J.; Jang, J.; Ro, Y.M. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput. 2017, 10, 223–236. [Google Scholar]
Wöllmer, M.; Kaiser, M.; Eyben, F.; Schuller, B.; Rigoll, G. LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 2013, 31, 153–163. [Google Scholar] [CrossRef]
Nicolaou, M.A.; Gunes, H.; Pantic, M. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2011, 2, 92–105. [Google Scholar] [CrossRef]
Rajasekhar, G.P.; Granger, E.; Cardinal, P. Deep domain adaptation with ordinal regression for pain assessment using weakly-labeled videos. Image Vis. Comput. 2021, 110, 104167. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Huang, L.; Wang, L.; Li, H. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 8002–8011. [Google Scholar]
Long, C.; Basharat, A.; Hoogs, A.; Singh, P.; Farid, H. A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Forged Videos. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 1–10. [Google Scholar]
Sethu, V.; Epps, J.; Ambikairajah, E. Speech based emotion recognition. In Speech and Audio Processing for Coding, Enhancement and Recognition; Springer: New York, NY, USA, 2014; pp. 197–228. [Google Scholar]
Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
Albanie, S.; Nagrani, A.; Vedaldi, A.; Zisserman, A. Emotion recognition in speech using cross-modal transfer in the wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 292–301. [Google Scholar]
Slimi, A.; Hamroun, M.; Zrigui, M.; Nicolas, H. Emotion recognition from speech using spectrograms and shallow neural networks. In Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia, Chiang Mai, Thailand, 30 November–2 December 2020; pp. 35–39. [Google Scholar]
Praveen, R.G.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.; Bacon, S.; Cardinal, P.; et al. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. arXiv 2022, arXiv:2203.14779. [Google Scholar]
Praveen, R.G.; Granger, E.; Cardinal, P. Deep weakly supervised domain adaptation for pain localization in videos. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 18–22 May 2020; pp. 473–480. [Google Scholar]
Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention for visual question answering. Adv. Neural Inf. Process. Syst. 2016, 29, 289–297. [Google Scholar]
Kollias, D.; Tzirakis, P.; Nicolaou, M.A.; Papaioannou, A.; Zhao, G.; Schuller, B.; Kotsia, I.; Zafeiriou, S. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 2019, 127, 907–929. [Google Scholar] [CrossRef]
Kossaifi, J.; Tzimiropoulos, G.; Todorovic, S.; Pantic, M. AFEW-VA database for valence and arousal estimation in-the-wild. Image Vis. Comput. 2017, 65, 23–36. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Kollias, D.; Zafeiriou, S. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3652–3660. [Google Scholar]
Deng, D.; Chen, Z.; Shi, B.E. Multitask emotion recognition with incomplete labels. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 592–599. [Google Scholar]
Toisoul, A.; Kossaifi, J.; Bulat, A.; Tzimiropoulos, G.; Pantic, M. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell. 2021, 3, 42–50. [Google Scholar] [CrossRef]
Kossaifi, J.; Toisoul, A.; Bulat, A.; Panagakis, Y.; Hospedales, T.M.; Pantic, M. Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6060–6069. [Google Scholar]
Kim, D.; Song, B.C. Contrastive adversarial learning for person independent facial emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 5948–5956. [Google Scholar]
Mitenkova, A.; Kossaifi, J.; Panagakis, Y.; Pantic, M. Valence and arousal estimation in-the-wild with tensor methods. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar]
Tellamekala, M.K.; Valstar, M. Temporally coherent visual representations for dimensional affect recognition. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 1–7. [Google Scholar]
Handrich, S.; Dinges, L.; Al-Hamadi, A.; Werner, P.; Saxen, F.; Al Aghbari, Z. Simultaneous prediction of valence/arousal and emotion categories and its application in an HRC scenario. J. Ambient Intell. Humaniz. Comput. 2021, 12, 57–73. [Google Scholar]
Aspandi, D.; Sukno, F.; Schuller, B.; Binefa, X. An enhanced adversarial network with combined latent features for spatio-temporal facial affect estimation in the wild. arXiv 2021, arXiv:2102.09150. [Google Scholar]
Pei, E.; Hu, Z.; He, L.; Ning, H.; Berenguer, A.D. An ensemble learning-enhanced multitask learning method for continuous affect recognition from facial images. Expert Syst. Appl. 2024, 236, 121290. [Google Scholar]
Kollias, D. Abaw: Valence-arousal estimation, expression recognition, action unit detection & multi-task learning challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022; pp. 2328–2336. [Google Scholar]
Zhang, W.; Qiu, F.; Wang, S.; Zeng, H.; Zhang, Z.; An, R.; Ma, B.; Ding, Y. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2428–2437. [Google Scholar]
Karas, V.; Tellamekala, M.K.; Mallol-Ragolta, A.; Valstar, M.; Schuller, B.W. Continuous-time audiovisual fusion with recurrence vs. attention for in-the-wild affect recognition. arXiv 2022, arXiv:2203.13285. [Google Scholar]
Savchenko, A.V. Frame-level prediction of facial expressions, valence, arousal and action units for mobile devices. arXiv 2022, arXiv:2203.13436. [Google Scholar]
Nguyen, H.H.; Huynh, V.T.; Kim, S.H. An ensemble approach for facial expression analysis in video. arXiv 2022, arXiv:2203.12891. [Google Scholar]
Zhang, S.; An, R.; Ding, Y.; Guan, C. Continuous emotion recognition using visual-audio-linguistic information: A technical report for abaw3. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 18–24 June 2022; pp. 2376–2381. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
Zhang, D.; Wu, L.; Sun, C.; Li, S.; Zhu, Q.; Zhou, G. Modeling both Context-and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 5415–5421. [Google Scholar]
Mao, Y.; Sun, Q.; Liu, G.; Wang, X.; Gao, W.; Li, X.; Shen, J. Dialoguetrm: Exploring the intra-and inter-modal emotional behaviors in the conversation. arXiv 2020, arXiv:2010.07637. [Google Scholar]
Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphMFT: A graph network based multimodal fusion technique for emotion recognition in conversation. Neurocomputing 2023, 550, 126427. [Google Scholar]

Figure 1. Architecture of the developed HMATN emotion recognition system.

Figure 2. The Contextual Semantic and Spatial Attention (CSSA) segment comprises two distinct components: the SEMantic ATtention (SEMAT) subsystem and the SPAtial ATtention (SPAAT) subsystem. This segment leverages audio input as a guiding signal to direct the model’s focus toward visually significant regions associated with events. It enhances the extraction of meaningful semantic information from the visual domain, ensuring alignment with the audio modality. By integrating these attention mechanisms, the CSSA module effectively bridges audio-visual information, enabling a more robust understanding of events through cross-modal interaction. The colored cubes represent the fused audio-visual features.

Figure 3. The parallel cross-modal attention (PCMA) subsystem is built on a co-attention mechanism to efficiently capture interactions between the visual and audio modalities. This block ensures consistency in event information by synchronizing and integrating features from both modalities, enabling the model to achieve a cohesive understanding of cross-modal data. The colored boxes represent the audio and visual features.

Figure 4. Examples of training samples from AFEW-VA dataset.

Figure 5. The confusion matrices of the proposed HMATN on the AFEW-VA dataset.

Figure 6. Examples of training samples from the Affwild2 dataset.

Figure 7. Examples of training samples from the IEMOCAP dataset.

Figure 8. The confusion matrices of proposed HMATN on the IEMOCAP dataset.

Figure 9. Visualization of the spectrogram used with our proposed A-V fusion models on the video named “63-10-1920-1080” from the Affwild2 dataset.

Table 1. Major contributions on papers related to multimodal emotion recognition.

Category	Method	Contributions, Innovations, and Limitations
Audio–Visual	Zhang et al. [22] Liu et al. [23] Huang et al. [24] Middya et al. [25] Sharafi et al. [26]	Hybrid deep learning using a CNN and 3D-CNN for feature extraction, DBN for fusion, and SVM for classification. No end-to-end training, large number of parameters leads to computational cost, and developed for discrete emotions. CapsGCN-based emotion recognition with 2D-CNNs, capsule networks, GCN for relational learning. Ignore the complementary information between different modalities. Transformer-based emotion recognition with eGeMAPS for speech, geometric facial features, and multi-head attention. CNN-based feature extractors with concatenation fusion, FC, and Softmax for classification. Spectral audio features did not perform well when there was a class distribution mismatch among datasets. Spatiotemporal CNN with Bi-LSTM for feature extraction, followed by FC and Softmax. The model cannot learn the features of the images correctly when using a pre-trained model.
Audio–Text	Hazarika et al. [27] Priyasad et al. [28] Lian et al. [29] Zhang et al. [30] Fu et al. [31]	Feature-level fusion using self-attention, LLDs for speech, CNNs for text, and FC with Softmax. The performance is decreased with the increase in noise for both modalities. Deep learning approach with SincNet, DCNN for audio, Bi-RNN for text, and cross-attention for fusion. SincNet designs only the first layer, focusing on low-frequency components that fail to capture formats. Conversational transformer model with Bi-GRU speaker embeddings, ATS-Fusion for audio–text integration. Global utterance sequence modeling minimizes the complex emotional interactions between multimodal utterances. AIA-Net with interactive attention, RoBERTa for text, and Wav-RoBERTa for speech. Context- and knowledge-aware GCNs using CNN-BiLSTM for audio, BERT for text, and graph-based fusion.
Audio–Visual–Text	Poria et al. [32] Pan et al. [33] Ren et al. [34] Zheng et al. [35] Zhao et al. [36]	Temporal CNN for visual–text feature extraction, combining image pairs for sequence sensitivity. The pooling operations in the CNN resulted in the loss of overall semantic dependency. cLSTM-MMA multimodal attention mechanism, selective fusion across three modalities. IMAN uses cross-modal attention fusion and conversational modeling for speaker dependency. IMAN defines three different gated recurrent units (GRUs) to capture the context information, Multi-channel Weight-sharing Autoencoder to handle affective heterogeneity in MER. Ignoring the interaction of modalities. MEmoBERT is a multimodal pre-training model that employs self-supervised learning with a prompt-based approach. This method combines the strengths of BERT with multimodal inputs to achieve state-of-the-art performance.

Table 2. Parameters of the proposed method.

Parameter	Value	Parameter	Value
Model for video	Inception I3D and R3D	Weight-decay	0.0005
Model for audio	Resnet 18	Dropout	0.8
Learning rate for video	0.0001	Optimizer	SGD
Learning rate for audio	0.001	Activation	ReLU
Learning rate decay start	15	Epochs	50
Learning rate decay every	5	Batch size	64
Learning rate decay rate	0.9	Momentum	0.9

Table 3. Parameter scale of different models for emotion recognition.

Model	Parameter Scale (Million)
I3D (RGB)	12 M
R3D-18 (ResNet-3D, 18 layers)	33 M
2D CNN + LSTM (ResNet-18 + 2-layer LSTM)	13.3 M
ResNet-18 (2D CNN)	11.7 M

Table 4. Impact of the CSSA module on the Aff-Wild2 dataset.

Method	Valence	Arousal
w/o CSSA	0.421	0.343
w/o SEMAT	0.451	0.367
w/o SPAAT	0.447	0.359
Full Model [Ours]	0.457	0.375

Table 5. Impact of the HASPCM module on the Aff-Wild2 dataset.

Method	Valence	Arousal
w/o HASPCM	0.432	0.348
w/o SMA	0.441	0.352
w/o PCMA	0.449	0.361
Full Model [Ours]	0.457	0.375

Table 6. Results of the ablation analysis on the IEMOCAP dataset.

IEMOCAP	ACC	WA-F1
HMATN	75.39	78.56
Modality
Visual	67.52	68.95
Audio	61.77	62.34
Feature concatenation	70.12	70.65
Cross-attention	72.75	72.52

Table 7. Quantitative results on the AFEW-VA dataset.

Method	[71]	[72]	[73]	[74]	[75]	[76]	[77]	[78]	Our Method
Valence	0.69	0.55	0.59	0.270	0.475	0.39	0.377	0.502	0.654
Arousal	0.66	0.53	0.54	0.333	0.306	0.29	0.467	0.581	0.617
Average	0.675	0.540	0.56	0.302	0.391	0.34	0.497	0.541	0.635

Table 8. Quantitative results on the Affwild2 dataset.

Method	[79]	[80]	[63]	[81]	[82]	[83]	[84]	[41]	Our Method
Valence	0.180	0.300	0.374	0.418	0.417	0.450	0.520	0.448	0.457
Arousal	0.170	0.244	0.363	0.406	0.453	0.448	0.601	0.417	0.375
Average	0.175	0.272	0.369	0.412	0.435	0.449	0.560	0.432	0.416

Table 9. Validation set results of Affwild2 dataset for valence and arousal.

Validation Set	Valence	Arousal	Mean
Fold 0	0.455	0.652	0.553
Fold 1	0.596	0.683	0.640
Fold 2	0.475	0.639	0.557
Fold 3	0.544	0.658	0.601
Fold 4	0.438	0.638	0.538
Fold 5	0.469	0.623	0.546

The best results are highlighted in bold.

Table 10. Experimental results on the IEMOCAP dataset. The best result in each column is in bold.

Baseline	Hapy	Sad	Neut	Angry	Excit	Frust
Baseline	Acc/F1	Acc/F1	Acc/F1	Acc/F1	Acc/F1	Acc/F1
BC-LSTM-Att	30.26/33.59	58.24/61.41	53.12/52.30	56.03/57.45	52.14/56.58	66.52/59.17
DialogueRNN	29.13/34.51	74.11/77.55	59.16/60.14	63.72/66.13	81.57/74.15	65.12/62.55
ConGCN	42.43/45.11	84.10/86.45	62.90/64.14	68.54/65.44	66.15/64.18	67.40/60.45
DialogueTRM	48.87/53.16	79.11/80.14	65.02/65.56	73.18/68.11	80.15/79.93	53.17/53.17
AMER-Net	55.14/56.21	82.17/78.94	58.93/64.14	70.17/71.01	80.15/75.01	69.16/69.11
GraphMFT	53.94/56.11	82.17/78.84	61.03/63.74	70.70/71.05	77.87/75.08	69.56/69.01
HMATN	58.73/67.87	84.96/87.14	61.82/62.45	74.14/73.51	71.82/75.57	70.84/70.33

The best results are highlighted in bold.

Table 11. Performance comparison on the IEMOCAP Corpus (Average Accuracy and Weighted Metrics).

Method	Accuracy	WA (F1)
TextCNN [85]	48.47	48.05
BC-LSTM-Att [86]	56.32	56.11
DialogueRNN [40]	63.50	62.70
ConGCN [87]	64.19	64.10
DialogueTRM [88]	68.72	69.13
DialogueGCN [39]	65.91	65.62
GraphMFT [89]	69.76	70.05
HMATN [Ours]	75.39	78.56

The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moorthy, S.; Moon, Y.-K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics 2025, 13, 1100. https://doi.org/10.3390/math13071100

AMA Style

Moorthy S, Moon Y-K. Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics. 2025; 13(7):1100. https://doi.org/10.3390/math13071100

Chicago/Turabian Style

Moorthy, Sathishkumar, and Yeon-Kug Moon. 2025. "Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion" Mathematics 13, no. 7: 1100. https://doi.org/10.3390/math13071100

APA Style

Moorthy, S., & Moon, Y.-K. (2025). Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion. Mathematics, 13(7), 1100. https://doi.org/10.3390/math13071100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. Audio–Visual Emotion Recognition

2.2. Attention Models for A-V Integration

3. The Proposed Method

3.1. Visual Network

3.2. Audio Network

3.3. Cross-Attentional Fusion

3.4. Feature-Level Fusion

3.5. Overall Architecture

Modality Representation Module

3.6. Hybrid Attention of Single and Parallel Cross-Modal Framework

3.7. Audio–Visual Fusion Module

4. Experimental Evaluation

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Study

4.4. Performance and Comparison

4.4.1. Experiments on the AFEW-VA Dataset

4.4.2. Experiments on Affwild2 Dataset

4.4.3. Experiments on IEMOCAP Dataset

4.4.4. Qualitative Evaluation

4.4.5. Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI