Next Article in Journal
Convection and Microwave–Convection Drying of Moldavian Dragonhead (Dracocephalum moldavica L.) Leaves
Previous Article in Journal
Variability of Interpolation Errors and Mutual Enhancement of Different Interpolation Methods
Previous Article in Special Issue
Cyclic Consistent Image Style Transformation: From Model to System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(24), 11494; https://doi.org/10.3390/app142411494
Submission received: 10 October 2024 / Revised: 4 December 2024 / Accepted: 6 December 2024 / Published: 10 December 2024

Abstract

:
Speech emotion recognition (SER) is important in facilitating natural human–computer interactions. In speech sequence modeling, a vital challenge is to learn context-aware sentence expression and temporal dynamics of paralinguistic features to achieve unambiguous emotional semantic understanding. In previous studies, the SER method based on the single-scale cascade feature extraction module could not effectively preserve the temporal structure of speech signals in the deep layer, downgrading the sequence modeling performance. To address these challenges, this paper proposes a novel multi-scale feature pyramid network. The enhanced multi-scale convolutional neural networks (MSCNNs) significantly improve the ability to extract multi-granular emotional features. Experimental results on the IEMOCAP corpus demonstrate the effectiveness of the proposed approach, achieving a weighted accuracy (WA) of 71.79% and an unweighted accuracy (UA) of 73.39%. Furthermore, on the RAVDESS dataset, the model achieves an unweighted accuracy (UA) of 86.5%. These results validate the system’s performance and highlight its competitive advantage.

1. Introduction

Speech emotion recognition (SER) refers to extracting and analyzing emotion-related features from speech signals, allowing the computer to understand the speaker’s emotional expression [1]. As an essential part of affective computing, SER is the key to facilitating natural human–computer interactions (HCIs) [2]. The HCI system with an affective computing module has been applied to many tasks, such as psychological assessment [3], mobile services [4], and safe driving [5]. Decades of research in SER have been devoted to modeling emotional representations from linguistic (e.g., lexical, syntactic, discourse and rhetorical features) and paralinguistic (e.g., supra-segmental phoneme and prosodic features) characteristics and to developing appropriate algorithms for implementing robust and effective emotion recognition [6,7,8,9,10].
Recently, significant progress has been made in the field of SER using multi-scale convolutional neural networks (MSCNNs). MSCNNs can extract multi-scale temporal and spatial features. Compared to single-scale network methods, MSCNNs [11,12,13,14,15] effectively capture emotion-related features of variable lengths from speech inputs and are expected to further enhance semantic understanding and emotion recognition.
However, the MSCNN module in the previous research used a one-way transmission design. The temporal dynamic range that deep emotional features can retain depends on the size of the convolutional kernel. This unidirectional design can lead to the loss of temporal structure of speech in progressive resolution reduction. To enhance SER performance, it is crucial to learn both global–local semantic representation and temporal dynamics in speech. The main limitations comprise the following: (1) The high-level semantic features learned by the hierarchical MSCNN layers suffer from the loss of temporal structure of speech. (2) In context-aware semantic understanding, the MSCNN independently produces features that ignore the long-term relation between multi-scale sound units, and the correlations between multi-scale features should be explored.
Based on the limitations of the existing feature extraction networks in emotional feature extraction, we propose a novel framework learning context-aware representation using a multi-scale feature pyramid network (MSFPN) for SER. We explore the bi-directional fusion of multi-scale semantic features using a feature pyramid network and preserve the resolution for temporal dynamics learning and global–local semantic understanding. More specifically, we adopt parallel MSCNN groups for multi-scale feature learning, supplemented with a forward fusion mechanism to fuse these features into high-level semantic features. Moreover, to learn the local-global correlations in MSCNN, this paper improves and uses the convolutional self-attention (CSA) [16] layer to focus on the emotion-related periods in the local region effectively. The MSFPN enhances the connections between adjacent elements and captures the interactions between features extracted by different attention heads. In the top-down pathway, features are rejoined into low-level acoustic features by backward fusion, and bi-directional long short-term memory (BiLSTM) is adopted to generate an utterance-level representation for emotion classification.
Our research contributions can be summarized as follows:
1. 
We have conducted a detailed exploration of the application of feature pyramids in speech emotion recognition for the first time. We enhanced the MSCNN by integrating the CSA module to better capture local emotional correlations.
2. 
We have improved the CSA by using a multi-scale convolutional module, avoiding the degradation problem of the convolutional attention network.
3. 
We designed a backward fusion approach that effectively captures features across different levels of detail, successfully preserving the importance of local dynamics and deep semantics in emotional representation.
The rest of this paper is organized as follows: We summarize related work in Section 2. In Section 3, we propose our framework in detail. Experiments are reported in Section 4. We summarize this paper in Section 5.

2. Related Work

2.1. Background of Speech Emotion Recognition

Emotions are psychological states triggered by neurophysiological changes, and they are related differently to thoughts, feelings, behavioral responses, and varying degrees of happiness or unhappiness. Currently, there is no scientific consensus on the definition of emotions, which often intertwine with feelings, temperament, personality, character, and creativity. Emotions consist primarily of subjective experiences, physiological responses, and behavioral reactions, playing a crucial role in building interpersonal relationships. They can be recognized through speech, facial expressions, and body language. Basic emotions include surprise, joy, disgust, sadness, anger, and fear, while complex emotions, such as contempt, amusement, and embarrassment, are blends of multiple feelings and are more challenging to identify.
Speech emotion recognition (SER) technology aims to identify human emotions through voice. Typically, people are not very accurate in recognizing others’ emotions, making emotion recognition a burgeoning field of study where appropriate technology can enhance accuracy. The core of this process lies in identifying emotions from speech without considering cognitive content. SER mainly involves two processes, feature extraction and emotion classification, and it holds significant potential for applications in security, healthcare, entertainment, and education. The task of emotion recognition is complex due to the highly subjective nature of emotions, and there is yet to be a unified standard for classifying or measuring them.
As illustrated in Figure 1, an SER system consists of modules for speech signal input, pre-processing, feature extraction and selection, classification, and emotion recognition. The ability to extract robust emotional features from speech is crucial to the success of an SER system. Studies have shown that extracting features across multiple modal [17,18,19,20,21,22,23] and multiple scales can significantly improve the accuracy of emotion recognition.

2.2. Multi-Scale Network Model

A multi-scale network model is a deep learning architecture designed to capture and process features across different scales simultaneously, thereby enhancing overall model performance. Typically, the model consists of multiple branches, each handling input data at various scales. These features are then fused at specific stages, enabling the extraction of more robust and comprehensive representations. In speech emotion recognition (SER), where emotions are subjective and vary among individuals, many researchers focus on extracting emotion features from speech at multiple scales to enhance SER performance. Zhu et al. [13] utilized a global perceptual fusion method to extract features across various scales in speech emotion recognition. They created a special neural network that learns and combines these multi-scale features using a global perceptual fusion module. Peng et al. [11] proposed a framework using an MSCNN with statistical pooling units to obtain both the audio and text hidden representations and employed an attention mechanism to improve performance further. Xie et al. [24] utilized multi-head attention to extract emotional features from both the temporal and frequency domains. They employed additive attention in the frequency domain, enhancing the capability to extract nonlinear features. In modeling temporal dependence, Chen et al. [15] employed an MSCNN with a bi-directional long short-term memory (BiLSTM) network to model local-aware temporal dynamics for SER. Gan et al. [25,26,27,28] focused on temporal features and proposed a Transformer-based model for sequential feature extraction, incorporating multi-scale temporal feature operators and attention modules. This approach effectively extracts emotion features across different time scales, thereby improving the accuracy of emotion recognition.
Nowadays, multi-scale feature extraction for emotion has become widely adopted. However, existing multi-scale networks often employ one-way transfers, which can compromise the temporal structure of speech. To tackle this, we have incorporated a bi-directional transfer design in our model, which better maintains the emotional features’ temporal structure compared to other models.

2.3. Feature Pyramid Network Model

The feature pyramid network [29] was initially designed to address multi-scale challenges in object detection. Due to its capability to effectively capture features at various granularities, it has become widely adopted in object detection tasks. Lately, researchers have been investigating how feature pyramid networks can be applied to sequential tasks like speech recognition and classification. Liu et al. [30] introduced a contextual pyramid generative adversarial network for speech enhancement. This design effectively captures speech details at various levels and removes audio noise in a structured way. Luo et al. [31] explored sound event detection based on feature pyramid networks, and their experiments demonstrated that the model utilizing feature pyramid networks surpassed long short-term memory (LSTM) in performance. Furthermore, researchers [32,33,34] have utilized feature pyramid networks to enhance speech recognition capabilities by combining multi-level features.
Previous studies have shown that feature pyramid networks can effectively capture multi-scale features in sequential tasks. Yet, there has been limited research on how these networks benefit extracting emotion features from speech. So, we delved deeper into using feature pyramid networks for speech emotion recognition. We enhanced the original feature pyramid network by introducing a more powerful convolutional attention network to extract stronger emotion features.

3. Methodology

In this section, we propose our framework in detail. The proposed framework, as illustrated in Figure 2, comprises two primary modules: the multi-scale feature pyramid model and the global–local representation learning model. The multi-scale feature pyramid model consists of three sets of convolutional blocks (Conv Blocks), each characterized by distinct kernel sizes. Specifically, the first set employs 3 × 3 kernels, the second set utilizes 5 × 5 kernels, and the third set incorporates 7 × 7 kernels. This varied kernel configuration facilitates the effective extraction of emotional features across different scales. Furthermore, the groups interact and fuse features through both forward and backward fusion mechanisms, enhancing the robustness of the extracted multi-scale emotional characteristics. The global–local representation learning model is composed of three bi-directional long short-term memory (BiLSTM) networks, which are specifically designed to capture global features from the outputs of the multi-scale feature pyramid model.
The process commences with the input speech signal being processed through a pre-trained Hidden-Unit BERT (HuBERT) model [35], which is a self-supervised learning model designed for speech representation. HuBERT captures high-level acoustic features by learning from large amounts of unlabeled speech data, providing robust embeddings for downstream tasks. These embeddings are subsequently input into both the multi-scale feature pyramid model and the global–local representation learning model to extract multi-scale discourse-level features. Ultimately, the framework generates the final emotional output through two fully connected layers.

3.1. Multi-Scale Feature Pyramid Network

The overview architecture of the MSFPN is proposed as shown in Figure 2. The overall structure consists of two paths: one for extracting deep emotional features from bottom to top and another for merging multi-scale emotional features from top to bottom.
In the bottom-up pathway, as shown in Figure 3, the MSFPN includes a series of ConvBlock layers to learn and protrude the fine-grained emotion-related features in the different levels. Each ConvBlock contains two CNN layers for feature learning and is mixed with an improved convolutional self-attention (detailed in Section 3.2) layer for local correlation learning.
Specifically, there are three groups ( X , Y , Z ) of bottom-up pathways with different kernel widths ( k w = 3, 5, 7), each extracting three levels of semantic features at different time scales. Given a speech dataset with n utterances, the characteristics of Group1 ( x 1 , x 2 , x 3 ) are calculated using hierarchical ConvBlock layers shown in Figure 3. After each layer, the number of output channels is reduced by half. To be specific, the output channels for x 1 are 1024, for x 2 are 512, and for x 3 are 256. The deep features of Group2 ( y 1 , y 2 , y 3 ) and Group3 ( z 1 , z z , z 3 ) are computed in the same way. The difference is that y i and z i can receive the previous fine-grained acoustic emotional features from the above groups by a forward fusion to enhance the local semantic understanding of phrases. Typically, the forward fusion here is adding function.
In the top-down pathway, a backward fusion mechanism is introduced to combine semantic feature maps with high-resolution acoustic features. As shown in Figure 4, for deep features F i = [ x i , y i , z i ] , firstly, calculate the attention score α i for the i-th layer using Equation (1). Then, multiply α i with features from various depth levels to extract the most pertinent emotion features for the i-th layer. Lastly, sum these products to derive the feature G i for the i-th layer.
α i = s o f t m a x ( W 2 × T a n h ( W 1 F i T ) )
G i = i = 1 3 α i × F i
where G i is the multi-scale deep features of the i-th level in the MSFPN, and W 1 and W 2 are trainable parameters.
Then, the multi-scale deep feature G is upsampled by a transposed convolution function, and the low-level features add high-level semantic information. We acquire the deep features that contain adequate semantic information for understanding the emotional expression of each utterance and a high resolution with the sequential structure for temporal dependence learning.

3.2. Convolutional Self-Attention

A vanilla convolutional self-attention mechanism is proposed in [16] for learning short-range dependencies. Self-attention [36] is a mechanism that allows a model to focus on different parts of an input sequence when processing each element, enabling it to capture long-range dependencies by assigning attention weights across all elements. However, the self-attention mechanism distributes attention across all elements, which can sometimes lead to the neglect of relationships between adjacent elements and phase-level patterns. To address this, convolutional self-attention (CSA) [16] introduces a convolutional layer within the self-attention framework. CSA applies localized filters that focus on neighboring elements, enabling the model to capture short-range dependencies and fine-grained features between adjacent utterances. This makes CSA particularly effective for extracting local patterns in speech data. As illustrated in Figure 5a, the multi-head convolutional self-attention mechanism operates with each color representing a different attention head. The darker regions indicate the areas that convolutional self-attention focuses on. From the figure, it is evident that CSA restricts its attention to local regions, concentrating on the contextual relationships between adjacent elements. This localized attention allows CSA to capture finer details and better extract local features, making it especially suited for modeling dependencies between neighboring elements.
However, CSA with a fixed kernel size has limitations in fusing multi-scale features. Specifically, when CSA projects multi-head spaces of the same dimension, the restricted regions of interest and the high number of heads can cause the regions projected by different heads to converge into similar feature spaces, leading to a lack of diversity in the feature representations learned by the different heads.
To overcome this limitation, we propose a multi-scale feature extraction approach to enhance the feature space projection. This allows self-attention to compute temporal dependencies across different regions of information while mitigating the overfitting problem associated with the multi-head mechanism in CSA. In this paper, we introduce an improved convolutional self-attention layer that incorporates multi-scale feature extraction. As shown in Figure 5b, we use different kernel sizes across the attention heads, enabling the extraction of fine-grained features at multiple scales. Given an input sequence X, CSA focuses on local regions for each query q i and restricts its attention region to a local scope of fixed size m + 1 ( m 3 , 7 , 11 , 15 ), centered at position i. To maintain the same output dimension as the input, we reduce the output dimension for each kernel width by a factor of four.
K m ^ = { k i m 2 , , k i , , k i + m 2 } .
V m ^ = { v i m 2 , , v i , , v i + m 2 } .
Each head’s linear mapping query, key, and value with dimensions of d k , d k , and d v . In practice, self-attention computes the attention function on a set of queries simultaneously. The query, key, and value are packed together into a matrix Q, K m ^ , and V m ^ . The attention output is calculated as
A t t e n t i o n ( Q , K m ^ , V m ^ ) = s o f t m a x ( Q K m ^ T d k ) V m ^ .
H e a d i , m = A t t e n t i o n ( Q W i Q , K m ^ W i K m ^ , V m ^ W i V m ^ )
M u l t i H e a d ( Q , K m ^ , V m ^ ) = C o n c a t ( h e a d 1 , 3 , , h e a d l , m )
where W i Q , W i K m ^ , W i V m ^ are the weight matrices in multi-head attention with dimensions d k / l , d k / l , d v / l , respectively.

3.3. Global–Local Representation Learning Module

In previous studies, the BiLSTM network has been proven effective in capturing the long temporal dynamics of deep features to aggregate global–local representations [15], but the progressive resolution reduction limits the sequence modeling performance [10]. In this paper, we explore the relative interaction between each emotional state in the progressive acoustic feature extraction. We model the multi-scale temporal dependence to generate the global–local representations for SER. In practice, the representation R is aggregated by BiLSTM H = ( h 1 , h 2 , , h t ) for the multi-scale deep features G = ( g 1 , g 2 , g t ) .
f t = σ ( W f [ g t , h t 1 ] + b f ) i t = σ ( W i [ g t , h t 1 ] + b i ) o t = σ ( W o [ g t , h t 1 ] + b o ) C t = f t C t 1 + i t t a n h ( W c [ g t , h t 1 ] + b c ) h t = o t t a n h ( C t ) R = h t 1 h t 2 h t 3
Here, σ denotes the sigmoid activation function, while f, i, o, and C represent the vectors for the input gate, forget gate, output gate, and memory cell activation, respectively. The weight matrices and bias vectors for each gate are indicated by W and b. The last hidden output, h t i , serves as the utterance-level representation R , which is subsequently input into fully connected layers for emotion inference.

4. Experiments

In this section, we evaluate the proposed framework in the IEMOCAP corpus. This paper compares the results with related state-of-the-art methods and deploys ablation studies to measure each component’s contribution.

4.1. Corpora Description

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus [37] is commonly utilized for evaluations. IEMOCAP is an interactive emotional binary motion capture database developed by the SAIL Lab at the University of Southern California, containing a total of 12 h of recordings. The data were recorded by ten professional actors (five males and five females) in a studio setting. Each recording is accompanied by discrete emotional labels and annotations for emotional dimensions. The IEMOCAP database consists of five sessions, each featuring dialogues between a female and a male actor, divided into two parts: improvised performances and scripted performances. The former involves spontaneous dialogues without predetermined content, while the latter follows a predefined script. This database encompasses multimodal information, including audio and text, making it suitable for various unimodal emotion recognition studies. To ensure a balanced representation of audio samples across categories, this study combines the excitement emotion into the happiness category. Ultimately, the dataset comprises 5531 audio samples, distributed as follows: 1103 instances of anger, 1084 instances of sadness, 1708 instances of calmness, and 1636 instances of happiness. Figure 6 illustrates the number of audio samples corresponding to each emotional label. As there is no predefined data split in IEMOCAP, we perform 10-fold cross-validation with a leave-one-person-out strategy to achieve comparable results.
In addition to IEMOCAP, we also evaluate our model on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. RAVDESS consists of 24 actors (12 males and 12 females) and includes both emotional speech and song samples, covering a range of emotions such as happiness, sadness, anger, fear, surprise, and neutral. For this dataset, we adopt a random 8:2 split for the training and test sets. Figure 7 illustrates the number of audio samples corresponding to each emotional label. This split strategy aligns with the approach used in other studies on the RAVDESS dataset, facilitating comparisons of our results with those reported in the literature. The differing data split strategies between IEMOCAP and RAVDESS are intended to ensure consistency with prior research on the RAVDESS dataset.
In this study, we used the HuBERT pre-trained model to extract speech embeddings, which served as the input to our pyramid network model. The HuBERT model, pre-trained on a large corpus of speech data, provides high-quality representations that capture both phonetic and prosodic features. These embeddings were crucial in facilitating accurate emotion recognition, as they allowed the model to focus on high-level speech characteristics rather than low-level acoustic details. By incorporating these pre-trained embeddings, the pyramid network could more effectively discern emotional cues in speech, leading to improved model performance across different emotion categories. This approach highlights the importance of leveraging pre-trained models for feature extraction in speech-based emotion recognition tasks.

4.2. Implementation Details

The proposed framework was implemented in PyTorch. For training, we used the Adam optimizer with a learning rate of 1 × 10 5 , a batch size of 32, and applied early stopping [38]. To handle data imbalance, we evaluated performance using both weighted accuracy (WA) and unweighted accuracy (UA). WA measures the overall classification accuracy by dividing the total number of correctly predicted samples by the total number of samples, while UA calculates the average accuracy for each emotion category, providing insight into the model’s performance across different emotions. The experiments were conducted on a local environment with an NVIDIA GeForce RTX 3080 GPU, sourced from NVIDIA, Santa Clara, CA, USA.
W A = i = 1 K N i A c c u r a c y i i = 1 K N i . U A = 1 K i = 1 K A c c u r a c y i .
where K denotes the number of emotion categories and i represents the i-th emotion category. N i denotes the data quantity of the i-th emotion category.

4.3. Experiment Results and Discussion

To fairly compare the performance of the proposed framework, we implement the end-to-end method (E2ESA without multi-task learning), resolution maintained method (DRN), and multi-scale feature representation method (GLAM without data augmentation) for evaluation. The experimental results are shown in Table 1. In the MSCNN-based method, GLAM’s multiple feature representations are conducted on the high-level semantic features, which are limited by the fine-grained temporal dynamics and local representation learning. Compared to the end-to-end method, E2ESA uses single-scale correlation modeling and limits global–local representation learning for context-aware emotion classification. The DRN method appropriately preserves the resolution of deep features, which avoids the loss of temporal structure of speech in hierarchical CNN. However, directly applying the MSCNN module in the DRN leads to a gridding effect when learning high-level temporal dynamics upon the mediate hidden features. In this paper, the MSFPN mainly benefits from the high semantic and full resolution feature map for global–local representation learning and achieves the highest results with 3.69% UA (GLAM), 3.39% UA (Xie [24]), 2.53%UA (E2ESA), and 1.8%UA (DRN) improvements, respectively. Additionally, the model inference time on the IEMOCAP dataset is 0.5 s, ensuring efficient real-time processing.
Furthermore, we also evaluate the performance of MSFPN on the RAVDESS dataset. Since most existing studies report results on the RAVDESS dataset using unweighted accuracy (UA) as the evaluation metric, we adopt UA as the evaluation metric for MSFPN on this dataset as well. The experimental results are shown in Table 2. Compared to the methods proposed by Chakhtouna [40] and Ullah [41], MSFPN better extracts multi-granularity emotional features across different scales. As shown in Table 2, the ability to capture multi-granularity features significantly improves emotion recognition accuracy on the RAVDESS dataset as well. The comparison results demonstrate that the proposed method effectively captures multi-scale contextual information in speech, achieving superior performance on the RAVDESS dataset as well. The model inference time on the RAVDESS dataset is 0.2 s, further emphasizing the efficiency of the proposed method.
Moreover, the feature space of MSFPN is well compacted and assembled. The t-SNE visualization is depicted in Figure 8. This figure shows that our method can produce differentiated emotional features. Unlike DRN, neutral emotional features are more condensed and refined and less mixed in other emotional spaces.

4.4. Ablation Study

To further measure the contributions of each component of the proposed model, the results of ablation experiments depicted in the IEMOCAP dataset, which is shown in Table 3.
From the results in Table 3 and Table 4, there is a clear measurement of the contributions of each component. The forward fusion provides the fusion of multi-scale deep features for robust semantic feature leaning in progressive downsampling. The backward fusion mainly focuses on the salient emotional periods in multi-scale deep features and aggregates them into a high-resolution feature map. The CSA in MSCNN is important for local correlation learning, which effectively enhances the expression of emotion in semantically strong features. The core component in our proposed framework is the MSCNN, which extracts multiple deep features and enables the fusion mechanism to appropriately capture emotional-related features.

5. Conclusions

In this paper, we propose a multi-scale feature pyramid network for context-aware speech emotion recognition. The MSCNN-based feature extraction module with an improved CSA layer is useful for capturing global–local correlations in the speech sequence. With bottom-up and top-down connections, semantically strong features are effectively aggregated with high-resolution acoustic features, which contain adequate emotional characteristics and retain temporal structure for global–local representation learning. The experimental results on IEMOCAP demonstrate our framework’s effectiveness and are significantly better than the state-of-the-art approaches.

6. Future Work

While the proposed multi-scale feature pyramid network has proven effective in capturing multi-scale features from speech, its potential in multimodal emotion recognition was explored only to a limited extent. Previous studies have shown that integrating multiple modalities can further improve recognition performance [42,43,44]. We plan to extend this framework to a multimodal setting by incorporating complementary information from diverse modalities, such as text, visual cues, and physiological signals. By harnessing the strengths of multi-scale feature learning across modalities, we aim to achieve a deeper understanding of emotional expressions and further enhance the system’s robustness and performance in complex real-world scenarios.

Author Contributions

Conceptualization, Y.W. and Z.Z.; methodology, Z.Z., J.H. and Y.W.; software, Z.Z. and J.H.; validation, Z.Z. and J.H.; formal analysis, Z.Z., Y.W. and J.H.; investigation, Z.Z. and J.H.; resources, Y.W. and X.Z.; data curation, Z.Z. and J.H.; writing—original draft preparation, Z.Z. and J.H.; writing—review and editing, Z.Z., Y.W. and J.H.; visualization, Z.Z. and J.H.; supervision, Y.W. and H.L.; project administration, Y.W. and H.L.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publica tion of this paper.

References

  1. Korsmeyer, C.; Rosalind, W. Picard, affective computing. Minds Mach. 1999, 9, 443–447. [Google Scholar] [CrossRef]
  2. Schuller, B.W. Speech emotion recognition two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
  3. Low, L.-S.A.; Maddage, N.C.; Lech, M.; Sheeber, L.; Allen, N.B. Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 2011, 58, 574–586. [Google Scholar] [CrossRef]
  4. Yoon, W.-J.; Cho, Y.-H.; Park, K.-S. A study of speech emotion recognition and its application to mobile services. In Ubiquitous Intelligence and Computing; Springer: Berlin/Heidelberg, Germany, 2007; pp. 758–766. [Google Scholar]
  5. Tawari, A.; Trivedi, M. Speech based emotion classification framework for driver assistance system. In Proceedings of the 2010 IEEE Intelligent Vehicles Symposium, La Jolla, CA, USA, 21–24 June 2010; pp. 174–178. [Google Scholar]
  6. Ma, H.; Yarosh, S. A review of affective computing research based on function-component-representation framework. IEEE Trans. Affect. Comput. 2021, 14, 1655–1674. [Google Scholar] [CrossRef]
  7. Deshmukh, S.; Gupta, P.; Mane, P. Investigation of results using various databases and algorithms for music player using speech emotion recognition. In International Conference on Soft Computing and Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2021; pp. 205–215. [Google Scholar]
  8. Basu, S.; Chakraborty, J.; Aftabuddin, M. Emotion recognition from speech using convolutional neural network with recurrent neural network architecture. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 19–20 October 2017; pp. 333–336. [Google Scholar]
  9. Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM architecture for speech emotion recognition with data augmentation. arXiv 2018, arXiv:1802.05630. [Google Scholar]
  10. Li, R.; Wu, Z.; Jia, J.; Zhao, S.; Meng, H. Dilated residual network with multi-head self-attention for speech emotion recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6675–6679. [Google Scholar]
  11. Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient speech emotion recognition using multi-scale CNN and attention. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3020–3024. [Google Scholar]
  12. Liu, J.; Liu, Z.; Wang, L.; Guo, L.; Dang, J. Speech emotion recognition with local-global aware deep representation learning. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7174–7178. [Google Scholar]
  13. Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6437–6441. [Google Scholar]
  14. Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech emotion recognition with multiscale area attention and data augmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6319–6323. [Google Scholar]
  15. Chen, M.; Zhao, X. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 374–378. [Google Scholar]
  16. Yang, B.; Wang, L.; Wong, D.F.; Chao, L.S.; Tu, Z. Convolutional self-attention networks. arXiv 2019, arXiv:1904.03107. [Google Scholar]
  17. Lee, J.-H.; Kim, J.-Y.; Kim, H.-G. Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning. Bioengineering 2024, 11, 997. [Google Scholar] [CrossRef] [PubMed]
  18. Liu, G.; Hu, P.; Zhong, H.; Yang, Y.; Sun, J.; Ji, Y.; Zou, J.; Zhu, H.; Hu, S. Effects of the Acoustic-Visual Indoor Environment on Relieving Mental Stress Based on Facial Electromyography and Micro-Expression Recognition. Buildings 2024, 14, 3122. [Google Scholar] [CrossRef]
  19. Das, A.; Sarma, M.S.; Hoque, M.M.; Siddique, N.; Dewan, M.A.A. AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition. Sensors 2024, 24, 5862. [Google Scholar] [CrossRef]
  20. Udahemuka, G.; Djouani, K.; Kurien, A.M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Appl. Sci. 2024, 14, 8071. [Google Scholar] [CrossRef]
  21. Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
  22. Wang, Y.; Li, Y.; Cui, Z. Incomplete multimodality-diffused emotion recognition. Adv. Neural Inf. Process. Syst. 2024, 36, 17117–17128. [Google Scholar]
  23. Meng, T.; Shou, Y.; Ai, W.; Yin, N.; Li, K. Deep imbalanced learning for multimodal emotion recognition in conversations. In IEEE Transactions on Artificial Intelligence; IEEE: New York, NY, USA, 2024. [Google Scholar]
  24. Xie, Y.; Liang, R.; Liang, Z.; Zhao, X.; Zeng, W. Speech emotion recognition using multihead attention in both time and feature dimensions. IEICE Trans. Inf. Syst. 2023, 106, 1098–1101. [Google Scholar] [CrossRef]
  25. Gan, C.; Wang, K.; Zhu, Q.; Xiang, Y.; Jain, D.K.; García, S. Speech emotion recognition via multiple fusion under spatial–temporal parallel network. Neurocomputing 2023, 555, 126623. [Google Scholar] [CrossRef]
  26. Li, Z.; Xing, X.; Fang, Y.; Zhang, W.; Fan, H.; Xu, X. Multi-scale temporal transformer for speech emotion recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Dublin, Ireland, 20–24 August 2023; pp. 3652–3656. [Google Scholar]
  27. Yu, L.; Xu, F.; Qu, Y.; Zhou, K. Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion. Appl. Acoust. 2024, 216, 109752. [Google Scholar] [CrossRef]
  28. Andayani, F.; Theng, L.B.; Tsun, M.T.; Chua, C. Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access 2022, 10, 36018–36027. [Google Scholar] [CrossRef]
  29. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  30. Liu, G.; Gong, K.; Liang, X.; Chen, Z. CP-GAN: Context pyramid generative adversarial network for speech enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6624–6628. [Google Scholar]
  31. Luo, S.; Feng, Y.; Liu, Z.J.; Ling, Y.; Dong, S.; Ferry, B. High precision sound event detection based on transfer learning using transposed convolutions and feature pyramid network. In Proceedings of the 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2023; pp. 1–6. [Google Scholar]
  32. Basbug, A.M.; Sert, M. Acoustic scene classification using spatial pyramid pooling with convolutional neural networks. In Proceedings of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Newport Beach, CA, USA, 30 January–1 February 2019; pp. 128–131. [Google Scholar]
  33. Gupta, S.; Karanath, A.; Mahrifa, K.; Dileep, A.D.; Thenkanidiyoor, V. Segment-level probabilistic sequence kernel and segment-level pyramid match kernel based extreme learning machine for classification of varying length patterns of speech. Int. J. Speech Technol. 2019, 22, 231–249. [Google Scholar] [CrossRef]
  34. Ren, Y.; Peng, H.; Li, L.; Xue, X.; Lan, Y.; Yang, Y. A voice spoofing detection framework for IoT systems with feature pyramid and online knowledge distillation. J. Syst. Archit. 2023, 143, 102981. [Google Scholar] [CrossRef]
  35. Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Self-supervised speech representation learning by masked prediction of hidden units. In IEEE/ACM Transactions on Audio, Speech, and Language Processing; IEEE: New York, NY, USA, 2021; Volume 29, pp. 3451–3460. [Google Scholar]
  36. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  37. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  38. Prechelt, L. Early stopping—But when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
  39. Li, Y.; Zhao, T.; Kawahara, T. Improved end-to-end speech emotion recognition using self-attention mechanism and multitask learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
  40. Chakhtouna, A.; Sekkate, S.; Abdellah, A. Unveiling embedded features in Wav2vec2 and HuBERT models for Speech Emotion Recognition. Procedia Comput. Sci. 2024, 232, 2560–2569. [Google Scholar] [CrossRef]
  41. Ullah, R.; Asif, M.; Shah, W.A.; Anjam, F.; Ullah, I.; Khurshaid, T.; Wuttisittikulkij, L.; Shah, S.; Ali, S.M.; Alibakhshikenari, M. Speech emotion recognition using convolution neural networks and multi-head convolutional transformer. Sensors 2023, 23, 6212. [Google Scholar] [CrossRef] [PubMed]
  42. Manelis, A.; Miceli, R.; Satz, S.; Suss, S.J.; Hu, H.; Versace, A. The Development of Ambiguity Processing Is Explained by an Inverted U-Shaped Curve. Behav. Sci. 2024, 14, 826. [Google Scholar] [CrossRef] [PubMed]
  43. Arslan, E.E.; Akşahin, M.F.; Yilmaz, M.; Ilgın, H.E. Towards Emotionally Intelligent Virtual Environments: Classifying Emotions Through a Biosignal-Based Approach. Appl. Sci. 2024, 14, 8769. [Google Scholar] [CrossRef]
  44. Sun, L.; Yang, H.; Li, B. Multimodal Dataset Construction and Validation for Driving-Related Anger: A Wearable Physiological Conduction and Vehicle Driving Data Approach. Electronics 2024, 13, 3904. [Google Scholar] [CrossRef]
Figure 1. Functional diagram of SER system.
Figure 1. Functional diagram of SER system.
Applsci 14 11494 g001
Figure 2. The overview of proposed multi-scale feature pyramid network.
Figure 2. The overview of proposed multi-scale feature pyramid network.
Applsci 14 11494 g002
Figure 3. Bottom-up pathway, where k w denotes different kernel widths, and C S A denotes convolutional self-attention.
Figure 3. Bottom-up pathway, where k w denotes different kernel widths, and C S A denotes convolutional self-attention.
Applsci 14 11494 g003
Figure 4. Backward fusion structure, where ϕ represents the attention score calculation function as shown in Equation (1), and F i denotes the feature of the i-th layer.
Figure 4. Backward fusion structure, where ϕ represents the attention score calculation function as shown in Equation (1), and F i denotes the feature of the i-th layer.
Applsci 14 11494 g004
Figure 5. Convolutional self-attention (CSA) framework. (a) vanilla CSA; (b) improved CSA.
Figure 5. Convolutional self-attention (CSA) framework. (a) vanilla CSA; (b) improved CSA.
Applsci 14 11494 g005
Figure 6. The number of audio samples corresponding to each emotional label in IEMOCAP.
Figure 6. The number of audio samples corresponding to each emotional label in IEMOCAP.
Applsci 14 11494 g006
Figure 7. The number of audio samples corresponding to each emotional label in RAVDESS.
Figure 7. The number of audio samples corresponding to each emotional label in RAVDESS.
Applsci 14 11494 g007
Figure 8. The t-SNE visualization of the proposed framework. (a) MSFPN; (b) DRN.
Figure 8. The t-SNE visualization of the proposed framework. (a) MSFPN; (b) DRN.
Applsci 14 11494 g008
Table 1. Comparisons of UA and WA with state-of-the-art methods on IEMOCAP. The best results are highlighted in bold.
Table 1. Comparisons of UA and WA with state-of-the-art methods on IEMOCAP. The best results are highlighted in bold.
ModelUAWA
GLAM [13]69.70%68.75%
E2ESA [39]70.86%69.25%
Xie [24]70.0%68.8%
DRN [10]71.59%70.23%
MSFPN73.39%71.79%
Table 2. Comparisons of UA with state-of-the-art methods on RAVDESS. The best results are highlighted in bold.
Table 2. Comparisons of UA with state-of-the-art methods on RAVDESS. The best results are highlighted in bold.
ModelUA
Chakhtouna [40]82.6%
Ullah [41]82.3%
MSFPN86.5%
Table 3. Performance of ablation studies on IEMOCAP dataset. ‘w/o’ denotes the vanilla MSFPN framework without certain components.
Table 3. Performance of ablation studies on IEMOCAP dataset. ‘w/o’ denotes the vanilla MSFPN framework without certain components.
MethodUAWA
w/o CSA66.89%65.35%
w/o MSCNN70.95%69.05%
w/o forward fusion71.85%70.09%
w/o backward fusion72.72%71.26%
MSFPN73.39%71.79%
Table 4. Performance of ablation studies on RAVDESS dataset. ‘w/o’ denotes the vanilla MSFPN framework without certain components.
Table 4. Performance of ablation studies on RAVDESS dataset. ‘w/o’ denotes the vanilla MSFPN framework without certain components.
MethodUA
w/o CSA68.8%
w/o MSCNN80.6%
w/o forward fusion80.9%
w/o backward fusion83.0%
MSFPN86.5%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Huang, J.; Zhao, Z.; Lan, H.; Zhang, X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Appl. Sci. 2024, 14, 11494. https://doi.org/10.3390/app142411494

AMA Style

Wang Y, Huang J, Zhao Z, Lan H, Zhang X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Applied Sciences. 2024; 14(24):11494. https://doi.org/10.3390/app142411494

Chicago/Turabian Style

Wang, Yuhua, Jianxing Huang, Zhengdao Zhao, Haiyan Lan, and Xinjia Zhang. 2024. "Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network" Applied Sciences 14, no. 24: 11494. https://doi.org/10.3390/app142411494

APA Style

Wang, Y., Huang, J., Zhao, Z., Lan, H., & Zhang, X. (2024). Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Applied Sciences, 14(24), 11494. https://doi.org/10.3390/app142411494

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop