Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM

Sha, Mo; Yang, Wenzhong; Wei, Fuyuan; Lu, Zhifeng; Chen, Mingliang; Ma, Chengji; Zhang, Linlu; Shi, Houwang

doi:10.3390/electronics13030588

Open AccessArticle

Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM

by

Mo Sha

¹,

Wenzhong Yang

^1,*,

Fuyuan Wei

¹,

Zhifeng Lu

²,

Mingliang Chen

³,

Chengji Ma

¹,

Linlu Zhang

¹ and

Houwang Shi

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China

²

College of Information Technology, Xinjiang Teacher’s College (Xinjiang Education Institute), Urumqi 830043, China

³

School of Software, Xinjiang University, Urumqi 830091, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(3), 588; https://doi.org/10.3390/electronics13030588

Submission received: 21 December 2023 / Revised: 20 January 2024 / Accepted: 28 January 2024 / Published: 31 January 2024

Download

Browse Figures

Versions Notes

Abstract

:

Speech emotion recognition (SER) is a key branch in the field of artificial intelligence, focusing on the analysis and understanding of emotional content in human speech. It involves a multidisciplinary knowledge of acoustics, phonetics, linguistics, pattern recognition, and neurobiology, aiming to establish a connection between human speech and emotional expression. This technology has shown broad application prospects in the medical, educational, and customer service fields. With the evolution of deep learning and neural network technologies, SER research has shifted from relying on manually designed low-level descriptors (LLDs) to utilizing complex neural network models for extracting high-dimensional features. A perennial challenge for researchers has been how to comprehensively capture the rich emotional features. Given that emotional information is present in both time and frequency domains, our study introduces a novel time–frequency domain convolution module (TFCM) based on Mel-frequency cepstral coefficient (MFCC) features to deeply mine the time–frequency information of MFCCs. In the deep feature extraction phase, for the first time, we have introduced hybrid dilated convolution (HDC) into the SER field, significantly expanding the receptive field of neurons, thereby enhancing feature richness and diversity. Furthermore, we innovatively propose the residual attention-gated multilayer perceptron (RA-GMLP) structure, which combines the global feature recognition ability of GMLP with the concentrated weighting function of the multihead attention mechanism, effectively focusing on the key emotional information within the speech sequence. Through extensive experimental validation, we have demonstrated that TFCM, HDC, and RA-GMLP surpass existing advanced technologies in enhancing the accuracy of SER tasks, fully showcasing the powerful advantages of the modules we proposed.

Keywords:

SER; time–frequency feature; attention; gate

1. Introduction

Speech, as the most instinctive and natural form of human communication, has always been a focal point for scientists and researchers. Based on this, the study of SER has emerged, aiming to build a bridge between human speech information and emotional expression using computer science and technology. SER is a key component of artificial intelligence (AI), involving multiple disciplines, such as acoustics, phonetics, linguistics, information theory, pattern recognition, and neurobiology. In human–computer interaction, SER plays a crucial role, aiming to more accurately understand and recognize the emotional state conveyed by the speaker through the voice signal. This technology has been widely practiced and applied in various intelligent applications. For example, in intelligent interaction systems, voice customer service has become the main means to achieve effective interaction [1]. Moreover, in the medical field [2,3], SER can assist doctors and medical workers in more accurately understanding the emotional state of mental patients, thereby providing more personalized and precise medical services. In the field of autonomous driving, by analyzing the emotional state of drivers through SER technology, the system can more accurately determine whether to take corresponding safety measures to avoid potential dangers. In the field of education, teachers and educators can use SER to more comprehensively understand the emotional state of students, to provide more personalized and effective teaching methods and psychological support. In the field of criminal investigation, by analyzing the voice of suspects, one can more accurately understand their emotional state, thus providing important auxiliary information for the investigation and trial of cases. These application examples demonstrate the potential value and practical application of SER in different fields. With further research and development, SER is expected to play a larger role in enhancing human–machine interaction, medical services, autonomous driving, education, and criminal investigation.

Traditional SER methods mostly rely on manually designed low-level descriptors (LLDs) for classification. Lampropoulos et al. [4] evaluated the application of MPEG-7 audio descriptors in emotion category modeling, especially in the emotion classification task based on an RBF-SVM classifier, showing good performance. However, with the continuous development and maturity of deep learning and neural network technologies, there has been a gradual shift towards using neural networks to extract more advanced and comprehensive features. Although these methods have achieved certain success in enhancing the diversity and richness of features, there are still many challenges and problems in practical applications. Among them, the main challenge is how to extract the most representative and informative features from the vast and complex speech signals, which remains a highly challenging issue. To address these issues, researchers have proposed various methods and techniques. For instance, through the multi-attribute decision-making (MADM) method, M. Virvou [5] innovatively integrates the image and speech mode in the dual-mode user interface, realizing the dynamic recognition of the six basic emotions of computer users. M. R. Makiuchi et al. [6] suggest combining acoustic information with textual information, utilizing high-level contextual information to aid in emotion prediction. Zhang et al. [7] use features of three modalities, speech, text, and image, to achieve a more comprehensive and accurate emotion recognition. Moreover, Heqing Zou et al. [8] have predicted tasks by integrating spectrogram features, MFCC features, and the time domain and frequency domain information of Wav2Vec2. However, these methods have not yet sufficiently considered the impact of time domain and frequency domain features on the SER tasks.

Without increasing parameters, by enlarging the receptive field of neurons, we can extract more diverse and rich emotional features, thereby more accurately reflecting the speaker’s emotional state. In the field of SER, we are the first to apply high-dimensional convolution (HDC) to speech feature extraction. In recent studies, Mirsamadi et al. have proposed a recurrent neural network (RNN) [9] with a local attention architecture to learn features in SER. F. Tao et al. [10] have introduced a new variant of long short-term memory (LSTM), namely, advanced LSTM (A-LSTM), to better model the temporal context for SER. Chen et al. [11] have employed an attention-based convolutional RNN (ACRNN) to extract high-level emotional feature representations from log Mel spectrograms. In [12], the recognition of the emotion was achieved using vision transformers. With the introduction of MLP-Mixer [13], which does not use convolution or self-attention mechanisms, it has effectively replaced vision transformers [14]. This has led to a renewed interest in MLP research. Xiaohan Ding [15] proposed RepMLP, combining the global feature extraction capability of MLPs with the local feature extraction ability of CNNs, thereby improving performance in translation-invariant tasks, such as semantic segmentation. Zongzhao Qiu [16] achieved a 33% reduction in mean squared error (MSE) by stacking GMLP and ReZero to learn sequential drug-target affinity information. Jiale Yan [17] introduced TT-MLP using tensor decomposition to compress deep MLPs, thus reducing the amount of training parameters. Wenjing Zhu [18] utilized GMLP components and the global attention module (Glam) to achieve good results. However, speech emotional information is not uniformly distributed across speech features, and the aforementioned MLP structures did not consider how to focus the model on the required information. Therefore, we propose a new type of MLP structure called RA-GMLP, which uses multihead attention to weight the importance information of emotion. This allows for a more focused approach to the key information in the frame, comprehensively understanding and representing the emotional information in speech, thereby significantly improving the accuracy of SER tasks.

In this paper, we will detail the design and implementation specifics of the model we propose and validate its performance and effectiveness through extensive experiments and analysis. We will also compare it with other mainstream SER methods to demonstrate the superiority of our model. In summary, the main contributions of this paper in the field of SER research are as follows:

(1): We propose a residual attention-based gated multilayer perceptron (RA-GMLP) structure, where the GMLP helps the model determine which information is reserved, and the attention mechanism further weights this important emotion information. This structure fills the gap of the GMLP mechanism in terms of attention, enhancing the accuracy of SER.
(2): This study proposes TFCM and uses a new method to achieve a time–frequency domain fusion of MFCC features. The experimental results show that this fusion method is actually superior to traditional feature extraction methods.
(3): To increase the receptive field of neurons and the diversity of speech emotion features, hybrid dilated convolution is introduced to the SER task for the first time. Experimental results show that the model achieves outstanding performance in speech emotion classification tasks.
(4): This paper proposes a time–frequency SER model based on the RA-GMLP module, which has higher classification accuracy.

2. Related Work

2.1. Traditional Features

In the classification of speech emotions, traditional methods involve using low-level descriptors (LLDs) that are manually designed for classification. Different high-level statistical functions (HSFs), such as the mean, maximum value, and variance, are applied to each LLD, and then the results are concatenated into a long feature vector. With the development of artificial intelligence, features have become more diverse. Currently, the main features in SER include prosodic and spectral features [19]. Prosodic features mainly consist of fundamental frequency-related features and energy-related features [20], which can extract both global and local features, such as maximum, minimum, mean, and variance, forming a high-dimensional feature set. Spectral features reflect the differences in the frequency domain of the signal, including linear spectral features and cepstral features. Common linear spectral features include linear prediction coefficients (LPCs) and log frequency power coefficients (LFCCs), while common cepstral features include Mel-frequency cepstral coefficients (MFCCs) and linear predictive cepstral coefficients (LPCCs). Studies have shown that cepstral features have better discriminative power for speech emotions than linear spectral features, as demonstrated by the research of Bou-Ghazale et al. [21]. In feature extraction, Han et al. [22] used a 16-dimensional feature set, where the first 9 dimensions were prosodic features and the remaining 7 were spectral features, including energy, formant, harmonics-to-noise ratio, etc., and utilized a nonlinear proximal support vector machine based on the Gaussian kernel for classification. Hsiao et al. [23] extracted features such as fundamental frequency, frequency perturbation, zero-crossing rate, energy, harmonics-to-noise ratio, and MFCC, and used a deep RNN model for classification, increasing the unweighted average recall (UAR) from 37.00% to 46.30% in the FAU-Aibo task. Meanwhile, Hudesheng et al. [24] extracted features including mean zero-crossing rate, energy, fundamental frequency, formant, and MFCC, and used a main auxiliary network feature fusion method to achieve 72.50% unweighted accuracy on the IEMOCAP dataset.

2.2. Popular SER Neural Network

Recurrent neural networks (RNNs) have been widely applied to SER tasks. RNNs are capable of modeling temporal dependencies within audio sequences. In contrast, convolutional neural networks (CNNs) [25] focus on extracting local information within sequences, while long short-term memory networks (LSTMs) [26] are used to handle long-term dependencies. Therefore, some studies have adopted a CNN+LSTM [27] structure, which has achieved good results. With the advancement of deep learning, the Transformer [28] was proposed and has been successfully applied to fields like machine translation, text classification, and text generation. In the realm of SER, there has been research applying the Transformer [29] to emotion recognition tasks, and a key-sparse Transformer [30] was proposed to better focus on emotional information. Recently, Google’s ViT team introduced a novel visual framework, MLP-Mixer, which uses multilayer perceptrons (MLPs) instead of the convolution operations in traditional CNNs and the self-attention mechanism in Transformers, achieving comparable results to Transformers. Additionally, by combining spatial projections and gated units [31], a method for capturing the most important emotional information within a global range was also proposed. Finally, Zhu [18] achieved good accuracy using GMLP as a classifier, proving the effectiveness of GMLP in the field of speech.

2.3. Hybrid Dilated Convolution

Dilated convolution is a convolution operation widely applied in the fields of computer vision and image processing. Compared with traditional convolution operations, dilated convolution introduces a dilation rate parameter to increase the receptive field of the convolution kernel, thereby capturing a larger range of contextual information while maintaining computational efficiency. The origin of dilated convolution can be traced back to 2015, first proposed by Fisher Yu and Vladlen Koltun in their research [32]. In this study, they introduced a new convolution operation, the dilated convolution, for image semantic segmentation tasks. Dilated convolution incorporates dilations (or holes) in the convolution kernel, sliding the kernel with a fixed sampling interval, indirectly enlarging its receptive field. The dilation rate parameter determines the size of the sampling interval; when the dilation rate is 1, the dilated convolution is equivalent to the traditional convolution operation. When the dilation rate is greater than 1, the dilated convolution can expand the receptive field. Dilated convolution has achieved remarkable results in fields such as image segmentation, object detection, and image semantic analysis. For example, Chen et al. [33] extensively applied dilated convolution in the DeepLab series of methods proposed in 2017. By incorporating multiscale dilated convolutions, DeepLab can effectively capture context information at different scales, achieving excellent performance in image segmentation tasks. In addition, dilated convolution has also been applied to tasks such as image super-resolution and image generation. For instance, Lin et al. [34] applied dilated convolution to the task of image super-resolution reconstruction in the dilated residual networks proposed in 2017, achieving remarkable results. Although extended convolution has a wide range of applications and developments in the field of images, it has not yet been applied in the field of SER, which is a unique aspect of our work. Our research will fill this gap, exploring the potential of dilated convolution in SER.

3. Architecture Design

In the following sections, we will elaborate in detail on the architecture of the model we propose, as illustrated in Figure 1. The time domain features, frequency domain features, and time–frequency domain features, derived from the same audio segment, are represented as

X_{t}^{B_{t} \times C_{t} \times T_{t} \times_{t}}

,

X_{f}^{B_{f} \times C_{f} \times T_{f} \times F_{f}}

, and

X_{t f}^{B_{t f} \times C_{t f} \times T_{t f} \times F_{t f}}

, respectively. The model consists of three main parts: the MFCC feature extraction module for the TF-feature, the HDC module for the deep feature, the RA-GMLP blocks for the classification block.

The MFCC feature extraction module uses three parallel convolution blocks to extract time domain information: Xt, frequency domain information: Xf, and time–frequency domain interaction information: Xtf, which, after a map, Xt ⊕ Xf ⊕ Xtf->X1, reach

X_{1}^{B \times C \times F \times T}

. We then introduce a multiscale dilated convolution structure to enrich the features and capture a time–frequency domain deep feature in the speech signal. Subsequently, the concatenated deep features

X_{2}^{B \times C \times F \times T}

are input into the RA-GMLP for further extraction and understanding of more comprehensive feature information. Finally, the classification is performed through a fully connected layer, realizing the recognition and classification of different speech signals. This architecture effectively leverages the individual strengths of the MFCC time domain features, MFCC frequency domain features, and MFCC time–frequency domain interaction features, enhancing the model’s performance in complex speech signal processing tasks.

3.1. Feature Extraction Block

As Figure 2 shows, this method extracts time domain and frequency domain information separately, then concatenates the original time–frequency features to achieve a more effective feature fusion. Figure 1 shows the size variation of our features; we used a kernel size of 1 × 3 for temporal convolution. To better capture frequency domain information, we used a kernel size of 3 × 1 for frequency domain convolution. To maintain the connectivity between the time and frequency domains, we performed convolution on the overall features using a kernel size of 3 × 3, with the number of output channels set to 16.

In addition, batch normalization is performed after tconv fconv and tfconv extraction. We concatenated the outputs of the temporal convolution, frequency domain convolution, and time–frequency domain convolution. After passing through the concatenation layer, we obtained time–frequency domain features

X_{1}^{B \times C \times F \times T}

(B is the batch size of 32, C is the number of channels of 16, F is the frequency dimension of 78, and T is the time domain dimension of 57). Through such processing, we are able to better capture the temporal and frequency domain information in speech signals, enhancing the model’s understanding of speech features.

3.2. Dilated Convolution

In the field of imaging, deep convolutional neural networks have been proven to effectively extract feature representations of images. However, we aim to achieve a larger receptive field with fewer parameters to capture richer features. To address this issue, dilated convolution is introduced since it allows us to achieve our goals, including the reduction of parameter quantity. However, dilated convolution may lead to the gridding effect by inserting zeros between pixels in the convolution kernel, potentially losing important information. This paper employs a hybrid dilated convolution block for further extraction of the time–frequency mixed features

X_{1}^{B \times C \times F \times T}

. This hybrid dilated convolution block effectively avoids the gridding effect and can capture spectral feature maps while preserving more spatial information. As illustrated in Figure 1, the deep feature extraction module includes three dilated convolutional layers. After each convolution, we perform downsampling and add a max pooling layer after each convolutional layer. As shown in Figure 3, the choice of dilation rates follows a heuristic method similar to a sawtooth wave, sequentially choosing dilation rates of 1, 2, and 3. Using three consecutive dilated convolution blocks, the receptive field is 13 × 13, which is much larger than that of ordinary convolution.The bluer the color, the more times the feature points are counted. According to the calculation of Formula (1), where ‘a’ represents the dilation rate,’k1’ represents the Convolution kernel,’k’ represents the Dilated Convolution kernel. ’p’ represents padding of Convolution,’w’ represents the dimension of feature, we ultimately obtained

X_{2}^{B \times C \times F \times T}

(B is the batch size of 32, C is the number of channels of 16, T is the temporal dimension of 18, and F is the frequency domain dimension of 12). By combining dilated convolutions with different dilation rates, we expanded the receptive field, making it more adaptable to the complexity of time–frequency domain features. Therefore, we can capture the emotional features of speech more richly, thereby improving the performance of the model in emotion classification tasks. Finally, the features

X_{3}^{B \times C \times D}

were obtained through reshape (Change of dimension).

\begin{matrix} k = k 1 + (k 1 - 1) \cdot (a - 1) \\ w = \frac{w - k + 2 p}{s} + 1 \end{matrix}

(1)

3.3. Residual Attention Perceptron Block

Figure 4 illustrates the overall structure of RA-GMLP; to compensate for the shortcomings of GMLP, we supplemented GMLP with an attention mechanism, which we introduce in detail in this section as residual attention-gated multilayer perceptron (RA-GMLP). The input to RA-GMLP is

X_{3}^{B \times C \times D}

(where B represents the batch size of 32, C represents the channel size of 128, and D represents the speech feature dimension of 216). The self-attention mechanism represents content stored in memory by key-value pairs, where each element consists of an address (key) and a value (value). A query can match a keyword, and depending on the relevance between the query and the keyword, the corresponding value is retrieved from the memory. As shown in Formula (2), the query, key, and value are typically multiplied by parameter matrices W to obtain Q, K, and V. By calculating attention scores, the model can capture complex contextual relationships between speech frames to enhance the capability of feature expression.

\begin{matrix} Q = W_{Q} \times query, K = W_{K} \times key, V = W_{V} \times value \\ Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{D_{k}}}) V \end{matrix}

(2)

After the multihead attention layer, the dimensions of the output features remain unchanged. Then, we perform a linear projection along the D dimension, expanding the D dimension of the feature

X_{3}

to 4D, and introduce nonlinearity with the GELU activation function. Next, we split the features along the D dimension into D1 and D2, where D = D1 + D2. In the gating mechanism, we use fw, b(D2) as the gate to judge the importance of the information in D1. Specifically, each position in D1 is multiplied by a learnable gating vector D2, resulting in a new position representation. This position representation retains some features of the original position while also diminishing unimportant emotional features. Thus, through the gating mechanism, we can control the flow of information in feature D1, thereby improving the model’s expressiveness and generalization ability.

\begin{matrix} f_{w, b} (D) & = w D + b \\ S (Z) & = D 1 \times f_{w, b} (D 2) \\ X 4 & = S (X 3) \end{matrix}

(3)

As illustrated by Formula (3), we perform multiplication along the D dimension, resulting in new features

X_{4}

that have the same dimensions as the input features

X^{B \times C \times D}

. The synergistic operation of the attention and gating mechanisms enhances the model’s capability to capture information: the former reveals the complex contextual relationships between different features at the current moment, while the gating mechanism controls the retention of information related to emotions, accurately reflecting the dynamic variations in the speech signal. In this way, we can more accurately express the emotional information of the input features, thereby improving the model’s classification accuracy.

4. Detailed Analysis of the Dataset and Precise Configuration of Experimental Parameters

4.1. Analysis of the Dataset

In this section, we will provide a detailed introduction to the performance of our model and comprehensively evaluate the impact of different input variables and model components on the final performance through a series of meticulously designed ablation studies. To validate the effectiveness of our proposed model, we conducted experiments using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. This dataset will be introduced in detail below.

The IEMOCAP dataset was recorded by the University of Southern California and contains about 12 h of multimodal audiovisual data, recorded in five sessions. There are two methods for organizing the IEMOCAP dataset, one based on scripts and the other improvised. The dataset includes performances by male and female professional actors in both scripted and improvisational scenarios. The scripted part consists of the actors performing predetermined emotional scenes, while the improvisational part is closer to natural speech and performance. These samples are annotated in terms of size and classification. The dataset includes a total of 5531 samples, with happiness accounting for 29.5%, sadness accounting for 19.6%, anger accounting for 19.9%, and neutrality accounting for 30.8%.

4.2. Configuration of Model Parameters

In our research, the implementation of the model is based on the PyTorch Python library version 3.6. We tested these models on a TITAN RTX (GPU). Table 1 summarizes the comparison parameters of our proposed model and the baseline models. For the experiments, we randomly divided the IEMOCAP dataset into 80% training set and 20% test set. Additionally, we meticulously processed each speech segment in the dataset, splitting it into 2 s clips with a 1.6 s overlap time set. In terms of feature extraction, we developed a time–frequency domain feature extraction method based on Mel-frequency cepstral coefficient (MFCC) features. Further, we introduced the hybrid dilated convolution technique to extract deeper features and proposed the RA-GMLP module to enhance performance. The experiments were divided into two main parts: the first three sets of experiments did not use the time–frequency domain feature extraction method, and their input size was 26 × 57; the latter three sets of experiments applied time–frequency domain feature extraction, resulting in an input size of 78 × 57. The training configurations for all experiments were consistent, using a learning rate of

10^{- 3}

, a batch size of 32, and a total of 50 epochs of training. As the loss function, we chose cross-entropy and selected the Adam optimizer. The output size for all experiments was uniformly set to 32 × 4. Finally, we chose weighted average accuracy (WA), unweighted average accuracy (UA),training time, t-sne scatter diagram, and loss function as the evaluation criteria for our model.

5. Detailed Analysis and Discussion of Experimental Results

5.1. Our Model Was Compared and Analyzed against Current Advanced Models

When evaluating the performance of deep learning models, accuracy and training efficiency are two crucial metrics. As seen in Table 2, our model demonstrates a moderate level of time consumption per training epoch, specifically 1.5 s. This duration is neither as rapid as the SPSL-MFCC model, which clocks the best training speed at 1.2 s per epoch, nor as slow as the SEQCAP-Spectrogram model, which marks the slowest in this comparison at 2.7 s per epoch. In terms of accuracy, our model, measured by Wa and Ua, achieves scores of 75.31% and 75.09%, respectively, surpassing all other competing models. While the SEQCAP-Spectrogram model comes close to our model in terms of accuracy, its training time per epoch is 2.7 s, significantly slower than our model, indicating that our model maintains a high level of training efficiency while achieving high accuracy.

Figure 5 illustrates the trend of loss values for various deep learning models over training epochs. In the initial phase of the training, the spsl-mfcc and apcnn-mfcc models demonstrated lower loss values. Apart from the SEQCAP-Spectrogram model, the loss values for the other models showed a pattern of rapid decline. However, as the training progressed, particularly by the 7th epoch, our model (indicated by the red line) began to show a significant advantage in the reduction of loss values. This trend was maintained and intensified in the subsequent training. By the 30th epoch, our model had reached the lowest point in loss values compared with the other models, demonstrating optimal performance on the training set. This result clearly signifies the superior performance and effectiveness of our model.

Based on Table 3, we can observe the sample distribution of the IEMOCAP dataset that we have utilized. We have utilized four emotion categories from this dataset: neutral (neu), happy (hap), sad (sad), and angry (ang), with each emotion type having varying sample sizes. Notably, the angry category has the highest number of samples, while the sad category has the least. The sample size is also approximately balanced between genders, which is beneficial in avoiding gender bias impacting the model’s performance.

Figure 6 displays the dimensionality-reduced visualization of test data after model training. Blue represents ‘neu’, red represents ‘hap’, yellow represents ‘sad’, and green represents ‘ang’. From the visualization, it is evident that our model can distinctly differentiate between various emotional categories, with clear boundaries between each category, indicating robust classification capabilities. The balanced and diverse distribution of samples aids the model in learning more generalized features, thus enhancing the model’s performance on unseen data. Moreover, the substantial volume of samples provides the model with ample information to learn complex decision boundaries, which is particularly crucial for fine-grained tasks such as emotion classification. In Figure 6, it is apparent that our model has utilized these balanced data more effectively during training, leading to a better distinction of categories in the dimensionality reduction visualization. In summary, the balanced distribution of training samples is critical for the overall performance of the model, and our model excels in this regard, which also explains its superior classification results.

5.2. Investigating the Impact of Varying Dilation Rates in Hybrid Dilated Convolution on Model Performance

Figure 7 demonstrates the changes in the receptive field when using different dilation factors. Meanwhile, Figure 8 shows the accuracy of three different combinations of HDC we attempted while extracting deep features using dilated convolution. Among these methods, when the dilation factors for all three layers of dilated convolution are set to 2, the model’s performance is lower than most of the models we compared. However, when all dilation factors are set to 3, the model’s performance is on par with current methods. This phenomenon is mainly due to the gridding effect caused by dilated convolutions. Compared with existing SER models, when the dilation factors are set to 1, 2, and 3, respectively, there is a significant improvement in accuracy. Therefore, to avoid the gridding effect and fully leverage the advantages of dilated convolutions, we set the dilation factors to 1, 2, and 3, respectively, in the three layers of dilated convolutions. Such parameter settings not only prevent the gridding effect but also effectively extract features. Based on these considerations, we chose the last parameter setting method, the third one, to ensure the correct application of dilated convolutions and optimize performance.

5.3. Exploring the Specific Impact of Different Attention Heads on Model Performance

Multihead attention is one of the core components of the RA-GMLP model, which allows the model to capture emotional information across different representational subspaces by computing multiple sets of attention weights in parallel. Vectors from different spaces are then aggregated (concatenated) together and passed through a linear transformation to produce the final output. For the processing of deep speech emotion features, we have conducted experimental comparisons between multihead attention mechanism GMLP with 1, 2, 4, and 8 heads and GMLP without attention. As shown in Figure 9, apart from RA-GMLP-Head2, which performs comparably to GMLP, the other multihead attention mechanisms significantly outperform GMLP, thanks to the attention mechanism’s ability to focus on key information.

5.4. A Detailed Comparative Analysis of the GMLP Mechanism and Its Enhanced Version, the RA-GMLP Mechanism

As shown in Figure 10, after incorporating the attention module, the accuracy rates for emotions other than “Anger” improved, while the accuracy for “Anger” became slightly lower. Moreover, it substantially eliminated the confusion set “Ang-Sad”. Based on the experimental results, we adopted the most outstanding 8-head attention mechanism in RA-GMLP to ensure the best model performance.

5.5. Conducted Ablation Studies and Provided an In-Depth Analysis of the Results

Figure 11 and Figure 12 clearly depict the trajectory of accuracy changes for our proposed modules during the first 50 epochs. To measure performance, we used two key indicators: WA and UA. If the sum of WA and UA does not show an increase in any epoch, the values of WA and UA are no longer updated for that epoch.

From Figure 11 and Figure 12, it can be seen that from the 1st to the 10th epoch, there was a significant improvement in both WA and UA for all models. Then, from the 10th to the 30th epoch, the growth of accuracy began to plateau. In the subsequent 20 epochs, there was a very slight change in accuracy. Initially, the “HDC (w/TF)” model held the lead in accuracy, but as the training progressed, the “HDC+RA-GMLP (w/TF)” model gradually surpassed the other models in accuracy. TFCM showed the best performance, both in the early stages and towards the end of the training. After completing 50 epochs of training, the “HDC+RA-GMLP (w/TF)” model achieved the highest level of accuracy.

As seen in Figure 13, under the condition of using the same model architecture, we found that the method using TFCM significantly outperforms the traditional conv method in both WA and UA. For instance, without TFCM, the baseline model (Baseline w/o TF) achieved a WA accuracy of 70.10%, while with TFCM (w/TF), the accuracy increased to 71.03%. Similarly, when the HDC module and RA-GMLP module were introduced, the “HDC (w/TF)” and “HDC+RA-GMLP (w/TF)” models achieved WA accuracies of 74.19% and 75.31%, respectively. This is a respective increase of 4.29% and 1.16% over “HDC (w/o TF)” and “HDC+RA-GMLP (w/o TF)” without TFCM. Models utilizing TFCM also showed an increase in UA accuracy, proving the effectiveness of our proposed feature extraction method. As the first three bar graphs indicate, even without TFCM, “HDC (w/o TF)” showed a modest increase in WA and UA over the baseline model, but with the addition of the RA-GMLP module, WA and UA were respectively improved by 4.05% and 3.96%, indicating an overall increase in accuracy. In the last three bar graphs, we observed that the addition of the HDC module and RA-GMLP module, when using TFCM, brought respective improvements of 3.36% and 0.92% in WA, with a steady increase in UA accuracy as well. Therefore, our proposed modules demonstrated good performance in both types of feature extraction methods involved in this study.

As shown in Figure 14, without the use of TFCM, the HDC module improved the recognition accuracy for the “Anger (Ang)” and “Sadness (Sad)” emotions, but at the same time, it reduced the recognition accuracy for the “Happiness (Hap)” and “Neutral (Neu)” emotions. Nonetheless, the overall accuracy rate was still improved. The RA-GMLP module effectively reduced the confusion between “Neutral-Anger (Neu-Ang)” and “Sadness-Anger (Sad-Ang)”, resulting in an increase of 27 and 13 correctly identified samples for the “Happiness” and “Neutral” emotions, respectively. As depicted in Figure 15, with the use of TFCM, the model eliminated the confusion between “Happiness-Neutral (Hap-Neu)” and “Neutral-Sadness (Neu-Sad)”. The number of correctly recognized samples for the “Neutral (Neu)”, “Happiness (Hap)”, and “Anger (Ang)” emotions increased by 8, 9, and 4, respectively. Moreover, with the basis of time–frequency domain feature module extraction, the HDC module eliminated the confusion in “Neutral-Anger (Neu-Ang)”, significantly improving the recognition accuracy for the “Neutral” emotion, with an increase of 14 correct samples.

6. Conclusions

This study introduces an innovative SER architecture based on time–frequency domain features of speech. The architecture utilizes the time–frequency signal fusion model (TFCM) to extract time–frequency domain features from Mel-frequency cepstral coefficients (MFCCs). Experiments indicate that, compared with the baseline model, this feature extraction method can improve recognition performance by 1% over traditional methods. To broaden the receptive field of neurons and enrich feature representation, this paper is the first to introduce high-dimensional convolution (HDC) into the task of emotion recognition. Through experimental analysis, it was found that the newly proposed residual attention-gated multilayer perceptron (RA-GMLP) can more accurately capture key emotional information in sequences, thereby achieving a higher recognition accuracy. While the presented HDC and RA-GMLP demonstrate significant improvements in emotion recognition, their adaptability to other domains and larger, more diverse datasets has not been fully assessed. This constitutes an avenue for future research. To comprehensively evaluate the performance of the model, tests were conducted on the IEMOCAP dataset. The results show that the proposed model achieved a WA of 75.31% and a UA of 75.09%, demonstrating superior recognition performance compared with current mainstream models. These promising results pave the way for future applications in more interactive and emotionally aware computing systems, such as adaptive e-learning platforms and empathetic customer service chatbots. Further research will focus on refining the model to cater to these applications while exploring the integration of real-time emotional feedback mechanisms.

Author Contributions

M.S. designed and performed a series of experiments, analyzed the results, and wrote the manuscript; W.Y. supervised the experiment and supervised the paper; F.W. guided the experiment; Z.L. supervised the ablation experiments and guided the analysis of the training sample and training time; M.C. helped revise the manuscript; C.M. helped analyze the results; L.Z. performed the investigation and visualization; H.S. assisted in the conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work is a research achievement supported by the “Tianshan Talent” Research Project of Xinjiang (No. 2022TSYCLJ0037), the National Natural Science Foundation of China (No. 62262065), and the National Key R&D Program of China (No. 2022ZD0115802).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schelinski, S.; Von Kriegstein, K. The relation between vocal pitch and vocal emotion recognition abilities in people with autism spectrum disorder and typical development. J. Autism Dev. Disord. 2019, 49, 68–82. [Google Scholar] [CrossRef] [PubMed]
Paris, M.; Mahajan, Y.; Kim, J.; Meade, T. Emotional speech processing deficits in bipolar disorder: The role of mismatch negativity and P3a. J. Affect. Disord. 2018, 234, 261–269. [Google Scholar] [CrossRef] [PubMed]
Hsieh, Y.H.; Chen, S.C. A decision support system for service recovery in affective computing: An experimental investigation. Knowl. Inf. Syst. 2020, 62, 2225–2256. [Google Scholar] [CrossRef]
Lampropoulos, A.S.; Tsihrintzis, G.A. Evaluation of MPEG-7 descriptors for speech emotional recognition. In Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Piraeus-Athens, Greece, 18–20 July 2012; pp. 98–101. [Google Scholar]
Virvou, M.; Tsihrintzis, G.A.; Alepis, E.; Stathopoulou, I.O.; Kabassi, K. Emotion recognition: Empirical studies towards the combination of audio-lingual and visual-facial modalities through multi-attribute decision making. Int. J. Artif. Intell. Tools 2012, 21, 1240001. [Google Scholar] [CrossRef]
Makiuchi, M.R.; Uto, K.; Shinoda, K. Multimodal emotion recognition with high-level speech and text features. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 350–357. [Google Scholar]
Zhang, X.; Wang, M.J.; Guo, X.D. Multi-modal emotion recognition based on deep learning in speech, video and text. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 328–333. [Google Scholar]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7367–7371. [Google Scholar]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
Tao, F.; Liu, G. Advanced LSTM: A study about better time dependency modeling in emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2906–2910. [Google Scholar]
Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
Arjun, A.; Rajpoot, A.S.; Panicker, M.R. Introducing attention mechanism for eeg signals: Emotion recognition with vision transformers. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico City, Mexico, 1–5 November 2021; pp. 5723–5726. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ding, X.; Xia, C.; Zhang, X.; Chu, X.; Han, J.; Ding, G. Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition. arXiv 2021, arXiv:2105.01883. [Google Scholar]
Qiu, Z.; Jiao, Q.; Wang, Y.; Chen, C.; Zhu, D.; Cui, X. rzMLP-DTA: GMLP network with ReZero for sequence-based drug-target affinity prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 308–313. [Google Scholar]
Yan, J.; Ando, K.; Yu, J.; Motomura, M. TT-MLP: Tensor Train Decomposition on Deep MLPs. IEEE Access 2023, 11, 10398–10411. [Google Scholar] [CrossRef]
Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6437–6441. [Google Scholar]
Pepino, L.; Riera, P.; Ferrer, L.; Gravano, A. Fusion approaches for emotion recognition from speech using acoustic and text-based features. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6484–6488. [Google Scholar]
Laukka, P.; Neiberg, D.; Forsell, M.; Karlsson, I.; Elenius, K. Expression of affect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation. Comput. Speech Lang. 2011, 25, 84–104. [Google Scholar] [CrossRef]
Bou-Ghazale, S.E.; Hansen, J.H. A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans. Speech Audio Process. 2000, 8, 429–442. [Google Scholar] [CrossRef]
Han, Z.; Wang, J. Speech emotion recognition based on Gaussian kernel nonlinear proximal support vector machine. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 2513–2516. [Google Scholar]
Hsiao, P.W.; Chen, C.P. Effective attention mechanism in dynamic models for speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2526–2530. [Google Scholar]
Yuan, Z.; Li, S.; Zhang, W.; Du, R.; Sun, X.; Wang, H. Speech Emotion Recognition Based on Secondary Feature Reconstruction. In Proceedings of the 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), Xiamen, China, 11–13 June 2021; pp. 149–154. [Google Scholar]
Liu, Z.T.; Han, M.T.; Wu, B.H.; Rehman, A. Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Appl. Acoust. 2023, 202, 109178. [Google Scholar] [CrossRef]
Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478. [Google Scholar]
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control. 2019, 47, 312–323. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Lian, Z.; Li, Y.; Tao, J.; Huang, J. Improving speech emotion recognition via transformer-based predictive coding through transfer learning. arXiv 2018, arXiv:1811.07691. [Google Scholar]
Chen, W.; Xing, X.; Xu, X.; Yang, J.; Pang, J. Key-sparse transformer for multimodal speech emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6897–6901. [Google Scholar]
Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay attention to mlps. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Lin, G.; Wu, Q.; Qiu, L.; Huang, X. Image super-resolution using a dilated convolutional neural network. Neurocomputing 2018, 275, 1219–1230. [Google Scholar] [CrossRef]
Noh, K.J.; Jeong, C.Y.; Lim, J.; Chung, S.; Kim, G.; Lim, J.M.; Jeong, H. Multi-path and group-loss-based network for speech emotion recognition in multi-domain datasets. Sensors 2021, 21, 1579. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Liu, S.; Cao, Y.; Li, X.; Yu, J.; Dai, D.; Ma, X.; Hu, S.; Wu, Z.; Liu, X.; et al. Speech emotion recognition using capsule networks. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6695–6699. [Google Scholar]

Figure 1. Architectural overview of our proposed model.

Figure 2. Structure of TFCM.

Figure 3. Dilated convolution blocks for SER.

Figure 4. Detailed view of our proposed RA-GMLP architecture.

Figure 5. Model comparisons with other methods on IEMOCAP (scripted + improvised) in terms of loss.

Figure 6. Visualized distribution plots of different models.

Figure 7. Selection of dilated convolution parameters.(The blue and green cells represent the regions for performing dilated convolution computations, where the blue cells are unaffected by the dilation factor).

Figure 8. Performance comparison of different dilations of HDC.

Figure 9. Performance comparison of multihead attention.

Figure 10. Performance comparison between GMLP and RA-GMLP-Head8.

Figure 11. We plot and compare the WA of the first 50 epochs for the proposed method.

Figure 12. We plot and compare UA of the first 50 epochs for the proposed method.

Figure 13. Performance comparison of six models at the 50th epoch.

Figure 14. Confusion matrices of three models on the IEMOCAP dataset without the application of TFCM, revealing true versus predicted emotions.

Figure 15. Confusion matrices of three models on the IEMOCAP dataset with the application of TFCM, revealing true versus predicted emotions.

Table 1. Experimental parameter settings.

	Baseline		HDC		HDC+RA-MLP
	(W/O TF)	(W/TF)	(W/O TF)	(W/TF)	(W/O TF)	(W/TF)
Segments length	2 s	2 s	2 s	2 s	2 s	2 s
Overlap	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s	1.6 s
Batch size	32	32	32	32	32	32
Epoch	50	50	50	50	50	50
Learning rate	$10^{- 3}$	$10^{- 3}$	$10^{- 3}$	$10^{- 3}$	$10^{- 3}$	$10^{- 3}$
Feature input	(26, 57)	(78, 57)	(26, 57)	(78, 57)	(26, 57)	(78, 57)
Optimizer	Adam	Adam	Adam	Adam	Adam	Adam
Feature output	(32, 4)	(32, 4)	(32, 4)	(32, 4)	(32, 4)	(32, 4)

Table 2. Model comparisons with other methods on IEMOCAP (scripted + improvised) in terms of WA UA and time.

Model	Wa	Ua	(Time/Epoch)
SPSL-MFCC - Mel-spec [35]	60.00	58.00	1.2 s
SEQCAP-Spectrogram [36]	72.73	59.71	2.7 s
APCNN-MFCC [18]	69.00	67.00	1.5 s
MHcnn-mfcc [18]	69.80	70.09	3.5 s
AAcnn-mfcc [18]	70.94	71.04	1.3 s
OUR	75.31	75.09	1.5 s

Table 3. The distribution of the IEMOCAP dataset.

Emotion Type	Sample Size (Total)	Sample Size (Male)	Sample Size (Female)
neu	929	466	463
hap	855	430	425
sad	653	251	402
ang	1071	531	540

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sha, M.; Yang, W.; Wei, F.; Lu, Z.; Chen, M.; Ma, C.; Zhang, L.; Shi, H. Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics 2024, 13, 588. https://doi.org/10.3390/electronics13030588

AMA Style

Sha M, Yang W, Wei F, Lu Z, Chen M, Ma C, Zhang L, Shi H. Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics. 2024; 13(3):588. https://doi.org/10.3390/electronics13030588

Chicago/Turabian Style

Sha, Mo, Wenzhong Yang, Fuyuan Wei, Zhifeng Lu, Mingliang Chen, Chengji Ma, Linlu Zhang, and Houwang Shi. 2024. "Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM" Electronics 13, no. 3: 588. https://doi.org/10.3390/electronics13030588

APA Style

Sha, M., Yang, W., Wei, F., Lu, Z., Chen, M., Ma, C., Zhang, L., & Shi, H. (2024). Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics, 13(3), 588. https://doi.org/10.3390/electronics13030588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM

Abstract

1. Introduction

2. Related Work

2.1. Traditional Features

2.2. Popular SER Neural Network

2.3. Hybrid Dilated Convolution

3. Architecture Design

3.1. Feature Extraction Block

3.2. Dilated Convolution

3.3. Residual Attention Perceptron Block

4. Detailed Analysis of the Dataset and Precise Configuration of Experimental Parameters

4.1. Analysis of the Dataset

4.2. Configuration of Model Parameters

5. Detailed Analysis and Discussion of Experimental Results

5.1. Our Model Was Compared and Analyzed against Current Advanced Models

5.2. Investigating the Impact of Varying Dilation Rates in Hybrid Dilated Convolution on Model Performance

5.3. Exploring the Specific Impact of Different Attention Heads on Model Performance

5.4. A Detailed Comparative Analysis of the GMLP Mechanism and Its Enhanced Version, the RA-GMLP Mechanism

5.5. Conducted Ablation Studies and Provided an In-Depth Analysis of the Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI