1. Introduction
As a cross-ethnic and cross-cultural feature shared by all mankind, emotion plays a significant role in the process of human perception and decision-making, especially in increasingly popular human–computer interaction (HCI) systems [
1]. Sentiment analysis is a significant mission in natural language processing (NLP), where the main purpose is to make machines understand human emotions [
2]. As the main method of people’s daily communication, speech contains abundant emotional information [
3]. As an important branch of emotion computing, an SER system can be defined as a set of methods for processing and classifying speech signals to detect implicit emotions [
4]. Therefore, how to extract key emotional information from speech signals is a challenging research matter within the scope of speech processing, which has attracted the attention of many relevant researchers.
In order to clearly perceive the emotional changes in speech, extracting the most relevant acoustic features has always been a subject of intense interest in the research of speech emotion recognition [
5]. Over the years, sentiment recognition models have gone through significant improvement with the introduction of deep neural networks. In the emotion classification of speech emotion recognition, a common method uses discrete states to represent emotions, for instance, sadness and happiness. Some researchers believe that there is a certain coherence to how people express their emotions in their daily lives. Therefore, three variables of arousal, potency, and valence are used to express people’s emotions in a three-dimensional (3D) continuous space [
6].
Commonly used methods of SER mainly include traditional classification or regression algorithms and deep learning methods. Traditional methods have lower algorithm complexity and processing speeds. Commonly used methods include the support vector machine (SVM), the Gaussian mixture model (GMM), the hidden Markov model (HMM), random forest (RF), etc. The SVM aims to find the hyperplane with the largest interval in the sample space to produce more robust classification results. For more complex samples, it can be mapped from the original space to a higher dimensional space. The GMM classifies and converts data into a probabilistic model through unsupervised learning, which provides guidance for many subsequent methods. The HMM can estimate and predict unknown variables based on some observed data. Not only that, but the HMM can also efficiently improve the matching degree between the evaluation model and the observation sequence. RF has a simple structure and a small amount of calculation. It can be used for both classification and regression problems. Even if the dataset is not complete, RF can maintain high classification accuracy. With the advancement of computing equipment and sensor levels, deep learning methods are being chosen by more researchers. Compared with some classic speech emotion recognition models, deep neural networks have greater advantages in terms of training speed and recognition performance.
Deep learning methods can be used to extract speech’s emotional features more accurately. In recent years, many speech emotion recognition models have incorporated deep learning. ED-TTS [
7] utilizes speech emotion diarization and speech emotion recognition to model different levels of emotions. This method can obtain fine-grained frame-level emotional information. Zhou et al. [
8] proposed an end-to-end speech emotion recognition system based on multi-level acoustic information. The model uses MFCC, spectrogram, and embedded high-level acoustic information to construct a multimodal feature input. Then, it uses a co-attention mechanism for feature fusion. A neural network mainly models the observation probability of speech. Compared with the traditional support vector machine model, Stuhlsatz et al. [
9] adopted a DNN network architecture to study discriminative representations from traditional statistical feature Log-Mel spectrograms and achieved higher accuracy. At the same time, due to the excellent performance of end-to-end networks in the image domain, more and more researchers are trying to apply neural networks to the field of speech emotion recognition to extract speech recognition features and their representations [
10]. Kim et al. [
11] employed CNNs for a new classification study, which applied CNNs to sentence-level classification tasks and achieved good results. In addition, to ameliorate the performance of the current models in emotion analysis and problem classification, they suggested that CNN architecture should be simply modified to allow for the simultaneous use of task-specific and static vectors. Badshah et al. [
12] used a method that included CNN architecture with rectangular filters and demonstrated its effectiveness in smart medical centers. In speech emotion recognition, a recurrent neural network (RNN) is also proven to be an effective method to solve emotion analysis problems. Haşim Sak et al. [
13] used LSTM, which is a recurrent neural network architecture and achieved good results. Fei et al. [
14] introduced an Advanced long short-term memory (A-LSTM) method, which employs a pooled RNN to learn sequences and has a better performance than simple LSTM. Since the attention mechanism has specific characteristics that make the model focus more on the output, it has been extensively verified in many fields [
15]. Chung-Cheng Chiu et al. [
16] introduced a multi-head attention (MHA) method to improve the existing automatic speech recognition (ASR) framework. In reference [
17], to improve the recognition performance of the framework, they used multi-head attention to pay attention to the information from different locations in the subspace.
A spectrogram is a time-frequency decomposition of a voice signal, which shows its frequency content changing with time. In this paper, we focus on the current work of many researchers on deep learning and spectrograms in speech processing. A novel speech emotion recognition architecture is proposed, which is based on multi-head attention and a bidirectional gated recurrent unit (Bi-GRU) network. For speech data, it is challenging to evaluate and compare different methods because there is not enough of a comprehensive public corpus of labeled emotional speeches [
18]. We selected the IEMOCAP [
19] and Emo-DB [
20] corpora to evaluate our proposed model. These corpora contain a lot of data about speech, which are marked as single sentences and used in some advanced research [
21].
In the whole research process of speech emotion recognition, emotion recognition from quasi-linguistic components of speech has always attracted a lot of research interest [
10,
11]. Common sentiment classification research mainly focuses on the classification and regression of feature extraction methods, including short-term frame-level information and discourse-level information [
16,
17]. In recent research on feature information recognition and classification, compared with the traditional manual low-frame (frame-level) features, in order to improve the classification accuracy of the entire network, the researchers performed statistical learning on the speech features at all levels of the deep network. At present, most speech emotion recognition research mainly uses an architecture composed of neural networks, which includes CNN, RNN, LSTM, or their combination, etc. [
22].
As a typical neural network, CNNs are often designed to process data with a grid-like topology, for instance, voice recognition, image processing, etc. By applying correlation filters, CNN can obtain more temporal and spatial feature information from the input signal. Then, it uses its own structure to simplify the input’s original signal without losing the feature form, which effectively improves the classification performance while reducing the complexity of the algorithm [
23]. Trigeorgis et al. [
24] used CNN to preprocess the raw samples, which can obtain more audio feature information that is beneficial for classification. The experimental results show the advantages of their proposed method in the field of speech recognition. In order to find more feature information that is beneficial to the classification results, Mao et al. [
25] proposed a strategy of using CNN to learn support vector machines and achieved good results. In addition to the CNN network, RNN has shown great capacity in many sequence modeling tasks [
26]. Kerkeni et al. [
27] put forward a method of speech emotion recognition. In their work, MFCC is extracted by using modulation spectrum (MS) features, and then seven emotions are classified from Emo-DB and Spanish databases by using the RNN learning algorithm. Zhao et al. [
28] proposed a new CNN–LSTM structure. They used the Log-Mel spectrogram as the input of the entire model to extract the feature information with time series in the speech. In addition, compared with DNNs, Tara N. Sainath et al. [
29] found that CNN and LSTM achieve better classification results in a wide variety of speech recognition tasks. Chen et al. [
30] proposed a multi-scale fusion framework, STSER, based on speech and text information. They took the log-spectrogram as the input of the model and used the neural network (CNN, Bi-LSTM) and attention mechanism to train and classify the source data.
Li Yang [
31] put forward a new emotion analysis frame, SLCABG, which is used to analyze the emotions of a large number of customers on e-commerce platforms when they review products. The model is based on an emotion dictionary, which combines a convolutional neural network and attention-based Bi-GRU to achieve better recognition performance. Over the past few years, the attention mechanism of deep learning has been successful in the context of speech emotion recognition [
32,
33,
34], which prefers to focus on emotionally charged words in people’s communication. Feng Xuanzhen and Liu Xiaohong [
35] designed a Bi-GRU and attention-based structure to focus on fine-grained emotional features that are easily ignored by humans, which aims to capture the scattered emotional feature information in comment texts. Po-Yao Huang et al. [
36] proposed an attention mechanism training model with a variable number of heads with reference to visual object detection technology. This model is mainly used to obtain feature information from the speaker’s voice, face, and other aspects. With the excellent performance of the end-to-end network in speech recognition, Tomoki Hayashi [
37] introduced it into the multi-head attention model to expect better recognition results.
In general, the problems faced by the current SER include the following: (1) Inconsistent feature scales, making it difficult to balance the importance of local features and long sequence emotional features; (2) redundant information limits the accuracy and stability of SER; and (3) using an overly complex model leads to overfitting in performance on a specific task or dataset.
Based on the above factors, we propose a speech emotion model. The main contributions of this paper can be summarized as follows:
We propose a new SER model, which combines double Bi-GRU and multi-head attention to extract a spectrogram from speech data to learn speech representation.
During training, CNN layers are first applied to study the local associations from spectrograms, and dual GRU layers are used to learn the long-term correlations and contextual feature information. Second, multi-head attention is used to focus on the features related to emotions. Finally, the softmax layer is used to output various emotions to improve the overall performance. After experimental verification, our proposed model has better classification performance, and its unweighted accuracies on person-neutral IEMOCAP and Emo-DB sentiment datasets reached 75.04% and 88.93%, respectively. Different from the methods proposed in references [
38,
39], our proposed model achieves better classification performance in speech emotion recognition. Moreover, the gated recurrent unit (GRU) is more efficient than long short-term memory in training [
40].
We also conducted training on different tasks and datasets to verify the generalization performance and stability of the model. Finally, we analyzed and summarized the results of the experiment and proposed possible future research points.
2. Materials and Methods
In this section, a new speech emotion recognition model is proposed, which combines the Bi-GRU network and attention mechanism for the extraction of speech signal emotion features.
Figure 1 shows the network structure we constructed, mainly including the spectrogram feature input, gated recurrent unit, multi-head attention layer, and softmax layer, which are used for the extraction, training, and classification of speech emotion information. The Bi-GRU layer is used to learn long-term correlation and contextual feature information. Multi-head attention is used to focus on emotion-related features. The softmax layer is used to output various emotions to improve the overall performance.
2.1. Spectrogram Extraction
As described in [
38], the spectrogram has better performance in speech emotion recognition. In the preparation stage of our experiment, since the spectrogram can retain rich frequency domain information, we adopted the spectrogram as the input information of the model in the proposed speech emotion model. We obtained the spectrogram by taking the short-time Fourier transform of the original speech signal. To preserve the integrity of emotional information in the original speech signal, we sampled the speech signal in the experimental dataset at 16 KHz and organized it into a single sentence with a duration from less than a second to about 20 s. In our experiments, we use the IEMOCAP and Emo-DB corpora and select a similar-sized sentiment subset. This approach can help us comprehensively analyze the classification performance of the model in the training process for each type of emotion. In order to more intuitively explain the proportion of each emotion in the corpus in the experiment, we filled in the number of sentences selected for each emotion information in
Table 1. Every sentence is marked with at least one emotion. With respect to the overlapping hamming window sequences, we set the frame step size to 10 ms and the frame length to 40 ms. For each frame in training, the specific calculation length of the DFT in the experiment is 1600. In other words, we used 10 Hz grid resolution. Since the frequency of the human voice signal is generally 300–3400 Hz, and after repeated verification and analysis of the experiments, the frequency range of 0–4 KHz was finally selected, and others were ignored. After summarizing the short-time spectrum, we obtained a matrix with the size of (N × M); in specific experiments, the variable N generally corresponds to the number of speech sentences and is used to represent the selected temporal grid resolution. The variable M was used to represent the frequency grid resolution in the experiment. After obtaining the DFT data, we converted it into a logarithmic power spectrum. Then, the logarithmic power spectrum was normalized by z using the mean and standard deviation of the training dataset.
Sentence lengths of speech samples are usually different. In order to upgrade the computational efficiency, we sort by length. If there are spectrograms with a similar timing length in the input sequence, we will organize them into the same batch and pad with 0 to the maximum length of spectrograms in the current batch. In the training stage of our proposed network, we performed unified parallel calculations on a batch of samples.
2.2. The Bi-GRU Layer
The GRU network, as another form of LSTM, was proposed to solve problems such as long-term memory and the gradient in backpropagation. Compared with the LSTM network, it can not only obtain the contextual feature information of the input speech, but it also has lower computational complexity. It is well known that speech signals are closely related to the time dimension. In the process of conversation, people usually think about the content of a conversation in a future time based on the utterance information of the previous time. In our work, the GRU network layer also takes advantage of its advantages in the time dimension. In other words, the experiment will first analyze the emotional tone of the whole utterance and then deeply analyze the local emotional information features in the utterance. Regarding the study of the GRU network, Cho et al. [
41] set up the same experimental conditions in their study to test the GRU and LSTM network models. The experimental results show that the algorithm complexity of the GRU network is smaller in training, and the recognition performance of the model is also better [
8].
For the traditional GRU network, the emotional features in the speech are usually extracted and analyzed according to the time rule, and the potential emotion of the next moment is determined by analyzing the emotion of the previous moment. However, it has been found in practice that emotional changes in people’s daily conversations are quite complex, and the current speech may also be related to future speech. For example, humans express current words with a certain emotion, and the emotion is constantly changing as the conversation progresses. Considering the above situation and the research and verification in the experiment, if the Bi-GRU layer is added to the whole network, it can better compensate for part of the emotional feature loss caused by the single-item GRU network. In a typical GRU network architecture, the architecture mainly includes a reset gate and an update gate. Intuitively, the reset gate determines how the new input information is combined with the previous memory, and the update gate defines the amount of previous memory saved to the current time step. Quantitative calculations are shown in Equations (1) and (2). When
is input at time
t, the bidirectional GRU network can obtain
and
, which are the hidden states of forward and reverse information, respectively.
Both forward and backward propagation of the Bi-GRU include the calculation of the reset gate
, update gate
, and the hidden state
. Their calculation processes are as follows:
where
represents sigmoid,
W represents weight, and
b represents bias.
The hidden state outputs at time t are and , where . Then, the global context information is obtained by combining and to capture the speech context feature information vector.
2.3. The Multi-Head Attention
Over the past few years, since the attention mechanism of deep learning networks can ensure that the classifier pays attention to the specific position of a given sample according to the attention weight of each part of the input, it is often applied in the field of speech emotion recognition. The attention function mainly includes these vectors, namely query, key, value, and output. Generally speaking, a typical attention mechanism automatically learns and calculates the contribution of the input data to the output data by computing the mapping of keys and values. Generally speaking, a typical attention mechanism automatically learns and calculates the contribution of the input data to output data by computing the mapping of keys and values. During training, the model predicts the current time step based on the input data and the historical output of neurons and finally obtains the weights of each dimension of the input data. Each value in this calculation process needs the compatibility function of the query and the matching key to calculate. The method adopted by the traditional attention mechanism is to concentrate the acquired context vector on the specific representation subspace of the input sequence. However, the context vector obtained in this way only reflects one aspect of the semantics in the input. Generally speaking, a sentence may involve multiple semantic spaces, especially for a long input sequence.
In order to achieve better classification performance during model training, a multi-head attention mechanism is introduced into our Bi-GRU model. The multi-head attention model can represent different subspaces according to the number of heads and obtain common attention information in different positions.
where the projections are parameter matrices
,
,
, and
.
In this work, after obtaining the output vector of the GRU network, we find it beneficial to linearly project the , , and dimensions for the h times. This linear projection is obtained by different and learned ways of query, key, and value, respectively. Then, we execute the attention function in parallel through the projection of each query, key, and value and generate the output value of the dimension. Finally, these vectors are connected and projected again to obtain the final values.
The overall process of the model is shown in Algorithm 1.
Algorithm 1. The pseudocode of the model. |
Input: Time-frequency characteristics |
Output: Emotion categories and probabilities Y, P |
// Bi-GRU algorithm (forward and backward) ← |
→
// Multi-head Attention algorithm |
, ,
|
// Linear layer and output layer
|
5. Conclusions
This paper proposes a new speech emotion recognition model, which combines the GRU network and multi-head attention mechanism. On IEMOCAP and Emo-DB datasets, the proposed model achieves 75.04% and 88.93% unweighted accuracy, respectively. Compared with the results in [
38], our results are a 10.82% improvement on the IEMOCAP dataset. At the same time, compared with the results obtained by Mustaqeem et al. [
39], the experimental results were improved by 2.79% and 3.36% on the IEMOCAP and Emo-DB datasets, respectively. In addition, the model has been improved by 4.40% and 5.71%, respectively, based on the baseline (Bi-GRU) [
44]. Compared with most advanced speech emotion recognition models, our model has a better performance on the IEMOCAP and Emo-DB datasets. We further compared the proposed model with common deep learning methods, and the emotion recognition model based on Bi-GRU and multi-head attention methods achieved the best recognition results. In addition, we also applied the model to the speech emotion analysis task and achieved more stable prediction results on different datasets. In these experiments, we used rich evaluation indicators to conduct a more comprehensive analysis of the model and prevent the model from overfitting for specific parameters. Finally, we analyzed the results of the experiment.
SER is of great significance in emotion recognition. Due to the variability of emotions, a piece of speech often has multiple emotions, which is challenging for the accurate extraction of speech information features. For multiple languages, cross-cultural emotion recognition is the future development trend. People in different countries and regions have certain cultural differences, but for humans, even if they cannot understand what foreigners are saying, they can roughly understand their tone and attitude. In the future, the cost of training the model may be interesting, especially on long input sequences. The idea of future work is similar to that of Nikita Kitaev [
45]. In order to reduce the performance loss of the model in the multi-head attention layer algorithm, we will add locality-sensitive hashing (LSH) attention to our subsequent research.
It can be seen from the experimental results that the recognition accuracy of our model for a specific emotion has not been greatly improved. We analyzed that this may be due to the fact that the speech features of the emotion of calm are not obvious, and our model does not use complex methods in feature fusion. In the future, we will consider conducting research on feature fusion. At the same time, speech is only one aspect of emotion. Combining text information with speech recognition will significantly improve the performance of emotion recognition. This research direction is also more in line with the development of artificial intelligence.
At the same time, multimodal sentiment analysis [
46] is also very popular. It has richer emotional information and is more comprehensive in predicting and identifying human emotions. Therefore, our model can be improved and applied to more scenarios with the help of advanced methods such as multi-task learning methods, transfer learning, or joint learning.