1. Introduction
The human voice contains rich information, and voice interaction is the most direct and convenient way of communication between people. In the era of artificial intelligence, the voice is of great significance for human–computer interaction, and people hope to give machines the same voice interaction capabilities as humans. In the early 21st century, with the rapid development of voice technology, voice interaction is no longer limited to the text content that the speaker needs to express. How to make various smart devices understand and judge the emotional state of the speaker has become a new problem and a research hotspot. Speech emotion recognition (SER) [
1,
2,
3,
4] refers to the technology that the computer infers the emotional state of the speaker by analyzing and processing the speech signal collected from the sensor. Speech emotion recognition can enable various smart devices to better understand the user’s intention, so as to better serve people.
In recent years, researchers have mainly studied SER in two aspects: the construction of a speech emotion database and the extraction of speech emotion features. Among them, feature extraction methods can be divided into two categories. One is the traditional manual features. The traditional static model obtains the emotional key element features of each sample through two links: the first link is to accurately collect the most basic descriptive information in the voice content, such as harmonic-to-noise ratio, fundamental frequency, zero-crossing rate, etc. In the second link, each piece of basic descriptive information is expressed as a feature vector by different statistical aggregation functions, which expresses the temporal variation and contours of different basic descriptive information at the sentence level, including mean, variance, skewness, kurtosis, linear regression coefficient, quartile, etc. [
5]. However, traditional machine learning methods based on advanced statistical function features cannot effectively utilize the contextual information of the original speech signal, resulting in a low recognition rate.
In recent years, the deep feature extraction methods based on deep learning have shown outstanding performance in SER tasks [
6,
7,
8]. Compared with traditional handcrafted feature extraction, deep neural networks (DNNs) are able to extract task-specific hierarchical feature representations from a large number of training samples through supervised learning. The authors of [
9] mentioned that by designing a speech emotion recognition system that can use the convolutional neural network (CNN) and recurrent neural network (RNN) technology, and using the sound spectrum as the input of the model, the final recognition rate on the EMO-DB dataset can reach 88.01%, and the experiment proves the effectiveness of the CRNN network model in the speech emotion classification task. Zhong et al. [
10] introduced an attention mechanism into the CRNN model, which improved the weighted accuracy by 7% compared with the original CRNN network. The experiments proved that the attention mechanism could effectively improve the accuracy of speech emotion recognition. Li et al. [
11] mentioned that in the process of their research, the one-dimensional convolutional network model was used as the construction carrier of the speech emotion recognition system, and the Mel spectrogram and the Log-Mel spectrogram were combined to the input end, and finally the reliability of this method was demonstrated and affirmed. Zhang et al. [
12] proposed a speech emotion recognition method based on a fully convolutional network and an attention mechanism, which achieved an unweighted accuracy of 63.9% on the IEMOCAP dataset.
In the process of this research, through a large number of collections and the analysis of existing research conclusions and research cases, we proposed a model that was based on the AA-CBGRU network for SER. In the design process of the model in this paper, we firstly extracted the spectrogram and its first and second order derivatives from the original speech signal, and then we fed these features into the advanced convolutional neural network to extract the spatial features. The residual network was introduced to avoid the gradient disappearance problem caused by the deep network structure. Finally, the bidirectional gated recurrent unit network (BGRU) with attention mechanism was designed to strengthen the learning of time series information to improve the classification performance of the model.
2. Speech Emotion Recognition Model Based on AA-CBGRU Network
The specific architecture of speech emotion recognition based on the AA-CBGRU network designed and constructed in this paper is shown in
Figure 1 below. In this model, it mainly includes four parts: data input, spatial feature collection, time series feature collection and classification.
The data input part first preprocesses the original speech signal, and then extracts the spectrogram and its first and second order derivatives as the data input for the advanced convolutional neural network. The spatial feature extraction part extracts spatial features through the convolutional neural network with residual structure, and the time series feature extraction part then inputs the extracted spatial features into the BGRU layer along the time axis to capture the time series features of the speech signal. The classification part is responsible for receiving the output of the attention layer and then inputting it to the fully connected layer, and finally the model realizes sentiment classification through the SoftMax layer.
2.1. Feature Selection
Since static features such as the logarithmic Mel-spectrogram cannot reflect the process of emotional change [
13], the consideration of the dynamic features can reflect the changing process of emotion, retain effective emotional information, and reduce the content, environment, speaker and other emotionally irrelevant factors. Therefore, the model in this paper selects the spectrogram and its first and second order derivatives which contain dynamic features such as the input of the improved convolutional neural network.
For a given speech signal, the power spectrum of each frame is obtained by discrete Fourier transform, and the output
is obtained by Mel filter bank. Then the Log-Mel static feature
, the first-order difference feature
, and the second-order difference feature
are calculated as follows:
where the
describes the number of frames,
is set to 2, based on the popular experience, and
is a variable.
After the spectrogram and its first and second order derivatives are calculated, the 3D feature is used as the input of the convolutional neural network, where represents the time (frame) length and represents the number of Mel filter banks, represents the number of feature channels.
2.2. Advanced Convolutional Neural Network (Advanced CNN, A-CNN)
In recent years, many researchers have used convolutional neural networks (CNN) to compute spectrogram features in speech emotion recognition tasks [
14,
15,
16]. In the deep neural network model, in general, there are many convolution layers under the front end of the network system structure, and the collected relevant information is described through partial or all input features. The deep neural network structure can make the receptive field wider, so as to obtain better recognition effect. However, practice has proved that a network structure that is too deep will increase the possibility of gradient disappearance.
In order to extract more effective features and avoid the risk of vanishing gradients, this paper adds a residual network [
17] to the network architecture. As shown in the feature representation module in
Figure 1, a convolutional neural network with residual blocks is constructed in the feature representation layer. The advanced convolutional neural network consists of a shallow convolution module and a residual module. Specifically, the shallow convolution module consists of one convolutional layer, one max-pooling layer, three convolutional layers, and one max-pooling layer successively. The first convolutional layer has 128 feature maps, while the remaining convolutional layers have 256 feature maps, and the filter size of each convolutional layer is 5 × 3. In the two max-pooling layers, the pooling size is 2 × 2. The shallow convolution module takes the spectrogram and its first and second order derivatives as inputs, extracting the low-level features of the input data, and then inputs the obtained low-level features to the residual module. The residual module includes a total of five identical residual blocks. The residual block structure is shown in
Figure 2.
Residual networks add skipping structures outside the convolutional layers. Let the input of the network be
, the result of the operation after the convolution layer is
, and the output of the residual network is
. If there are
x and
dimensions matching, then
If there is a difference in the dimensions between and , then some padding layers and average pooling layers will be added at the jump structure, and the padding layer will be used to enrich the dimension of the input information, so that the output dimension will be the same as that of dimensions match.
2.3. Bidirectional Gated Recurrent Unit (BGRU)
GRU [
18] (gated recurrent unit) is a variant of the LSTM [
19] network that works well. Three gate functions are introduced in LSTM: input gate, forget gate and output gate. The GRU only includes two gate functions: update gate and reset gate. The purpose of the former is to continue the valid information of the content of the last moment, and to control the input of candidate information. The purpose of the latter is to control whether the calculation of the candidate state depends on the state of the previous moment. GRU has fewer tensor operations, so it is faster to train than LSTM. The network structure is shown in
Figure 3.
At time t, let the current input be
, the hidden state output of GRU is
, and the hidden state at the previous time is
. The calculation method is as follows:
Among them, is the update gate, is the reset gate, is the weight matrix connecting the two layers, and and tanh are the activation functions.
However, the standard GRU can only transfer data in one direction in time when dealing with sequence problems, that is, data can only flow from the past to the present, without considering the influence of the later data on the previous data. The BRNN (bidirectional recurrent neural network) model improves the unidirectional recurrent neural network, and processes the input sequence from both positive and negative directions at the same time. The bidirectional gated recurrent unit (BGRU) can be regarded as the parallel connection of two unidirectional GRUs in opposite directions through the output layer, that is, a unidirectional GRU that processes sequence data from front to back and a unidirectional GRU that processes sequence data from back to front. Unidirectional GRU is connected to the same output layer [
20]. The BGRU structure can pay attention to the context information at the same time to make more accurate judgments. In this article, we use one BGRU layer to collect the temporal information, and each direction contains 128 cells, then we can obtain a sequence of 256-dimensional high-level feature representations. The network structure is shown in
Figure 4.
For a given n-dimensional input
, at time
, the hidden layer of BGRU outputs
, and the calculation method is as follows:
where
is the bias vector, and
and
are the outputs of the forward and reverse GRUs, respectively.
2.4. Attention Mechanism
When humans process information, they selectively focus on some of the most important information, while ignoring other visible information to a certain extent. This mechanism is called the attention mechanism. In the speech emotion recognition task, not all speech features in a speech play an important role in the judgment of emotional state. In this paper, the attention mechanism is applied to the features extracted from emotional speech, so that the model can give different features to the output of the BGRU network of attention. The specific implementation method is as follows:
The new representation of the input sequence is obtained by using Equation (11) through the multilayer perceptron layer (MLP) with tanh as the nonlinear activation function. Equation (12) normalizes the attention score to an attention weight between 0 and 1 through the SoftMax function. Formula (13) uses the obtained attention weight to weight the feature vector in the input frame features, and finally obtains the weighted feature representation .
3. Experimental Results and Analysis
3.1. Sentiment Corpus and Network Parameter Settings
This paper used the IEMOCAP emotional speech database recorded by the University of Southern California for experiments. The corpus contains approximately 12 h of audiovisual data. It consists of five sessions, and each session is displayed by a man and a woman. The two professional actors performed scripted and improvised scenarios that elicit expressions of specific emotions. The IEMOCAP database is annotated into categorical labels by multiple annotators and contains a total of nine sentiment categories. According to previous work [
21], this paper selected four types of emotional data that researchers are generally concerned about and the most representative, with a total of 2280 sentences, including 289 angry sentences, 284 happy sentences, 608 sad sentences, and 1099 neutral sentences. We implemented a five-fold cross validation. In each fold, the data from four sessions was used for model training, and the data from the remaining session was split: one actor for validation and the other one as the testing set.
The model in this paper used the PyTorch deep learning framework, and the parameters were set as follows: the learning rate was 0.0001, the batch size was 40, and the number of iterations (epoch) was 300. Adam was used as the network optimizer.
3.2. Evaluation Indicators
This paper used the evaluation indicators commonly used in the field of speech emotion recognition: the value of weighted accuracy (WA) and unweighted accuracy (UA) was used to achieve an effective interpretation of performance, and among them, WA represents the overall accuracy of the sample information, and UA stands for the mean of sentiment accuracy.
3.3. Experimental Results and Analysis
In order to verify the effectiveness of the model proposed in this paper, experiments were carried out on the IEMOCAP emotion corpus, and we provide a comparison of the accuracy of existing research and our model in
Table 1, all of the experiments in the table used the improvised part of IEMOCAP as data set.
As can be seen from
Table 1, compared with the research methods in the above literature, our proposed model is state-of-the-art and improved on both WA and UA on IEMOCAP dataset.
To verify the effectiveness of each module in the proposed model, we conducted ablation experiments, and the experimental results are shown in
Table 2.
The experiment ablated the residual block, attention and bidirectional network structure in the model, respectively. After comparing the experimental results with the model in this paper, it can be seen that when the residual block, attention and bidirectional network structure were not used, both WA and UA of the model had different degrees of decline. Among them, the ablation of residual block and attention had the most obvious impact on the classification performance of the model. The experimental results verified the effectiveness of the proposed model and the importance of each component to the classification performance of the model.
Through the above demonstration, in order to better realize the in-depth research and analysis of the recognition model, the confusion matrix of the model in this paper under the IEMOCAP data set is given, as shown in
Figure 5.
It can be seen from the confusion matrix in
Figure 5 that the three types of emotions, anger, sadness and neutral, have obtained high recognition accuracy rates of 82, 75 and 71%, respectively, while the recognition rate of happy is only 43%; 41% of the happy samples were wrongly classified as neutral. There are two possible reasons for the low recognition rate of happiness in the IEMOCAP dataset: (1) from the dimension space of emotion, the activation level of happiness is higher, while neutral is at the center of the activation valence space. (2) The IEMOCAP dataset has the problem of sample imbalance, while happy is less in the number of samples. Subsequently, the number of samples can be balanced by overlapping methods.