Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features

Maji, Bubai; Swain, Monorama; Mustaqeem,

doi:10.3390/electronics11091328

Open AccessArticle

Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features

by

Bubai Maji

¹,

Monorama Swain

¹ and

Mustaqeem

^2,*

¹

Department of Electronics and Communication Engineering, Silicon Institute of Technology, Bhubaneswar 751024, India

²

Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(9), 1328; https://doi.org/10.3390/electronics11091328

Submission received: 16 March 2022 / Revised: 16 April 2022 / Accepted: 20 April 2022 / Published: 22 April 2022

(This article belongs to the Special Issue Human Emotion Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the information, we present an advanced fusion-based dual-channel self-attention mechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap module was used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectral features from the input tensor. The self-attention layer was employed in each module to selectively focus on optimal cues and determine the attention weight to yield high-level features. Finally, we utilized a confidence-based fusion method to fuse all high-level features and pass them through the fully connected layers to classify the emotional states. The proposed model was evaluated on the Berlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED) datasets to improve the recognition rate. During experiments, we found that our proposed model achieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposed model outperformed the state-of-the-art models using the same datasets.

Keywords:

affective computing; bi-directional gated recurrent unit; convolutional capsule network; confidence-based fusion; speech emotion recognition

1. Introduction

The evolution of innovative technologies has made our daily life more comfortable. Consequently, speech is the most convenient communication method with increasing interactions between humans and machines. People can understand other ways of communicating emotions relatively easily, such as body gestures [1], facial expressions [2], image classification [3], and successfully detecting emotional states (e.g., happiness, anger, sadness, and neutrality). Speech-based emotion recognition tasks are receiving increasing attention in many areas, such as forensic sciences [4], customer support call review and analysis [5], mental health surveillance, intelligent systems, and quality assessments of education [6]. Deep learning technology has accelerated the recognition of emotions from speech, but the research on SER still has deficiencies, such as shortages of training data and inadequate model performance [7,8]. However, emotion classification from speech in real-time is difficult due to several dependencies, such as speakers, cultures, genders, ages, and dialects [9].

Most importantly, the feature extraction step in an SER task can effectively fill the gap between speech samples and the corresponding emotional states. Various hand-crafted features, such as pitch, duration, format, intensity, and linear prediction coefficient (LPC), have been used in SER so far [10,11]. However, these hand-crafted features suffer from a limited capability and low accuracy in building an efficient SER system [12] because they are at a lower level and may not be discriminative enough to predict the respective emotional category [13]. Nowadays, spectral features are more popular and usable than traditional hand-crafted features as they extract more emotional information by considering the time and frequency [14]. Due to these advantages, scholars have carried out a considerable amount of research using spectral features [15,16,17]. However, these features cannot adequately express the emotions in an utterance. Therefore, to overcome the limitations of traditional hand-crafted features and spectral features, high-level deep feature representations need to be extracted using efficient deep learning algorithms for SER from speech signals [12]. Over the last few years, researchers have introduced different deep learning algorithms using various discriminative features [18,19,20]. At present, deep learning methods still use several low-level descriptor (LLD) features, which are different from traditional SER features [21]. So, extracting more detailed and relevant emotional information from speech is the first issue we have to address.

Newly emerging deep learning methods may provide possible solutions to this issue. Among them, two of the most common are Convolutional Neural Networks (CNNs) [22] and Recurrent Neural Networks (RNNs), such as long short-term memory (LSTM) and Gated Recurrent Units (GRUs) [23,24]. A model with an RNN and a CNN was built to represent high-level features from low-level data for SER tasks [25]. The CNN represents a high-level feature map, while the RNN extracts long-term temporal contextual information from the low-level acoustic features to identify the emotional class [26]. Several studies [14,27] have been successfully carried out to learn features in speech signal processing using CNNs and RNNs.

Moreover, most speech emotion databases contain only utterance-level class labels, and not all parts of utterances hold emotional information. They contain unvoiced parts, short pauses, background sounds, transitions between phonemes, and so on [28]. Unfortunately, CNNs and RNNs are not able to efficiently deal with this situation and analyze acoustic features extracted from voices [29]. So, we must distinguish the emotionally relevant parts and determine whether the speech frame contains voiced or unvoiced parts. Capsule networks (CapNets) have been used to overcome the drawback of CNNs in capturing spatial information [30]. CapNets have been utilized in various tasks and demonstrated to be effective [31,32,33]. A capsule-based network was developed for emotion classification tasks [30]. Meanwhile, we applied a self-attention mechanism, emphasizing the capture of prominent features [34]. Because of this, we moved to the self-attention mechanism from the traditional attention mechanism.

Although these models with deep learning approaches are efficient and adaptable to practical situations, there remain some significant challenges to overcome. These are as follows:

The key challenge is that the durations of the speech signals are not equal. The traditional machine learning model truncates and pads the data to obtain input features with a fixed shape, affecting the model efficiencies;
The local features obtained by the CNN may not contain relevant data because some emotional information may be hidden in utterances with a variety of durations. Previous studies [35,36] consecutively combined networks for emotion recognition. The superiority of these models lies in the fact that they are very simple to design for segment-level features. However, in consecutive models, the inheritance relationship within the structures may lose some emotional information and may not capture the global contextual features.

Therefore, a novel system is required to independently learn the local and global contextual emotional features in speech utterances of varied length. We were inspired by the RNNs and capsule networks that have been used in many fields to achieve high accuracy and better generalizability, especially in time-series data. To reduce the redundancy and enhance the complementarity, we propose a novel fusion-based dual-channel self-attention deep learning model using a combination of a CNN and a capsule network (Conv-Cap) and a Bi-GRU (shown in Figure 1), as presented in Figure 2. The proposed model effectively uses the advantages of different features in parallel inputs to increase the performance of the system. The final issue is to design a suitably sized input array to meet the needs of other models’ performance.

The contributions and innovations of this study can be summarized as follows:

We propose an efficient speech emotion recognition architecture utilizing a modified capsule network and Bi-GRU modules in two streams with a self-attention mechanism. Our model is capable of learning spatial and temporal cues through low-level features. It automatically models the temporal dependencies. Therefore, we aim to contribute to the SER literature by using different spectral features with an effective parallel input mode and then fusing the sequential learning features;
We propose and explore for the first time a Conv-Cap and Bi-GRU feature-learning-based architecture for SER systems. To the best of our knowledge, the proposed methods are novel. We pass spectrograms through the Conv-Cap network and Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square through the Bi-GRU network in parallel to enhance the feature learning process in order to capture more detailed emotional information and to improve baseline methods;
We propose a dual-channel self-attention strategy to selectively focus on important emotional cues and ensure the system’s performance using discriminative features. We use a novel learning strategy in attention layers, which utilize the outputs of the proposed capsule network and sequential network simultaneously and capture the attention weight of each cue. Moreover, the proposed SER system’s time complexity is much lower than that of the sole model;
We demonstrate the significance of a fusion strategy by using a confidence-based fusion technique that ensures the best outcomes for these integrated, separately learned features by using a fully connected network (FCN). We demonstrate the effectiveness and efficiency of our proposed method on the IEMOCAP and EMO-DB databases and our own Odia speech corpus (SITB-OSED). We conducted extensive experimentation and obtained results that outperformed state-of-the-art approaches for emotion recognition.

The remainder of this paper is arranged as follows. A review of the literature on SER is given in Section 2. Section 3 provides the details of our proposed system. Section 4 presents the datasets, the experimental setup, and a comparative analysis that demonstrates the model’s efficacy and reliability. Finally, our conclusions and future research directions are provided in Section 5.

2. Literature Review

SER is an energetic field of research, and researchers have designed numerous strategies over the last few years. With the use of advanced deep learning strategies, researchers have mostly applied complex hand-crafted features (e.g., eGeMaps, ComParE, IS09, etc.) for SER with conventional machine learning approaches such as k-nearest neighbor (kNN), support vector machine (SVM), the Gaussian Mixed Model (GMM), and the Hidden Markov Model (HMM) [37,38,39]. Therefore, scholars have been greatly inspired by the growing use of deep learning methods for SER tasks. In [40], the authors implemented the first deep learning model for SER. Then, in [41], the authors utilized a CNN model to learn the salient emotional features. Later, in [42], the authors implemented a CNN model to obtain emotional information from spectrograms to identify the speech emotion [43]. A spectrogram is a two-dimensional visualization of a speech signal, is used in 2D-CNN models to extract high-level discriminative features, and has become more prevalent in this era [44]. Zhao et al. [45] extracted features using different spectrogram dimensions using CNNs and passed them through to an LSTM network. The LSTM network was used to learn global contextual information from the resulting features of the CNN. Kwon et al. [46] employed a new method to determine the best sequence segments using Radial Basis Functions (RBFs) and KNN. The authors used the selected vital sequence of the spectrogram and extracted features using a CNN. The obtained features were normalized and classified using bi-directional long short-term memory (BiLSTM). The researchers found that 2D CNN-LSTM networks work much better than traditional classifiers. In addition to 2D CNN [47] networks, some studies have also used 1D CNNs and achieved satisfying performance in SER. For example, in [48] Mustaqeem et al. proposed a multi-learning framework using a one-dimensional dilated CNN and a bi-directional GRU sequentially. The authors utilized the residual skip connection to learn the discriminative features and the long-term contextual dependencies. In this way, the authors improved the local learned features from the speech signal for SER. In [17], a new model was built using five spectral features (the MFCC, the chromagram, the Mel-scale spectrogram, the spectral contrast, and the Tonnetz representation). The authors used a 1D CNN to learn the features for the classification results. However, transfer learning methods can also train pre-trained networks, such as AlexNet [3] and VGG [49], for SER using spectrograms.

Moreover, some researchers concluded that LSTM performs better when using a CNN to extract high-level features. The LSTM model is well suited to addressing emotional analysis problems, but it is still time-consuming and difficult to train in parallel for large datasets. Cho et al. [24] employed a GRU, which has a shorter training time, has fewer parameters than the LSTM network, and can capture global contextual features. The GRU network is an advanced type of RNN and similar to LSTM [50]. It is easier and less complex to train than the LSTM network, which can help to increase the training efficiency. It has only two ports: an update port and a reset port. The function of the update port is the same as that of the LSTM networks forget port and input port, which make decisions on deleted information and newly stored information, respectively. The reset port decides how to combine the previous information with the newly stored information and determines the degree to which past information will be forgotten [51].

For time step

t

, the input feature sequence of

n

-th utterance

x_{n, t}

(where

x

is the feature vector of the utterance) is encoded into forward

({\vec{h}}_{n, t})

and backward

({\overset{\leftarrow}{h}}_{n, t})

hidden states. Then, the two hidden states are used together at the same time and the output

{\overset{\leftrightarrow}{h}}_{n, t}

is jointly calculated by them sequentially, making the result more robust. The formula for the calculation is as follows:

{\vec{h}}_{n, t} = \vec{G R U} (W_{x, \vec{h}} x_{n, t} + b_{\vec{h}})

(1)

{\overset{\leftarrow}{h}}_{n, t} = \overset{\leftarrow}{G R U} (W_{x, \overset{\leftarrow}{h}} x_{n, t} + b_{\overset{\leftarrow}{h}})

(2)

{\overset{\leftrightarrow}{h}}_{n, t} = ({\vec{h}}_{n, t} \oplus {\overset{\leftarrow}{h}}_{n, t})

(3)

where

W_{\vec{h}}

and

W_{\overset{\leftarrow}{h}}

are the weight matrix form of the forward and backward directions of the GRU layer and

b_{\vec{h}}

and

b_{\overset{\leftarrow}{h}}

are the bias of the forward and backward directions of the GRU layer, respectively. Compared with the LSTM network, the number of parameters, the complexity of the GRU model, and the experimental costs are reduced [50].

In addition, combinations of CNN and LSTM models were proposed by Trigeorgis et al. [35] to solve the problem of extracting context-aware emotion-related features and obtain a better representation from the raw audio signal. They did not use any hand-crafted features, such as linear prediction coefficients and MFCCs. The CNN and LSTM network combination has received a great deal of attention compared with the FCN network for various tasks. Similarly, Tzirakis et al. [52] used CNN and LSTM networks to capture the spatial and temporal features of audio data.

Furthermore, the attention mechanism is often used with deep neural networks. Chorowski et al. [53] first proposed a sample attention mechanism for speech recognition. Recurrent neural networks and attention mechanism approaches have made outstanding achievements in emotion recognition and allowed us to focus on obtaining more emotional information [54]. Rajamani et al. [55] adopted a new approach using an attention-mechanism-based Bi-GRU model to learn contextual information and affective influences from previous utterances to help recognize the current utterance’s emotion. The authors used low-level descriptor (LLD) features such as MFCCs, the pitch, and statistics. Zhao et al. [56] proposed a method that can automatically learn spatiotemporal representations of speech signals. However, the self-attention mechanism has distinct advantages over the attention mechanism [57]. It can determine the different weights in the frame with different emotional intensities and the autocorrelation between frames [58]. The self-attention mechanism is an upgraded version of the attention mechanism. The formula for the attention algorithm is as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{D_{k}}})

(4)

In Equation (4),

Q, K,

and

V

denote the queries, keys, and values of the input matrix, and the dimension of the keys is denoted

D_{k}

. When

Q = K = V,

it is a self-attention process. Self-attention models can learn semantic information, such as word dependencies in sentences, more easily.

Although CNNs have provided extremely successful results in the computer vision field and signal classification, they have some limitations [59]. Recently, capsule networks (CapNets) have been proposed to overcome the drawback of CNNs in capturing spatial information. CNNs cannot learn the relationship between the parts of an object and the positions of objects in an image. Additionally, traditional CNNs will lose some semantic information during the forward propagation of the pooling layer, making it difficult to identify the spatial relationship. To overcome these issues, Hinton et al. built capsule networks to maintain the positions of objects within the image and model the spatial relationships of the object’s properties [30]. A capsule contains a couple of neurons and is conveyed in the form of vectors. The vector’s length exhibits the probability of the activity described by the capsule and also captures the internal representation parameters of the activity. The capsule and a squashing function are used to obtain the vector form of each capsule. The output of the capsule

k

is defined as

q_{k}^{t}

for each time step

t

.

q_{k}^{t} = d (x_{k})

(5)

where the input to the capsule

k

is

x_{k}

and the squashing function is

d (.)

. The squashing function is represented as

d (x_{k}) = \frac{‖ x_{k} ‖^{2}}{1 + ‖ x_{k} ‖^{2}} \frac{x_{k}}{‖ x_{k} ‖}

(6)

Here, all inputs of the capsule

x_{k}

are the weighted summation of the prediction vector, represented as

u_{k | l}

, except for the first capsule layer. The prediction vector is calculated by the lower layer of the capsule, which is multiplied by a weight matrix

W_{l k}

, and the output of the capsule

u_{l}

. Then, the impact of the perspective is modeled by matrix multiplication.

x_{k} = \sum_{l} c_{l k} {\hat{u}}_{k | l}

(7)

where c_lk defines the coupling coefficients, which are calculated by the routing softmax function as expressed in Equation (8).

c_{l k} = \frac{e x p (a_{l k})}{\sum_{j} e x p (a_{l j})}

(8)

Wu et al. [32] implemented a new approach using a capsule network for SER tasks. The authors use spectrograms as the inputs of their model. The capsules are used to minimize the information loss in feature representation and retain more emotional information, which are important in SER tasks. In 2019, Jalal et al. [60] implemented a hybrid model based on BLSTM, a 1D Conv-Cap, and capsule routing layers for SER. Ng and Liu [61] used a capsule-network-based model to encode spatial information from speech spectrograms and analyze the performance under various loss functions on several datasets.

Moreover, fusion strategies are used to fuse the different types of features to make a decision on the classification of different modeling methods and reach a more sensible conclusion. Su et al. [62] developed a model with frame-level acoustic features using an RNN network and an SVM model with HSFs as inputs. The confidence score was determined using the probabilities of the RNN and SVM models. The confidence scores of these two models were averaged for the final classification of each class as the fusion confidence score. Yao et al. [21] used the decision-level fusion method after obtaining the outputs from three classifiers (HSF-DNN, MS-CNN, and LLD-RNN) for final class prediction.

Considering the limited number of speech samples, we also adopted a decision-level fusion method in this study to fulfill the goal of developing a high-performance SER system. We developed a new approach for SER tasks to efficiently process emotional information through the Conv-Cap and Bi-GRU modules from speech features. After that, we used the self-attention layer in each module to explore the autocorrelation of phonemes between the features. In addition, due to the limited amount of data, we simultaneously applied different learning strategies to manage and classify emotional states. Finally, we integrated a confidence-based fusion approach with the probability score of two models to identify the final emotional state. To the best of our knowledge, this is the first time the proposed framework has been implemented in emotion recognition, and we demonstrate that it is more efficient than other state-of-the-art methods.

3. Proposed Methodology

The main architecture of our proposed fusion-based dual-channel self-attention model using Conv-Cap and Bi-GRU networks is illustrated in Figure 2. This architecture requires two different parallel inputs (spectral LLD features and Mel-spectrograms), and they are input simultaneously into the proposed model. Specifically, we passed the Mel-spectrograms through to the Conv-Cap module and the spectral LLDs (Mel-frequency cepstral coefficients, a chromagram, the contrast, the zero-crossing rate, and the root mean square) to the Bi-GRU module. Our proposed dual-channel model extracts information sequentially from Mel-spectrogram and spectral LLD features. Moreover, we employed a self-attention mechanism to concentrate on the more salient emotional information. We separately trained each module with separate inputs. To achieve a higher probability for each class, we applied fusion techniques at the decision level. The details of our proposed method are described in the following sections.

3.1. Data Preprocessing

A feature extraction step is essential to building a successful machine learning model. Features may be accomplices to a well-trained model, whereas irrelevant features would considerably obstruct the training process [17]. We utilized the Librosa audio library for the feature extraction process [63]. We extracted six different types of spectral features: Mel-spectrogram, Mel-frequency cepstral coefficient, chromagram, contrast, zero-crossing rate (ZCR), and root mean square (RMS) features. Here, the Conv-Cap module uses the Mel-spectrogram feature, while the rest of the features (the MFCC, chromagram, contrast, ZCR, and RMS features) pass through the Bi-GRU module. The short-term Fourier transform (STFT) with a window length of 32 ms was used to obtain the Mel-spectrograms. We used Mel-spectrograms with 64 Mel bins with speech signals with a sampling rate of 16 kHz for each utterance. The scaled Mel-spectrograms were resized in the time dimension to the appropriate input size of the model. On the other hand, the total dimension of the spectral LLD features was 1 × 61.

3.2. Module Architecture

3.2.1. Convolutional Neural Network

CNNs have successfully been employed in several emotion-related tasks, such as automatic speech recognition (ASR) and emotion identification [41]. We initially used pre-trained networks such as AlexNet [3] instead of fine-tuning the CNN. The pre-trained AlexNet [3] model required a large amount of data because it was trained on a dataset containing millions of images. In our experiment, the pre-trained model did not provide good accuracy because of the limited amount of speech data. So, we created our own CNN network. We randomly ran models with various numbers of convolutional layers, kernel layers and filters, and other hyper-parameters several times and selected the optimal number of convolutional layers and other parameters based on the values that gave more efficient and stable results.

The output of the fine-tuned CNN architecture is then passed through to the CapNet module. A key contribution of our work is applying deep features to each capsule state to learn the temporal representation of different abstract features. The CNN network consists of three convolutional layers and two max-pooling layers, as shown in Figure 2.

We directly applied the Mel-spectrograms as inputs into the CNN network to capture each syllable’s temporal dependency information. After the first convolutional layer, the rectifier linear unit (Relu) activation function is used with batch normalization and dropout layers. The first convolutional layer consists of 256 filters with a kernel size of 5 × 5 and a stride of 2 × 2. The second and third convolutional layers consist of 128 and 64 filters with a kernel size of 5 × 5 and 3 × 3, respectively. After the first and second convolutional layers, we employed a max-pooling layer with a size of 2 × 2. To prevent the model from overfitting, we used a dropout rate of 0.2. Batch normalization confirmed that the average of the feature map was close to 0 and the standard deviation was relative to 1. The final CNN layer was extracted with a feature representation size that the CapNet could handle. The details of the capsule operation are described in the next section.

3.2.2. Capsule Network

In this study, output matrices of the final CNN layer were applied to form a capsule and a squashing function to obtain the vector form of each capsule to minimize the loss of semantic information. We set the iteration number of the routing algorithm to 3. Dynamic routing can retain the local spatial features through the feature transfer method [64]. The capsule’s output is squashed, and the vector representations of the features learned by that capsule are presented above in Equations (5) and (6). The routing capsule layer is then linked to the previous layer via the transformation matrices in the same way as the fully connected layer in the CNN. In this layer, the numbers of capsules are equal to the numbers of emotion classes with a length of 16 dimensions, and the output layer of each capsule is connected to the previous layer’s capsule. The previous layer’s capsule determines the prediction of the next layer’s capsule’s output, which is calculated using the output of these capsules using the above Equations (7) and (8). The output layer of the capsules determines the posterior probabilities.

3.2.3. Bi-Directional Gated Recurrent Unit

RNNs can incorporate all of the preamble vocabulary in the language information set and handle the sequence length of the signal. However, RNNs have the vanishing and exploding gradient problems. LSTM and GRU networks have ‘‘gates’’ that allow them to selectively affect the information in every state of the model to overcome the above problems.

We consider a Bi-GRU network over the standard one-directional GRU because the Bi-GRU network can handle more sequential data with two directions. A Bi-GRU has two GRU layers that pass the input independently through to the forward and reverse directions as shown in Figure 1.

The proposed dual-channel model uses a Bi-GRU layer to learn global semantic information from the LLD input matrix. We used 256 hidden units in each unidirectional GRU layer and spliced the output into the self-attention layer. Thus, the output node at each time step carries all of the past and future information of the present instant in the input sequence.

3.2.4. Self-Attention Mechanism

After the operations of the Conv-Cap and Bi-GRU modules, the output of each module is passed through to a self-attention layer to reduce the redundancy of its external information in order to capture a better internal correlation between the features. After the self-attention method is applied, we obtain a one-dimensional high-level feature vector as presented below:

Q_{1} = K_{1} = V_{1} = H_{s a 1}

for the Conv-Cap layer and

Q_{2} = K_{2} = V_{2} = H_{s a 2}

for the Bi-GRU layer.

3.2.5. Confidence-Based Fusion Method

The outputs from the two self-attention layers are then passed through the fully connected layers and determine the confidence score for each module, which is denoted

C_{1} (s)

for the Conv-Cap module and

C_{2} (s)

for the Bi-GRU module. Then, we concatenate

C_{1} (s)

and

C_{2} (s)

to obtain a fusion confidence score, which is denoted

C_{f} (s)

. While each classifier used different features for specific utterances as inputs, we included two modules to enhance the recognition rate. Specifically, we used the confidence-based decision-level fusion method using the sum of confidence scores mentioned in [21]. The confidence scores were calculated individually from the outputs of the two classifiers using the Softmax function of the fully connected layer and then classified.

Let

C_{n} (s)

denote the confidence score for the corresponding emotional class

s

, where

n

is the set of two classifiers (the Conv-Cap module and the Bi-GRU module).

C_{1} (s)

and

C_{2} (s)

are summed and represent the fusion confidence score

C_{f} (s)

of a given emotion category

s

, which can be determined by the Equation (9).

C_{f} (s) = \sum_{n} C_{n} (s) = C_{1} (s) + C_{2} (s)

(9)

The emotion class with the highest fusion confidence score will be output as the final predicted result,

p = a r g m a x (C_{f} (s))

.

4. Experimental Setup and Results

4.1. Dataset Description

To analyze the effect of the proposed model, we performed speaker-independent SER tasks on the IEMOCAP, EMO-DB, and Odia datasets. The Odia database [65], also known as the Odia Speech Emotion dataset (SITB-OSED), was recorded by the Silicon Institute of Technology in Bhubaneswar, India. It was recorded by 12 actors (6 male and 6 female) expressing six different emotions (anger, fear, surprise, sadness, happiness, and disgust). The database contains 7317 Odia utterances in total. The average length of each audio file is 4 s, the sampling frequency is 22.05 kHz, and a 16-bit quantization rate is used. The number of utterances in each class and their participation % are described in Table 1. The database will be publicly available at https://www.speal.org/sitb-osed/ (accessed on 24 June 2021).

The IEMOCAP [66] dataset contains five sessions and a total of 12 h of audio–video data in English. Each session was recorded by one male and one female speaker in both scripted and improvised scenarios. This database contains 10,039 utterances with an average length of 4.5 s and a sample rate of 16 kHz. There are ten discrete emotions (neutral, happiness, excitement, frustration, disgust, sadness, surprise, anger, fear, and other). In the experiments, we merged the excitement emotion into the happiness emotion and considered four emotions (happiness, neutral, anger, and sadness) for a fair comparison with prior studies. The final dataset contained a total of 5531 utterances, and a detailed description of the four emotions is provided in Table 2.

We also used the EMO-DB [67] dataset, which has been extensively utilized by researchers in the SER field, in order to make consistent comparisons with previous work. This database contains 535 utterances by 10 German speakers (5 males and 5 females) with seven emotion categories (anger, fear/anxiety, boredom, disgust, sadness, happiness, and neutral) with a sampling rate of 48 kHz. The average length of each sample is approximately 2 to 3 s. Table 3 presents detailed information on each class. For our experiments, all three datasets were down-sampled to 16 kHz.

4.2. Experimental Setup

The proposed dual-channel structure has several other hyper-parameters. We optimized some of the hyper-parameters to obtain the optimal model.

To train the model, we performed a five-fold cross-validation technique. We split the datasets in the ratio of 80:05:15 (%). A total of 80% of the data were used for training, 5% of the data were used for validation, and the remaining 15% of the data were used for the model evaluation. We calculated the UA and WA and constructed a confusion matrix to analyze the system’s performance.

The model used a batch size of 16 and a maximum of 100 epochs, including early stopping callbacks with a patience parameter value of 10 [68]. If the training process did not get better after 10 epochs, it automatically stopped and saved the best model for evaluation. The Adam [69] optimizer was used with a learning rate of 0.0001 and a decay rate of 10⁻⁶. Our experimental configuration was an NVIDIA QUADRO P620 GPU with 16 GB of memory using the Python version 3.7.10 environment, TensorFlow 2.3.0, and the Keras 2.4.0 library.

4.3. Experimental Results

To illustrate the effect of our dual-channel model, experiments were carried out on two modules with different parallel inputs (Conv-Cap and Bi-GRU). Each module of the model individually identified emotions using different inputs, with the other parameters remaining unchanged. Table 4 provides detailed evaluations of the proposed and individual modules in terms of WA and UA on the three databases. Note that the proposed classifier was more robust in recognizing emotions than the other individual classifiers. The WA of the proposed model was 90.31%, 76.84%, and 87.52%, and the UA of the proposed model was 87.61%, 70.34%, and 86.19%, on the EMO-DB, IEMOCAP, and Odia datasets, respectively. The proposed model exhibited a significant improvement in performance.

The Conv-Cap classifier was observed to be slightly more potent in recognizing emotions than the Bi-GRU classifier, except on the IEMOCAP dataset. This may have been because the duration of the speech samples in the IEMOCAP dataset is about 1 s to 15 s. A CNN requires a specific input size, so a fully convolutional layer may not learn proper emotional information through a sample with a long duration. In contrast, our Bi-GRU module can learn more time-related information due to the increased speech frame number. Using this advantage, the Bi-GRU module performed better than the Conv-Cap module on the IEMOCAP dataset. To further develop the performance, we used parallel training with multiple features as inputs to balance the gap in the emotional information between the two modules. Finally, it was able to learn all of the emotional information of a complete utterance.

Furthermore, we constructed a confusion matrix to examine the executions of each module and the proposed model. Figure 3 shows the predicted results of the confusion matrix on the EMO-DB dataset for the different modules. According to the experimental results, we passed two different input features through to the dual-channel architecture that may contain some special emotional information, enabling the proposed model to capture the contextual information better. Based on a comparison (Figure 3a–c), we discovered that the proposed model achieved a much better recognition rate than the single modules. The proposed model can capture the emotional information of features better. We can conclude from the confusion matrix that every emotional state was recognized with a good recognition rate (higher than 85%). The highest recognition accuracy was achieved for anger, fear, happiness, and neutral. In contrast, boredom, disgust, and sadness were classified with a moderate recognition accuracy that was similar to that of the two single modules or classifiers. In this way, the proposed model was able to draw on the advantages of the parallel architecture and pay more attention to specific emotions of different features.

Figure 4a–c show that, on the IEMOCAP dataset, in the happiness class, the recognition rate all of the modules is low (around 40% to 50%). All of the algorithms found happiness difficult to detect at a reasonable performance rate.

However, the dual-channel model had the highest rates of neutral, sadness, and anger recognition. Regarding the rate of happiness recognition, the method proposed in this paper occupies the middle position. In addition, it is known that, for all types of algorithms, this dataset yields a high recognition rate for neutral and anger and a low recognition rate for sadness and happiness. This is because the training dataset contains a large number of anger samples, so the features of anger are easier to learn. In each emotion category, the number of samples is not balanced.

Another reason for this result may be that actors can express anger and sadness in a good way while performing. Previous works have already reported [32,70] that happiness is difficult to identify, which may be due to the merging of the happiness samples with the excitement samples and its unique emotional characteristics, i.e., the emotion class of happiness depends on contextual contrastive information to a greater extent than the other emotion classes.

Finally, as can be seen from Figure 5a–c, the confusion matrix on the SITB-OSED dataset, the accuracy of the predictions obtained by the proposed dual-channel method is much higher than that of the individual modules. Figure 5a,b show the confusion matrix of the Conv-Cap module and the Bi-GRU module. The Bi-GRU module performed significantly better than the Conv-Caps module in all emotions except sadness. The reason for this may be that the actors did not express this emotion as well as the other emotions while recording, and perhaps the model was not able to capture relevant emotional information. Figure 5c shows the performance of our proposed model, which gives better recognition results for anger, disgust, happiness, surprise, and fear than the individual modules (a classification accuracy of 86.22%, 85.12%, 84.02%, 87.21%, and 83.58%, respectively). The proposed model identified the sadness emotion at a similar rate to the individual modules. From the experimental results, it is evident that the dual-channel model provides better recognition performance than a cascade connection with combined features.

4.4. Comparative Analysis

To further investigate our proposed approach, its recognition accuracy was compared with that of several state-of-the-art methods. Table 5, Table 6 and Table 7 present a detailed comparison between our proposed methodology and previously reported results on the IEMOCAP, EMO-DB, and SITB-OSED databases. The results from the tables indicate that the proposed system can minimize the gap in emotional features through the CapNet and the self-attention mechanism, and the Bi-GRU can improve the diversity of the information. Note that some previous works only employ the UA, which better reflects imbalances among emotional states when evaluating recognition performance. However, we have presented both the WA and the UA for a fair comparison.

Table 5 shows the comparison between our proposed model and seven recently published baseline methods on the EMO-DB dataset. From Table 5, we can see that our proposed model outperforms the baseline methods on the EMO-DB dataset in terms of WA, and the recognition rate is 8.57%, 7.01%, 4.89, 4.36%, 4.21%, 3.87%, and 0.30% higher compared with [15,17,29,47,58,71,72], respectively. We also reported a UA of 87.61%, which is higher than the UA of five baseline models, i.e., the baseline models with a UA of 85.57%, 84.53%, 82.82%, 82.10%, and 82.06% [15,29,36,46,58].

Chen et al. [36] used an attention-based convolutional recurrent neural network to learn discriminative features for SER. Jiang et al. [15] used a parallel connection using CNN and LSTM networks that take frame-level features and 3D log Mel-spectrograms simultaneously. This shows the advantages of the parallel model, which provides the best performance. In 2021, Li et al. [29] proposed a directional self-attention mechanism-based BLSTM model to decode features in two directions for SER. Li et al. [58] introduced a new spatiotemporal and frequency-based cascaded attention network with large-margin learning. This cascaded attention network focused on helping the model extract an adequate amount of emotional information from a large spectrogram and demonstrated satisfactory performance in SER.

Secondly, we compared the effectiveness of our proposed method with that of nine baseline methods on IEMOCAP. Chen et al. [72] utilized a Bi-LSTM network for SER with 32 features (including the F0, the zero-crossing rate, the voice probability, the 12-dimensional MFCCs, and their first-order derivatives) as inputs. Another different approach used a cascade combination of a CNN and LSTM with raw audio spectrograms as inputs [70]. Li et al. [73] used CNN networks with two groups of filters to extract a 2D spectrogram and then fed it to the subsequent convolutional layers. The attention pooling method was employed to learn the ultimate emotional representation. Wu et al. [32] developed a model based on a recurrent CapNet that reviews the spatial relationship of activities in spectrograms for SER. Recently, Meyer et al. [26] implemented a model with a CNN in combination with a BLSTM layer and fully connected layers. The authors integrated a multi-kernel-width CNN, applied a 3D log Mel-spectrogram as an input, and introduced an effective training method. Shirian et al. [74] proposed an efficient Graph Convolution Network (GCN) architecture with the Interspeech2009 emotion feature set. In 2021, Rajamani et al. [55] employed a novel attention-based rectified linear unit (AReLU) activation function with GRU and Bi-GRU cells. The authors used the pitch, MFCCs, and statistics in each frame of an utterance for SER.

Table 6 shows that the recognition results of the proposed model are impressive compared with all of the above-mentioned baseline models on the IEMOCAP dataset.

Our model clearly outperforms the nine baseline methods in terms of WA (19.74%, 12.54%, 9.94%, 8.04%, 14.04%, 5.04, 12.65%, 4.11%, and 3.83% higher values). Additionally, the recognition accuracy is 10.94% and 10.63% higher than that in [32,70], respectively, whereas the UA is 5.84% and 2.04% lower than that in [26,55], respectively.

Finally, our proposed model was also verified on the SITB-OSED dataset. Table 7 compares the performance of the proposed model with that of only one recently published baseline model on the SITB-OSED dataset because only one such work has been done previously in the Odia language. In our previous work [65], we achieved a WA of 62.36% by combining CNN and GRU networks and using only 1440 audio samples. For this experiment, we used 7317 recently created utterances to improve the model’s stability and accuracy. The proposed model achieved a WA of 87.52% and a UA of 86.19%.

5. Conclusions and Future Research Directions

Most SER models are unable to distinguish between the features that contribute to emotion recognition and the ones that do not. In order to use more emotional information, we simply passed the different features through a hybrid architecture based on the Conv-Cap and Bi-GRU modules at the same time. The proposed model, which extracts Mel-spectrograms and other spectral LLDs, as well as spatial and temporal cues, outperforms traditional machine learning algorithms using the strengths of feature extraction. Each module focuses on a specific emotion class using a dual self-attention mechanism. The Conv-Cap module is capable of considering temporal information, can consider the spatial relationships of activities within the speech segments, and can create an utterance-level feature representation for emotion recognition. At the same time, the Bi-GRU network addresses the temporal dynamic of speech as sequential data. It generates better contextual representations using the bi-directional layer. Our proposed model outperforms several baseline models in terms of both weighted and unweighted accuracy, demonstrating the efficiency of the CapNet on speech data. The results reveal that the proposed framework can better learn high-level abstractions of emotion-related features and overcome the problem with conventional machine learning models using an identical feature input. The experimental results demonstrate good performance with a weighted accuracy and an unweighted accuracy of 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19% on the IEMOCAP, EMO-DB, and SITB-OSED datasets, respectively.

In the future, we aim to extend the CapNet-based hybrid algorithm with different speech features to further enhance its ability to capture emotion-related characteristics that vary across time. We also aim to apply this architecture to the Odia language with more dialects and other languages in the future. Moreover, we aim to implement the proposed framework in real-time applications for emotion recognition and gender identification.

Author Contributions

Conception, design, analysis, interpretation of the data, and writing of the manuscript, B.M.; critical revision of the manuscript for important intellectual content, funding, and approval of the final version, M.S. and M.; supervision of the project, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the DST, Govt. of India, under grant reference no. DST/ICPS/CLUSTER/Data Science/2018/General, Date: 7 January 2019.

Acknowledgments

The authors express their gratitude to the Department of Science and Technology (DST) and the Silicon Institute of Technology, who provided us with an excellent academic environment for this research work. We also thank J. Talukdar for improving the quality of the manuscript.

Conflicts of Interest

The authors state that they have no conflict of interest.

References

Wu, J.; Zhang, Y.; Zhao, X. A generalized zero-shot framework for emotion recognition from body gestures. arXiv 2020, arXiv:2010.06362. [Google Scholar]
Alreshidi, A.; Ullah, M. Facial emotion recognition using hybrid features. Informatics 2020, 7, 6. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Roberts, L.S. A Forensic Phonetic Study of the Vocal Responses of Individuals in Distress. Ph.D. Thesis, University of York, York, UK, 2012. [Google Scholar]
Chakraborty, R.; Pandharipande, M.; Kopparapu, S.K. Knowledge-based framework for intelligent emotion recognition in spontaneous speech. Procedia Comput. Sci. 2016, 96, 587–596. [Google Scholar] [CrossRef] [Green Version]
Vogt, T.; André, E. Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006; pp. 1123–1126. [Google Scholar]
Ishaq, M.; Kwon, S. Short-Term Energy Forecasting Framework Using an Ensemble Deep Learning Approach. IEEE Access 2021, 9, 94262–94271. [Google Scholar]
Mustaqeem; Kwon, S. 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features. Comput. Mater. Contin. 2021, 67, 4039–4059. [Google Scholar] [CrossRef]
Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In Proceedings of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 17–19 December 2018. [Google Scholar]
Eyben, F.; Scherer, K.S.; Schuller, B.W.; Sundberg, J.; Andre, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef] [Green Version]
Swain, M.; Routray, A.; Kabisatpathy, P. Databases, features and classifiers for speech emotion recognition: A review. Int. J. Speech Technol. 2018, 21, 93–120. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar] [CrossRef]
Zhang, S.; Zhao, X.; Tian, Q. Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM. IEEE Trans. Affect. Comput. 2019, 1–10. [Google Scholar] [CrossRef]
Bertero, D.; Fung, P. A first look into a convolutional neural network for speech emotion detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5115–5119. [Google Scholar]
Abdel-Hamid, L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun. 2020, 122, 19–30. [Google Scholar] [CrossRef]
Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmed, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. 2019, 78, 5571–5589. [Google Scholar] [CrossRef]
Dangol, R.; Alsadoon, A.; Prasad, P.W.C.; Seher, I.; Alsadoon, O.H. Speech Emotion Recognition Using Convolutional Neural Network and Long-Short Term Memory. Multimed. Tools Appl. 2020, 79, 32917–32934. [Google Scholar] [CrossRef]
Senthilkumar, N.; Karpakam, S.; Gayathri Devi, M.; Balakumaresan, R.; Dhilipkumar, P. Speech emotion recognition based on Bi-directional LSTM architecture and deep belief networks. Mater. Today Proc. 2021, 57, 2180–2184. [Google Scholar] [CrossRef]
Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [Google Scholar] [CrossRef]
Abdul Qayyum, A.B.; Arefeen, A.; Shahnaz, C. Convolutional Neural Network (CNN) Based Speech-Emotion Recognition. In Proceedings of the IEEE International Conference on Signal Processing, Information, Communication and Systems, Dhaka, Bangladesh, 28–30 November 2019; pp. 122–125. [Google Scholar]
Tzinis, E.; Potamianos, A. Segment-based speech emotion recognition using recurrent neural networks. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26 October 2017; IEEE: Manhattan, NY, USA, 2017; pp. 190–195. [Google Scholar]
Cho, K.; Merrienboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Neumann, M.; Vu, N.T. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv 2017, arXiv:1706.00612. [Google Scholar]
Meyer, P.; Xu, Z.; Fingscheidt, T. Improving Convolutional Recurrent Neural Networks for Speech Emotion Recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Virtual, 19–22 January 2021; pp. 365–372. [Google Scholar]
Qamhan, M.A.; Meftah, A.H.; Selouani, S.A.; Alotaibi, Y.A.; Zakariah, M.; Seddiq, Y.M. Speech Emotion Recognition using Convolutional Recurrent Neural Networks and Spectrograms. In Proceedings of the 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), 30 August–2 September 2020; pp. 1–5. [Google Scholar]
Mao, S.; Ching, P.C.; Lee, T. Enhancing Segment-Based Speech Emotion Recognition by Deep Self-Learning. arXiv 2021, arXiv:2103.16456v1. [Google Scholar]
Li, D.; Liu, J.; Yang, Z.; Sun, L.; Wang, Z. Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 2021, 173, 114683. [Google Scholar] [CrossRef]
Sabour, S.; Frosst, N.; Hinton, G. Dynamic routing between capsules. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3856–3866. [Google Scholar]
Zhang, B.W.; Xu, X.F.; Yang, M.; Chen, X.J.; Ye, Y.M. Cross-domain sentiment classification by capsule network with semantic rules. IEEE Access 2018, 6, 58284–58294. [Google Scholar] [CrossRef]
Wu, L.; Liu, S.; Cao, Y.; Li, X.; Yu, J.; Dai, D.; Ma, X.; Hu, S.; Wu, Z.; Liu, X.; et al. Speech emotion recognition using capsule networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6695–6699. [Google Scholar]
Duarte, K.; Rawat, Y.S.; Shah, M. VideoCapsuleNet: A simplified network for action detection. Advances in Neural Information Processing Systems. arXiv 2018, arXiv:1805.08.08162. [Google Scholar]
Chen, Q.; Huang, G. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Eng. Appl. Artif. Intell. 2021, 102, 104277. [Google Scholar] [CrossRef]
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
Mustafa, M.B.; Yusoof, M.A.M.; Don, Z.M.; Malekzadeh, M. Speech emotion recognition research: An analysis of research focus. Int. J. Speech Technol. 2018, 21, 137–156. [Google Scholar] [CrossRef]
Koolagudi, S.G.; Maity, S.; Kumar, V.A.; Chakrabarti, S.; Rao, K.S. IITKGP-SESC: Speech database for emotion analysis. Commun. Comput. Inf. Sci. 2009, 40, 485–492. [Google Scholar] [CrossRef]
Akçaya, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014; pp. 223–227. [Google Scholar]
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
Khamparia, A.; Gupta, D.; Nguyen, N.G.; Khanna, A.; Pandey, B.; Tiwari, P. Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access 2019, 7, 7717–7727. [Google Scholar] [CrossRef]
Mustaqeem, M.; Kwon, S. Speech Emotion Recognition Based on Deep Networks: A Review. In Proceedings of the Korea Information Processing Society Conference, Seoul, Korea, 14 May 2021. [Google Scholar]
Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183. [Google Scholar]
Zhao, Z.; Zheng, Y.; Zhang, Z.; Wang, H.; Zhao, Y.; Li, C. Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 272–276. [Google Scholar]
Mustaqeem; Muhammad, S.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep bilstm. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
Mustaqeem; Kwon, S. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 2021, 167, 114177. [Google Scholar] [CrossRef]
Tursunov, A.; Choeh, J.Y.; Kwon, S. Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms. Sensors 2021, 21, 5892. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 2017, 92, 60–68. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Dai, W.; Hu, Y.; Li, J. Speech emotion recognition model based on Bi-GRU and Focal Loss. Pattern Recogn. Lett. 2020, 140, 358–365. [Google Scholar] [CrossRef]
Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process 2017, 11, 1301–1309. [Google Scholar] [CrossRef] [Green Version]
Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. arXiv 2015, arXiv:1506.07503. [Google Scholar]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 6294–6298. [Google Scholar]
Zhao, Z.; Bao, Z.; Zhao, Y.; Zhang, Z.; Cummins, N.; Ren, Z.; Schuller, B. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 2019, 7, 97515–97525. [Google Scholar] [CrossRef]
Ishaq, M.; Son, G.; Kwon, S. Utterance-Level Speech Emotion Recognition using Parallel Convolutional Neural Network with Self-Attention Module. In Proceedings of the 1st International Conference on Next Generation Computing Systems-2021, Coimbatore, India, 26–27 March 2021; Volume 1, pp. 1–6. [Google Scholar]
Li, S.; Xing, X.; Fan, W.; Cai, B.; Fordson, P.; Xu, X. Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 2021, 448, 238–248. [Google Scholar] [CrossRef]
Toraman, S.; Tuncer, S.A.; Balgetir, F. Is it possible to detect cerebral dominance via EEG signals by using deep learning? Med. Hypotheses 2019, 131, 109315. [Google Scholar] [CrossRef]
Jalal, M.A.; Loweimi, E.; Moore, R.K.; Hain, T. Learning temporal clusters using capsule routing for speech emotion recognition. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1701–1705. [Google Scholar]
Ng, A.J.B.; Liu, K.H. The Investigation of Different Loss Functions with Capsule Networks for Speech Emotion Recognition. Sci. Program. 2021, 2021, 9916915. [Google Scholar] [CrossRef]
Su, B.H.; Yeh, S.L.; Ko, M.Y.; Chen, H.Y.; Zhong, S.C.; Li, J.L.; Lee, C.C. Self- assessed affect recognition using fusion of attentional BLSTM and static acoustic features. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 536–540. [Google Scholar]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Librosa: Audio and music signal analysis in python. In Proceedings of the Forteenth Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–24. [Google Scholar]
Chen, Z.; Qian, T. Transfer Capsule Network for Aspect Level Sentiment Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistic, Florence, Italy, 28 July–2 August 2019; pp. 547–556. [Google Scholar]
Swain, M.; Maji, B.; Das, U. Convolutional Gated Recurrent Units (CGRU) for Emotion Recognition in Odia Language. In Proceedings of the IEEE EUROCON 19th International Conference on Smart Technologies, Lviv, Ukraine, 6–8 July 2021; pp. 269–273. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. Iemocap: An Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of german emotional speech. In Proceedings of the INTERSPEECH, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
Loughrey, J.; Cunningham, P. Using Early Stopping to Reduce Overfitting in Wrapper-Based Feature Weighting; Department of Computer Science, Trinity College Dublin: Dublin, Ireland, 2005. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
Sun, Y.; Wen, G.; Wang, J. Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal Process. Control 2015, 18, 80–90. [Google Scholar] [CrossRef]
Chen, S.; Zhang, M.; Yang, X.; Zhao, Z.; Zou, T.; Sun, X. The Impact of Attention Mechanisms on Speech Emotion Recognition. Sensors 2021, 21, 7530. [Google Scholar] [CrossRef]
Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the INTERSPEECH, Dresden, Germany, 6–10 September 2015; pp. 1537–1540. [Google Scholar]
Li, P.; Song, Y.; McLoughlin, I.; Guo, W.; Dai, L. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar]
Shirian, A.; Guha, T. Compact graph architecture for speech emotion recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–12 June 2021; pp. 6284–6288. [Google Scholar]

Figure 1. The Bi-GRU architecture with input, hidden, and output layers.

Figure 2. A visual representation of the proposed speech emotion recognition system using capsule and bi-directional GRU networks with a dual self-attention mechanism and a decision-level fusion technique for final classification.

Figure 3. The confusion matrix on the EMO-DB dataset. (a) The Conv-Cap module with a UA of 82.57%, (b) the Bi-GRU module with a UA of 78.73%, and (c) the proposed model with a UA of 87.61%.

Figure 4. The confusion matrix on the IEMOCAP dataset. (a) The Conv-Cap module with a UA of 62.75%, (b) the Bi-GRU module with a UA of 65.68%, and (c) the proposed model with a UA of 70.34%.

Figure 5. The confusion matrix on the SITB-OSED dataset. (a) The Conv-Cap module with a UA of 82.30%, (b) the Bi-GRU module with a UA of 83.84%, and (c) the proposed model with a UA of 86.19%.

Table 1. The number of utterances in each class and their participation (%) in the SITB-OSED dataset.

Emotion	No. of Utterances	Participation (%)
Happiness	1197	16.36
Surprise	1187	16.22
Anger	1197	16.36
Sadness	1309	17.89
Fear	1196	16.34
Disgust	1231	16.82

Table 2. The number of utterances in each class and their participation (%) in the IEMOCAP dataset.

Emotion	No. of Utterances	Participation (%)
Neutral	1708	30.88
Happiness	1636	29.58
Anger	1103	19.94
Sadness	1084	19.60

Table 3. The number of utterances in each class and their participation (%) in the EMO-DB dataset.

Emotion	No. of Utterances	Participation (%)
Happiness	71	13.27
Anger	127	23.74
Sadness	62	11.59
Neutral	79	14.76
Fear	69	12.90
Boredom	81	15.14
Disgust	46	8.60

Table 4. A comparison of the performance of different architectures of the proposed model in terms of weighted accuracy (WA) and unweighted accuracy (UA) using different datasets.

Database	Model	WA (%)	UA (%)
	Conv-Cap	86.35	82.57
EMO-DB	Bi-GRU	83.44	78.73
	Proposed model	90.31	87.61
	Conv-Cap	68.27	62.75
IEMOCAP	Bi-GRU	70.21	65.68
	Proposed model	76.84	70.34
	Conv-Cap	83.69	82.30
SITB-OSED	Bi-GRU	85.02	83.84
	Proposed model	87.52	86.19

Table 5. Performance (%) comparison between our proposed method and state-of-the-art methods on the EMO-DB dataset.

Database	Method (Refs.)	WA (%)	UA (%)
	Sun et al. (2015) [71]	81.74	-
	Issa et al. (2020) [17]	86.10	-
	Chen et al. (2018) [36]	-	82.82
	Mustaqeem et al. (2020) [46]	-	85.57
EMO-DB	Li et al. (2021a) [29]	85.95	82.06
	Jiang et al. (2019) [15]	86.44	84.53
	Li et al. (2021b) [58]	83.30	82.10
	Chen et al. (2021) [72]	85.42	-
	Mustaqeem et al. (2021) [47]	90.01	-
	Proposed model	90.31	87.61

Table 6. Performance (%) comparison between our proposed method and state-of-the-art methods on the IEMOCAP dataset.

Database	Method (Refs.)	WA (%)	UA (%)
IEMOCAP	Lee et al. (2015) [72]	62.80	63.90
	Satt et al. (2017) [70]	68.80	59.40
	Li et al. (2018) [73]	71.80	68.10
	Wu et al. (2019) [32]	72.73	59.71
	Yao et al. (2020) [21]	57.10	58.30
	Issa et al. (2020) [17]	64.30	-
	Meyer et al. (2021) [26]	-	64.50
	Shirian et al. (2021) [74]	64.19	60.31
	Rajamani et al. (2021) [55]	66.90	68.30
	Mustaqeem et al. (2021) [47]	73.01	-
	Proposed model	76.84	70.34

Table 7. Performance (%) comparison between our proposed method and a state-of-the-art method on the SITB-OSED.

Database	Method (Refs.)	WA (%)	UA (%)
SITB-OSED	Swain et al. (2021) [65]	62.36	-
	Proposed model	87.52	86.19

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Maji, B.; Swain, M.; Mustaqeem. Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features. Electronics 2022, 11, 1328. https://doi.org/10.3390/electronics11091328

AMA Style

Maji B, Swain M, Mustaqeem. Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features. Electronics. 2022; 11(9):1328. https://doi.org/10.3390/electronics11091328

Chicago/Turabian Style

Maji, Bubai, Monorama Swain, and Mustaqeem. 2022. "Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features" Electronics 11, no. 9: 1328. https://doi.org/10.3390/electronics11091328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features

Abstract

1. Introduction

2. Literature Review

3. Proposed Methodology

3.1. Data Preprocessing

3.2. Module Architecture

3.2.1. Convolutional Neural Network

3.2.2. Capsule Network

3.2.3. Bi-Directional Gated Recurrent Unit

3.2.4. Self-Attention Mechanism

3.2.5. Confidence-Based Fusion Method

4. Experimental Setup and Results

4.1. Dataset Description

4.2. Experimental Setup

4.3. Experimental Results

4.4. Comparative Analysis

5. Conclusions and Future Research Directions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI