MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition

Ma, Yiping; Wang, Wei

doi:10.3390/app122412805

Open AccessArticle

MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition

by

Yiping Ma

and

Wei Wang

^*

School of Education Science, Nanjing Normal University, Nanjing 210024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(24), 12805; https://doi.org/10.3390/app122412805

Submission received: 13 November 2022 / Revised: 10 December 2022 / Accepted: 12 December 2022 / Published: 13 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Speech emotion recognition (SER), a rapidly evolving task that aims to recognize the emotion of speakers, has become a key research area in affective computing. However, various languages in multilingual natural scenarios extremely challenge the generalization ability of SER, causing the model performance to decrease quickly, and driving researchers to ask how to improve the performance of multilingual SER. Recent studies mainly use feature fusion and language-controlled models to address this challenge, but key points such as the intrinsic association of languages or deep analysis of multilingual shared features (MSFs) are still neglected. To solve this problem, an explainable Multitask-based Shared Feature Learning (MSFL) model is proposed for multilingual SER. The introduction of multi-task learning (MTL) can provide related task information of language recognition for MSFL, improve its generalization in multilingual situations, and further lay the foundation for learning MSFs. Specifically, considering the generalization capability and interpretability of the model, the powerful MTL module was combined with the long short-term memory and attention mechanism, aiming to maintain the generalization in multilingual situations. Then, the feature weights acquired from the attention mechanism were ranked in descending order, and the top-ranked MSFs were compared with top-ranked monolingual features, enhancing the model interpretability based on the feature comparison. Various experiments were conducted on Emo-DB, CASIA, and SAVEE corpora from the model generalization and interpretability aspects. Experimental results indicate that MSFL performs better than most state-of-the-art models, with an average improvement of 3.37–4.49%. Besides, the top 10 features in MSFs almost contain the top-ranked features in three monolingual features, which effectively demonstrates the interpretability of MSFL.

Keywords:

speech emotion recognition; multi-task learning; feature learning; feature ranking

Graphical Abstract

1. Introduction

Speech emotion recognition (SER), which aims to recognize the emotion of speakers via extracting the acoustic features from the speech signal, started in the 1990s [1] and is a key research area in affective computing. Although modalities such as facial expressions, text, and physiological signals are known to have important and prominent results in affective computing [2,3,4], research reveals that speech modality, as the most convenient and natural medium of communication, has many advantages over the modalities described above [5]: compared with facial expressions, speech is a stronger temporal sequence, and it is easy to identify emotional changes from the entire sequence; compared with text, speech is more expressive in its intonation; compared with physiological signals such as those taken from an electroencephalogram (EEG), it is easy to collect speech data through the lightweight devices. In view of these advantages, SER has made breakthroughs in theoretical methods and key technologies after nearly three decades of development [6,7,8] and is widely used in intelligent vehicles [9], distance education [10], medical treatment [11], media retrieval systems [12] and other fields.

However, emotion implies huge intra-class and inter-class differences, and objective factors such as gender [13], age [14], language [15], and speaker [16] reduce the performance of the existing methods. Since most studies in SER focus on a single language, multilingual SER has been addressed only in a few studies, causing us to focus on the effect of diverse languages on SER. Feraru et al. [17] found that the effect of SER in the same language or within the same language family is higher than that across languages or language families, which means that the model will be limited by the trained corpus due to the single language category and the small sample size of the corpus. This is also the case in real life. Taking SER in the teaching classroom as an example, language courses have lower emotion recognition performance than non-language courses due to the poor generalization of the model in multilingual scenarios. In other words, existing models mainly focus on feature learning within the corpus to improve the performance of SER, without considering the influence of different languages in multiple corpora, far from the SER in real complex situations. A natural question arises: how can we improve SER in multilingual scenarios? To answer this question, the connection between different languages and SER is investigated in this research, and the key challenges are summarized as follows: (1) What are the similarities in the feature representation of speech emotion expression in different languages? (2) How can we build a generalization model that can improve the performance of SER in multiple languages simultaneously?

To address the above challenges, scholars have attempted to find breakthroughs from the perspective of features and models. For the first issue, researchers have explored fusing multiple features to find the features that are more conducive to multilingual emotional expression, and lead to an improvement. Currently, feature fusion has included the fusion between traditional handcrafted features [18,19,20,21,22,23], the fusion between traditional handcrafted features and deep features [24,25,26,27], and the fusion between deep features [28,29]. In addition, feature selection is also an effective way to obtain optimal acoustic features. Li et al. [30] used a three-layer model consisting of acoustic features, semantic primitives, and emotion dimensions to map multilingual acoustic features to emotion dimensions by using a model inspired by human emotion perception ability, where acoustic features were selected using Fisher discriminant ratio and sequential forward selection methods were used to select features and develop a shared standard acoustic parameter set. For the second issue, the main point is how to control the influence of language as an objective factor on SER. Most of the existing studies have been conducted by training language recognition classifiers for model selection [31,32] or improving the model to enhance the generalization [33,34,35,36,37,38,39]. Although these approaches show good results, there are still some limitations in these efforts. We can summarize as follows: (1) existing methods emphasize controlling the influence of the languages and ignore the intrinsic connection between languages. (2) Most studies either focus on the exploration between multilingual features or on the improvement of models, with little consideration given to studying models and features together.

A more efficient approach alternative to address the above limitations is multi-task learning (MTL), which is inspired by the fact that humans can learn multiple tasks simultaneously and use the knowledge learned in one task to help the learning of the remaining tasks [40]. MTL can learn robust and generalized feature representations from multiple tasks to better enable knowledge sharing between tasks, with the core idea of reducing the risk of overfitting for each task by weighing the training information between tasks. From this, it can be assumed that if the machine learns both language recognition and emotion recognition, will the multilingual shared features learned during the MTL training contribute to the improvement of multilingual SER? The literature proves that this is a feasible method. Lee [41] investigated multilingual SER across English and French using MTL trained with gender recognition and language recognition. Through comparative experiments, he confirmed that the MTL strategy can lead to further improvements under all conditions and is effective for multilingual SER. Zhang et al. [42] proposed a multi-task deep neural network with shared hidden layers and jointly trained several SER tasks from different corpora. This method achieved large-scale data aggregation and obtained feature transformation of all corpora through shared hidden layers. Sharma [43] combined 25 open-source datasets to create a relatively larger multilingual corpus, which shows good performance on a multilingual and multi-task learning SER system based on the multilingual pre-trained wav2vec 2.0 model. In his experiments, several auxiliary tasks were performed, including gender prediction, language prediction, and three regression tasks related to acoustic features. Gerczuk et al. [44] created a novel framework with the concept of residual adapters for multi-corpus SER in a deep transfer learning perspective, where the multi-task transfer experiment of the model trained a shared network for all datasets while only the adapter modules and final classification layers were specific to each dataset. Experiments showed that multi-task transfer experiment led to increased results for 21 of the 26 databases and achieved the best performance. From these studies, it is clear that MTL applied to SER is beneficial for aggregating data, sharing features, and establishing emotional representations. However, previous studies have only applied MTL to improve the generalization ability of models, and have not fully taken into account the interpretability of the model and its generated shared features. In other words, MTL should not only be a method to improve model generalization but is also an effective way to analyze and explain shared features.

To this end, considering the variability of emotional expressions in different languages, we propose an explainable Multitask-based Shared Feature Learning (MSFL) model for multilingual SER, which can improve the SER performance of each language and effectively analyze multilingual shared features (MSFs). Based on the basic idea of MTL, the module can be divided into a task-sharing module and a task-specific module, where the task-sharing module is the key component of MSFL, as it undertakes the feature selection and transformation to uncover the generalized high-level discriminative representations. The task-specific module is for the classification of emotion and language tasks. Specifically, the task-sharing module utilizes a long short-term memory network (LSTM) and attention mechanism from a new perspective, where LSTM uses the global feature dimensions as time steps to obtain long-term dependencies of features and the attention mechanism layer enables the model to better understand the important contribution of each feature in MSFs by assigning different weights. The weights of MSFs generated from the attention mechanism are essential to explain the reason for the improved validity and generalizability of the MSFL model and its MSF features. The main work and contributions of this paper are summarized below.

(1): For the model generalization, our proposed model uses gradient normalization-based MTL to jointly learn the tasks of emotion recognition and language recognition, where the gradient normalization method is to adjust the gradient of each task dynamically. With the LSTM-attention structure in MSFL, two tasks can learn features from different perspectives to better regularize the model and uncover the high-level discriminative MSFs.
(2): For the model interpretability, the validity and generalization of the model are explained in multilingual scenarios from the perspective of MSFs. The MSFs are ranked according to the weights from the attention mechanism in MSFL and we analyze the top features to compare with the monolingual features of three datasets. The differences and commonalities are found between the three monolingual features and MSFs.
(3): The technical idea and model interpretability perspective are provided from model building to feature analysis in this study. In particular, the feature ranking and analysis lay the theoretical foundation for multi-language and multi-corpus data aggregation to alleviate the data sparsity, and promote the research prospects of multilingual SER.

The rest of this paper is structured as follows. Section 2 presents the related literature on deep learning and MTL on SER. Section 3 describes the proposed MSFL model in detail. Section 4 presents the experimental setup, including the corpora employed to perform experiments, speech acoustic features extraction, and model configuration. Section 5 presents the experimental results and the significance of the observed findings. The conclusions and future research directions are provided in Section 6.

2. Related Work

2.1. Deep Learning for Speech Emotion Recognition

Early SER techniques relied on extensive feature engineering and performed emotion recognition by traditional machine learning models, such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and Gaussian Mixture Model (GMM) [45]. The flourishing development of deep learning has broadened the representation of acoustic features, and feature extraction is no longer limited to traditional feature engineering. The method of extracting deep representation by using the powerful feature learning ability of deep neural networks has gradually become mainstream, which has also laid the foundation for end-to-end models. SER formally steps into the era of relying on deep learning technology and achieving good performance. Convolution neural networks (CNNs) [46] and recurrent neural networks (RNNs) [47] have become the common deep neural networks in SER. CNNs are designed to process data with a grid-like topology, such as time series and image data, and generally contain convolutional layers, pooling layers, and fully connected layers. Since these can overcome the scalability problem of standard neural networks by allowing the multiple regions of the input to share the same weights [48], they have been widely used in SER to learn the frequency and time domain representations from spectrum images [49]. However, in our study, to enhance the interpretability of features, we use the traditional handcrafted features, which are always applied in deep neural networks (DNNs) and RNNs. Since DNNs are basic for deep learning, we will thus introduce RNNs in detail.

The self-connection property of an RNN has a great advantage in dealing with the temporal sequence problem, but with continuous training, the gradient will disappear and it is difficult to deal with the long temporal sequence, which promotes the proposal of the long short-term memory network [50]. Differing from the RNN, the LSTM adds one cell unit that holds the data for a common time and enables it to call the last calculated value. To protect and control information in the cell state, the LSTM sets up three gate structures including input gate, forgetting gate and output gate. Since the LSTM can both learn the long-term dependence in the data and effectively alleviate the gradient disappearance problem during the training process, frame-level and spectral features generally input LSTM to learn long-term contextual relationships in speech [51]. On this basis, bidirectional long short-term memory (BiLSTM) is proposed to obtain the present and future information in an utterance [52]. To strengthen the capability of capturing the long-time dependency in sequential data, Wang et al. [53] combined BiLSTM with a multi-residual mechanism, where the multi-residual mechanism targets the pattern of the relationship between the current time step and further distant time steps instead of only one previous time step. Additionally, the attention mechanism, which is borrowed from human visual selective attention and was first introduced into SER by Mirsamadi [54], is often combined with LSTM to select the importance of a sentence or some frame segments in the whole sentence on the time series [55].

2.2. Multi-Task Learning for Speech Emotion Recognition

Multi-task learning (MTL), also known as joint learning, learning to learn, and learning with auxiliary tasks, was proposed by Caruana in 1997 [56]. Its successful applications in natural language processing, computer vision, speech recognition, and other fields demonstrate the irreplaceable advantages of this learning paradigm. It is more conducive to alleviating the data sparsity problem by exploiting shared low-dimensional representations in multiple related tasks. In this way, representations learnt in the MTL scenario become more generalized, which helps improve the performance [57]. However, the premise of applying this method is that all tasks need to correlate. Otherwise, it will generate a negative transfer phenomenon and reduce the inter-task learning effect, so selecting tasks with a strong correlation is crucial for multi-task SER. Based on previous research, the related auxiliary tasks are summarized into four categories. The first is about different emotion representations such as dimensional emotion [58], the second is about the different objective factors related to speech emotion such as gender [59], speaker [60], and language [41], the third is about the different feature representations [61], and the final category is the related task from different databases [42]. Through these tasks, the multi-task SER model can share common feature representations to improve the generalization ability and performance of the model. To establish the link between languages and emotions for multilingual SER, the language recognition is regarded as an auxiliary task in this study.

Thung et al. [62] divide MTL models into single-input multi-output models (SIMO), multi-input multi-output models (MIMO), and multi-input single-output models (MISO). According to the existing literature and the basic situation of SER, multi-task SER can be classified into SIMO and MIMO. SIMO usually takes the traditional handcrafted features [63] or spectrograms [64] as model input, and outputs multiple task targets. MIMO trains the model with multiple sources of data such as multimodal [65], multi-corpus [42], and multi-domain [66] data as inputs, and the probability of one input source predicted as one target is defined as the task. The framework of the two models is shown in Figure 1. Generally, during MTL model optimization, the loss function of the model is combined with the sum of weighted loss functions of tasks. Previous studies on multi-task SER have applied an experience-based weight-adjusting method to assign different weights for each task. However, the loss magnitudes of different tasks during training may not be consistent, and this will lead to being dominated by a certain task at a certain stage and being more inclined to fit a certain kind of task. Therefore, scholars have started to balance the task gradients adaptively during the training process to improve the performance of all tasks. Inspired by this, an adaptative loss balancing method called gradient normalization is introduced to improve the performance of the two tasks in our proposed model [67].

3. Proposed Model

Relying on the advantages that MTL can learn the shared features between tasks, an explainable multitask-based shared feature learning model using language recognition as an auxiliary task is proposed to establish the relationship between languages and emotions. The MSFL incorporates a task-sharing module for feature representation and a task-specific module for multi-task classification. Figure 2 illustrates our proposed model.

Previous studies show that utterance-level features are usually transformed into deeper representations by applying the DNN, while LSTM generally handles frame-level features, and its time step is typically related to sentence length. Different from previous research, 88-dimensional utterance-level features are extracted in this study, which are also called High-level Statistics Functions (HSFs), and they are input as time steps into LSTM to obtain the dependencies of features. However, not all features contribute equally to the emotional representation in an utterance, so the attention mechanism layer is stacked on LSTM to make the model better understand the important contribution of each feature and focus on the important features by assigning different weights. To further process the features selected by the attention mechanism and transform them into deeper feature representation, we utilized the non-linear transformation operations of DNN stacking two fully connected layers with a ReLU activation function. Finally, to transform the feature vectors into a posterior probability distribution for classification, the more emotionally discriminative MSFs are input into the task-specific module, which consists of two softmax layers concerning specific tasks of emotion and language recognition.

Specifically, three labeled corpora are used in this study, where

X_{e}

represents the EMO-DB corpus,

X_{c}

represents the CASIA corpus, and

X_{s}

represents the SAVEE corpus. Let

X = [X_{e}, X_{c}, X_{s}] \in R^{n \times d}

, where

n = n_{e} + n_{c} + n_{s}

,

d

is the dimension of features.

Y_{e} = [y_{e, 1}, y_{e, 2}, \dots, y_{e, n}] \in R^{n \times c_{e}}

represents the label of emotion recognition task,

Y_{l} = [y_{l, 1}, y_{l, 2}, \dots, y_{l, n}] \in R^{n \times c_{l}}

represents the label of language recognition task, where

c_{e}

and

c_{l}

represent the number of emotion categories and language categories, respectively. Each row in matrix

X

represents a feature vector

x = \{x_{1}, x_{2}, \dots, x_{i}, \dots, x_{d}\}

,

x_{i}

represents an acoustic feature. The 88-dimensional acoustic features are extracted, so the value of

d

is 88. For our proposed model, feature vector

x

with 88 time-steps with one input dimension is input into an LSTM network containing 32 nodes to obtain the output

B

(

B = [b_{1}, b_{2}, \dots, b_{i}, \dots, b_{d}]

), where

b_{i}

is a 32-dimensional vector. Then, to correspond the 88-dimensional features to the nodes of the fully connected layer containing 88 nodes in the attention mechanism, the matrix

B

is transposed and input to the attention mechanism layer, which is calculated as follows.

f (b_{i}) = W^{T} b_{i}

(1)

α_{i} = \frac{\exp (f (b_{i}))}{\sum_{n} \exp (f (b_{n}))}

(2)

Z = \sum α_{i} b_{i}

(3)

Here,

W^{T}

is the trainable parameter,

α_{i}

is the normalized attention weight of each feature calculated by the softmax function operating on the attention scores, and

Z

is the output weighted sum of the LSTM. The attention weights are the key part of the interpretable model since they can analyze the MSFs learned from the MSFL and monolingual features learned from the single-task version of MSFL (MSFL_ST) to explain why the proposed model can learn generalized features representation and improve the recognition effect of multiple languages. Finally,

Z

is transformed by two fully connected layers and inputs the task-specific module for task classification.

The loss function of our model is shown in Equation (4), where

L_{e}

represents the loss function of the emotion recognition task shown in Equation (5), while

L_{l}

represents the loss function of the language task shown in Equation (6). Note that considering the unbalanced samples of emotion categories and language categories, the weighted cross-entropy is used to perform emotion and language loss functions so that the network places equal emphasis on the categories with smaller sample sizes.

L = w_{1} \times L_{e} (x_{e}, y_{e}) + w_{2} \times L_{l} (x_{l}, y_{l})

(4)

L_{e} (x, y_{e}) = - w_{y_{e}} \log (\frac{\exp (x [y_{e}])}{\sum_{c_{e}} \exp (x [c_{e}])})

(5)

L_{l} (x, y_{l}) = - w_{y_{l}} \log (\frac{\exp (x [y_{l}])}{\sum_{c_{l}} \exp (x [c_{l}])})

(6)

Generally, the weight of the i-th task

w_{i}

in the multi-task learning model can be decided by experience adjustment. However, due to imbalances between backpropagated gradients, multi-task networks are difficult to properly train. The task with the larger gradient magnitude will dominate the training process of the model and makes the model inadequate for the learning of other tasks. To obtain robust shared features that are useful across all tasks, an adaptive loss balancing method called grad normalization (GradNorm) is introduced, which controls the training of multi-task networks by adjusting the magnitude similarity of task feature gradients during the training process, thereby encouraging the network to learn all tasks at the same speed [67]. In this method, the

w_{i}

is no longer a fixed value but varies with the training step

t

:

w_{i} = w_{i} (t)

. Therefore, it aims to find the optimal

w_{i} (t)

for each training step

t

, and results in two optimization goals. One is to optimize

w_{i} (t)

, the other is to optimize the label loss

L

. For the optimization of

w_{i} (t)

, it can be divided into two aspects: balancing the different gradient magnitude

G_{W}^{(i)} (t)

and balancing the different learning rate

r_{i} (t)

of each task

i

in each training step

t

. The specific calculation formula can be found in [67].

4. Experimental Setup

4.1. Corpora

To evaluate the multilingual performance of our proposed model, the Berlin Database of Emotional Speech (EMO-DB) [68], the Chinese Emotional Corpus (CASIA) [69], and the Surrey Audio-Visual Expressed Emotion Database (SAVEE) [70] are combined together as input for the model, where the corpora represent German, Chinese and English, respectively. Since EMO-DB and CASIA are the most representative and influential discrete emotional corpora in German and Chinese, respectively, they are selected to facilitate subsequent comparison with other experiments. In contrast to the first two corpora, SAVEE is less frequently used in English SER studies than IEMOCAP, which is most widely used in English SER. However, considering that the large scale of IEMOCAP will affect the balance of the overall multilingual sample, SAVEE, which is relatively the same size as EMO-DB, is chosen in this study. These corpora contain data from professional actors who interpret relevant emotions based on the fixed scripts and have similar emotion labeling schemes. As in most of the previous studies, to maintain the consistency of emotions across the corpus, more concentration will be given to the recognition of four basic emotions (neutral, anger, happiness, and sadness), and we selected a total of 1439 utterances for the experiments. The basic information for each corpus is described in Table 1. Below, the relevant corpora are briefly introduced.

4.1.1. EMO-DB

EMO-DB is a German emotion corpus recorded by the Institute of Communication Science, Technical University Berlin. It comprises 10 German everyday life sentences including five short sentences and five long sentences, which were collected from ten actors for seven emotional states: neutral, anger, fear, happiness, sadness, disgust, and boredom. In total, there are 535 utterances in the corpus and emotions were annotated by 20–30 raters. Four basic emotion categories were selected, and a corpus of 339 sentences was formed in our study.

4.1.2. CASIA

CASIA is a Chinese emotion corpus recorded by the Institute of Automation of the Chinese Academy of Sciences, performed by four professional actors (i.e., two females, two males) in a pure recording environment. It covers 9600 utterances including six emotional states: anger, fear, happiness, neutral, sadness, and surprise. Each speaker has 300 utterances of the same text and 100 different texts for each emotion. Since only part of the voice data were publicly released, we used publicly available 50 identical texts recorded by four actors with four basic emotions to reduce the effect of text content on emotion recognition, composing a total corpus of 800 sentences.

4.1.3. SAVEE

SAVEE is a famous multimodal dataset that recorded utterances of four native English speakers from Surrey University, including seven emotions: neutral, anger, disgust, fear, happiness, sadness and surprise. Specifically, each person has 30 utterances of neutral emotion and 15 utterances of other emotions, totaling 480 utterances. Ten evaluators examined the audio, video, and combined audio-visual conditions to assess the quality of recordings. Experiments were performed with 300 utterances from four basic emotion states.

4.2. Speech Features Extraction

Acoustic features convey emotional information independently from the linguistic content, and it is crucial to extract and select appropriate features due to the different correlations between varied features and emotions. Currently, feature representation can be divided into traditional handcrafted features and depth features. One of the traditional handcrafted features is low-level descriptors (LLDs) such as prosody features, voice quality features, spectral features, and TEO (Teager energy operator)-based features, which are also called local features extracted from each frame in an utterance. The other is the HSFs, which are also called global features and are calculated as statistics of all local features [71]. Depth features utilize the advantage where deep learning has the great ability of feature learning, to automatically learn deeper features from the original speech signal or spectrogram [72]. Although the application of depth features boosts the development of the end-to-end model and enables the improved performance of SER, it has also led to the disadvantages of poor interpretability, high memory usage, and increased computational complexity.

Since one of our focuses is explaining the shared features of different languages, we use traditional handcrafted features with better interpretability as the features. To this end, we leverage the feature set of the Extended Geneva Minimalistic Acoustic Parameter Set (eGeMaps) as emotional acoustic features, which is extracted by the OpenSMILE toolkit and has better parameter interpretation and generalization capability than large-scale high-dimensional baseline feature sets. The eGeMaps contains 25 LLDs broadly sorted by three parameter groups: frequency-related parameters, energy-related parameters (i.e., amplitude-related parameters), and spectral parameters. To mitigate noise in the speech signal acquisition process on emotion recognition, all LLDs were smoothed over time with a symmetric moving average filter three frames long. The details of eGeMaps are given in Table 2. In addition, six temporal features (i.e., the rate of loudness peaks, the mean length and the standard deviation of continuously voiced regions, the mean length of and the standard deviation of unvoiced regions, the number of continuous voiced regions per second) and equivalent soundlevel are included. In total, 88 dimensions of features are used in our experiments.

4.3. Model Configuration

The experiments were performed on an Ubuntu 16.04 LST operating system and were implemented using the PyTorch deep learning framework. A total of 20% of the speech in each corpus is randomly selected as the test set, and the remaining speech is mixed. The model was trained with the batch size of 32 and iterations of 50, where Adaptive Moment Estimation (Adam) with learning rate of 0.01 was used as the optimizer. When using the GradNorm adaptative loss balancing method, the alpha hyperparameter was set to 0.16. When using the conventional experience-based weight adjusting method, the value of

w_{1}

is set in the range of 0.1 to 0.9. Note that

w_{1} + w_{2} = 1

. For the cross validation, ten-fold cross validation was used and the average of the ten-fold testing results was taken as the final performance of the model. The details of the model are summarized in Table 3.

5. Experiments and Results

In this section, experiments were conducted from the model generalization and model interpretability aspects on the EMO-DB, CASIA, and SAVEE corpora; a brief experiment conclusion is shown in Table 4.

The first two experiments focus more on the generalization of the model, while the last experiment enhances the interpretability of the model from the perspective of acoustic features. For model evaluation, performance metrics are the criterion for validating the merits of a model, and different performance metrics can lead to different evaluation results. Since the distribution of the emotion categories in speech is uneven, unweighted average recall (UAR) and accuracy (ACC) were used as the comparison measures due to their widely accepted use in SER. The calculation formulas are as follows. Note that TP represents the number of samples where both ground truths and predictions are positive, FN represents the number of samples where both ground truths and predictions are negative, FP represents the number of samples where ground truths are negative and predictions are positive, and TN represents the number of samples where ground truths are positive and predictions are negative.

UAR = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}}

(7)

ACC = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i} + T N_{i}}{T P_{i} + T N_{i} + F N_{i} + F P_{i}}

(8)

5.1. Experiment I: Different Weight Adjustment Methods for MSFL

To verify the generalization performance of the GradNorm method on MTL, three types of models are used with different weight adjustment methods. The results are shown in Table 5, where the M_MSFL model represents the MSFL model incorporating the Metabalance approach concerning the study [73], the G_MSFL model represents the MSFL model incorporating the GradNorm method concerning the study [67], and the E_MSFL model represents the MSFL model depending on manually adjusting the task weights with experience. The models have the same framework and parameters described in Section 4.3, except for the weight adjustment methods. Specifically, the parameters of relaxation and beta in the M_MSFL model are set to 0.7 and 0.9, respectively, and the alpha parameter in G_MSFL to 0.16. For the weight parameters in E_MSFL, the model for different values of

w_{1}

and

w_{2}

are evaluated ranging from 0.1 to 0.9, where

w_{1}

represents the weight of the emotion recognition task and is named alpha, and

w_{2}

represents the weight of the language recognition task and is named beta. The performance for the two tasks of different alpha values is shown in Figure 3. According to Figure 3, assigning different weights to the tasks affects the performance of the model. A too high alpha will result in less assistance in the language recognition task on the emotion recognition task, while a too small alpha will cause the model to be more biased towards learning the language recognition task and ignore the emotion recognition task. By comparing the model performance of different alpha values, the best result of about 80.77% on average was achieved in the E_MSFL model when setting alpha to 0.5 (i.e., 0.5 for the weight of emotion recognition task and 0.5 for the weight of language recognition task). Intuitively, jointly learning representations of emotions with languages helps to uncover the common high-level discriminative representations in multilingual scenarios, which leads to performance improvements in the SER system.

As shown in Table 5, all three models achieve about 99% on UAR and ACC in the language recognition task, which means language emotion is an easier task than emotion recognition, and the emotion recognition is challenging. Therefore, the following experimental analysis focuses on the main task of emotion recognition. In terms of the emotion recognition task, the average performance of the G_MSFL model is slightly higher than that of the E_MSFL model and M_MSFL model, with an average improvement of about 0.89% and 0.77% on UAR, respectively. Especially on EMO-DB and CASIA, the G_MSFL model achieved good results of 85.24% and 84.44% on UAR, with an improvement of 2.66% and 2.24% over the E_MSFL model and an improvement of 1.38% and 1.36% over the M_MSFL model. Compared with the E_MSFL model, the adaptive loss balancing approach (i.e., M_MSFL, G_MSFL) achieves better results and saves the process of brute force tuning than the time-consuming empirical tuning. Hence, this study incorporates the GradNorm adaptive loss balancing approach to optimize the objective function of the MSFL model, and the MSFL proposed in this paper in the subsequent experiments also defaults to using the GradNorm adaptive loss balancing approach.

5.2. Experiment II: Comparison with Different Models

To verify the generalization of the MSFL proposed in this paper, a comparative experiment was conducted between MSFL and other baseline models on the categorical corpora. For the baseline models, five models were selected, where the MSFL_ST and MTL_DNN were performed by us, and other models were selected from research from the past three years, retaining their experimental results from the literature. Note that MSFL_ST is set to verify the generalization ability in MTL, and the results of the MTL-DNN model are obtained by reproducing the literature [41]. The experimental results are shown in Table 6. Since not all studies in multilingual SER use the same corpora as this study, some models show incomplete corpus recognition results in Table 6. To further extend our comparison scope, the model performance is evaluated from a macro perspective and we explored the contribution of MTL on emotion categories in each corpus from a microscopic perspective.

From a macro perspective, our evaluation is first started by benchmarking the performance of the ST model over the MT model (i.e., MSFL_ST and MSFL). As shown in Table 6, the emotion recognition performance of each corpus improved with an average improvement of 4.49%. Additionally, the performance of MSFL also outperformed in EMO-DB, CASIA, and SAVEE, which shows significant improvement of 3.36%, 5.36%, and 4.75%, respectively. This proves that introducing MTL can improve the multilingual SER effect using language recognition as an auxiliary task. Then, the performances among the other models were compared, including MT_DNN, MT-SHL-DNN, CAbiLS, and Ensemble Model. Except for the result in the Ensemble Model on EMO-DB, MSFL outperforms the other models on three corpora, where the average improvement is 3.37% compared with MTL-DNN. The results prove the robustness and effectiveness of the MSFL model proposed in this paper.

From a micro perspective, two comparisons based on the confusion matrix are performed. The first one compares the confusion matrixes of MSFL_ST and MSFL to demonstrate the contribution of MTL for each emotion category. The second one compares the confusion matrix of MSFL and MTL_DNN to demonstrate that the model proposed in this paper outperforms the MTL_DNN in each emotion category. Figure 4, Figure 5 and Figure 6 show the confusion matrix for MSFL_ST, MTL_DNN, and MSFL on EMO-DB, CASIA, and SAVEE, respectively. The confusion matrix describes the accuracy and the degree of confusion between four emotion categories. The x-axis in the confusion matrix illustrates the predicted labels, and the y-axis illustrates the ground-truth labels. Values on the diagonal indicate the proportion of correct predictions, while the remaining values indicate the proportion of incorrect predictions (i.e., confusion). For the first comparison, except for the reduction of neutral accuracy in SAVEE, the recognition accuracy of almost all emotions in the three corpora on MSFL outperform MSFL_ST, especially emotions in CASIA, are all improved. The reason for the exception of neutral accuracy in SAVEE is the unbalanced sample size, where the number of neutral samples is twice that of other emotions. This leads to more biased recognition results in identifying neutral when conducting monolingual ST experiments, and especially sadness in SAVEE is extremely easy to confuse as neutral. On the contrary, when the MTL is used, the degree of confusion decreases and the accuracy of other emotions improves, which is the same as the confusion of anger and happiness in EMO-DB. Hence, MTL can alleviate the problem of unbalanced sample size and data sparsity through data augmentation, thus solving the emotional confusion issue in SER. For the second comparison, the recognition accuracy of almost all emotions in the three corpora on MSFL outperforms MTL_DNN. Although compared with MTL_DNN, the performance of sadness recognition in EMO-DB on MSFL is slightly lower, the degree of confusion in recognizing anger and happiness in EMO-DB significantly decreases. The same is true for SAVEE, where the performance of neutral recognition is slightly lower while the degree of confusion in recognizing neutral and sadness significantly decreases. In CASIA, the degree of emotional confusion was not evident on either MSFL or MTL_DNN. While the recognition of neutral and happiness on MSFL is slightly lower than MTL_DNN, there is a significant recognition effect on anger recognition. Overall, MSFL is very sensitive to recognizing anger and achieves the best results on all three corpora. Moreover, it mitigates emotional confusion for EMO-DB and SAVEE, which is benefited from the transformation and selection of acoustic features by the LSTM and attention mechanisms.

5.3. Experiment III: Multilingual Shared Features Ranking and Analysis

One of the advantages of the MTL is that it can exploit shared low-dimensional representations in multiple related tasks to improve the performance of all or some of the tasks. In this study, the MTL learns MSFs in language recognition and emotion recognition to enhance SER. To further investigate the similarities and differences between the MSFs learned from the MSFL and the monolingual feature representations learned from MSFL_ST, we first obtain the feature weights from the attention mechanism layer. Secondly, single-task experiments with monolinguals are implemented to get the feature weights for each language from the MSFL_ST. Then, the features are sorted according to the attention weights in descending order and we compared the top 30 MSFs with the top 30 EMO-DB features (EFs), top 30 CASIA features (CFs), and top 30 SAVEE features (SFs) learned from MSFL_ST. Due to the limited space, we list the top 10 features in MSFs, EFs, CFs, and SFs in Table 7. The procedure of feature ranking and comparison is shown in Figure 7.

As shown in Table 7, the LLDs in the top 10 HSFs of the three monolingual corpora (i.e., EFs, CFs, and SFs) are as follows:

EMO-DB: HNR, Spectral Slope 0–500 Hz, Harmonic difference H1–H2, Hammarberg index, MFCC-3, F0 Pitch, and Formant-2 frequency;
CASIA: Spectral Flux, Spectral Slope 0–500 Hz and 500–1500 Hz, Formant-2 bandwidth, Hammarberg index, MFCC-1, Formant-1 frequency, and Loudness;
SAVEE: F0 Pitch, Spectral Slope 0–500 Hz and 500–1500 Hz, Formant-2 bandwidth, Spectral flux.

It can be found that emotion-related features of languages differ from one another, and Spectral Slope ranks high in all three languages, which can be easily found in MSFs named slopV0-500_sma3nz_amean and ranked first. In particular, the first features in each monolingual feature named HNRdBACF_sma3nz_amean, spectralFlux_sma3_stddevNorm, and F0semitoneFrom27.5Hz_sma3nz_pctlrange0–2 are ranked the fourth, the fifth, and the second, respectively. In other words, except for the third feature, the top five features in MSFs cover both the common features of the three languages and the first feature of each language. It is worth mentioning that the third feature concerns the Spectral Slope, which also coincides with the fact that Spectral Slope plays a vital role in emotional expression in SER [74]. Besides, we find it interesting that the HSFs related to the F0 Pitch account for almost half of the top 10 features of SFs, which shows that the F0 Pitch is an essential LLD in SAVEE. However, this vital LLD only appears in the top 10 features of EFs, while not in the top 10 features of CFs. This is probably because SAVEE (English) and EMO-DB (German) belong to the same Germanic language family, so there is a certain similarity in the acoustic features when expressing emotions.

To provide a macro view of the similarities and differences in the top 30 monolingual and multilingual feature rankings, the features are divided into four feature groups: Equivalent Soundlevel, Temporal Features, Frequency, Energy, and Spectral (Figure 8). It is easy to see that Spectral accounts for the most significant portion among all features, probably because it has a large amount in eGeMpas features and contains strong emotional information, which is widely used in the end-to-end model. Apart from the Spectral, there are many distinguishing features in the different languages. As for EMO-DB, the Temporal Features and Equivalent Soundlevel are essential to recognize the emotion, while the energy-related features are more critical in CASIA, and frequency-related features are more important in SAVEE. The MSF balances the portion of each feature group, which may convey that it can maximize the learning of features shared by all languages and discard some language-sensitive features to achieve generalization.

6. Conclusions and Future Work

Multi-task learning can learn robust and generalized feature representations from multiple tasks to better enable knowledge sharing among tasks. That is why it is widely used in various fields, and this is no exception in SER. However, previous studies have focused on applying MTL as a learning algorithm and have ignored the interpretability and generalized feature representations of the method. In this study, we break through the limitations of previous studies and propose an explainable multitask-based shared feature learning model using the gradient normalization adaptive loss balancing method. We establish the association between emotions and languages by jointly training language recognition and emotion recognition tasks to address the challenge that acoustic features are overly sensitive to monolinguals and make speech undergeneralized in multilingual scenarios. Our model can learn a set of shared emotional features in the multilingual case and explain why multi-task learning is beneficial to improve the generalization ability of the model. Our proposed model can improve the interpretability of the multi-task learning model from the perspective of acoustic features and provide some reference significance for subsequent multilingual research.

Experiments were conducted on three corpora: EMO-DB, CASIA, and SAVEE. The experimental results demonstrate that the present model is more effective in speech emotion recognition on the three corpora than other models. In summary: (1) The gradient normalization method could balance the weight of tasks efficiently by computing the gradient of each training step. Compared with adjusting the task weights based on experience, this saves the heavy workload of manually adjusting weights and improves the model performance. (2) The multi-task learning strategy can enable the MSFL to learn the generalized features of multiple languages to improve recognition performance and alleviate the confusion between emotions, such as anger and happiness in EMO-DB, and neutrality and sadness in SAVEE. (3) The top 10 multilingual shared features almost cover the top-ranked monolingual features, especially the slopeV0-500_sma3nz_amean that occurred in all three monolingual languages and ranked the first among the multilingual shared features. In addition, when analyzing the top 30 feature rankings, there are certain commonalities and differences among the three languages. The commonalities provide a theoretical basis for multi-language and multi-corpus data aggregation to alleviate the data sparsity. The differences promote the research prospect of multilingual speech emotion recognition.

Future work will focus on the research of multilingual acoustic features and the multi-task SER model. For the multilingual acoustic features, we will increase the number of corpora for each language and mitigate the influence of extraneous factors other than language on speech emotion in the corpus, enhancing the generalizability of features across multiple corpora. On this basis, we will also attempt to fuse deep features to improve the performance of the model. Moreover, modalities such as text and facial expressions have also been developing rapidly in recent years, especially in facial expression recognition. EmoAffectNet [75] and Emotion-GCN [76] models have shown good performances, and multimodal features can also be combined with speech features in the future to explore commonalities and differences in multilingual emotional expressions from multiple modal cues [77]. For the multi-task model improvement, the first issue is related to the computational cost of the proposed model. Although the effect of MSFL is better than that of MTL_DNN, the running speed is twice slower than that of MTL_DNN, which also reminds us to take into account the model performance and model complexity together when improving the model in the future. The second issue is about the type of multi-task models; the current multi-task SER models are focused on the hard parameter sharing model, which requires a strict correlation between tasks and limits model performance improvement and the research on the related tasks. Thus, the soft parameter sharing model can be considered in the future, where each task has its own model and parameters, and the parameters between the models are guaranteed to be correlated by adding specific restrictions between the different models. In a word, it is expected that these aspects can further deepen the study of multilingual shared features and broaden the research perspective of multi-task SER.

Author Contributions

Conceptualization, Y.M. and W.W.; methodology, Y.M. and W.W.; software, Y.M. and W.W.; validation, Y.M. and W.W.; formal analysis, Y.M.; investigation, Y.M. and W.W.; data curation, Y.M.; writing—original draft preparation, Y.M.; writing—review and editing, Y.M. and W.W.; visualization, Y.M.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Chinese National Social Science Foundation, grant number BCA150054.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to acknowledge Chinese National Social Science Foundation.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dellaert, F.; Polzin, T.; Waibel, A. Recognizing Emotion in Speech. In Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP ’96, Philadelphia, PA, USA, 3–6 October 1996; Volume 3, pp. 1970–1973. [Google Scholar]
Savchenko, A.V.; Savchenko, L.V.; Makarov, I. Classifying Emotions and Engagement in Online Learning Based on a Single Facial Expression Recognition Neural Network. IEEE Trans. Affect. Comput. 2022, 13, 2132–2143. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Research. 2022, 21, 5485–5551. [Google Scholar]
Zhong, P.; Wang, D.; Miao, C. EEG-Based Emotion Recognition Using Regularized Graph Neural Networks. IEEE Trans. Affect. Comput. 2022, 13, 1290–1301. [Google Scholar] [CrossRef]
Li, H.F.; Chen, J.; Ma, L.; Bo, H.J.; Xu, C.; Li, H.W. Dimensional Speech Emotion Recognition Review. Ruan Jian Xue Bao/J. Softw. 2020, 31, 2465–2491. (In Chinese) [Google Scholar]
Kakuba, S.; Poulose, A.; Han, D.S. Attention-Based Multi-Learning Approach for Speech Emotion Recognition with Dilated Convolution. IEEE Access 2022, 10, 122302–122313. [Google Scholar] [CrossRef]
Jiang, P.; Xu, X.; Tao, H.; Zhao, L.; Zou, C. Convolutional-Recurrent Neural Networks with Multiple Attention Mechanisms for Speech Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 1564–1573. [Google Scholar] [CrossRef]
Guo, L.; Wang, L.; Dang, J.; Chng, E.S.; Nakagawa, S. Learning Affective Representations Based on Magnitude and Dynamic Relative Phase Information for Speech Emotion Recognition. Speech Commun. 2022, 136, 118–127. [Google Scholar] [CrossRef]
Vögel, H.-J.; Süß, C.; Hubregtsen, T.; Ghaderi, V.; Chadowitz, R.; André, E.; Cummins, N.; Schuller, B.; Härri, J.; Troncy, R.; et al. Emotion-Awareness for Intelligent Vehicle Assistants: A Research Agenda. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, Gothenburg, Sweden, 28 May 2018; pp. 11–15. [Google Scholar]
Tanko, D.; Dogan, S.; Burak Demir, F.; Baygin, M.; Engin Sahin, S.; Tuncer, T. Shoelace Pattern-Based Speech Emotion Recognition of the Lecturers in Distance Education: ShoePat23. Appl. Acoust. 2022, 190, 108637. [Google Scholar] [CrossRef]
Huang, K.-Y.; Wu, C.-H.; Su, M.-H.; Kuo, Y.-T. Detecting Unipolar and Bipolar Depressive Disorders from Elicited Speech Responses Using Latent Affective Structure Model. IEEE Trans. Affect. Comput. 2020, 11, 393–404. [Google Scholar] [CrossRef]
Merler, M.; Mac, K.-N.C.; Joshi, D.; Nguyen, Q.-B.; Hammer, S.; Kent, J.; Xiong, J.; Do, M.N.; Smith, J.R.; Feris, R.S. Automatic Curation of Sports Highlights Using Multimodal Excitement Features. IEEE Trans. Multimed. 2019, 21, 1147–1160. [Google Scholar] [CrossRef]
Vogt, T.; André, E. Improving Automatic Emotion Recognition from Speech via Gender Differentiation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06); European Language Resources Association (ELRA): Genoa, Italy, 2006. [Google Scholar]
Mill, A.; Allik, J.; Realo, A.; Valk, R. Age-Related Differences in Emotion Recognition Ability: A Cross-Sectional Study. Emotion 2009, 9, 619–630. [Google Scholar] [CrossRef] [Green Version]
Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages. In Proceedings of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 19 December 2018; pp. 88–93. [Google Scholar]
Ding, N.; Sethu, V.; Epps, J.; Ambikairajah, E. Speaker Variability in Emotion Recognition—An Adaptation Based Approach. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 5101–5104. [Google Scholar]
Feraru, S.M.; Schuller, D.; Schuller, B. Cross-Language Acoustic Emotion Recognition: An Overview and Some Tendencies. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, 21–24 September 2015; pp. 125–131. [Google Scholar]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef]
Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.; Narayanan, S.S. The INTERSPEECH 2010 Paralinguistic Challenge. In Proceedings of the Interspeech 2010, ISCA, Chiba, Japan, 26–30 September 2010; pp. 2794–2797. [Google Scholar]
Qadri, S.A.A.; Gunawan, T.S.; Kartiwi, M.; Mansor, H.; Wani, T.M. Speech Emotion Recognition Using Feature Fusion of TEO and MFCC on Multilingual Databases. In Proceedings of the Recent Trends in Mechatronics Towards Industry 4.0; Ab. Nasir, A.F., Ibrahim, A.N., Ishak, I., Mat Yahya, N., Zakaria, M.A., Abdul Majeed, A.P.P., Eds.; Springer: Singapore, 2022; pp. 681–691. [Google Scholar]
Origlia, A.; Galatà, V.; Ludusan, B. Automatic Classification of Emotions via Global and Local Prosodic Features on a Multilingual Emotional Database. In Proceedings of the Fifth International Conference Speech Prosody 2010, Chicago, IL, USA, 10–14 May 2010. [Google Scholar]
Bandela, S.R.; Kumar, T.K. Stressed Speech Emotion Recognition Using Feature Fusion of Teager Energy Operator and MFCC. In Proceedings of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Delhi, India, 3–5 July 2017; pp. 1–5. [Google Scholar]
Rao, K.S.; Koolagudi, S.G. Robust Emotion Recognition Using Sentence, Word and Syllable Level Prosodic Features. In Robust Emotion Recognition Using Spectral and Prosodic Features; Rao, K.S., Koolagudi, S.G., Eds.; SpringerBriefs in Electrical and Computer Engineering; Springer: New York, NY, USA, 2013; pp. 47–69. ISBN 978-1-4614-6360-3. [Google Scholar]
Araño, K.A.; Gloor, P.; Orsenigo, C.; Vercellis, C. When Old Meets New: Emotion Recognition from Speech Signals. Cogn Comput 2021, 13, 771–783. [Google Scholar] [CrossRef]
Wang, C.; Ren, Y.; Zhang, N.; Cui, F.; Luo, S. Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion. Multimed. Tools Appl. 2022, 81, 4897–4907. [Google Scholar] [CrossRef]
Sun, L.; Chen, J.; Xie, K.; Gu, T. Deep and Shallow Features Fusion Based on Deep Convolutional Neural Network for Speech Emotion Recognition. Int. J. Speech Technol. 2018, 21, 931–940. [Google Scholar] [CrossRef]
Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [Google Scholar] [CrossRef]
Al-onazi, B.B.; Nauman, M.A.; Jahangir, R.; Malik, M.M.; Alkhammash, E.H.; Elshewey, A.M. Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion. Appl. Sci. 2022, 12, 9188. [Google Scholar] [CrossRef]
Issa, D.; Fatih Demirci, M.; Yazici, A. Speech Emotion Recognition with Deep Convolutional Neural Networks. Biomed. Signal Process. Control. 2020, 59, 101894. [Google Scholar] [CrossRef]
Li, X.; Akagi, M. Improving Multilingual Speech Emotion Recognition by Combining Acoustic Features in a Three-Layer Model. Speech Commun. 2019, 110, 1–12. [Google Scholar] [CrossRef]
Heracleous, P.; Yoneyama, A. A Comprehensive Study on Bilingual and Multilingual Speech Emotion Recognition Using a Two-Pass Classification Scheme. PLoS ONE 2019, 14, e0220386. [Google Scholar] [CrossRef] [Green Version]
Sagha, H.; Matějka, P.; Gavryukova, M.; Povolny, F.; Marchi, E.; Schuller, B. Enhancing Multilingual Recognition of Emotion in Speech by Language Identification. In Proceedings of the Interspeech 2016, ISCA, San Francisco, CA, USA, 8 September 2016; pp. 2949–2953. [Google Scholar]
Bertero, D.; Kampman, O.; Fung, P. Towards Universal End-to-End Affect Recognition from Multilingual Speech by ConvNets. arXiv 2019, arXiv:1901.06486. [Google Scholar]
Neumann, M.; Thang Vu, N. goc Cross-Lingual and Multilingual Speech Emotion Recognition on English and French. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5769–5773. [Google Scholar]
Zehra, W.; Javed, A.R.; Jalil, Z.; Khan, H.U.; Gadekallu, T.R. Cross Corpus Multi-Lingual Speech Emotion Recognition Using Ensemble Learning. Complex Intell. Syst. 2021, 7, 1845–1854. [Google Scholar] [CrossRef]
Sultana, S.; Iqbal, M.Z.; Selim, M.R.; Rashid, M.M.; Rahman, M.S. Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks. IEEE Access 2022, 10, 564–578. [Google Scholar] [CrossRef]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B.W. Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022. [Google Scholar] [CrossRef]
Tamulevičius, G.; Korvel, G.; Yayak, A.B.; Treigys, P.; Bernatavičienė, J.; Kostek, B. A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces. Electronics 2020, 9, 1725. [Google Scholar] [CrossRef]
Fu, C.; Dissanayake, T.; Hosoda, K.; Maekawa, T.; Ishiguro, H. Similarity of Speech Emotion in Different Languages Revealed by a Neural Network with Attention. In Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC), San Diego, CA, USA, 3–5 February 2020; pp. 381–386. [Google Scholar]
Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Lee, S. The Generalization Effect for Multilingual Speech Emotion Recognition across Heterogeneous Languages. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5881–5885. [Google Scholar]
Zhang, Y.; Liu, Y.; Weninger, F.; Schuller, B. Multi-Task Deep Neural Network with Shared Hidden Layers: Breaking down the Wall between Emotion Representations. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4990–4994. [Google Scholar]
Sharma, M. Multi-Lingual Multi-Task Speech Emotion Recognition Using Wav2vec 2.0. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6907–6911. [Google Scholar]
Gerczuk, M.; Amiriparian, S.; Ottl, S.; Schuller, B.W. EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
Akçay, M.B.; Oğuz, K. Speech Emotion Recognition: Emotional Models, Databases, Features, Preprocessing Methods, Supporting Modalities, and Classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
Wang, W.; Cao, X.; Li, H.; Shen, L.; Feng, Y.; Watters, P. Improving Speech Emotion Recognition Based on Acoustic Words Emotion Dictionary. Nat. Lang. Eng. 2020, 27, 747–761. [Google Scholar] [CrossRef]
Hsu, J.-H.; Su, M.-H.; Wu, C.-H.; Chen, Y.-H. Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J. Direct Modelling of Speech Emotion from Raw Speech. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15 September 2019; pp. 3920–3924. [Google Scholar]
Wu, X.; Cao, Y.; Lu, H.; Liu, S.; Wang, D.; Wu, Z.; Liu, X.; Meng, H. Speech Emotion Recognition Using Sequential Capsule Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3280–3291. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech Emotion Recognition with Dual-Sequence LSTM Architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478. [Google Scholar]
Graves, A.; Jaitly, N.; Mohamed, A. Hybrid Speech Recognition with Deep Bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 273–278. [Google Scholar]
Wang, Y.; Zhang, X.; Lu, M.; Wang, H.; Choe, Y. Attention Augmentation with Multi-Residual in Bidirectional LSTM. Neurocomputing 2020, 385, 340–347. [Google Scholar] [CrossRef]
Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7042–7052. [Google Scholar]
Zhang, Y.; Yang, Q. An Overview of Multi-Task Learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef] [Green Version]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Qadir, J.; Schuller, B.W. Survey of Deep Representation Learning for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2021. [Google Scholar] [CrossRef]
Zhang, Z.; Wu, B.; Schuller, B. Attention-Augmented End-to-End Multi-Task Learning for Emotion Prediction from Speech. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6705–6709. [Google Scholar]
Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the Interspeech 2019, ISCA, Graz, Austria, 15 September 2019; pp. 2803–2807. [Google Scholar]
Fu, C.; Liu, C.; Ishi, C.T.; Ishiguro, H. An End-to-End Multitask Learning Model to Improve Speech Emotion Recognition. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Virtual, 18–21 January 2021; pp. 1–5. [Google Scholar]
Li, X.; Lu, G.; Yan, J.; Zhang, Z. A Multi-Scale Multi-Task Learning Model for Continuous Dimensional Emotion Recognition from Audio. Electronics 2022, 11, 417. [Google Scholar] [CrossRef]
Thung, K.-H.; Wee, C.-Y. A Brief Review on Multi-Task Learning. Multimed Tools Appl 2018, 77, 29705–29725. [Google Scholar] [CrossRef]
Xia, R.; Liu, Y. A Multi-Task Learning Framework for Emotion Recognition Using 2D Continuous Space. IEEE Trans. Affect. Comput. 2017, 8, 3–14. [Google Scholar] [CrossRef]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J.; Schuller, B.W. Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022, 13, 992–1004. [Google Scholar] [CrossRef] [Green Version]
Atmaja, B.T.; Akagi, M. Dimensional Speech Emotion Recognition from Speech Features and Word Embeddings by Using Multitask Learning. APSIPA Trans. Signal Inf. Process. 2020, 9, e17. [Google Scholar] [CrossRef]
Kim, J.-W.; Park, H. Multi-Task Learning for Improved Recognition of Multiple Types of Acoustic Information. IEICE Trans. Inf. Syst. 2021, E104.D, 1762–1765. [Google Scholar] [CrossRef]
Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 3 July 2018; pp. 794–803. [Google Scholar]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A Database of German Emotional Speech. In Proceedings of the Interspeech 2005, ISCA, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
Pan, S.; Tao, J.; Li, Y. The CASIA Audio Emotion Recognition Method for Audio/Visual Emotion Challenge 2011. In Proceedings of the Affective Computing and Intelligent Interaction; D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 388–395. [Google Scholar]
Jackson, P.; Haq, S. Surrey Audio-Visual Expressed Emotion (Savee) Database; University of Surrey: Guildford, UK, 2014. [Google Scholar]
El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu Features? End-to-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
He, Y.; Feng, X.; Cheng, C.; Ji, G.; Guo, Y.; Caverlee, J. MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks. In Proceedings of the ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2205–2215. [Google Scholar]
Eyben, F.; Weninger, F.; Schuller, B. Affect Recognition in Real-Life Acoustic Conditions—A New Perspective on Feature Selection. In Proceedings of the Interspeech 2013, ISCA, Lyon, France, 25–29 August 2013; pp. 2044–2048. [Google Scholar]
Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In Search of a Robust Facial Expressions Recognition Model: A Large-Scale Visual Cross-Corpus Study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
Antoniadis, P.; Filntisis, P.P.; Maragos, P. Exploiting Emotional Dependencies with Graph Convolutional Networks for Facial Expression Recognition; IEEE Computer Society: Washington, DC, USA, 2021; pp. 1–8. [Google Scholar]
Kakuba, S.; Poulose, A.; Han, D.S. Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features. IEEE Access 2022, 10, 125538–125551. [Google Scholar] [CrossRef]

Figure 1. The framework of multi-task SER based on SIMO and MIMO.

Figure 2. The overview of the MSFL model.

Figure 3. The performance of two tasks in the E_MSFL with different alpha values.

Figure 4. The confusion matrix of the EMO-DB corpus. (a) The confusion matrix for the MSFL_ST model. (b) The confusion matrix for the MTL_DNN model. (c) The confusion matrix for the MSFL model.

Figure 5. The confusion matrix of the CASIA corpus. (a) The confusion matrix for the MSFL_ST model. (b) The confusion matrix for the MTL_DNN model. (c) The confusion matrix for the MSFL model.

Figure 6. The confusion matrix of the SAVEE corpus. (a) The confusion matrix for the MSFL_ST model. (b) The confusion matrix for the MTL_DNN model. (c) The confusion matrix for the MSFL model.

Figure 7. The feature ranking and comparison procedure based on the attention mechanism.

Figure 8. The comparison of the top 30 features in EFs, CFs, SFs, and MSFs.

Table 1. The detailed information of corpora used in our experiment.

Corpus	Language	Size	Actors	Sampling Rate	Emotions
Corpus	Language	Size	Actors	Sampling Rate	Neu	Ang	Hap	Sad
EMO-DB	German	339	10(5M, 5F)	16 KHz	79	127	71	62
CASIA	Chinese	800	4(2M, 2F)	22.05 KHz	200	200	200	200
SAVEE	English	300	4M	44.1 KHz	120	60	60	60

Abbreviations: M: Male; F: Female; Neu: Neutral; Ang: Anger; Hap: Happiness; Sad: Sadness.

Table 2. Overview of LLDs and functions used in our experiment.

Parameter Group	Low-Level Descriptors (LLDs)	Functions	Amount
Frequency	Pitch	(V): am, stddevNorm, 20th, 50th, and 80th percentile, the range of 20th to 80th percentile, the mean and standard deviation of the slope of rising/falling signal parts	10
	Jitter	(V): am, stddevNorm	2
	Formant 1, 2 and 3 frequency	(V): am, stddevNorm	6
	Formant 1, 2 and 3 bandwidth	(V): am, stddevNorm	6
Energy	Shimmer	(V): am, stddevNorm	2
	Loudness	am, stddevNorm, 20th, 50th, and 80th percentile, pctlrange0–2, the mean and standard deviation of the slope of rising/falling signal parts	10
	Harmonics-to-Noise Ratio (HNR)	(V): am, stddevNorm	2
Spectral	Alpha Ratio	(V): am, stddevNorm (UV): am	3
	Hammarberg Index	(V): am, stddevNorm (UV): am	3
	Spectral Slope 0–500 Hz and 500–1500 Hz	(V): am, stddevNorm (UV): am	6
	Formant 1, 2 and 3 relative energy	(V): am, stddevNorm	6
	Harmonic difference H1-H2	(V): am, stddevNorm	2
	Harmonic difference H1-A3	(V): am, stddevNorm	2
	MFCC 1–4	am, stddevNorm (V): am,stddevNorm	16
	Spectral flux	am, stddevNorm (V): am, stddevNorm (UV): am	5

Abbreviations: am: arithmetic mean; stddevNorm: coefficient of variation; pctlrange0–2: the range of the 20th to 80th percentile; UV: functions are applied to the unvoiced segment; V: functions are applied to the voiced segment. No special markings mean that functions are applied in all segments.

Table 3. Settings of hyperparameters in training.

Notation	Meaning	Value
$d_{l s t m}$	Number of cells in LSTM	32
$d_{a t t}$	Number of nodes in attention	88
$d_{d e n 1}$	Number of nodes in FC1	512
$d_{d e n 2}$	Number of nodes in FC2	300
$d_{s o f t 1}$	Number of nodes in softmax1	4
$d_{s o f t 2}$	Number of nodes in softmax2	3

Table 4. The brief conclusion of experiments in our study.

Purpose	Experiment	Introduction	Result
Model Generalization	Experiment I	Verify the effect of gradient normalization method introduced in MTL on the generalization ability of the model	The gradient normalization method performs best than other methods
Model Generalization	Experiment II	Compare the results of generalization ability of MSFL with other baseline models	MSFL achieves better results than most models with an average improvement of 3.37–4.49%
Model Interpretability	Experiment III	Analyze and compare the features in MSFs and monolingual features	The top 10 MSFs almost contain the top-ranked monolingual features

Table 5. Experimental performance comparison of MSFL models with different weight adjustment methods.

Models	Evaluation Metrics	Emotion Recognition				Language Recognition
Models	Evaluation Metrics	E	C	S	Avg	Language Recognition
E_MSFL	UAR	82.58%	82.20%	76.67%	80.77%	99.20%
E_MSFL	ACC	82.80%	82.86%	77.43%	81.09%	98.96%
M_MSFL	UAR	83.86%	83.08%	75.75%	80.89%	99.10%
M_MSFL	ACC	83.77%	82.99%	76.81%	81.19%	98.83%
G_MSFL	UAR	85.24%	84.44%	75.33%	81.66%	99.35%
G_MSFL	ACC	85.29%	84.41%	75.89%	81.86%	99.13%

Abbreviations: E: EMO-DB; C: CASIA; S: SAVEE; Avg: average performance of EMO-DB, CASIA and SAVEE.

Table 6. Accuracy of different models for SER on discrete corpora.

Models	EMO-DB	CASIA	SAVEE	Average
MSFL_ST	81.93%	79.05%	71.14%	77.37%
MTL_DNN	81.14%	82.30%	72.03%	78.49%
MT-SHL-DNN (2019) [42]	82.34%	-	-	-
CAbiLS (2020) [39]	75.61%	64.33%	-	-
Ensemble Model (2021) [35]	89.75%	-	69.31%	-
MSFL (our proposed)	85.29%	84.41%	75.89%	81.86%

Table 7. Feature ranking based on attention weights.

Ranking	Multilingual Shared Features (MSFs)	EMO-DB Features (EFs)	CASIA Features (CFs)	SAVEE Features (SFs)
1	slopeV0-500_ sma3nz_amean	HNRdBACF_ sma3nz_amean	spectralFlux_ sma3_stddevNorm	F0semitoneFrom27.5Hz_ sma3nz_pctlrange0-2
2	F0semitoneFrom27.5Hz_ sma3nz_ pctlrange0-2	slopeV0-500_ sma3nz_amean	slopeV500-1500_ sma3nz_amean	F0semitoneFrom27.5Hz_ sma3nz_amean
3	slopeUV500-1500_sma3nz_ amean	StddevUnvoiced SegmentLength	slopeV0-500 _sma3nz_amean	slopeV500-1500_ sma3nz_amean
4	HNRdBACF_ sma3nz_amean	logRelF0-H1-H2_ sma3nz_amean	F2bandwidth _sma3nz_amean	F0semitoneFrom27.5Hz_sma3nz_ percentile80.0
5	spectralFlux_ sma3_ stddevNorm	Equivalent SoundLevel_dBp	hammarbergIndexV _sma3nz_amean	F2bandwidth_sma3nz_ stddevNorm
6	loudness_sma3_ percentile20.0	hammarbergIndexV_ sma3nz_amean	mfcc1V_sma3nz_amean	slopeV0-500_sma3nz_amean
7	spectralFluxUV_ sma3nz_amean	HNRdBACF_ sma3nz_stddevNorm	slopeUV0-500_sma3nz_amean	F0semitoneFrom27.5Hz _sma3nz_meanRisingSlope
8	loudnessPeaks PerSec	mfcc3_sma3_ stddevNorm	F1frequency_sma3nz_ stddevNorm	StddevUnvoiced SegmentLength
9	loudness_sma3_ stddev FallingSlope	F0semitoneFrom27.5Hz_ sma3nz_pctlrange0-2	loudness_sma3_ percentile20.0	spectralFluxV_sma3nz_ stddevNorm
10	F3bandwidth_sma3nz_amean	F2frequency_ sma3nz_stddevNorm	hammarbergIndexV_ sma3nz_stddevNorm	F0semitoneFrom27.5Hz _sma3nz_stddevRisingSlope

Bold indicates the common features between the top 10 MSFs and top 10 monolingual features (i.e., EFs, CFs, and SFs); underline indicates the common features among top 10 EFs, CFs, and SFs; sma3: smoothing move average by 3 windows; nz: normalization; pctlrange0-2: the range of 20th to 80th percentile.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.; Wang, W. MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition. Appl. Sci. 2022, 12, 12805. https://doi.org/10.3390/app122412805

AMA Style

Ma Y, Wang W. MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition. Applied Sciences. 2022; 12(24):12805. https://doi.org/10.3390/app122412805

Chicago/Turabian Style

Ma, Yiping, and Wei Wang. 2022. "MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition" Applied Sciences 12, no. 24: 12805. https://doi.org/10.3390/app122412805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSFL: Explainable Multitask-Based Shared Feature Learning for Multilingual Speech Emotion Recognition

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Speech Emotion Recognition

2.2. Multi-Task Learning for Speech Emotion Recognition

3. Proposed Model

4. Experimental Setup

4.1. Corpora

4.1.1. EMO-DB

4.1.2. CASIA

4.1.3. SAVEE

4.2. Speech Features Extraction

4.3. Model Configuration

5. Experiments and Results

5.1. Experiment I: Different Weight Adjustment Methods for MSFL

5.2. Experiment II: Comparison with Different Models

5.3. Experiment III: Multilingual Shared Features Ranking and Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI