Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function

Zhang, Shaohua; Feng, Yan; Ren, Yihao; Guo, Zefei; Yu, Renjie; Li, Ruobing; Xing, Peiran

doi:10.3390/electronics13163262

Open AccessArticle

Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function

by

Shaohua Zhang

^*,

Yan Feng

,

Yihao Ren

,

Zefei Guo

,

Renjie Yu

,

Ruobing Li

and

Peiran Xing

College of Information Science and Technology, Tibet University, Lhasa 850000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3262; https://doi.org/10.3390/electronics13163262

Submission received: 12 July 2024 / Revised: 11 August 2024 / Accepted: 14 August 2024 / Published: 16 August 2024

(This article belongs to the Section Circuit and Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Emotion recognition plays an increasingly important role in today’s society and has a high social value. However, current emotion recognition technology faces the problems of insufficient feature extraction and imbalanced samples when processing speech and text information, which limits the performance of existing models. To overcome these challenges, this paper proposes a multi-modal emotion recognition method based on speech and text. The model is divided into two channels. In the first channel, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) feature set is extracted from OpenSmile, and the original eGeMAPS feature set is merged with the wavelet transformed eGeMAPS feature set. Then, speech features are extracted through a sparse autoencoder. The second channel extracts text features through the BERT-RoBERTa model. Then, deeper text features are extracted through a gated recurrent unit (GRU), and the deeper text features are fused with the text features. Emotions are identified by the attention layer, the dual-layer Bidirectional Long Short-Term Memory (BiLSTM) model, and the loss function, combined with cross-entropy loss and focus loss. Experiments show that, compared with the existing model, the WA and UA of this model are 73.95% and 74.27%, respectively, on the imbalanced IEMOCAP dataset, which is superior to other models. This research result effectively solves the problem of feature insufficiency and sample imbalance in traditional sentiment recognition methods, and provides a new way of thinking for sentiment analysis application.

Keywords:

multi-modal emotion recognition; wavelet transform; sparse autoencoder; BERT-RoBERTa; GRU; attention layer; dual-layer BiLSTM; cross-entropy loss; focus loss

1. Introduction

Emotion recognition technology plays an increasingly important role in today’s society, especially in the era of artificial intelligence and big data. It has not only attracted widespread attention in academia, but has also demonstrated great potential in practical applications. Emotion recognition refers to the identification of corresponding emotional states through the analysis of emotional information in speech, text, or images, including but not limited to basic emotional categories such as happiness, sadness, and anger [1]. In this framework of emotion recognition, speech emotion recognition refers to the analysis of sound features in speech data, such as pitch, speech speed, etc., to infer the emotional state of the speaker [2]. Text emotion recognition is used to identify the author’s emotional tendency and emotional state by analyzing information such as words, semantics, and emotional vocabulary from the text data [3]. With the rapid development of information technology, speech and text, as the main carriers of human emotion expression, play a crucial role in intelligent interaction systems. Speech and text multi-modal emotion recognition technology can comprehensively analyze these two information modalities to more accurately identify and understand the user’s emotional state, and thus has a wide range of application prospects and far-reaching social value in multiple fields. In the field of intelligent customer service, the multi-modal emotion recognition technology can sense the speech and text emotions of customers in real time, helping enterprises to quickly respond to customer needs and improve service quality and customer satisfaction. In the field of autonomous driving, this technology is also of great importance. By recognizing the driver’s voice and text emotions, the system can judge the driver’s driving status and issue timely reminders or take emergency measures to ensure driving safety. In the field of education, the technology can be applied to intelligent teaching systems to monitor students’ learning emotions in real time and provide teachers with personalized teaching suggestions, thus improving teaching effectiveness and learning experience. At the same time, it can also help identify students’ mental health problems and provide data support for timely intervention.

Speech emotion recognition plays an important role in modern society. By analyzing a speaker’s vocal characteristics, such as pitch and speed, the system can accurately capture and understand their emotional state, thereby enhancing user experience and improving communication efficiency. Many scholars have conducted a lot of research on speech emotion recognition. In the literature [4], an improved two-layer stacked autoencoder structure was proposed, which combined the advantages of a sparse autoencoder and a denoised autoencoder and achieved a better recognition rate on Chinese speech emotion recognition task than using these two autoencoders alone. However, this study only considered the traditional acoustic characteristics of the speech signal, such as fundamental frequency, sound intensity, and zero cross rate, and failed to make full use of richer information, such as the frequency spectrum and time-frequency distribution of the speech signal. At the same time, no experiments were conducted on an imbalanced dataset. One paper [5] combined single-frequency filtering technology and high-order nonlinear energy operators to extract instant modulation spectrum features for emotion recognition, and the detection accuracy rates on EMODB, FAU-AIBO, and IEMOCAP datasets were 85.75%, 59.88%, and 65.78%, respectively, thus improving the performance and robustness of speech emotion recognition. Although the instant modulation spectrum features showed good robustness under the noise conditions tested, there might still be certain limitations when processing speech signals in highly non-stationary or other complex noise environments. The literature [6] has also introduced a method based on the semi-natural Egyptian Arabic Emotion dataset and considered a variety of features including long-term average spectrum (LTAS) and wavelet parameters, which significantly improved the emotion recognition of Arabic speech. However, the literature [6] only considered the prosodic, spectral, and continuous wavelet transform features of speech signals, but did not consider the semantic information of speech signals. Therefore, in this paper, the wavelet transform is added to the sparse autoencoder in the speech feature extraction channel to obtain more features of the speech signal and increase the richness of the features, so as to reduce the limitations of the above research as much as possible.

The importance of text emotion recognition lies in its ability to understand and analyze the complexity and diversity of human emotional expression. By automatically recognizing emotions in text, many practical applications can be achieved. Many scholars have also carried out a lot of research on text emotion recognition. By introducing the XLNet-BiGRU-Att deep learning model based on bidirectional cyclic unit and attention mechanism, one paper [7] realized the enhanced performance of text emotion recognition on the IEMOCAP and CASIA datasets, achieving an accuracy of 91.71% and 85.71%, respectively. However, XLNet might not be as good at handling short texts as BERT or RoBERTa, because XLNet does not pay special attention to short text processing during the pre-training phase. The literature [8] has also introduced a BERT model and an SVM algorithm based on a neural network to improve the accuracy of text emotion recognition. Experiments showed that the model was significantly more accurate than LSTM-RNN and other traditional classifiers for sentiment classification. However, the BERT-SVM model might face the problem of large computational resource consumption when processing very long or complex text. In another study [9], a TE-LSTM+SC model was introduced which combined semantic role annotation and a BERT processing predicate argument structure and effectively improved the accuracy of text emotion recognition. However, a possible limitation of the algorithm was its limited ability to handle complex contextual relationships. Therefore, in this paper, we further improve the modeling of complex contextual relationships by extracting text features using BERT and RoBERTa, respectively, and then processing them through GRU, to enhance the model’s ability to process short and long texts as much as possible, thus reducing the limitations.

Multi-model emotion recognition is crucial for modern information processing and human–computer interaction [10]. By integrating speech, text, and other model information, users’ emotional states can be more comprehensively and accurately captured and understood, thus improving the naturalness and intelligence level of human–computer interaction. It is widely used in intelligent systems, virtual reality, education, and other fields, promoting the in-depth application and development of artificial intelligence technology in real life. The current research on multi-model emotion recognition is committed to integrating multiple information sources such as language, voice, and facial expression, and realizing accurate recognition and analysis of users’ emotional states through deep learning and machine learning technologies. The research results include multi-model emotion recognition models based on a deep neural network, which can effectively integrate different information sources and improve recognition accuracy. One study [11] proposed a multi-modal emotion recognition method based on speech and text. By combining a convolutional neural network (CNN) and a pre-trained BERT model, speech and text features were extracted, and an attention-based fusion mechanism was used to integrate the features of these two modes, thus improving the accuracy and adaptability of emotion recognition. Experiments showed that the proposed method achieved excellent performance on two benchmark datasets, CMU-MOSEI and MELD, which was significantly better than the existing benchmark models. A new multi-modal emotion recognition method, HFU-BERT, was proposed in reference [12]. By combining pre-trained BERT models and heterogeneous features to fuse audio and visual features, good results were obtained for the CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets. Reference [13] introduced a video multi-modal emotion recognition method based on Bi-GRU and an attention fusion network, which effectively improved the accuracy and performance of text, visual, audio single-modal, and video multi-modal emotion recognition.

Although many breakthroughs have been made in the multi-modal emotion recognition models of deep learning, there has been relatively little research conducted on the inadequacy and low accuracy of feature extraction caused by the multi-modal emotion recognition of speech and text. Therefore, aiming at the problems of inadequate feature extraction and low accuracy in multi-modal emotion recognition, this paper proposes a multi-modal emotion recognition method based on speech and text. The model is divided into two channels to extract features, respectively. The first channel extracts eGeMAPS features through OpenSmile, converts the extracted eGeMAPS features into frequency-domain representation through wavelet transform, fuses the original eGeMAPS features with the eGeMAPS features processed by wavelet transform, and then optimizes the feature representation through sparse autoencoder to improve the anti-noise capability. The second channel firstly combines different semantic features extracted from two models through the BERT model and RoBERTa model, respectively, to increase the diversity of text data expression. Then, the mixed features are processed through GRU to process the effective sequence features and combine them with the fused text features to obtain multi-level text features. Then, the features extracted from the two channels are fused, and the loss function is decided jointly by cross-entropy loss and focus loss. Finally, we use attention layer and dual-layer BiLSTM for emotion recognition.

Briefly, the major contributions include:

In the speech feature extraction channel, we add the wavelet transform on the basis of the original sparse autoencoder to increase the richness of features and improve the anti-noise ability. The wavelet transform can capture the features of the signal at different scales, which makes the feature extraction more comprehensive, especially in the face of noise. The model can better separate the effective signal and noise, thus improving the classification accuracy.
In the text feature extraction channel, we parallel a RoBERTa model on the basis of the current mainstream BERT model, mix the features extracted from the two models, and then process the effective sequence features through GRU. This method not only enhances the text features, but also improves the processing ability of the model for short text and long text. By using RoBERTa in parallel, we can obtain richer text features. GRU further processes the mixed features, making the model more adaptable to various text lengths.
After mixing speech features and text features, we add an attention layer on the basis of the original BiLSTM to enhance the flexibility and performance of the model. The attention layer allows the model to better focus on important features and context information when processing sequence data, thus enhancing the ability to capture key features.
For an imbalanced sample dataset like IEMOCAP, we add the focus loss function to the common cross-entropy loss function to improve the classification ability of difficult samples. The focus loss function is specifically used to solve the problem of sample imbalance, and by paying more attention to difficult samples, the model’s performance on imbalanced sample datasets is significantly improved.
We conduct comprehensive tests, including a speech single experiment, text single experiment, loss function ratio experiment, multi-modal comparison experiment with different models, and multi-modal comparison experiment with different literature models, and the results show that our model is of excellent effect in speech and text multi-modal emotion recognition. These experiments validate the performance of our model under different conditions and prove its effectiveness and superiority in practical applications.

The rest of this article is organized as follows. Section 2 introduces a multi-modal emotion recognition method based on speech and text. Section 3 describes the data set required for the experiment. Section 4 introduces the experiments and the results. Section 5 summarizes the thesis.

2. Model Structure

The construction of a multi-modal emotion recognition model generally includes speech feature extraction, text feature extraction, and feature fusion. The overall model structure is shown in the Figure 1.

2.1. Speech Feature Extraction Channel

The speech feature extraction channel first extracts the eGeMAPS feature set from the voice data through OpenSmile, then transforms the feature set into the frequency domain representation through wavelet transform, combines the original features with the wavelet transformed features, and then improves the feature representation capability through a sparse autoencoder.

OpenSmile is an opensource toolkit for audio feature extraction and analysis, which provides a rich set of audio features and processing capabilities. OpenSmile can extract various features from audio files, including acoustic features (such as fundamental frequency, spectral convexity, etc.), speech quality features (such as noise, distortion, etc.), and advanced speech features (such as emotional features, speech flow features, etc.), and provide a pre-set of features. In this paper, eGeMAPS is selected as the pre-feature set. The eGeMAPS feature set is a standard feature set for speech analysis and emotion recognition and consists of 88 acoustic parameters. It covers the basic frequency, spectrum characteristics, time domain characteristics, sound dynamic characteristics, harmonic characteristics, spectral cepstrum coefficient, formant characteristics, and time-frequency characteristics of voice signals.

Wavelet transform is a signal processing technique that can decompose a signal into frequency components of different scales in order to better understand the time-domain and frequency-domain characteristics of the signal. The Haar wavelet transform is used in this paper. The Haar wavelet function is the simplest basis function, which is a set of piecewise constant valued functions. This set of functions is defined in a semi-open region [0,1), where the value of each piecewise constant function is 1 in one small region and 0 in the rest. The mother wavelet of a Haar wavelet is represented by

Ψ (x) = {\begin{matrix} 1, 0 \leq x < 1 / 2 \\ - 1, 1 / 2 \leq x < 1 \\ 0, otherwise \end{matrix}

(1)

The corresponding scaling function is expressed as:

ϕ (x) = {\begin{matrix} 1, 0 \leq t < 1 \\ 0, otherwise \end{matrix}

(2)

Its filter

h [n]

is defined as:

h [n] = {\begin{matrix} 1 / \sqrt{2}, n = 0, 1 \\ 0, otherwise \end{matrix}

(3)

Hence,

Ψ (t / 2) = \sqrt{2} \sum_{n = - \infty}^{\infty} {(- 1)}^{1 - n} * h [1 - n] * ϕ (t - n)

(4)

In other words,

Ψ (x) = ϕ (2 x) - ϕ (2 x - 1)

(5)

The Harr basis function can obtain the scale information of the signal, while the Haar wavelet function can represent the detailed information of the signal. In this paper, Harr wavelet transform is used for one-dimensional wavelet decomposition to obtain approximate coefficients and detail coefficients. The obtained approximate coefficients and detail coefficients are combined into a one-dimensional array, and the combined wavelet transform features are reconstructed into the same shape as the original features to maintain the consistency of features.

Autoencoder is an unsupervised method of data dimension compression and data feature representation. In most cases where autoencoders are mentioned, the compression and decompression functions are implemented through neural networks.

A simple autoencoder structure is shown in Figure 2. The structure is an encoder from the input to the hidden layer, and a decoder from the hidden layer to the output.

For sample

x

, the activity value of the middle hidden layer of the autoencoder is the encoding of

x

, that is

z = f (W^{(1)} x + b^{(1)})

(6)

where

f

is the activation function, and

W^{(1)}

and

b^{(1)}

are the weight matrix and bias of the encoder, respectively.

The output of the autoencoder is the reconstructed data

x^{'}

:

x^{'} = f (W^{(2)} z + b^{(2)})

(7)

where

f

is the activation function, and

W^{(2)}

and

b^{(2)}

are the weight matrix and bias of the decoder, respectively.

In this paper, a sparse autoencoder is used, and a regularization term is added to the original autoencoder to prevent overfitting. The regularization terms are shown below.

L = \sum_{n = 1}^{N} {| | x (n) - x' (n) | |}_{2} + η ρ (Z) + λ {| | W | |}_{2}

(8)

where

\sum_{n = 1}^{N} {| | x (n) - x' (n) | |}_{2} + η ρ (Z) + λ {| | W | |}_{2}

is the reconstruction error, which measures the difference between the original input

x

and the reconstructed output

x^{'}

;

η ρ (Z)

is the sparsity constraint;

ρ (Z)

is the sparsity measure that the hidden layer represents;

η

is the importance weight that controls sparsity, which encourages the activation of hidden layers to be as sparse as possible, forcing the network to learn more meaningful features; and

λ {| | W | |}_{2}

is the weight regularization, where

{| | W | |}_{2}

is the

L_{2}

norm of the weight matrix, and

λ

is the importance weight of the control weight. This term helps prevent excessive weight, further reducing the risk of overfitting. Through these regularization terms, the sparse autoencoder can effectively learn useful features of the input data, while avoiding overfitting and improving the generalization ability of the model.

2.2. Text Feature Extraction Channel

The text feature extraction channel firstly extracts the text data through the BERT model and RoBERTa model, respectively, and then combines the different semantic features extracted from the two models. Then GRU processes the mixed features, and finally combines them with the fused text features to obtain multi-level text features.

BERT [14] as a whole is a self-coding language model, and two tasks are designed to pre-train the model. The first task is to train a language model in MaskLM’s way by randomly masking a certain percentage of input markers and then predicting which markers will be masked. The second task adds an additional sentence-level continuity prediction task on the basis of text input. BERT is continuous text. The introduction of this task can better enable the model to learn the relationship between consecutive text fragments. Deep bidirectional models are strictly more powerful than shallow concatenations of left-to-right models and right-to-left models. Unfortunately, standard conditional language models can only be trained from left to right or right to left, because bidirectional conditioning will allow each word to ‘‘see itself’’ indirectly, and the model can simply predict the target word in a multilayered context. To train a deep bidirectional representation, we simply randomly mask a certain percentage of input markers, and then predict those masked markers. We call this process ‘‘masked LM’’ (MLM), often referred to as a Cloze task. In this case, the final hidden vector corresponding to the mask tag is fed into the output softmax of the vocabulary, just as in standard LM. In all of our experiments, we randomly mask 15% of all Word Piece tags in each sequence. We only predict the blocked words, and do not reconstruct the entire input. Although this allows us to obtain a bidirectional pre-trained model, the downside is that we create a mismatch between pre-training and fine-tuning because the [MASK] tag does not appear during fine-tuning. To mitigate this, we do not always replace “blocked” words with actual [MASK] tags. The training data generator randomly selects 15% of the marker locations for prediction. Ti is a sequence of marks processed by a trained data generator, some of which are randomly obscured or replaced to simulate what would happen in a real application. If the i th token is selected, we replace the [MASK] token with (1) 80% probability, (2) 10% probability random token, or (3) 10% probability unchanged. Ti will then be used to predict the original label with a cross-entropy loss. Many important downstream tasks, such as question answering (QA) and natural language reasoning (NLI), are based on understanding the relationship between two sentences, which language modeling does not capture directly. To train a model capable of understanding sentence relationships, we pre-train a binary next sentence prediction task that can be simply generated from any monolingual corpus. Specifically, when choosing sentences A and B for each pre-trained example, there is A 50% probability that B is a random sentence in the corpus (labeled NotNext). Pre-training for this task is very beneficial for both QA and NLI.

RoBERTa’s model [15] uses transformer architecture and includes multiple encoder layers to process input token sequences through self-attention mechanisms and feed-forward neural networks, where the pre-training phase learns language representation on large scale text data through self-supervised learning tasks such as mask language modeling, Cloze, and sentence coherence prediction. RoBERTa models the mask language by dynamically generating masks instead of fixed preprocessing masks. The model randomly blocks part of the token in the input sequence and re-selects the token to be blocked in each training iteration, making the model more robust and general in dealing with changing masks in different iterations. RoBERTa learns to predict blocked tokens through a cloze task during the pre-training phase. A portion of the tokens in each training sample are randomly selected and masked, and the model needs to infer the original tokens in those locations. RoBERTa trains the model using a sentence coherence prediction task to determine whether the original order of two sentences in a text has been changed. It emphasizes the model’s understanding of the more nuanced logical relationships between sentences in textual paragraphs.

GRU [16] is a variant of a recurrent neural network, which stands for gated recurrent unit. The GRU structure diagram is shown in the Figure 3.

x_{t}

represents the input data at the current time

t

, and

h_{t - 1}

represents the implicit state output at the historical time

t - 1

.

σ

and

\tan h

represent the sigmoid function and the hyperbolic tangent function, respectively. GRU network structure mainly includes an update gate and reset gate.

The update gate at time

t

is:

z_{t} = σ (W_{z} * [h_{t - 1}, x_{t}])

(9)

where

z_{t}

is the gated update signal;

h_{t - 1}

is the historical implied state;

x_{t}

represents the input data at time

t

;

W_{z}

is the weight matrix; and

σ

is the sigmoid function.

The reset gate at time

t

is:

r_{t} = σ (W_{r} * [h_{t - 1}, x_{t}])

(10)

where

r_{t}

is the reset signal, and

W_{r}

is the weight matrix.

Under the action of update gate

z_{t}

and reset gate

r_{t}

, the hidden output state

h_{t}

can be updated as:

h_{t} = (1 - z_{t}) {* h}_{t - 1} + z_{t} * \tilde{h_{t}}

(11)

where the candidate implied state is:

\tilde{h_{t}} = \tan h (W * [x_{t}, h_{t - 1} * r_{t}])

(12)

In Formula (12),

W

is the weight matrix.

In the GRU model, Formulas (9) and (10) are used to compute update gate

z_{t}

and reset gate

r_{t}

, the two gated signals that determine how the current hidden state is updated. Specifically, the update gate

z_{t}

determines the combination ratio of the historical hidden state h_t-1 and the candidate hidden state

\tilde{h_{t}}

to balance the influence of historical and new information. The reset gate

r_{t}

controls the degree of influence of the historical hidden state in calculating the candidate hidden state

\tilde{h_{t}}

, adjusting its importance by the element-by-element product of the historical hidden state. The candidate hidden state

\tilde{h_{t}}

is linearly transformed from the current input

x_{t}

to the adjusted historical hidden state

(h_{t - 1} * r_{t})

. It is calculated by using the hyperbolic tangent function

\tan h

. Finally, Formula (11) combines the historical hidden state

h_{t - 1}

and the candidate hidden state

\tilde{h_{t}}

, weighted according to the ratio of the update gate

z_{t}

to update the current hidden state

h_{t}

. In this way, the GRU can effectively combine current input and historical information to handle long- and short-term dependencies in sequence data.

2.3. Model Training

Speech features and text features are fused and input into the attention layer and dual-layer BiLSTM network, with 256 BiLSTM network neurons in the first layer and 128 BiLSTM network neurons in the second layer. An attention layer is added before the first layer, and the loss function is jointly decided by cross-entropy loss and focus loss. The evaluation indexes are weighted accuracy WA and unweighted accuracy UA.

The attention layer is designed to achieve dynamic attention weighting of input sequences. First, the attention weight matrix is defined and initialized using normal distribution, then the bias term is defined and initialized, and the product of the input data and weight matrix is transformed by adding the bias term through the

\tan h

function to obtain the attention weight and perform normalization processing. Finally, the input data is multiplied with the attention weight to obtain the weight output. This design approach allows the model to dynamically learn and adjust the importance of each time step based on the input data, thus improving the effectiveness and generalization ability of the model when processing serial data.

BiLSTM is a deep learning model suitable for sequential data processing. It combines the ability of short term memory networks to capture long-term dependencies with the advantages of bidirectional recurrent neural networks to model and predict sequence data efficiently. The dual-layer bidirectional design means that the model consists of two separate LSTM stacked layers, each with a forward and a backward LSTM cell. Forward LSTM handles the flow of information from the beginning to the end of the sequence, while reverse LSTM handles the flow of information from the end to the beginning of the sequence. This allows the model to take into account both past and future information in the present moment. The first layer of LSTM has 256 neurons, which means that the input for each time step is processed through 256 parallel LSTM units. The second layer of LSTM has 128 neurons, and this layer receives the output as input from the 256 LSTM units for each time step of the first layer. At each time step, the input data first passes through the 256 LSTM cells of the first layer. Each cell updates its own state and generates an output based on the current input and the state of the previous time step. The output of the first layer acts as the input of the second layer and is processed through 128 LSTM units to generate the final output.

Cross-entropy loss is a measure of the difference between two probability distributions and is widely used in classification. The prediction accuracy of classification problems is high, especially in multi-class classification tasks. During training, the gradient calculation directly depends on the difference between the predicted value and the actual value, so optimization is relatively straightforward. Focus loss is a kind of loss function proposed in order to solve the deficiency of cross-entropy loss in class imbalance and easy to confuse samples. It improves the classification ability of the model on difficult samples by reducing the weight of easily classified samples and focusing on difficult samples. It can effectively deal with the problem of category imbalance, and can flexibly deal with different types of misclassification by adjusting balance factors and regulatory factors. The influence of easy-to-classify samples on loss function is reduced, the weight of hard-to-classify samples is increased, and the classification performance of the model on hard-to-classify samples is enhanced.

In this paper, 5-fold cross-validation is used as the final experimental result. In cross-validation, the data is divided into five sets, including four training sets and one test set, so as to train and evaluate the model. At the same time, to prevent the model from overfitting, a dropout layer is set up. The parameters of the final model are as follows: BiLSTM is divided into two layers, the first layer has 256 neurons, the second layer has 128 neurons, the training batch is set to 32, the initial learning rate is set to 0.0001, the learning rate is updated by exponential drop, the number of iterations is set to 500, the dropout is set to 0.5, and the optimizer uses Adam.

The evaluation indexes are weighted accuracy, WA, and unweighted accuracy, UA. The calculation formula is

W A = \frac{T h e n u m b e r o f c o r r e c t c l a s s i f i c a t i o n s}{N u m b e r o f a l l t e s t e d e m o t i o n s}

(13)

U A = \frac{1}{K} \sum_{i}^{K} \frac{E m o t i o n i c o r r e c t l y c l a s s i f i e d n u m b e r}{N u m b e r o f a l l t e s t s f o r e m o t i o n i}

(14)

Weighted accuracy refers to the proportion of the number of samples correctly classified by the model to the total number of test samples. This is an overall measure of the model’s performance across all categories. Unweighted accuracy averages the accuracy of each category. Specifically, the accuracy of each category is first calculated (that is, the number of correctly classified categories as a proportion of the total number of test samples for that category), and then the accuracy of all categories is averaged, where K is the total number of categories. In this case,

K

= 4. UA measures the consistency of a model’s performance across different classes, regardless of the number of sample classes.

3. Dataset

The IEMOCAP dataset [17] was selected for this experiment.

The IEMOCAP dataset is a dataset for emotion recognition and interactive sentiment analysis, mainly used in the fields of speech emotion recognition and computer vision. This dataset contains emotional data from real conversations, including voice, facial expressions, and body movements, to study algorithms and models for emotion recognition and emotional interaction. The speech and text data were selected as the data for this experiment. Four emotions, namely, angry, neutral, sad, and happy, were selected from the dataset to identify, and excitement was also included as part of the happy emotion. Five groups of conversations were recorded in the IEMOCAP dataset, and the number of different emotional sentences in the dataset is shown in the Table 1.

4. Experimental Analysis

Aiming at the multi-modal emotion recognition of speech and text, we designed a speech single modal comparison experiment, text single modal comparison experiment, loss function ratio comparison experiment, multi-modal comparison experiment with different models, and comparison experiment between an existing multi-modal model and the current work.

4.1. Pre-Processing of the IEMOCAP Dataset

The pre-processing of the IEMOCAP dataset mainly consisted of the following steps: first, we traversed each session in the dataset, and separately processed the emotion evaluation file, the transcribed text file, and the audio file. From the emotion evaluation file, we extracted the conversation round name and emotion label with regular expression, and mapped them to specific emotion label. Next, we looked for lines in the transcribed text file that matched the conversation round name to obtain the corresponding input text. Then, for each sample with a specific emotional label, the audio features were extracted using the speech extraction algorithm described above. At the same time, we also encoded the input text and extracted the text features using the above text extraction algorithm. Finally, we sorted out the extracted emotional labels and updated the statistical information of emotion labels to provide support for the subsequent comparative experiments.

4.2. Discussion of Experiment Results

4.2.1. Speech Single Modal Comparison Experiment

The eGeMAPS feature set was extracted from the speech signals through OpenSmile, and the original eGeMAPS feature set was merged with the eGeMAPS feature set after wavelet transform. After the fusion, the reconstructed speech features were obtained through a sparse autoencoder. These features were used for emotion recognition through the attention layer and dual-layer BiLSTM network model. Cross-entropy loss was used in this experiment.

Different speech single-mode models are listed in Table 2. It can be seen from the table that the combination of a sparse autoencoder and wavelet transform was significantly better than other models in terms of WA and UA performance, because the original time domain features were converted to the frequency domain after the eGeMAPS feature set was transformed by Haar wavelet transform, and the feature representation under different perspectives was obtained. The combination of the original feature and the feature after wavelet transform enriched the expression ability of the feature. Meanwhile, the sparse autoencoder learned the sparse representation in the data, which helped to compress the feature space and extract the most significant features, and remove the noise and redundant information in the data. At the same time, the combination of sparse autoencoder and wavelet transform made the features more robust and helped the model to deal with these changes better.

4.2.2. Text Single Modal Comparison Experiment

In the text data, the features extracted from the two models were fused through the BERT and RoBERTa models, respectively, and then through the GRU. The features extracted from the GRU were fused with the features extracted from the two models, and the emotion recognition was carried out through the attention layer and the dual-layer BiLSTM model. Cross-entropy loss was used in this experiment.

The different text single-mode models used are shown in Table 3. As can be seen from the table, the combination of BERT-RoBERTa-GRU significantly outperformed other models in terms of WA and UA performance. This is because the BERT and RoBERTa models are capable of capturing semantic and contextual information in text data, so the features they extract often contain rich semantic information and text representations. Combining these advanced features with the sequence information extracted by GRU, the model was able to consider both the deep semantic and the sequence pattern of the text while learning, so as to represent the text data more comprehensively. At the same time, the features extracted by BERT and RoBERTa are usually static, that is, they represent the features obtained by encoding the entire text sequence. The features extracted by GRU contain dynamic sequence information, which can capture time-dependent and sequential patterns in text. By combining the two, static and dynamic features can be effectively combined to make the model more comprehensive and accurate in understanding the text.

4.2.3. Loss Function Ratio Comparison Experiment

According to the model in this paper, the experiment was carried out on the different proportions of the two loss functions to find the most suitable proportion of the loss functions.

WA and UA with different proportions of cross-entropy loss and focus loss are shown in Table 4. It can be seen from the table that the IEMOCAP dataset had the best effect when the ratio of cross-entropy loss and focus loss was 0.8:0.2. Cross-entropy loss is often used for multiple classification tasks and can effectively punish classification errors, prompting the model to predict the main class more accurately. Focus loss focuses on reducing the weight of easy-to-classify samples while increasing the impact of hard-to-classify samples, which is particularly helpful for learning rate categories in unbalanced datasets. In an uneven sample dataset such as IEMOCAP, some emotion classes may have fewer samples, which leads to the possibility that the model may prefer to learn the common classes while ignoring the rare ones during training. Setting the ratio of cross-entropy loss and focus loss to 0.8:0.2 makes the model pay more attention to the rare classes during training, thus improving the classification performance for a few classes. Cross-entropy loss ensures optimization in overall classification accuracy, while focus loss makes the model perform better in difficult classification situations, and the combination of the two can balance the performance of the model across various categories.

4.2.4. Multi-Modal Comparison Experiment with Different Models

The model in this paper utilized an input att single-layer LSTM, att dual-layer LSTM, and att dual-layer BiLSTM successively, and was compared with other models to verify its actual effect. In the confusion matrix, ‘ang’ represents ‘angry’, ‘neu’ represents ‘neutral’, ‘sad’ represents ‘sad’, and ‘hap’ represents ‘happy’. Additionally, ‘att’ stands for ‘attention layer’. The ratio of cross-entropy loss and focus loss in this experiment was 0.8:0.2.

It can be seen from the Table 5 that the model in this paper was superior to other combined models, whether it was att single-layer LSTM, att dual-layer LSTM, or att dual-layer BiLSTM. At the same time, it can also be seen from the confusion matrix that att dual-layer BiLSTM was better than att single-layer LSTM and att dual-layer LSTM. This is because att dual-layer BiLSTM can more effectively capture the context information of each word in the text sequence, can more effectively deal with long-distance modes, can reasonably integrate information of different modes, has strong expression ability, and can learn complex sequence patterns and semantic features. In contrast, att single-layer LSTM can only take advantage of information prior to the present moment (Figure 4), while a normal att dual-layer LSTM (Figure 5), while able to capture both past and future information, may not be able to take full advantage of two-way information in some cases. By processing forward and reverse information at the same time, att dual-layer BiLSTM can learn and express features in text sequences more comprehensively, so as to improve the effect of multi-modal recognition (Figure 6).

4.2.5. Comparison Experiment between Multi-Modal Model and Current Work

This paper’s model was compared with those from the literature, where the cited models all use the same affective corpus IEMOCAP.

Xu et al. [18]: The study aligned speech frames and text words with attention mechanisms and input the aligned cross-modal features into a sequential model for emotion recognition, resulting in 72.50% WA and 70.90% UA on the IEMOCAP dataset, respectively.

Rajamani et al. [19]: In this study, a novel Attention ReLU GRU (AR-GRU) was proposed and applied to emotion recognition tasks by introducing attention-mechanism-based activation functions into GRU and BiGRU units. The final WA and UA on the IEMOCAP dataset were 68.30% and 66.90%, respectively.

Yeh et al. [20]: In this study, an Interactive Awareness Attention Network (IAAN) was proposed, which incorporated contextual information in the learned acoustic representations through a novel attentional mechanism, resulting in WA and UA of 64.70% and 66.30%, respectively, on the IEMOCAP dataset.

Makiuchi et al. [21]: The methods proposed in this study include a cross-representation speech model inspired by decoupling representation learning for processing wav2vec 2.0 speech features, and a convolutional neural network-based emotion recognition model for identifying emotions from text features extracted from a transformer-based model. The final WA and UA on the IEMOCAP dataset were 73.00% and 73.50%, respectively.

From Table 6, we can make the following observations. (1) Our proposed model has the highest WA and UA scores, which proves the validity of the model. (2) Meanwhile, we observed the lowest WA and UA in Yeh et al. Although IAAN introduces context information, its effect on audio feature representation and fusion may not be as comprehensive as other methods, and its processing of features and information fusion may not be as fine and effective as other models. (3) Although Makiuchi et al. achieved good results by combining the cross-representation speech model and convolutional neural network with transformer model to extract text features, our model extracts more abundant features. A new combination of focus loss function and cross-entropy loss function is introduced, and the effect is still better than these competitive results, increasing WA from 73.00% to 73.95%, and UA from 73.50% to 74.27%.

5. Conclusions

In this paper, aiming at problems in the multi-modal emotion recognition of speech and text, such as insufficient feature extraction and poor classification of imbalanced samples, a new emotion recognition method based on speech and text is proposed. The model is divided into two channels to extract speech and text features, respectively.

In order to improve anti-noise ability and feature representation optimization, the eGeMAPS feature set is extracted from OpenSmile, the original eGeMAPS feature set is merged with the wavelet transformed eGeMAPS feature set, and then speech features are extracted through a sparse autoencoder. In order to increase the diversity of text data, BERT and RoBERTa are used to extract different semantic features in the text extraction channel, and then GRU is used to extract effective sequence features. Then, the features extracted from the GRU are fused with the features extracted from the two models. Then, the features extracted from the two channels are fused through attention layer and dual-layer BiLSTM, and finally, cross-entropy loss and focus loss are combined to make decisions to improve the emotion recognition effect of unbalanced samples.

Future research should focus on the following directions: further optimizing the effect of speech and text feature extraction, such as exploring more efficient wavelet transform and sparse autoencoder structures; enhancing the model’s anti-jamming ability to noise; studying the combination and effect of different pre-trained models (such as BERT and RoBERTa) in text feature extraction; exploring more advanced sequence feature extraction methods, such as introducing more complex recurrent neural network structures or attention mechanisms; and optimizing multi-modal feature fusion strategies, perhaps considering more complex fusion models or attention mechanisms to better combine speech and text information. In order to improve the overall performance and robustness of emotion recognition, more effective loss function design or sample weighting strategies should be further studied to solve the problem of unbalanced samples.

Author Contributions

Conceptualization, S.Z.; Methodology, S.Z.; Validation, P.X.; Formal analysis, P.X.; Investigation, Y.R., Z.G. and R.L.; Resources, R.Y.; Writing—original draft, S.Z.; Writing—review & editing, Y.F.; Supervision, Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This study analyzed publicly available dataset. The data can be found at the following website: https://sail.usc.edu/iemocap/iemocap_release.htm (accessed on 15 March 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech Emotion Recognition Using Deep Learning Techniques: A Review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
Karna, M.; Juliet, D.S.; Joy, R.C. Deep learning based Text Emotion Recognition for Chatbot applications. In Proceedings of the 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184), Tirunelveli, India, 15–17 June 2020. [Google Scholar]
Wei, P.; Yu, Z. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model. Pers. Ubiquitous Comput. 2019, 23, 521–529. [Google Scholar] [CrossRef]
Thirumuru, R.; Gurugubelli, K.; Vuppala, A.K. Novel feature representation using single frequency filtering and nonlinear energy operator for speech emotion recognition. Digit. Signal Process. 2022, 120, 103293. [Google Scholar] [CrossRef]
Abdel-Hamid, L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun. 2020, 122, 19–30. [Google Scholar] [CrossRef]
Han, T.; Zhang, Z.; Ren, M.; Dong, C.; Jiang, X.; Zhuang, Q. Text Emotion Recognition Based on XLNet-BiGRU-Att. Electronics 2023, 12, 2704. [Google Scholar] [CrossRef]
Hao, S.; Zhang, P.; Liu, S.; Wang, Y. Sentiment recognition and analysis method of official document text based on BERT–SVM model. Neural Comput. Appl. 2023, 35, 24621–24632. [Google Scholar] [CrossRef]
Vishnu Priya, R.; Nag, P.K. Text-based emotion recognition using contextual phrase embedding model. Multimed. Tools Appl. 2023, 82, 35329–35355. [Google Scholar]
Zhao, S.; Jia, G.; Yang, J.; Ding, G.; Keutzer, K. Emotion Recognition From Multiple Modalities: Fundamentals and methodologies. IEEE Signal Process. Mag. 2021, 38, 59–73. [Google Scholar] [CrossRef]
Makhmudov, F.; Kultimuratov, A.; Cho, Y.I. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
Lee, S.; Han, D.K.; Ko, H. Multimodal Emotion Recognition Fusion Analysis Adapting Bert With Heterogeneous Feature Unification. IEEE Access 2021, 9, 94557–94572. [Google Scholar] [CrossRef]
Huan, R.H.; Shu, J.; Bao, S.L.; Liang, R.H.; Chen, P.; Chi, K.K. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding; North American Chapter of the Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Xu, H.; Zhang, H.; Han, K.; Wang, Y.; Peng, Y.; Li, X. Learning alignment for multimodal emotion recognition from speech. arXiv 2019, arXiv:1909.05645. [Google Scholar]
Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Yeh, S.-L.; Lin, Y.-S.; Lee, C.-C. An interaction-aware attention network for speech emotion recognition in spoken dialogs. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
Makiuchi, M.R.; Uto, K.; Shinoda, K. Multimodal emotion recognition with high-level speech and text features. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021. [Google Scholar]

Figure 1. Model structure diagram.

Figure 2. Autoencoder structure diagram.

Figure 3. GRU structure chart.

Figure 4. This paper’s model att single-layer LSTM confusion matrix.

Figure 5. This paper’s model att dual-layer LSTM confusion matrix.

Figure 6. This paper’s model att dual-layer BiLSTM confusion matrix.

Table 1. IEMOCAP number of sentences with different emotions.

Dialogue	Angry	Neutral	Sad	Happy
Dialogue1	229	384	194	278
Dialogue2	137	362	197	327
Dialogue3	240	320	305	286
Dialogue4	327	258	143	303
Dialogue5	170	384	245	442
Total	1103	1708	1084	1636

Table 2. Comparison results for the IEMOCAP dataset using speech-only.

Single Modal Speech Model	Vanilla Autoencoder	Sparse Autoencoder	Frequency Domain Filtering	Wavelet Transform	WA	UA
vanilla autoencoder	✓				40.88	38.58
sparse autoencoder		✓			43.32	41.78
vanilla autoencoder + frequency domain filtering	✓		✓		42.63	41.33
sparse autoencoder + frequency domain filtering		✓	✓		44.97	44.27
vanilla autoencoder + wavelet transform	✓			✓	43.39	42.22
sparse autoencoder + wavelet transform		✓		✓	47.37	46.86

Table 3. Comparison results on the IEMOCAP dataset using text-only.

Pretraining Model	BERT	RoBERTa	GRU	WA	UA
BERT	✓			66.52	66.67
RoBERTa		✓		65.58	65.78
BERT-RoBERTa	✓	✓		68.25	68.52
BERT-RoBERTa-GRU	✓	✓	✓	73.39	73.61

Table 4. Comparison results on the IEMOCAP dataset using multi-modal models at different proportions of loss functions.

Cross-Entropy Loss: Focus Loss	WA	UA
0:1	71.00	71.05
0.1:0.9	69.86	69.80
0.2:0.8	71.52	71.56
0.3:0.7	71.29	71.27
0.4:0.6	70.20	70.16
0.5:0.5	70.67	70.68
0.6:0.4	71.49	71.50
0.7:0.3	70.53	70.67
0.8:0.2	73.95	74.27
0.9:0.1	72.55	73.03
1:0	70.84	70.93

Table 5. Multi-modal comparison experiments of different models on the IEMOCAP dataset.

Multi-Modal Model	Att Single-Layer LSTM		Att Dual-Layer LSTM		Att Dual-Layer BiLSTM
Multi-Modal Model	WA	UA	WA	UA	WA	UA
vanilla autoencoder+BERT	60.66	59.83	64.56	64.24	64.42	64.27
sparse autoencoder+BERT	59.34	58.43	64.94	64.86	65.85	65.46
vanilla autoencoder+RoBERTa	47.30	44.10	59.56	59.13	61.29	60.82
sparse autoencoder+RoBERTa	46.84	43.48	60.86	60.35	64.44	64.17
vanilla autoencoder+BERT-RoBERTa	63.71	62.82	67.73	67.51	66.95	66.90
sparse autoencoder+BERT-RoBERTa	62.23	61.13	67.24	67.06	67.49	67.46
this paper model	63.93	63.17	69.64	69.66	73.95	74.27

Table 6. Comparison of our multi-modal results with current works, tested on the IEMOCAP dataset.

Model	WA	UA
Xu et al.	72.50	70.90
Rajamani et al.	68.30	66.90
Yeh et al.	64.70	66.30
Makiuchi et al.	73.00	73.50
this paper model	73.95	74.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Feng, Y.; Ren, Y.; Guo, Z.; Yu, R.; Li, R.; Xing, P. Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function. Electronics 2024, 13, 3262. https://doi.org/10.3390/electronics13163262

AMA Style

Zhang S, Feng Y, Ren Y, Guo Z, Yu R, Li R, Xing P. Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function. Electronics. 2024; 13(16):3262. https://doi.org/10.3390/electronics13163262

Chicago/Turabian Style

Zhang, Shaohua, Yan Feng, Yihao Ren, Zefei Guo, Renjie Yu, Ruobing Li, and Peiran Xing. 2024. "Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function" Electronics 13, no. 16: 3262. https://doi.org/10.3390/electronics13163262

APA Style

Zhang, S., Feng, Y., Ren, Y., Guo, Z., Yu, R., Li, R., & Xing, P. (2024). Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function. Electronics, 13(16), 3262. https://doi.org/10.3390/electronics13163262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Emotion Recognition Based on Wavelet Transform and BERT-RoBERTa: An Innovative Approach Combining Enhanced BiLSTM and Focus Loss Function

Abstract

1. Introduction

2. Model Structure

2.1. Speech Feature Extraction Channel

2.2. Text Feature Extraction Channel

2.3. Model Training

3. Dataset

4. Experimental Analysis

4.1. Pre-Processing of the IEMOCAP Dataset

4.2. Discussion of Experiment Results

4.2.1. Speech Single Modal Comparison Experiment

4.2.2. Text Single Modal Comparison Experiment

4.2.3. Loss Function Ratio Comparison Experiment

4.2.4. Multi-Modal Comparison Experiment with Different Models

4.2.5. Comparison Experiment between Multi-Modal Model and Current Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI