Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models

Park, Seoin; Jeon, Byeonghoon; Lee, Seunghyun; Yoon, Janghyeok

doi:10.3390/app14177604

Open AccessArticle

Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models

Department of Industrial Engineering, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7604; https://doi.org/10.3390/app14177604

Submission received: 23 July 2024 / Revised: 22 August 2024 / Accepted: 22 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Advances in HCI: Recognition Technologies and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

As speech is the most natural way for humans to express emotions, studies on Speech Emotion Recognition (SER) have been conducted in various ways However, there are some areas for improvement in previous SER studies: (1) while some studies have performed multi-label classification, almost none have specifically utilized Korean speech data; (2) most studies have not utilized multiple features in combination for emotion recognition. Therefore, this study proposes deep fusion models for multi-label emotion classification using Korean speech data and follows four steps: (1) preprocessing speech data labeled with Sadness, Happiness, Neutral, Anger, and Disgust; (2) applying data augmentation to address the data imbalance and extracting speech features, including the Log-mel spectrogram, Mel-Frequency Cepstral Coefficients (MFCCs), and Voice Quality Features; (3) constructing models using deep fusion architectures; and (4) validating the performance of the constructed models. The experimental results demonstrated that the proposed model, which utilizes the Log-mel spectrogram and MFCCs with a fusion of Vision-Transformer and 1D Convolutional Neural Network–Long Short-Term Memory, achieved the highest average binary accuracy of 71.2% for multi-label classification, outperforming other baseline models. Consequently, this study anticipates that the proposed model will find application based on Korean speech, specifically mental healthcare and smart service systems.

Keywords:

deep fusion model; multi-label classification; speech emotion recognition

1. Introduction

Emotion is a crucial element in expressing the psychological state of humans [1]. For this reason, psychologists and neuroscientists have conducted studies over the past decades to recognize and classify human emotions [2]. Recently, driven by advancements in Artificial Intelligence, Big Data, and the Internet of Things, researchers have explored emotion recognition technology in Human–Computer Interaction (HCI) with the aim of automatically identifying and monitoring emotional states [3,4]. This emotion recognition technology utilizes modalities, such as facial expressions, brain waves, heart rates, and speech, to automatically recognize emotions [5,6]. The automated emotion recognition system can benefit social and healthcare systems by facilitating the detection of mental health issues such as depression and anxiety disorders [7].

Emotion recognition through human bio-information, such as brain waves, heart rates, and pulse, poses challenges due to the requirement for specialized equipment for data collection [8]. In contrast, speech offers an advantage, as such data are relatively less constrained temporally and spatially compared to other modalities, thus facilitating easier data collection [9]. Therefore, deep learning-based Speech Emotion Recognition (SER) studies are actively being conducted to recognize emotions using recorded speech and speech utterance data [10]. Deep learning-based SER studies are typically categorized into two approaches: multi-class classification and multi-label classification. Multi-class classification identifies only one emotion among the many, whereas multi-label classification focuses on recognizing both single and mixed emotions, considering the possibility of multiple emotions being derived in combination [11].

First, most SER studies have been performed based on multi-class classification which utilizes various speech features. For example, a study was conducted using voice quality parameters, which represent the pitch, loudness, and speech clarity from the EmoDB dataset built in German for SER [12]. Additionally, other studies have been conducted using the Log-mel spectrogram, which represents the frequency, amplitude, and time domain of speech, with data from the EmoDB dataset [13,14]. Another SER study was conducted using Mel-Frequency Cepstral Coefficients (MFCCs), which represent the unique speech information in the frequency domain, from the RAVDESS dataset built in English for SER [15,16]. Second, some SER studies have been performed based on multi-label classification. For example, a study was conducted using speech utterances from the IEMOCAP dataset built in English for SER [17]. Additionally, a study was conducted using the Log-mel spectrogram and MFCCs in combination to perform multi-label emotion classification on RML datasets, including various languages such as English, Italian, Persian, and Chinese [18].

Despite the significant contributions of previous studies on SER, some areas still need improvement. First, while most studies have focused on multi-label emotion classification with English and German speech data, relatively few have been on Korean speech data. Since there are differences in speech signals such as speech rate, intonation, and morphemes in the expression of emotions depending on the culture and language of each country [19], constructing a model using speech databases of each country’s language is imperative. Therefore, further studies are required to propose a multi-label classification model that accurately reflects the diverse characteristics of Korean speech. Second, most previous studies have utilized a single speech feature, such as the Log-mel spectrogram or MFCCs to recognize emotions, rather than combinations of different speech features. Although a single feature can be used to recognize emotions, combining and using multiple features extracted from speech signals further improves the accuracy and objectivity of emotion recognition [4,20]. For example, the Mel spectrograms, which represent the frequency and amplitude of speech signals over time, and MFCCs, which represent the distinctive spectral features of speech over time, can be used together [21,22]. Therefore, studies on models that exploit a combination of diverse speech features are needed to improve SER performance.

Since few studies have conducted multi-label emotion classification based on Korean speech data, this study proposes deep learning-based fusion models for multi-label classification using Korean speech data. Specifically, three types of speech features are utilized: Log-mel spectrogram, MFCCs, and Voice Quality Features (VQFs). Subsequently, the performances of the models constructed using one of these speech features, fusion models constructed using two of three speech features, and a fusion model using all three speech features are compared and analyzed. The model construction follows four steps: (1) preprocessing Korean speech data which are labeled with ‘Sadness’, ‘Happiness’, ‘Neutral’, ‘Anger’, and ‘Disgust’; (2) applying data augmentation techniques to reduce the variation in the number of labeled data; (3) extracting speech features, including the Log-mel spectrogram, MFCCs, and VQFs for constructing multi-label classification models; (4) validating performance of the constructed models. The models proposed in this study are expected to have applications in fields such as mental health care, customer service, and entertainment.

The remainder of this paper is structured as follows: Section 2 reviews the related literature. Section 3 introduces the data and labeling process used in this study. Section 4 describes the speech feature extraction process and model construction. Section 5 presents the experimental results and Section 6 presents the concluding remarks, including the study’s contributions and future study directions.

2. Related Works

In recent years, studies on SER have emerged as a crucial area in HCI and speech processing [23]. Consequently, studies on SER using large-scale data are being conducted, aiming to apply deep learning-based methods to recognize various emotions [24]. Deep learning-based SER can automatically recognize human emotional states through HCI technology, enabling the provision of customized services that reflect individual moods. In addition, speech-based mental health monitoring can also contribute to the detection of stress or depression in users. To this end, SER studies have mainly focused on multi-class and multi-label classification.

First, most previous studies on SER were performed for multi-class classification, aimed at recognizing one emotion among many, using a single speech feature. For example, one study utilized segmented speech utterance data in a Deep Neural Network (DNN) to classify emotions such as Anger, Neutral, and Sadness [25]. Additionally, spectrograms extracted from the EmoDB dataset and applied them to a deep Convolutional Neural Network (CNN) for classify emotion recognition such as anger, boredom, disgust, fear, happiness, neutral, and sadness [26]. Furthermore, MFCCs extracted from the RAVDESS dataset were used in a Long Short-Term Memory (LSTM) to identify emotions such as calm, happiness, sadness, anger, fearful, surprise, and disgust [16]. Moreover, raw audio clips and Log-mel spectrogram extracted from the EmoDB and IEMOCAP datasets were utilized in 1D and 2D CNN-LSTM, respectively, to recognize emotions [14]. Additionally, a study was conducted on emotion recognition using Bi-LSTM with the Log-mel spectrogram extracted from the EmoDB and IEMOCAP datasets [13]. These studies have made significant contributions to deep learning-based SER by focusing on the use of a single speech feature, such as Mel spectrograms or MFCCs. While this approach effectively identifies distinct emotions, relying on only one speech feature may limit the ability to fully capture the complexity of emotional expressions.

To improve emotion recognition performance, some previous studies on SER aimed to enhance the accuracy of recognizing by using multiple speech features. For example, one study proposed the application of a Dual-Sequence LSTM, an extension of LSTM, utilizing MFCCs and Mel spectrograms for recognizing emotions such as Happiness, Neutral, Anger, and Sadness, extracted from the IEMOCAP dataset [27]. Additionally, a study was conducted to improve the performance in emotion recognition by constructing a fusion model that combines Bi-LSTM and CNN, utilizing MFCCs and ERB spectrograms extracted from a Korean speech database [4]. These studies have contributed to enhancing the performance of SER by incorporating multiple speech features, such as MFCCs and Mel spectrograms. However, despite the improved performance, these studies primarily focused on multi-class classification and did not address the challenge of multi-label classification, which is essential for recognizing mixed emotions that often occur in the real-world scenarios.

Recognizing the importance of addressing mixed emotions, some previous studies on SER were performed for multi-label classification. For example, to enhance the performance of emotion recognition, speech utterances were extracted from the IEMOCAP dataset and applied to a Response Residual Network to recognize mixed emotions such as a combination of happiness, neutral, sadness, and anger [17]. Additionally, a study was conducted to improve performance in recognizing emotions such as a combination of disgust, happiness, fear, anger, surprise, and sadness by constructing a fusion model of CNN and LSTM, utilizing Log-mel spectrograms and MFCCs extracted from the RML dataset [18]. These studies have contributed to the literature as the first to conduct multi-label classification using English or German speech data. However, an important gap in the research is the lack of focus on multi-label classification using languages other than English or German, such as Korean. Therefore, conducting multi-label emotion classification using Korean speech data is important. Table 1 summarizes these previous studies main results.

Table 1 reveals that previous studies support deep learning-based SER through various deep-learning methods. As shown in Table 1, most studies have primarily utilized English or German speech datasets for SER, while relatively few have used Korean speech data. Specifically, while some studies have conducted multi-class classification using Korean speech, few have focused on multi-label classification. Therefore, studies need to be conducted on multi-label emotion classification that accurately reflects the diversity and characteristics of Korean speech, while also effectively classifying mixed emotions. In addition, most previous studies recognized emotions using a single speech feature. Utilizing a single feature to classify emotions is possible; however, this approach does not fully consider the diverse characteristics and complexity of speech data, making it challenging to recognize mixed emotions. Therefore, studies need to be conducted on improving the performance of mixed emotions recognition by constructing fusion models that combine multiple speech features. By employing this hybrid feature fusion approach, the model utilizes a more comprehensive set of speech information, effectively capturing frequency domain and temporal patterns, as well as qualitative aspects of speech signals. In addition, this hybrid approach not only improves overall performance but also results in a more robust emotion recognition system [28]. Therefore, this study constructs deep fusion models for multi-label classification using Korean speech data. The model proposed in this study is expected to find application in mental health care, customer services, and entertainment.

3. Data Preparation

3.1. Korean Speech Emotional Database

To obtain a reliable database, this study used the Korean speech emotional database provided by the AI-Hub (https://www.aihub.or.kr (accessed on 20 November 2023)) of the National Information Society Agency (NIA). The AI-Hub speech emotional database consists of speech recordings covering people aged between 20 and 50 without limitations on age group or gender. Furthermore, it is a multi-label emotion database that includes utterances from various contexts. Table 2 provides an example of the dataset, where each speech file reflects the judgments of five experts who listened to the corresponding utterance and recognized the emotions. Since each speech file is labeled with between one and five emotions, the authors determined that the AI-Hub database is suitable for performing multi-label classification. Therefore, this study uses the database labeled with five emotions (‘Happiness’, ‘Sadness’, ‘Anger’, ‘Neutral’, ‘Disgust’) to perform the analysis.

This study employs a hard labeling approach to perform SER exclusively on specific emotions. Specifically, only emotions unanimously identified by at least two experts within each audio file are used. The hard labeling approach employed in SER utilizes only emotions judged by the majority of experts to eliminate ambiguous emotions [17]. Some studies have demonstrated high SER performance by applying a hard-labeling approach [15,29]. Figure 1 illustrates the process of applying hard labeling to each speech file. Through hard labeling, each audio file is labeled with one or two emotions based on the agreement of at least two out of five judges. In this process, speech files labeled with ‘Happiness and Disgust’ and ‘Happiness and Anger’ are excluded because of the small amount of data and the fact that these combinations are not logically appropriate for expressing emotions simultaneously. Consequently, experiments are conducted using 40,645 speech data with hard labeling. Table 3 lists the number of data for each emotion label in the dataset. As shown in Table 3, a large variation is identified in the number of data for each emotion label.

3.2. Preprocessing the Speech Data

This section describes the preprocessing of speech data to improve the effectiveness of emotion recognition. Preprocessing of data follows three steps: (1) converting the speech files to audio data by applying a sampling rate of 16,000 Hz, (2) removing low-volume audio data because identifying mixed emotions using low-volume audio is difficult, and (3) removing any silence that occurs before and after the audio data by a trimming technique. For this, power_to_dB from the Python librosa (https://librosa.org (accessed on 20 November 2023)) library is utilized to convert the power of frequency bands to dB, and then data with an average dB below −70 are removed. Then, only audio data longer than 3 s with silence removed are used. The reason for using only audio data with a length of 3 s or longer is that the authors of this study concluded that to recognize mixed emotions, the minimum length of audio data should be at least 3 s. Figure 2 provides an example of using only audio data longer than 3 s after removing any silence occurring before and after. Consequently, preprocessing the speech data resulted in the removal of approximately 20% of the data.

3.3. Applying Data Augmentation

This section describes the data augmentation process to reduce the imbalance in the number of labeled data. Imbalanced training data can directly cause deterioration in a deep learning model. Using data augmentation to increase the amount of effective training and testing data improves the performance of deep learning models and helps avoid underfitting or overfitting [30]. Other studies have also used data augmentation to address the imbalance in the IEMOCAP dataset, which is widely used in SER research [31,32]. First, the preprocessed speech data are predivided randomly into the Training, Validation, and Test datasets in an approximately 6:2:2 ratio. Shift augmentation is then applied to the Training, Validation, and Test datasets to address the data imbalance. Shift augmentation is a technique that horizontally shifts the original data either left or right. In this study, the shift augmentation technique is applied by shifting the data only slightly to the right. For example, if the length of the preprocessed speech is 3.5 s, the original speech data can be split into two by setting a time interval of 3.0 s and a shift time of 0.5 s. Figure 3 shows an example of the process in which an audio data segment is split into two by applying shift augmentation.

In this process, shift augmentation is not applied to ‘Sadness’, which has the largest number of data, while ‘Happiness and Sadness’ and ‘Neutral and Disgust’, which have relatively fewer data, are augmented extensively by setting a shorter shift time. Next, to standardize the input size of the audio data used for model construction, all the data samples were set to a length of 3 s. Table 4 lists the number of data for each emotion label in the dataset after applying shift augmentation. As Table 4 shows a reduction in the imbalance of the data compared to Table 3.

4. Feature Extraction and Model Construction

This section describes the extraction and utilization of three speech features from the preprocessed audio data to construct multi-label classification models. Specifically, Log-mel spectrogram, MFCCs, and VQFs are extracted from the preprocessed audio data. Then, three deep single models are constructed: one utilizing the Log-mel spectrogram with a Vision-Transformer (ViT), another utilizing the MFCCs with a 1D CNN-LSTM, and the other utilizing the VQFs with a DNN. Subsequently, four deep fusion models are constructed: three models are built through the fusion of two deep single models, and one model is built by the fusion of three deep single models. Finally, the performance of the constructed models are compared and validated. Figure 4 shows the overall process of constructing the deep single and fusion models.

4.1. Extracting Speech Features

In speech analysis, the two primary methods utilized consist of converting speech data into time and frequency domains, or raw speech signals. Specifically, the speech features utilized in speech analysis include MFCCs and Mel spectrogram, which exploit the characteristics of the speech signals in the frequency or time domains, and VQFs, which represent the raw signals and quality of the speech itself. First, the Mel spectrogram is a feature extraction technique used in speech analysis to capture changes in frequency and amplitude characteristics over time. Humans perceive low-frequency sounds as more sensitive, and sounds become less sensitive as their frequency increases. Expressed differently, humans do not perceive frequency linearly, but rather through the Mel-scale. This non-linear perception of frequency was modeled using the Mel-scale function, which reflects the criteria for human speech perception in terms of frequency, considering the characteristics of the human cochlea. The Mel-scale function, which converts Hertz to Mel, is represented by Equation (1), where M represents Mel, f represents Hertz, and the constant 2595 is used to convert the linear frequency to the Mel scale, reflecting the non-linear way humans perceive pitch [33].

M = 2595 \log (1 + \frac{f}{700})

(1)

Therefore, to extract the Log-mel spectrogram, the spectrogram obtained through Short-Time Fourier Transform (STFT) is mapped to the Mel-scale function or a Mel-filterbank to convert it into the Mel spectrogram. Then, the logarithm for each value of the Mel spectrogram is then taken. This process produces the Log-mel spectrogram, which distributes energy across a wider frequency range, enabling more detailed feature extraction. Second, MFCCs are derived by applying a discrete cosine transform to the Mel spectrogram to extract distinctive speech features [34]. The MFCCs represent unique feature vectors extracted from speech, capturing the energy distribution across frequency bands through a set of coefficients. The primary objective of extracting MFCCs is to efficiently capture various characteristics of speech data in vector form. The MFCCs extraction function is represented by Equation (2), where

\hat{S_{k}}

denotes the vector values obtained through the Mel-scale process, k represents the index of the filterbank, and n indicates the number of MFCCs coefficients to be extracted, where n = 1, 2…k [35].

C_{n} = \sum_{k = 1}^{k} (l o g \hat{S_{k}}) c o s [n (k - \frac{1}{2}) \frac{π}{k}]

(2)

In speech analysis, the Log-mel spectrogram and MFCCs represent heterogeneous data with different speech features. These are vectorized and stored separately to utilize in combination. Figure 5 shows a sample visualization of the Log-mel spectrogram and the MFCCs generated using the librosa library.

Third, VQFs are extracted to represent the quality and signals of the speech itself. Since human voices lack regularity, the acoustic parameters that measure irregularity, fluctuations, and noise in the waveform, referred to as VQFs, are often used to enhance the performance of speech recognition systems. Therefore, 22 acoustic parameters related to VQFs are extracted for analysis. Specifically, the acoustic parameters related to fundamental frequency, which signifies the lowest frequency of a periodic waveform (F0 mean, F0 stdev); parameters related to the ratio of harmonics to noise energy (Harmonics-to-Noise Ratio); parameters related to frequency variation or frequency perturbation (Local jitter, Local absolute jitter, Rap jitter, PPQ5 jitter, DDP jitter); and parameters related to amplitude variation (Local shimmer, Local dB shimmer, APQ3 shimmer, APQ5 shimmer, APQ11 shimmer, DDA shimmer) are extracted from the speech periodic wave. In addition, the parameters related to speech energy magnitude (Intensity max, Intensity mean, Intensity min, Intensity dynamic range, Intonation variation) and the parameters related to pitch (Pitch max, Pitch mean, Pitch range) are extracted. Table 5 contains the 22 acoustic parameters related to the VQFs.

4.2. Constructing Models for Multi-Label SER

This section describes the construction of three deep single models and four deep fusion models. The Log-mel spectrogram is a two-dimensional (2D) form of data that represents the change in frequency and amplitude over time. Therefore, the 2D Log-mel spectrogram images can be used in the ViT architecture. Specifically, ViT divides input images into patches and performs operations patch-by-patch. The ViT architecture comprises a patch embedding, an encoder, and the Multi-Layer Perceptron (MLP) classifier. Initially, the image is tokenized by dividing it into patches. Then, the linear embedding is applied to each patch to make it a sequence, which is input to the transformer encoder and the encoder generates features by the correlation between each patch through the self-attention mechanism, and finally generated features are classified by the classifier [36]. By dividing the image into multiple patches, local features can be better extracted, thereby improving the efficiency and accuracy of image processing [37]. In this patch division process, the image can be divided into patches of squares, rectangles, or triangles based on the features of the image [37,38,39]. Although conducting self-attention in square patches may be optimal, this study divides the Log-mel spectrogram into rectangular patches to analyze the original image without resizing it for compaction or extension. Resizing might be necessary to create square patches, especially when considering rectangular images. This decision is based on the understanding that speech is characterized by sequences and the Log-mel spectrogram represents frequency and amplitude at specific time intervals. Therefore, dividing the patches into rectangles along the time axis can capture changes in frequency and amplitude over short periods, thereby enabling the identification of evolving emotions. Once the image patches are extracted, linear embedding is applied to each token and then input to the encoder. Figure 6a shows the ViT encoder structure. The encoder mainly comprises Layer Normalization (LN), Multi-Head Self-Attention (MHSA), MLP, and residual connection, and is calculated using Equations (3) and (4) [36].

Z_{l}^{'} = M H S A (L N (Z_{l - 1})) + Z_{l - 1}

(3)

Z_{l} = M L P (L N (Z_{l}^{'})) + Z_{l}^{'}

(4)

In Equation (3),

Z_{l}^{'}

represents the result of the LN, MHSA, and residual connection applied to the embedding vector

Z_{l - 1}

, which is used as the input to the encoder. In Equation (4),

Z_{l}

represents the result of LN, MLP, and residual connections performed on the output of Equation (3). The processes described in Equations (3) and (4) are repeated for each layer of the encoder. Through this repetition, the patch embeddings of the input images are transformed into more complex representations of the underlying relationships. By combining and updating the embedding information for each patch and the entire images, the model can learn numerous parameters to classify the images effectively. Therefore, the Log-mel spectrogram can be utilized in ViT to recognize both the global features of the speech and the detailed changes and patterns occurring in specific frequency bands. In addition, ViT is especially effective for processing Log-mel spectrograms. Unlike CNNs, which use fixed-size filters, ViT employs a self-attention mechanism that captures both global and local dependencies across the entire image [40]. By dividing the spectrogram into patches and applying self-attention to these patches, ViT can efficiently learn complex patterns that are important information for emotion recognition. Therefore, this study utilizes ViT instead of CNN. Some studies also applied the Mel spectrogram in ViT and demonstrated high performance in SER research [41,42]. Although ViT models pretrained on ImageNet can be utilized, the goal of this study is to develop a model appropriate for Korean SER. Therefore, instead of using pretrained ViT models, we built the model from scratch.

Second, MFCCs is a 1D spectrum feature widely used for vectorizing speech data, representing the energy in frequency bands as coefficients. Therefore, 1D MFCCs can be utilized in the 1D CNN-LSTM architecture for emotion recognition. Figure 6b shows the structure of the 1D CNN-LSTM. Specifically, the MFCCs 1D vector is input into the 1D CNN layer, where convolutional filters are applied to detect patterns in the temporal sequence of these MFCCs. The CNN layer operates by sliding these filters across the input vector, capturing localized temporal features that represent variations in speech that correspond to different emotional states. The output from the CNN layer consists of feature maps, which are then downsampled using max pooling to reduce the dimensionality while retaining essential features. These downsampled feature maps are subsequently passed into the LSTM layer, which is designed to capture long-term dependencies and sequential information in the data. Thereby, the LSTM, acting as a decoder in the 1D CNN-LSTM structure, has the advantage of effectively learning in sequential data. As such, utilizing MFCCs with LSTM can facilitate the learning of multiple speech signals, thus enabling the detection of various emotions. By integrating the CNNs in feature extraction and LSTMs in sequence modeling, this architecture enables the effective recognition of emotions from speech. Some studies have utilized MFCCs to the 1D CNN-LSTM and demonstrated high performance in SER research [30,43]. While this study utilized only LSTM layers for SER due to their effectiveness in capturing long-term dependencies in sequential data, exploring other RNN variants, such as the Content-Adaptive Recurrent Unit (CARU) and Gated Recurrent Unit (GRU), could offer additional advantages in handling complex temporal patterns [44,45].

Third, the VQFs comprising the 22 acoustic parameters are dimensionless structured data. Therefore, the VQFs can be used in the DNN to classify emotions. The DNN is a neural network comprising multiple layers of perceptron and can learn complex patterns from non-linear data. Hence, the network is trained by inputting 22 acoustic parameter values into the DNN and passing them through several hidden layers to the output layer. In this process, the weights of each layer are adjusted to determine the relationship between the input and output. Finally, after constructing the deep single models, we proceed to construct four deep fusion models using feature fusion: three models by combining features from two single models and one model by combining features from three single models. Specifically, the first fusion model combines ViT and 1D CNN-LSTM, integrating the ability of ViT to capture global patterns in 2D Log-mel spectrograms with the ability of 1D CNN-LSTM to process temporal sequences in MFCCs. The second fusion model integrates ViT with DNN, where ViT focuses on frequency and amplitude variations, and DNN utilizes voice quality characteristics like pitch and intensity. The third fusion model combines 1D CNN-LSTM with DNN, using 1D CNN-LSTM to capture changes over time in MFCCs and DNN to analyze structured voice quality data. Then, the comprehensive fusion model integrates ViT, 1D CNN-LSTM, and DNN, combining features from Log-mel spectrograms, MFCCs, and VQFs to analyze a diverse set of speech characteristics.

To perform multi-label emotion classification with five emotions, the final output layer of the model is configured with five nodes, with a sigmoid function applied as the activation function to derive the probability values for each class. When the probability value for each class exceeded the threshold, it indicated the presence of that emotion class; when it fell below the threshold, it indicated the absence of that emotion class. Hence, since only five emotions are used; the model’s output can represent a combination of up to five emotions or as few as zero emotions. Moreover, this study employed a binary focal loss as the loss function for model training. While focal loss was initially developed for object detection tasks, focal loss has been shown to exhibit advantageous characteristics for handling imbalanced datasets in classification problems as well [46]. Focal loss is a refinement of the cross-entropy loss that addresses class imbalance by assigning greater weight to difficult or easily misclassified cases [47]. Focal loss has demonstrated effectiveness in balancing loss by amplifying the loss in classes that are difficult to classify. As mentioned in Section 3.2, shift augmentation was performed to reduce the data variation for each label. However, the data imbalance of the original dataset was extreme, and augmentation could not entirely resolve this imbalance. Hence, the binary focal loss function is applied to enhance the performance by mitigating the imbalance during model training and giving higher loss to emotions that the model fails to predict well to improve its accuracy. The formula for binary focal loss is given by Equation (5).

B F L (y, \hat{p}) = - α {y (1 - \hat{p})}^{γ} \log (\hat{p}) - (1 - y) {\hat{p}}^{γ} \log (1 - \hat{p})

(5)

In Equation (5),

y

represents the binary class label (0, 1),

\hat{p}

represents the prediction probability,

α

is the weight to balance the classes, and γ is the hyperparameter that determines how much weight to give the losses. Thus, adjusting the value of γ based on the model and the data is crucial, as the appropriate setting of γ values can significantly impact the model’s performance.

4.3. Validating the Performance of Constructed Models

This section presents an evaluation and analysis of the performance of the deep single and fusion models. Since the constructed models perform multi-label classification, the probability values for the five emotions are calculated using the sigmoid activation function in the final output layer of the models. Specifically, if the probability value exceeds the threshold, the label is 1, and if it falls below the threshold, the label is 0. A confusion matrix for the five emotions is thus employed to evaluate the performance of each emotion. The confusion matrix is a table used to evaluate the performance of a classification model, displaying the relationship between the predicted classes and the actual classes, as shown in Figure 7. The matrix in Figure 7 is employed to calculate the binary accuracy, precision, recall, and f1-score to evaluate the models’ performance.

The model’s binary accuracy, which represents the proportion of correctly classified samples by the model within a specific class, is calculated as (TP + TN)/(TP + FN + FP + TN). Precision, which signifies the proportion of accurate positive predictions among all positive predictions, is calculated as TP/(TP + FP). Recall, which signifies the proportion of correct positive predictions among all positive samples, is calculated as TP/(TP + FN). Subsequently, the f1-score is defined as the harmonic mean of precision and recall and is calculated as (2 × precision × recall)/(precision + recall). Therefore, the f1-score can be used to select models with high precision and recall. By evaluating the binary accuracy and f1-score calculated for each class, this study can select the most appropriate architecture for multi-label classification using Korean speech data.

5. Results and Performance Analysis

In this study, we compared the performance of both the single and fusion models constructed using speech features extracted from the dataset shown in Table 4. Table 6 provides an overview of the experimental environment.

To construct the SER models, we extracted three speech features. Specifically, to extract the Log-mel spectrogram, this study set the number of Mel filters to M = 100 and a fixed hop length of 25 ms in the STFT with a fixed overlap length of 15 ms. Additionally, 40 MFCCs were extracted using the same STFT configuration. The extraction of 100 Log-mel spectrogram filters and 40 MFCCs is based on previous studies [48,49]. Subsequently, 22 acoustic parameters were extracted from the VQFs. Given that all speech features were extracted from audio data with a length of 3 s, the input vector size of the Log-mel spectrogram for model training was (100, 300, 1). Next, a transpose operation was performed on the existing vectors to enable the utilization of the 1D CNN-LSTM [50]. Thus, the vector size of the MFCCs, representing 40 coefficients for each frame, was (300, 40). Moreover, the vector size of the VQFs representing the structured dimensionless data was (22).

Before constructing the single models, we compared the performance of the ViT model using rectangular versus square patches. As shown in Table 7, ViT models utilizing rectangular patches exhibited a slightly higher average f1-score compared to using square patches. The rectangular patches more effectively captured the temporal, frequency, and amplitude characteristics of the Log-mel spectrogram, thereby enhancing the recognized mixed emotions. Although rectangular patches are not commonly applied in ViT, using them on smaller speech datasets can compensate for inductive biases, highlighting the effectiveness of patch selection strategies in SER. Therefore, the ViT with rectangular patches was used in this study.

Next, we describe the architectures of the ViT model with rectangular patches, the 1D CNN-LSTM, and the DNN. The architectures of these models constructed using the extracted speech features are illustrated in Figure 8, Figure 9 and Figure 10. In Figure 8, the input Log-mel spectrogram shape for the ViT was 100 × 300 pixels. Given that each input patch has a shape of 100 × 10 pixels, the image was divided into 30 patches. Thus, the embedding dimension for each patch was calculated as 100 × 10 × 1 = 1000. Positional embeddings containing sequence information were then applied to each patch after linear embedding. After the application of linear and positional embedding, each patch passed through six transformer encoder layers and was then flattened into a one-dimensional array, forming a vector. Finally, the emotions were classified using MLP, with dropout regularization applied to the fully connected layer to avoid overfitting.

The architectures of the 1D CNN-LSTM and DNN are shown in Figure 9 and Figure 10. In Figure 9, the 1D CNN acts as an encoder in the 1D CNN-LSTM architecture, comprising three layers. During the convolution operation, the number of filters and filter lengths were set to (32, 3), (64, 3), and (128, 3), respectively. Consequently, the input MFCCs vector sized (300, 40) underwent a convolution operation and max pooling to transform it into a feature vector sized (35, 128). The LSTM layer used this feature vector from the 1D CNN as input and converted it into a vector sized (1, 15). Subsequently, a fully connected layer with a dropout was employed. Finally, in Figure 10, the first hidden layer of the DNN model has 32 neurons, the second hidden layer has 16 neurons, and the output layer has 5 neurons. Between these layers, the GELU activation function, batch normalization, and dropout were applied to avoid overfitting.

Furthermore, for model training, the Adam optimizer was employed, selected for its computational efficiency and low memory requirements. The Adam optimizer’s learning rate was set within the range of [0.0005, 0.001]. Additionally, the hyperparameters were defined with epochs in the range [50, 150], the batch sizes ranged [32, 256], the early stopping patience ranged [5, 8], and gamma values within the range [0.5, 2.5]. We identified and utilized the hyperparameters that exhibited optimal performance throughout the model training and performance evaluation processes. Finally, the models were trained and evaluated with the sigmoid function threshold set to 0.45, which outputs the probability values for the five classes at the final output layer. This threshold was chosen instead of the standard of 0.5 because it led to higher performance by enabling a more complex combination of various emotions to be derived. The overall performances of the single deep models constructed in this study are listed in Table 8. The performances of the fusion models are listed in Table 9.

As Table 8 shows, both the ViT and 1D CNN-LSTM models exhibited notably higher performance than the DNN model. Specifically, the average binary accuracy was the highest for the 1D CNN-LSTM model at 70.7%, whereas the average f1-score was the highest for the ViT model at 51.2%. However, the DNN model utilizing VQFs did not exhibit high performance in the multi-label emotion classification, particularly showing relatively low recall and f1-score for the ‘Anger’ emotion compared to other emotions. This indicates that VQFs do not significantly contribute to identifying the ’Anger‘ emotion. However, the ViT model utilizing the Log-mel spectrogram and the 1D CNN-LSTM model utilizing MFCCs demonstrated relatively higher performance across all emotions, suggesting that these two speech features are effective for SER.

Next, as shown in Table 9, the ViT and 1D CNN-LSTM fusion model achieved the highest performance, with an average binary accuracy of 71.2% and an average f1-score of 52.2%. By integrating the Log-mel spectrogram and MFCC features, the model learned diverse characteristics, such as frequency, amplitude, and frequency coefficients over time, which can facilitate the identification of complex emotions. Based on these results, the ViT and 1D CNN-LSTM fusion model performed the best. Indeed, while the other three fusion models exhibited enhancements in performance compared to the DNN model, they did not surpass those of the ViT and 1D CNN-LSTM fusion models. This indicates that the lower performance of the DNN model compared to the other single models contributed to a decrease in the overall performance when integrated into the fusion models.

Next, having the highest average binary accuracy and average f1-score indicates that the ViT and 1D CNN-LSTM fusion model more accurately identifies the multi-label classifications of mixed emotions compared to other models. The architecture of the ViT and 1D CNN-LSTM fusion model is shown in Figure 11. In Figure 11, the feature fusion approach was applied, where the fully connected layers derived from ViT and 1D CNN-LSTM are concatenated and processed through an MLP to perform multi-label classification. The other fusion models have similar architectures, as shown in Figure 11. The other fusion models involve extracting features from single models, concatenating these features, and subsequently using an MLP for the final classification task. This approach allows for the integration of diverse feature representations, enabling the model to better capture the mixed emotions in speech data.

To further explore the model’s performance, this study conducts an in-depth analysis of the ViT and 1D CNN-LSTM fusion model, which has shown the best performance among the constructed models. When examining the model’s performance in more detail, the performance metrics reveal a higher binary accuracy, suggesting that the model is generally effective at distinguishing between the presence and absence of emotions. However, this model shows a lower f1-score and recall compared to the binary accuracy.

Specifically, the model’s ability to achieve a higher True Negative Rate (TNR) across all emotions highlights its effectiveness in minimizing false positives. The lower recall and f1-score imply that, while the model performs well in minimizing false positives, challenges arise in correctly identifying all classes of certain emotions, particularly when those emotions are less distinctly expressed or when multiple emotions are mixed in a speech recording. For instance, ‘Anger’ and ‘Disgust’ exhibit relatively lower recall values compared to other emotions. This issue is a common challenge in multi-label emotion recognition, where the complexity of the task often leads to imbalances in precision and recall. Despite these challenges, the ViT and 1D CNN-LSTM fusion model demonstrates reliable overall performance. With further improvements, the model has the potential to achieve even better results, enhancing its ability to accurately recognize a wide range of emotions.

Conclusively, by utilizing the Korean emotional speech database provided by the AI-Hub for constructing multi-label emotion classification, this study demonstrated that the ViT and 1D CNN-LSTM fusion model is the most appropriate choice. While deep single models constructed using either Log-mel spectrogram or MFCCs also demonstrated commendable performance, the fusion models that utilized both speech features generally evidenced better performance. This implies that integrating multiple speech features into SER is more effective than using a single feature. However, studies on multi-label emotion recognition are limited, making direct performance comparisons with prior studies challenging. Specifically, recent studies exploring multi-label emotion classification often employ multimodal approaches, where different modalities such as facial expressions or text are combined with speech to recognize emotions [51,52]. This multimodal approach in recent studies complicates direct comparisons with our method, which relies solely on the speech modality. Additionally, as mentioned in the Related Works section, other multi-label classification studies focus on identifying different sets of emotions, with some studies concentrating on as few as four emotions and others on as many as six [17,18]. In contrast, our study specifically focuses on detecting five emotions, which makes further performance comparisons difficult since the emotions identified in other studies do not fully align with those in our research. These variations in the number and type of emotions further complicate direct performance comparisons. To address these challenges, we adopted a strategy of constructing a diverse set of models and conducting comprehensive performance comparisons within the scope of this study. We anticipate that future research will be able to use this work as a benchmark, facilitating more meaningful performance comparisons in the field.

6. Concluding Remarks

Studies on SER using deep learning-based approaches are actively ongoing. However, almost none have specifically used Korean speech data for multi-label classification. Therefore, this study performed multi-label emotion classification using the Korean speech database. To validate this approach, the Log-mel spectrogram, MFCCs, and VQFs were extracted from the Korean speech database provided by the NIA’s AI-Hub. Consequently, three deep single and four fusion models were constructed for the SER. In conclusion, the ViT and 1D CNN-LSTM fusion model demonstrated the highest performance in multi-label emotion classification, and achieved an average binary accuracy of 71.2% and an average f1-score of 52.2%. The next highest performing model was the ViT and 1D CNN-LSTM and DNN fusion model, which achieved an average binary accuracy of 70.6% and an average f1-score of 50.9%. This contributes to increasing the efficiency and accuracy of Korean speech multi-label emotion classification. Additionally, this highlights that extracting and utilizing multiple features from speech data is effective for emotion recognition tasks.

This study makes the following contributions. First, it contributes to the literature by performing multi-label emotion classification using a Korean speech database. We expect to establish a foundational framework for future studies on Korean speech processing and emotion analysis. Second, these fusion models effectively combine speech features, such as Log-mel spectrogram, MFCCs, and VQFs, to enhance accuracy, rather than relying solely on one speech feature. Thus, we demonstrated that utilizing multiple speech features in combination improves SER performance. Third, we used the AI-Hub speech emotional database, which consists of recordings from a wide range of ages from the 20s to the 50s, without limitations on specific age groups or genders. Therefore, the models constructed using the AI-Hub database can be applied by a diverse range of users in real-world settings where the Korean language is used. Additionally, these models can contribute to a better understanding and management of emotional states by monitoring diverse users.

Despite these contributions, further studies are required. First, this study utilized the Log-mel spectrogram, MFCCs, and VQFs to combine various speech features. However, MFCCs are particularly sensitive to background noise, which may affect performance in noisy environments. To address this limitation, future studies should incorporate noise reduction techniques during the preprocessing stage to improve the overall model performance. Next, since ViT has a low inductive bias, its performance can degrade when trained on small-scale datasets. To address this, this study aimed to enhance the model’s understanding of local structures and patterns by using rectangle patches instead of squares, thereby improving emotion classification performance. Therefore, with the acquisition of additional data, studies can be carried out to enhance the performance of the ViT model, utilizing both the rectangle patches strategy and the large-scale database. Lastly, while this study did not employ interpretability techniques to analyze the decision-making process, future research will focus on integrating methods such as Grad-CAM or SHAP to improve interpretability. Specifically, by utilizing these advanced techniques, future studies aim to provide deeper insights into the inner workings of the model, offering a more comprehensive understanding of how specific features influence the model’s predictions.

Author Contributions

Conceptualization, S.P., B.J., S.L. and J.Y.; Methodology, S.P., B.J., S.L. and J.Y.; Software, S.P. and B.J.; Validation, J.Y.; Formal analysis, S.L.; Investigation, S.P., B.J. and J.Y.; Resources, J.Y.; Data curation, S.P. and B.J.; Writing—original draft, S.P., B.J., S.L. and J.Y.; Writing—review and editing, J.Y.; Visualization, S.P. and B.J.; Supervision, J.Y.; Project administration, S.P. and J.Y.; Funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in 2023 by Konkuk University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analyzed in this study are available in the AI-Hub (https://www.aihub.or.kr (accessed on 20 November 2023)) of the National Information Society Agency.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
Gunes, H.; Pantic, M. Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emot. (IJSE) 2010, 1, 68–99. [Google Scholar] [CrossRef]
Alkatheiri, M.S. Artificial intelligence assisted improved human-computer interactions for computer systems. Comput. Electr. Eng. 2022, 101, 107950. [Google Scholar] [CrossRef]
Jo, A.-H.; Kwak, K.-C. Speech emotion recognition based on two-stream deep learning model using Korean audio information. Appl. Sci. 2023, 13, 2167. [Google Scholar] [CrossRef]
Ali, M.; Mosa, A.H.; Machot, F.A.; Kyamakya, K. Emotion recognition involving physiological and speech signals: A comprehensive review. In Recent Advances in Nonlinear Dynamics and Synchronization: With Selected Applications in Electrical Engineering Neurocomputing, and Transportation; Springer: Cham, Switzerlamd, 2018; pp. 287–302. [Google Scholar]
Gangamohan, P.; Kadiri, S.R.; Yegnanarayana, B. Analysis of emotional speech—A review. In Toward Robotic Socially Believable Behaving Systems-Volume I: Modeling Emotions; Springer: Cham, Switzerlamd, 2016; pp. 205–238. [Google Scholar]
Singh, J.; Saheer, L.B.; Faust, O. Speech emotion recognition using attention model. Int. J. Environ. Res. Public Health 2023, 20, 5140. [Google Scholar] [CrossRef]
Dzedzickis, A.; Kaklauskas, A.; Bucinskas, V. Human emotion recognition: Review of sensors and methods. Sensors 2020, 20, 592. [Google Scholar] [CrossRef] [PubMed]
Fahad, M.S.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
Khalil, R.A.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar] [CrossRef]
Bendjoudi, I.; Vanderhaegen, F.; Hamad, D.; Dornaika, F. Multi-label, multi-task CNN approach for context-based emotion recognition. Inf. Fusion 2021, 76, 422–428. [Google Scholar] [CrossRef]
Lugger, M.; Yang, B. The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
Zhang, H.; Gou, R.; Shang, J.; Shen, F.; Wu, Y.; Dai, G. Pre-trained deep convolution neural network model with attention for speech emotion recognition. Front. Physiol. 2021, 12, 643202. [Google Scholar] [CrossRef]
Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
Dolka, H.; Arul Xavier, V.M.; Juliet, S. Speech emotion recognition using ANN on MFCC features. In Proceedings of the 2021 3rd International Conference on Signal Processing and Communication (ICPSC), Coimbatore, India, 13–14 May 2021. [Google Scholar]
Kumbhar, H.S.; Bhandari, S.U. Speech emotion recognition using MFCC features and LSTM network. In Proceedings of the 2019 5th International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 19–21 September 2019. [Google Scholar]
Li, X.; Zhang, Z.; Gan, C.; Xiang, Y. Multi-label speech emotion recognition via inter-class difference loss under response residual network. IEEE Trans. Multimed. 2022, 25, 3230–3244. [Google Scholar] [CrossRef]
Slimi, A.; Hafar, N.; Zrigui, M.; Nicolas, H. Multiple models fusion for multi-label classification in speech emotion recognition systems. Procedia Comput. Sci. 2022, 207, 2875–2882. [Google Scholar] [CrossRef]
Byun, S.-W.; Lee, S.-P. A study on a speech emotion recognition system with effective acoustic features using deep learning algorithms. Appl. Sci. 2021, 11, 1890. [Google Scholar] [CrossRef]
Tu, Z.; Liu, B.; Zhao, W.; Yan, R.; Zou, Y. A feature fusion model with data augmentation for speech emotion recognition. Appl. Sci. 2023, 13, 4124. [Google Scholar] [CrossRef]
Joshi, D.; Pareek, J.; Ambatkar, P. Comparative study of Mfcc and Mel spectrogram for raga classification using CNN. Indian J. Sci. Technol. 2023, 16, 816–822. [Google Scholar] [CrossRef]
Peng, N.; Chen, A.; Zhou, G.; Chen, W.; Zhang, W.; Liu, J.; Ding, F. Environment sound classification based on visual multi-feature fusion and GRU-AWS. IEEE Access 2020, 8, 191100–191114. [Google Scholar] [CrossRef]
Wani, T.M.; Gunawan, T.S.; Qadri SA, A.; Kartiwi, M.; Ambikairajah, E. A comprehensive review of speech emotion recognition systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep learning techniques for speech emotion recognition, from databases to models. Sensors 2021, 21, 1249. [Google Scholar] [CrossRef]
Harár, P.; Burget, R.; Dutta, M.K. Speech emotion recognition with deep learning. In Proceedings of the 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), Delhi-NCR, India, 2–3 February 2017. [Google Scholar]
Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech emotion recognition from spectrograms with deep convolutional neural network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republick of Korea, 13-15 February 2017. [Google Scholar]
Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Panda, S.K.; Jena, A.K.; Panda, M.R.; Panda, S. Speech emotion recognition using multimodal feature fusion with machine learning approach. Multimed. Tools Appl. 2023, 82, 42763–42781. [Google Scholar] [CrossRef]
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
Pan, S.-T.; Wu, H.-J. Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation. Electronics 2023, 12, 2436. [Google Scholar] [CrossRef]
Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv 2018, arXiv:1802.05630. [Google Scholar]
Yi, L.; Mak, M.-W. Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 172–184. [Google Scholar] [CrossRef]
Thornton, B. Audio Recognition Using Mel Spectrograms and Convolution Neural Networks; Academia: San Francisco, CA, USA, 2019. [Google Scholar]
Abdul, Z.K.; Al-Talabani, A.K. Mel frequency cepstral coefficient and its applications: A review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
Gupta, S.; Jaafar, J.; Ahmad, W.W.; Bansal, A. Feature extraction using MFCC. Signal Image Process. Int. J. 2013, 4, 101–108. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhou, T.; Niu, Y.; Lu, H.; Peng, C.; Guo, Y.; Zhou, H. Vision transformer: To discover the “four secrets” of image patches. Inf. Fusion 2024, 105, 102248. [Google Scholar] [CrossRef]
Chen, Z.; Zhu, Y.; Zhao, C.; Hu, G.; Zeng, W.; Wang, J.; Tang, M. Dpt: Deformable patch-based transformer for visual recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021. [Google Scholar]
Kobat, S.G.; Baygin, N.; Yusufoglu, E.; Baygin, M.; Barua, P.D.; Dogan, S.; Yaman, O.; Celiker, U.; Yildirim, H.; Tan, R.-S. Automated diabetic retinopathy detection using horizontal and vertical patch division-based pre-trained DenseNET with digital fundus images. Diagnostics 2022, 12, 1975. [Google Scholar] [CrossRef]
Akinpelu, S.; Viriri, S.; Adegun, A. An enhanced speech emotion recognition using vision transformer. Sci. Rep. 2024, 14, 13126. [Google Scholar] [CrossRef]
Kumar, C.A.; Maharana, A.D.; Krishnan, S.M.; Hanuma SS, S.; Lal, G.J.; Ravi, V. Speech emotion recognition using CNN-LSTM and vision transformer. In Proceedings of the International Conference on Innovations in Bio-Inspired Computing and Applications, Seattle, WA, USA, 15–17 December 2022. [Google Scholar]
Tran, M.; Soleymani, M. A pre-trained audio-visual transformer for emotion recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
Kawade, R.; Konade, R.; Majukar, P.; Patil, S. Speech Emotion Recognition Using 1D CNN-LSTM Network on Indo-Aryan Database. In Proceedings of the 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India, 11–12 August 2022. [Google Scholar]
Chan, K.-H.; Ke, W.; Im, S.-K. CARU: A content-adaptive recurrent unit for the transition of hidden state in, N.L.P. In Proceedings of the Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, 23–27 November 2020. Proceedings, Part I 27. [Google Scholar]
Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017. [Google Scholar]
Tran, G.S.; Nghiem, T.P.; Luong, C.M.; Burie, J.-C. Improving accuracy of lung nodule classification using deep learning with focal loss. J. Healthc. Eng. 2019, 2019, 5156416. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. Proc. IEEE Int. Conf. Comput. Vis. 2017, 42, 318–327. [Google Scholar]
Paseddula, C.; Gangashetty, S.V. Late fusion framework for Acoustic Scene Classification using LPCC, SCMC, and log-Mel band energies with Deep Neural Networks. Appl. Acoust. 2021, 172, 107568. [Google Scholar] [CrossRef]
Patni, H.; Jagtap, A.; Bhoyar, V.; Gupta, A. Speech emotion recognition using MFCC, GFCC, chromagram and RMSE features. In Proceedings of the 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 26–27 August 2021. [Google Scholar]
Chang, Z.; He, R.; Yu, Y.; Zhang, Z.; Bai, G. A two-stream convolution architecture for ESC based on audio feature distanglement. Asian Conf. Mach. Learn. 2023, 189, 153–168. [Google Scholar]
Yoon, S.; Byun, S.; Jung, K. Multimodal speech emotion recognition using audio and text. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. [Google Scholar]
Zhang, J.; Xing, L.; Tan, Z.; Wang, H.; Wang, K. Multi-head attention fusion networks for multi-modal speech emotion recognition. Comput. Ind. Eng. 2022, 168, 108078. [Google Scholar] [CrossRef]

Figure 1. Example of applying the hard labeling process.

Figure 2. Example of audio data preprocessing steps.

Figure 3. Example of applying shift augmentation.

Figure 4. Overview process of constructing the SER models.

Figure 5. Example of audio data converting to the Log-mel spectrogram and MFCCs.

Figure 6. (a) Vision-Transformer encoder; (b) 1D CNN-LSTM structure.

Figure 7. Confusion matrix.

Figure 8. Vision-Transformer model architecture with rectangle patches.

Figure 9. The 1D CNN-LSTM model architecture.

Figure 10. DNN model architecture.

Figure 11. Vision-Transformer and 1D CNN-LSTM fusion model architecture.

Table 1. Summary of related works on SER.

Approach	Database (Language)	Features	Classifier	Reference
Multi-class classification	EmoDB (German)	Speech utterances	DNN	[25]
	EmoDB (German)	Spectrogram	Deep CNN	[26]
	EmoDB (German) IEMOCAP (English)	Raw audio clips, Log-mel spectrogram	1D CNN-LSTM and 2D CNN-LSTM	[14]
	RAVDESS (English)	MFCCs	LSTM	[16]
	EmoDB (German) IEMOCAP (English)	Log-mel spectrogram	Bi-LSTM	[13]
	IEMOCAP (English)	MFCCs, Mel spectrogram	Dual Sequence-LSTM	[27]
	Korean Speech Database (Korean)	MFCCs, GTCCs, ERB Spectrogram, etc	CNN-LSTM	[4]
Multi-label classification	IEMOCAP (English)	Speech utterances	Response Residual Network	[17]
Multi-label classification	RML (English, Mandarin, Urdu, Punjabi, Persian, Italian)	MFCCs, Log-mel spectrogram	CNN-LSTM	[18]

Table 2. Example of the AI-Hub emotional dataset.

Audio Id	Judge 1	Judge 2	Judge 3	Judge 4	Judge 5
Audio#1	Sadness	Neutral	Neutral	Sadness	Neutral
Audio#2	Neutral	Anger	Sadness	Anger	Disgust
Audio#3	Happiness	Sadness	Neutral	Happiness	Happiness
…
Audio#N	Disgust	Neutral	Anger	Neutral	Disgust

Table 3. The number of data for each emotion label.

Label	Number of Data	Ratio
Sadness	15,184	37.4%
Anger	6466	15.9%
Neutral	5515	13.6%
Happiness	3126	7.7%
Disgust	2099	5.2%
Sadness and Neutral	2909	7.2%
Sadness and Anger	1625	4.0%
Happiness and Neutral	1313	3.2%
Anger and Neutral	640	1.6%
Anger and Disgust	627	1.5%
Sadness and Disgust	480	1.2%
Happiness and Sadness	395	1.0%
Neutral and Disgust	266	0.7%
Sum	40,645	100%

Table 4. The number of data for each emotion label after applying shift augmentation.

Label	Training		Validation		Test
Label	# of Data	Ratio	# of Data	Ratio	# of Data	Ratio
Anger	6354	8.9%	897	7.5%	1106	8.6%
Happiness	5800	8.1%	817	6.8%	1039	8.1%
Neutral	5781	8.1%	866	7.2%	1069	8.3%
Sadness	5456	7.6%	831	6.9%	1013	7.9%
Disgust	5151	7.2%	861	7.2%	767	6.0%
Anger and Disgust	6178	8.7%	1136	9.5%	1085	8.4%
Happiness and Sadness	5818	8.2%	780	6.5%	1317	10.2%
Sadness and Disgust	5400	7.6%	1279	10.7%	796	6.2%
Happiness and Neutral	5335	7.5%	920	7.7%	1114	8.7%
Sadness and Anger	5274	7.4%	781	6.5%	982	7.6%
Sadness and Neutral	5023	7.0%	700	5.8%	972	7.6%
Anger and Neutral	4935	6.9%	1032	8.6%	799	6.2%
Neutral and Disgust	4835	6.8%	1068	8.9%	790	6.1%
Sum	71,340	100%	11,968	100%	12,849	100%

Table 5. Description of the 22 acoustic parameters.

Main Features	Derivative Features
Fundamental frequency (F0)	F0 mean, F0 stdev
Harmonics-to-Noise Ratio (HNR)	HNR
Jitter	Local jitter, Local absolute jitter, Rap jitter, PPQ5 jitter, DDP jitter
Shimmer	Local shimmer, Local dB shimmer, APQ3 shimmer, APQ5 shimmer, APQ11 shimmer, DDA shimmer
Intensity	Intensity max, Intensity mean, Intensity min, Intensity dynamic range, Intonation variation
Pitch	Pitch max, Pitch mean, Pitch range

Table 6. Experimental environment.

Division		Use
Hardware	CPU	AMD Ryzen™ 9 7950 @ 4.5 Ghz (AMD, Santa Clara, CA, USA)
	GPU	NVIDIA GeForce RTX 4090 (NVIDIA, Santa Clara, CA, USA)
	RAM	64 GB
Software	OS	Linux
	Kernel version	5.15.0-58-generic
	Programming language	Python 3.8.16

Table 7. Performance comparison between ViT models with rectangular vs. square patches.

Model	Image Size	Patch Size	Num of Patches	Avg. Binary Accuracy	Avg. Recall	Avg. Precision	Avg. F1-Score
Rectangle patches	(100, 300, 1)	(100, 10, 1)	30	0.678	0.531	0.503	0.512
Square patches	(128, 128, 1)	(32, 32, 1)	16	0.677	0.506	0.500	0.499

Table 8. Performance comparison of the ViT, 1D CNN-LSTM, and DNN models.

Model	Emotion	Binary Accuracy	Avg. Binary Accuracy	Recall	Precision	F1-Score	Avg. F1-Score
Vision-Transformer	Happiness	0.762	0.678	0.423	0.582	0.490	0.512
	Sadness	0.617		0.657	0.512	0.575
	Anger	0.663		0.448	0.454	0.451
	Neutral	0.658		0.611	0.532	0.569
	Disgust	0.693		0.518	0.437	0.474
1D CNN-LSTM	Happiness	0.773	0.707	0.476	0.600	0.531	0.499
	Sadness	0.692		0.387	0.701	0.499
	Anger	0.693		0.304	0.507	0.380
	Neutral	0.647		0.672	0.517	0.584
	Disgust	0.728		0.506	0.493	0.499
DNN	Happiness	0.542	0.572	0.586	0.311	0.406	0.389
	Sadness	0.521		0.827	0.447	0.580
	Anger	0.671		0.093	0.368	0.149
	Neutral	0.606		0.413	0.453	0.432
	Disgust	0.518		0.542	0.290	0.378

Table 9. Performance comparison of fusion models.

Model	Emotion	Binary Accuracy	Avg. Binary Accuracy	Recall	Precision	F1-Score	Avg. F1-Score
Vision-Transformer and 1D CNN-LSTM	Happiness	0.776	0.712	0.471	0.610	0.531	0.522
	Sadness	0.684		0.481	0.631	0.546
	Anger	0.654		0.621	0.456	0.526
	Neutral	0.686		0.528	0.582	0.554
	Disgust	0.761		0.368	0.584	0.451
Vision-Transformer and DNN	Happiness	0.774	0.690	0.297	0.692	0.415	0.465
	Sadness	0.612		0.633	0.508	0.563
	Anger	0.657		0.323	0.428	0.368
	Neutral	0.674		0.471	0.571	0.516
	Disgust	0.732		0.433	0.499	0.464
1D CNN-LSTM and DNN	Happiness	0.779	0.701	0.378	0.657	0.480	0.478
	Sadness	0.663		0.616	0.568	0.591
	Anger	0.669		0.287	0.444	0.349
	Neutral	0.658		0.561	0.535	0.547
	Disgust	0.739		0.356	0.518	0.422
Vision-Transformer, 1D CNN-LSTM, and DNN	Happiness	0.774	0.706	0.514	0.593	0.551	0.509
	Sadness	0.674		0.617	0.583	0.600
	Anger	0.665		0.466	0.459	0.463
	Neutral	0.657		0.693	0.527	0.599
	Disgust	0.762		0.221	0.664	0.332

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Jeon, B.; Lee, S.; Yoon, J. Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models. Appl. Sci. 2024, 14, 7604. https://doi.org/10.3390/app14177604

AMA Style

Park S, Jeon B, Lee S, Yoon J. Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models. Applied Sciences. 2024; 14(17):7604. https://doi.org/10.3390/app14177604

Chicago/Turabian Style

Park, Seoin, Byeonghoon Jeon, Seunghyun Lee, and Janghyeok Yoon. 2024. "Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models" Applied Sciences 14, no. 17: 7604. https://doi.org/10.3390/app14177604

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Label Emotion Recognition of Korean Speech Data Using Deep Fusion Models

Abstract

1. Introduction

2. Related Works

3. Data Preparation

3.1. Korean Speech Emotional Database

3.2. Preprocessing the Speech Data

3.3. Applying Data Augmentation

4. Feature Extraction and Model Construction

4.1. Extracting Speech Features

4.2. Constructing Models for Multi-Label SER

4.3. Validating the Performance of Constructed Models

5. Results and Performance Analysis

6. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI