Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English

Wang, Liyan; Yang, Jun; Wang, Yongshan; Qi, Yong; Wang, Shuai; Li, Jian

doi:10.3390/app14093543

Open AccessArticle

Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English

¹

School of Culture and Education, Shaanxi University of Science and Technology, Xi’an 710021, China

²

School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

³

School of History and Civilization, Shaanxi Normal University, Xi’an 710062, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(9), 3543; https://doi.org/10.3390/app14093543

Submission received: 9 February 2024 / Revised: 5 March 2024 / Accepted: 5 March 2024 / Published: 23 April 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study is dedicated to developing an innovative method for evaluating spoken English by integrating large language models (LLMs) with effective space learning, focusing on the analysis and evaluation of emotional features in spoken language. Addressing the limitation of current spoken English evaluation software that primarily focuses on acoustic features of speech (such as pronunciation, frequency, and prosody) while neglecting emotional expression, this paper proposes a method capable of deeply recognizing and evaluating emotional features in speech. The core of the method comprises three main parts: (1) the creation of a comprehensive spoken English emotion evaluation dataset combining emotionally rich speech data synthesized using LLMs with the IEMOCAP dataset and student spoken audio; (2) an emotion feature encoding network based on transformer architecture, dedicated to extracting effective spatial features from audio; (3) an emotion evaluation network for the spoken English language that accurately identifies emotions expressed by Chinese students by analyzing different audio characteristics. By decoupling emotional features from other sound characteristics in spoken English, this study achieves automated emotional evaluation. This method not only provides Chinese students with the opportunity to improve their ability to express emotions in spoken English but also opens new research directions in the fields of spoken English teaching and emotional expression evaluation.

Keywords:

spoken English; large language models (LLMs); emotion recognition; emotion evaluation; emotional features

1. Introduction

With the continuous advancement of artificial intelligence technology, especially breakthrough developments in speech evaluation technology [1,2], large-scale scoring of spoken English exams has become a reality. This transformative technology not only improves the timeliness of evaluations but also enhances objectivity, significantly increasing the reference value of evaluation results. However, existing English spoken language evaluation software or platforms primarily focus on speech-based scoring, considering acoustic features such as pronunciation, frequency, and prosody while neglecting the evaluation of spoken emotions. In fact, emotions play an indispensable role in spoken expression, especially in actual communication, where the emotional expression of language is crucial. For Chinese students, there is a general deficiency in emotional expression, which is particularly evident in the process of learning English. Zhang [3] believes that Chinese people often reflect on themselves when making mistakes. They are very low-key and rarely mention their achievements. Once they succeed, they often say it is not because of their own efforts but the help of their teachers or parents. Unlike this, people in Western countries advocate for individual efforts, especially those countries that have achieved brilliant military achievements. They often openly express their confidence and honor. Westerners have independent views on themselves and possess relative independence. Taking conversations between Asians and Westerners as an example, cultural differences may lead to significant variations in the expression and interpretation of the same emotions. In Western culture, individuals tend to express their emotions and feelings directly, while in Asian culture, people may be more reserved and indirect. Such differences can lead to misunderstandings and communication barriers. Therefore, incorporating emotional evaluation into the system is of great significance. It helps Chinese students to improve their ability to express emotions in the English language and enhances the evaluation system.

Currently, the discrete emotional model is primarily adopted to describe vocal emotions as independent labels without interrelation, among which the American psychologist Paul Ekman’s classification of six basic emotions—anger, disgust, fear, happiness, sadness, and surprise—is widely used in the current field of emotion-related research [4]. Speech emotion recognition technology is the process of extracting feature parameters related to emotional representation from speech signals and establishing a mapping model between these feature parameters and emotional categories. Its core goal is to achieve the emotional classification of different speech samples. Despite the significant progress achieved in the field of emotion classification [5,6,7,8], existing research primarily focuses on classification with limited attention to the quantitative evaluation of emotions, especially in the application of spoken English learning assessments.

To objectively and automatically assess the emotional expression in students’ spoken English, this study meticulously constructs a specialized dataset for the evaluation of emotions in spoken English. This is achieved by leveraging existing public emotion datasets (such as IEMOCAP [9]), self-generated emotional speech materials utilizing the large language model (LLM) strategy, and students’ emotional speech audio captured by ourselves. Based on the attention-based transformer model, emotional features are successfully decoupled from other speech characteristics like speech rate and intonation in English spoken audio. Deep formalized representations related to emotional features are thus extracted from speech, and a supervised learning method is employed to achieve the automated evaluation (scoring) of emotional expression in the English spoken language. In this study, for the first time in the context of voice communication scenarios, the automated evaluation of emotional expression in spoken English is realized. This enables this study to be specifically applicable in the context of Chinese students’ English oral practice. This advancement aids Chinese students in enhancing their ability to express emotions in spoken English. The main contributions of this paper are as follows:

Combining an LLM with the deep representation of features in the emotional space, this study successfully builds a deep learning framework for the evaluation of emotions in spoken English based on the transformer attention mechanism. This is a first in the academic community, achieving the automated evaluation of emotions in spoken English.
This study integrates emotional speech data generated by an LLM with existing public emotion datasets. Using frequency domain transformations and methods based on the transformer, combined with the membership function method in fuzzy mathematics, it effectively achieves the decoupling of emotional features from other sound features in spoken English and establishes the corresponding deep learning representation of emotional features.
Compared to current multimodal emotion recognition and evaluation methods, this research focuses on training and evaluating vocal emotional expression in spoken English. This not only provides rich emotional expression practice materials for learning spoken English but also automatically provides learners with feedback and evaluation on emotional expression during the learning process, opening a new path for Chinese students to enhance their spoken English abilities.

Section 2 will introduce relevant work related to this paper, Section 3 will introduce the dataset, models, and methods used in this paper, Section 4 will describe the experimental methods and results of this paper, and finally Section 5 will summarize and discuss this paper.

2. Related Work

2.1. Emotion Analysis Research on Spoken Language

Current research primarily focuses on building and evaluating datasets with emotional labels for spoken English emotion analysis models, and significant achievements have been made in emotion classification. In this context, scholars continue to explore more innovative and effective methods. Representative research cases include the following: Paranjape et al. used Longformer, BERT, and BigBird models for emotion classification and combined these with a threshold voting mechanism to derive the final results [10]; Chiorrini et al. utilized a BERT-based emotion classifier network for the sentiment classification of tweet data [11]; additionally, Chaudhari et al. employed a vision transformer for facial emotion recognition [12]. However, these works only use a single data source for emotion recognition and lack multimodal data sources. This leads to less accurate recognition results.

It is noteworthy that recently, an increasing number of scholars have started to use multimodal information as input for models to enhance the accuracy of emotion recognition: for instance, Pan et al. conducted a review of multimodal emotion recognition methods [13]; Syed Zaidi et al. used a multimodal dual attention transformer for speech emotion recognition [14]; Luna-Jiménez et al. proposed an automatic emotion recognizer model composed of speech and facial emotion recognizers [15], demonstrating the potential of multimodal approaches in emotion recognition and providing new perspectives and directions for future research improvements; Siriwardhana et al. [5] as well as Tripathi et al. [6] employed text, audio, and visual inputs combined with self-supervised learning for emotion recognition; Wang et al. [7] along with Voloshina et al. [8] each proposed multimodal enhancement fusion methods based on the transformer to improve the accuracy of emotion recognition; and Vu et al. combined multimodal technology with scaled data to recognize emotions from internal human signals [16].

As stated above, existing research focuses on classification and pays limited attention to the quantitative evaluation of emotions. Therefore, this paper aims to construct an English spoken language emotion evaluation framework based on the transformer attention mechanism by integrating an LLM with deep representations of effective space features. This framework is not only capable of classifying emotions but also possesses a quantitative evaluation function, making it particularly suitable for assessing students’ emotional expression abilities in spoken English.

2.2. Research on Decoupling of Vocal Emotional Features

In previous studies in this field, scholars typically used acoustic features and signal processing techniques to construct the emotional features of spoken pronunciation: for instance, Patel et al. [17] explained the acoustic changes caused by emotions based on tension, perturbation, and occurrence frequency; and Kanluan et al. [18] used prosody and spectral features to represent the audio characteristics of emotional speech and conducted a regression analysis with the help of support vector machines. Recently, with the advancement of deep learning technology, some studies have focused on using deep neural networks to more accurately capture and express emotional information in spoken pronunciation. For example, Islam et al. [19] used three different transformation features, integrated them in a 3D form for input, and utilized deep learning models to recognize emotions in speech. Research in this area is valuable for enhancing the emotional perception capabilities of automatic speech recognition systems and the development of human-computer interaction fields. However, the studies described above limited their work to the use of datasets containing recordings of actual people (instead of combining these with materials created by an LLM). This led to a smaller volume of data and weaker models.

This study takes a step further from this basis by combining artificial intelligence generated content (AIGC) generative deep learning techniques, particularly the transformer model and the membership function method of fuzzy mathematics, to effectively decouple emotional features from other sound features in spoken English. The innovation of this method lies in its ability to more precisely capture and express emotional information in spoken pronunciation, thereby providing a more accurate and robust data foundation for speech emotion recognition.

2.3. Research on the Application of AI in Educational Evaluation

In past studies, scholars have consistently emphasized the importance of emotions in the educational process, arguing that paying attention to students’ emotional states is crucial for effective learning, and some progress has been made [20]. Orji et al. [21] and Melweth et al. [22] have shown, from different perspectives, a moderate positive correlation between the frequency of artificial intelligence usage and teaching capabilities. Despite certain advancements in modern educational technology in integrating AI, most applications are still focused on the transmission of knowledge and the assessment of English pronunciation accuracy, with less attention being paid to students’ emotional expression in spoken English. This has led to a widespread view that AI in student education needs to focus more on emotional aspects, offering a warmer and more humanized educational concept: Liefooghe et al. [23] believe that models in artificial intelligence should be more humanized in their optimization, promoting the development of personalized adaptation in artificial intelligence; Shao et al. [24] proposed that the application of AI technology in education should adhere to a people-oriented concept, further promoting the integration of personalization into education; Martínez-Miranda et al. [25] elucidated the importance of emotions in human intelligence and that computers aimed at mimicking human behavior should not only think and reason but also be able to exhibit human emotions. These works have a profound impact on improving the user experience of educational systems, enhancing learning effects, and promoting personalized teaching. However, many existing AI applications or systems in education have not yet fully integrated emotional intelligence. For example, Liu [26] preliminarily applied emotion recognition technology in psychological education, which was still insufficient for the comprehensive perception and evaluation of students’ emotional expression states.

This study, by incorporating emotion recognition technology into spoken English learning, not only provides learners with a wealth of materials for practicing emotional expression but also automatically offers timely feedback and evaluation on learners’ emotional expressions. The innovation of this method lies in its focus not only on knowledge transfer but also on the emotional experience of students, promoting a personalized and humanized educational experience, which is conducive to enhancing students’ learning outcomes and the overall educational experience.

3. Method and Model

This study develops a method that integrates an LLM with emotional representation for analyzing and scoring the emotional content in students’ spoken English audio samples. Firstly, using the text and emotional labels from the IEMOCAP dataset, emotionally rich corresponding speech data is synthesized through the LLM. Subsequently, the synthesized speech data is combined with the audio from the IEMOCAP dataset and student spoken language samples to form a comprehensive dataset. Next, an emotion feature encoding network based on the transformer architecture is introduced to deeply extract emotional features from the audio. Finally, an English spoken language emotion evaluation network is designed, which enables the regression analysis and precise (accurate) scoring of students’ emotional expression in spoken English. The overall framework is illustrated in Figure 1 and specifically includes the following three components:

Step 1. LLM for emotion voice synthesis: the LLM, Typecast [27], is utilized in this paper to synthesize emotional voice by integrating textual content with emotional labels. This synthesized audio, combined with the IEMOCAP dataset audio and collected students’ spoken language samples, is used to construct a comprehensive dataset for evaluating emotions in students’ spoken language.

Step 2. Emotion feature encoding network: The paper introduces an emotion feature encoding network based on the transformer model, aimed at deeply extracting the emotional spatial features from the audio. This network is capable of eliminating the influence of factors such as timbre, pitch, and content in the audio, ensuring that the emotional features are independent of other audio information. This independence enhances the accuracy of the evaluation.

Step 3. Emotion evaluation network: The paper designs an English spoken language emotion evaluation network, aimed at achieving the accurate scoring of emotional expression in students’ spoken English. For any input of student spoken audio (accompanied by emotion labels), the process begins by utilizing the corresponding text and emotion labels to generate TTS synthesized speech through the LLM. Subsequently, the actual student spoken audio and the TTS synthesized speech are separately input into the emotion feature encoding network, resulting in the corresponding emotion feature encodings. These two types of encodings, along with the emotion mean feature vectors of the IEMOCAP audio dataset corresponding to the emotion labels, are collectively fed into the emotion evaluation network. This process can be treated as a regression problem, where the emotion evaluation network is coarse-tuned through similarity calculations of emotional features extracted from the three types of audios and fine-tuned based on teachers’ scoring. This method allows for the precise scoring of emotional expression in students’ spoken language audio. For a detailed description of the training process, please refer to Section 3.3.

3.1. Dataset

In this study, audio data from the IEMOCAP dataset, synthetic audio, and students’ spoken English audio are used as the primary data sources. These audio recordings undergo noise addition, speed variation, pitch alteration, and other preprocessing steps to create a comprehensive dataset consisting of 54,464 audio samples. Specifically, this dataset comprises three main components.

3.1.1. IEMOCAP Dataset Audio

IEMOCAP (interactive emotional dyadic motion capture) is a specialized multimodal emotion dataset designed to provide rich resources for research on human emotion expression and recognition. This study focuses on the dialogue portion of the dataset, which includes natural conversations recorded with ten professional actors. These dialogues take place in various contexts and aim to simulate emotional communication in real-life situations. The data includes scripted dialogues as well as spontaneous, emotionally rich interactions, covering a wide range of emotions such as anger, sadness, excitement, surprise, and more. The content of these dialogues holds significant research value and is essential for gaining a deeper understanding of human emotion expression and recognition.

In the data preprocessing phase of this research, 3106 speech samples were selected from the dataset. These samples not only include the speech content but also record the duration of the speech and emotional evaluations. The emotional evaluation part was completed by 3–4 expert evaluators who categorized the emotion of each speech sample into six types: frustration, sadness, anger, neutral, happiness, and excited.

3.1.2. TTS Speech Synthesized Audio

Based on the 3106 dialogues from the IEMOCAP dataset along with their emotional annotations, 2892 speech samples were synthesized using the TTS model, Typecast, ensuring that each synthesized speech corresponds to the highest-scoring emotional category from the original speech’s emotional annotations during the synthesis process.

3.1.3. Student-Recorded Spoken Audio and Teacher Emotional Evaluation Scores

To implement the proposed oral speech emotion evaluation method in actual teaching scenarios for this research, a total of 45 students recorded 810 oral speech audio samples based on 18 representative dialogues from the IEMOCAP dataset. These audio recordings cover the six emotions contained in the 810 dialogues from the IEMOCAP dataset. During the recording process, students referred to the emotional annotations in the IEMOCAP dataset to express the corresponding emotions as accurately as they could. Additionally, we formed a team consisting of English teachers who were responsible for assessing the emotional richness of the recordings. These teacher ratings served as crucial reference standards for the emotional evaluation model used in model training, evaluation, and optimization. Figure 2 displays the distribution of scores, including the average spoken scores for each emotion category for each student. The scores on the heat map indicate the degree of high or low scores on oral emotional expression, with the more reddish colors corresponding to higher scores, and the more purplish colors corresponding to lower scores. The different colors and shades of colors can better show the students’ oral emotional expression scores and their distribution.

From Figure 2, it can be observed that for the majority of students, their ability to express various emotions is consistent. For the same students, their evaluation scores for six types of emotions are distributed around the same value. Only a few students show excellent performance in expressing a specific emotion, as evidenced by significantly higher evaluation scores for one particular type of emotion compared to others.

3.2. Emotional Voice Synthesis Using LLM

3.2.1. English Spoken Language Emotion Synthesis Method and Fine-Tuning of LLM

To ensure sufficient augmentation of the dataset and to provide an accurate reference standard for the subsequent student emotion evaluation network, this study employed an LLM based on the transformer architecture with a masked self-attention mechanism. This model successfully generated 2892 emotionally rich speech data entries, further enhancing the diversity of the dataset. The self-attention mechanism played a central role in this process, enabling the model to more accurately capture and process emotional information.

The overall structure, as shown in Figure 3, illustrates that the model consists of a series of encoders and decoders. The encoders are responsible for transforming the input text into feature representations, while the decoders utilize these feature representations to generate the corresponding speech output. This design allows the model to flexibly adjust its focus on different parts of the text, thereby better capturing and expressing emotional information.

The LLM based on the Transformer architecture not only demonstrated its capability in handling emotional text but also highlighted its crucial role in emotional voice synthesis technology. The audio generated is not only suitable for assessing students’ spoken language skills but can also serve as standardized audio materials, providing learning guidance for students.

3.2.2. Emotion Label Determination Based on Membership Functions

To enhance the capability of the feature encoding network in extracting emotional features from audio, 2892 text samples were carefully selected with emotional labels based on the IEMOCAP dataset. The LLM, Typecast, was used to generate emotionally expressive spoken audio for these texts. In the IEMOCAP dataset, each speech sample has 3–4 different emotional labels annotated by experts. To determine the primary emotional label for the selected texts, we employed a membership function calculation method based on the normal distribution. This function accurately assesses the membership of each sentence in the IEMOCAP dataset based on six different emotions. Specifically, we utilize the following membership function formula:

S_{e m o t i o n} = \int_{- \infty}^{x} e^{- \frac{1}{2} {(\frac{t - μ_{e m o t i o n}}{σ_{e m o t i o n}})}^{2}} d t,

(1)

where

μ_{e m o t i o n}

and

σ_{e m o t i o n}

represent the average occurrence and standard deviation of the corresponding emotional evaluation in each sample in the IEMOCAP dataset, respectively, and x denotes the occurrence count of the corresponding emotional evaluation in each sample. Subsequently, by analyzing the performance of the six emotions, the primary emotional label for each sentence was determined. The method of calculating the primary emotional label for samples accurately captures the information entropy of different emotions within the samples. Information entropy represents the uncertainty of emotional labels in the dataset, and information gain measures the reduction in entropy before and after splitting the dataset using emotional features. The parameters

μ_{e m o t i o n}

and

σ_{e m o t i o n}

for the membership function are chosen based on the feature with the highest information gain.

3.3. Deep Representation of Emotional Features in Spoken English

To unify the feature representation of different audios in the emotional space, an emotion feature encoding network was introduced here based on the transformer architecture. Firstly, a short-time Fourier transform (STFT) was applied to both the audio from the IEMOCAP dataset and the synthesized audio to convert the time-domain representation into a frequency-domain representation. Specifically, the STFT transformation formula is as follows:

S T F T (t, f) = \sum_{n = 0}^{N - 1} x (n) \cdot w (n - t) \cdot e^{- j 2 π f n},

(2)

where

S T F T (t, f)

represents the short-time Fourier transform at time

t

and frequency

f

,

x (n)

denotes the input discrete signal, and

w (n - t)

represents the length of the discrete window function signal. The application of a short-time Fourier transform in emotional encoding networks involves transforming the time-domain space into a frequency-domain space, providing a balance between time and frequency. This process significantly mitigates the influence of factors such as emphasis and accent, enabling the network to extract emotional features primarily from the audio and accurately capture the emotion-related characteristics of the audio in a precise and effective manner.

The training process of the emotion feature encoding network is shown as Step 2 in Figure 1. Firstly, after the spoken audio data input (comprising mainly synthesized audio and audio from the IEMOCAP dataset) underwent a STFT, the corresponding frequency-domain features were obtained. Subsequently, these features were passed through the emotion feature encoding network, which consists of multiple layers of perceptions (MLP) and a transformer, to extract and encode the emotional features of the audio. Next, the features were further processed using an emotion recognition network composed of an MLP network for emotion classification. Using emotional labels, the parameters of the emotion feature encoding network and emotion recognition network were optimized with cross-entropy loss. In this process, because the network can identify corresponding emotional classifications through the intermediate features passed through the encoding network and inputted into the emotion recognition network, these intermediate features are regarded as the output of the emotional feature encoding network. These intermediate features represent a more abstract and meaningful representation of the emotional information in the audio.

To enhance the robustness of the emotion feature encoding network, this study conducted rigorous adjustments to the classifications within the dataset [28]. The original six emotional labels, including frustration, sadness, anger, neutral, happiness, and excited, were streamlined to four categories, namely sadness, anger, neutral, and happiness. The frustration category was discarded due to its relatively low count, and the happiness and excitement categories were merged into a single happiness category, as suggested by Sahu [29].

During the training process, to further enhance the model’s robustness, the research team applied various forms of data augmentation to the audio data. This included operations such as pitch shifting, accelerated audio playback, and the addition of Gaussian noise. The design and adjustments in these training processes aimed to make the emotion feature extraction network more robust, providing a more reliable foundation for the deep extraction of emotional information from audios.

Through the aforementioned methods, the effective extraction of emotional features was achieved in English spoken language audio, leading to the establishment of corresponding deep learning representations of emotional features.

3.4. Assessment of Emotional Expression in English Spoken Language

To assess the emotional expression in students’ spoken language audio, an emotion evaluation network was designed in this paper to analyze and evaluate emotional features from three types of audios in the previously constructed dataset, ultimately yielding emotional scores for students’ spoken language audio. The network structure is shown as Step 3 in Figure 1. Firstly, the audio from the IEMOCAP dataset, students’ spoken language audio, and synthesized audio were individually subjected to short-time Fourier transforms (STFT) to convert them from the time-domain space to the frequency-domain space. Subsequently, these frequency-domain representations were fed into the emotion feature encoding network to extract emotional features from the three types of audios. These emotional features were then processed through a MLP to obtain the final emotion evaluation scores, and the network was trained using mean squared error (MSE) loss.

During the training process, a two-stage approach consisting of coarse-tuning and fine-tuning was employed. Firstly, the coarse-tuning stage was conducted, followed by the fine-tuning stage. The specific procedures for each stage are outlined below.

Coarse-tuning Stage: Firstly, we calculated the cosine similarity between the emotional features of the audio from the IEMOCAP dataset and the emotional features of the other two types of audios. Then, using the cosine similarity between the audio from the IEMOCAP dataset and the synthesized audio as a reference, this reference was compared to the cosine similarity between the audio from the IEMOCAP dataset and the students’ spoken language audio. By calculating the ratio of the difference between the two to the reference similarity and applying Equation (3), a preliminary emotional evaluation score for students’ spoken language was obtained. The specific formula for calculating the evaluation score is as follows:

$s_{i} = \{\begin{matrix} 10 & 10 - C \leq 1000 \times \frac{(d_{2} - d_{1})}{d_{1}} \\ C \times (1 + 1000 \times \frac{(d_{2} - d_{1})}{d_{1}}) & - C < 1000 \times \frac{(d_{2} - d_{1})}{d_{1}} < 10 - C \\ 0 & 1000 \times \frac{(d_{2} - d_{1})}{d_{1}} \leq - C \end{matrix}\},$

(3)

where $d_{1}$ is the cosine similarity of emotional features between audio from the IEMOCAP dataset and synthesized audio, $d_{2}$ is the cosine similarity of emotional features between students’ spoken language audio and audio from the IEMOCAP dataset, and $C$ is the median of teachers’ ratings for students’ spoken language. During the coarse-tuning process, only the emotion evaluation network was trained without altering the model parameters of the emotion feature encoding network;
Fine-tuning Stage: In this stage, based on the teachers’ ratings for students’ emotional expression in spoken language, the emotion evaluation network was fine-tuned with the objective of minimizing the mean squared error between the predicted results and teachers’ ratings. During the fine-tuning process, the parameters of the entire network, including both the emotion feature encoding network and the emotion evaluation network, were trained comprehensively.

The loss function for the network in the paper is divided into coarse-tuning loss function, L_coarse, and fine-tuning loss function L_fine, both of which use mean squared error (MSE) loss. Specifically, the loss function formula is as follows:

L_{e m o t i o n} = λ_{1} L_{c o a r s e} + λ_{2} L_{f i n e},

(4)

where, during coarse training,

λ_{1}

equals 1 and

λ_{2}

equals 0, and during fine training,

λ_{1}

equals 0.1 and

λ_{2}

equals 1. When the model underwent fine-tuning, lambda values were set to 0.1 and 1 in order to bias the model more towards teacher ratings while also considering the objective factors from the coarse-tuning stage. Through this design, the combination of reference similarity and cosine similarity of students’ audio allowed the network to effectively learn and adjust in both training stages, resulting in more accurate emotional evaluation results.

4. Experiments and Analysis

4.1. Evaluation Metrics

Two main metrics, F1 score and unweighted accuracy (UA), are used in this paper to assess the performance of the emotion feature encoding network. Below is an introduction to these two metrics.

F1 score: The F1 score is a measure that takes into account both precision and recall, providing a balanced assessment of the model’s performance. It is the harmonic mean of precision and recall, making it robust for datasets with imbalanced classes. Specifically, a higher F1 score indicates that the model maintains precision while achieving a higher recall, which means it handles the imbalance between positive and negative cases better. The equation for F1 is as follows.

$F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall};$

(5)
Unweighted accuracy (UA): Unweighted accuracy refers to the average accuracy of the model across all categories, without considering the number of samples in each category. This allows unweighted accuracy to fairly assess datasets with imbalanced classes, unaffected by sample distribution. By using unweighted accuracy, we can gain a more comprehensive understanding of the overall performance of the model across different categories. Its equation is as follows.

$A c c u r a c y = \frac{T}{T + F},$

(6)

where T represents the correctly predicted sample count, and F represents the falsely predicted sample count.

For the emotion evaluation network, mean squared error (MSE) is chosen as the primary evaluation metric. MSE measures the average of the squares of differences between the model’s predicted outputs and the teachers’ scores. In the context of emotion evaluation, MSE provides a measure of the model’s regression performance on emotion values.

The selection of these evaluation metrics aims to provide a comprehensive understanding of the model’s accuracy, recall, and adaptability to imbalanced data, offering a thorough assessment of model performance.

4.2. Performance Evaluation of Emotion Encoding Network

As shown in Step 2 in Figure 1, the transformer-based emotion encoding network is utilized for emotion recognition experiments on audio from the IEMOCAP dataset and synthesized audio. The experiments covered both raw audio and audio that underwent specific augmentation processing, aiming to comprehensively assess the model’s performance. Considering the specific application context of this research, which is students’ English spoken language emotion evaluation, only spoken audio and their corresponding text were used as input data, and no image information was introduced. Furthermore, following the established method mentioned in the previous section, the 6-class problem of the dataset was adjusted to a 4-class problem to simplify the classification task. The experimental dataset includes 35,248 speech recordings. Detailed experimental results are as shown in Figure 4.

In this research, the emotion recognition method proposed achieved an unweighted accuracy (UA) of 66.6%, surpassing the DialogueCRN algorithm for multimodal emotion recognition, which includes video, audio, and text, with an accuracy of 66.2% [30]. The F1 score also reached 64.1%, ensuring that the algorithm has robustness to accommodate various differences in student speech tones, with a focus on accuracy in spoken language emotion expression. The research method utilized only audio and text as inputs, relying solely on speech emotion features for feature extraction, achieving accuracy comparable to multimodal emotion recognition that includes video. This demonstrates the clear research significance of spoken language emotion recognition in the context of speech environments. The research objective in this paper was to extract emotion features for subsequent score calculation. The experimental results prove that the emotion encoding network excels in extracting audio emotion features and exhibits good robustness under data augmentation conditions.

4.3. Effectiveness of Emotion Feature Similarity Evaluation

The emotion feature encoding network constructed in this paper aimed to accurately extract and encode emotional features from speech. By comparing cosine similarities between emotion feature vectors, corresponding evaluation scores were derived, providing strong support for the quantitative analysis of emotional features. The data distribution is depicted in Figure 5, with the central horizontal line representing the average data distribution. Figure 5 illustrates that the distribution of emotion scores calculated using cosine similarity is relatively broad, while the distribution of teacher ratings is comparatively narrow. This further confirms that the former can provide a more objective assessment of students’ spoken emotional features, whereas the latter may be influenced by encouragement or other factors, reducing the occurrence of very low or perfect scores, which is more in line with practical application requirements. Therefore, during the training stages of the emotion evaluation network, a two-stage approach involving coarse-tuning and fine-tuning was employed, harnessing the advantages of both scores to compensate for their respective limitations.

4.4. Performance Evaluation of the Emotion Evaluation Network

As shown in Step 3 (Training the emotion evaluation network) in Figure 1, a quantitative analysis of emotions was conducted in student spoken audio, and corresponding scores were generated. This process is divided into two main stages: coarse tuning and fine-tuning. In the coarse-tuning stage, the emotion feature encoding network played a crucial role in extracting emotional features from input audio. By comparing the cosine similarity between the IEMOCAP dataset audio, synthesized audio, and student spoken audio, emotion evaluation scores for students spoken audio were generated, which were subsequently used to train the emotion evaluation model. In the fine-tuning stage, the model was further trained using teachers’ ratings. Mean squared error (MSE) was used as the primary evaluation metric for this part. Detailed experimental results are available in Figure 6.

In the coarse-tuning stage (represented by the blue dotted line), the model’s mean squared error (MSE) is 2.808. However, in the fine-tuning stage (represented by the orange dotted line), this value decreases to 0.847. As shown in the above graph, during the coarse-tuning stage, the model is in its early training phase, and the loss function decreases rapidly but lacks precision. In contrast, during the fine-tuning stage, the model starts from a lower initial point in terms of loss. Although there may be some fluctuations in loss at the beginning, it demonstrates a stable trend over time.

The experimental results clearly demonstrate that the emotion evaluation network exhibits accuracy in evaluating emotions in student spoken language. The higher mean squared error (MSE) during the coarse-tuning stage may reflect the model’s initial understanding of spoken language emotions during the early learning phase. However, with professional guidance from teachers, the MSE significantly decreases during the fine-tuning stage. This strongly indicates that after careful adjustments, the model’s evaluation results become increasingly consistent with the assessments made by professional teachers.

During the inference stage, the evaluation metrics for the four emotion sentiment models obtained by the proposed method are presented in Figure 7. The mean square error (MSE) for each emotion category is as follows: sadness is 0.837, anger is 1.004, neutral is 0.773, and happiness is 1.006. Through validation, it has been confirmed that the emotion evaluation network is effective and reliable in assessing emotions in student spoken language. As the model continues to learn and adapt, it gradually approaches the assessment level of professional teachers, providing a dependable auxiliary tool for spoken language instruction.

4.5. Experimental Conclusion

After comprehensive experimental validation, the emotion evaluation method proposed in this paper has demonstrated satisfactory results in evaluating the emotion performance in students’ spoken English, thanks to the good performance of the LLM in emotional speech generation, transformer-based emotion encoding networks in emotion feature representation, and emotion evaluation networks in emotion score regression, respectively.

This method is not only versatile but also practical, which provides advanced and effective means for English spoken language learning and emotion evaluation and is expected to provide strong tool support in the field of spoken language education, offering high practical value.

5. Conclusions and Prospects

In the context of artificial intelligence and speech assessment technology, this research has made pioneering strides by incorporating emotion evaluation into the English spoken language evaluation system. By combining an LLM and emotion feature learning, this study has successfully achieved in-depth representation and automatic evaluation of emotional features in English spoken language. This method not only fills the gap in the emotional dimension of existing spoken English language evaluation software but also addresses the specific shortcomings of Chinese students in emotional expression, providing an effective path for improvement.

With further advancements in artificial intelligence technology, we anticipate optimizing and expanding upon this research method. Future research will continue to refine and enhance the English spoken language emotion analysis model, increasing its capability to recognize more complex and nuanced emotional expressions.

Author Contributions

Conceptualization, L.W.; Methodology, Y.Q.; Software, J.Y.; Formal analysis, Y.Q.; Resources, L.W. and J.L.; Data curation, J.Y. and Y.W.; Writing—original draft, L.W., J.Y., Y.W. and Y.Q.; Writing—review & editing, S.W. and J.L.; Project administration, J.L.; Funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2018 National Social Science Fund of China’s Western Project grant number 18XKG003, the Teaching Reform Research Project of Shaanxi University of Science and Technology Project grant number 23Y081 and the Internationalization of Education and Teaching Reform Research Project of Shaanxi University of Science and Technology grant number GJ22YB09. And The APC was funded by 18XKG003.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of School of Culture and Education, Shaanxi University of Science and Technology (protocol code 20240301 and date of approval is 6 January 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Herein, our research team would like to express sincere gratitude for the support and assistance provided by the following projects to this thesis. We are thankful for the 2018 National Social Science Fund of China’s Western Project (Project No. 18XKG003), whose funding has facilitated the smooth progress of our research. We are also grateful for the support from two teaching reform projects provided by Shaanxi University of Science and Technology: The Teaching Reform Research Project of Shaanxi University of Science and Technology (Project No. 23Y081) and the Internationalization of Education and Teaching Reform Research Project of Shaanxi University of Science and Technology (Project No. GJ22YB09). The financial and resource support from these projects provided invaluable conditions for the completion of this thesis.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, S.; Wu, K.; Zhu, B.; Wang, S. Speech Evaluation Technology for Teaching and Evaluating Spoken English. Artif. Intell. View 2019, 3, 72–79. [Google Scholar]
Du, J.; Huo, Q. An improved VTS feature compensation using mixture models of distortion and IVN training for noisy speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1601–1611. [Google Scholar] [CrossRef]
Zhang, F. Contrast between Chinese and Western cultural values and its effects on English learning in China. Trans/Form/Ação 2024, 47, e0240062. [Google Scholar] [CrossRef]
Ekman, P. Basic emotions. In Handbook of Cognition and Emotion; John Wiley & Sons: Hoboken, NJ, USA, 1999; Volume 98, p. 16. [Google Scholar]
Siriwardhana, S.; Kaluarachchi, T.; Billinghurst, M.; Nanayakkara, S. Multimodal emotion recognition with transformer-based self-supervised feature fusion. IEEE Access 2020, 8, 176274–176285. [Google Scholar] [CrossRef]
Tripathi, S.; Tripathi, S.; Beigi, H. Multi-modal emotion recognition on iemocap dataset using deep learning. arXiv 2018, arXiv:1804.05788. [Google Scholar]
Wang, Y.; Gu, Y.; Yin, Y.; Han, Y.; Zhang, H.; Wang, S.; Li, C.; Quan, D. Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobotics 2023, 17, 1181598. [Google Scholar] [CrossRef] [PubMed]
Voloshina, T.; Makhnytkina, O. Multimodal Emotion Recognition and Sentiment Analysis Using Masked Attention and Multimodal Interaction. In Proceedings of the 2023 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 24–26 May 2023; IEEE: New York, NY, USA, 2023; pp. 309–317. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Paranjape, A.; Kolhatkar, G.; Patwardhan, Y.; Gokhale, O.; Dharmadhikari, S. Converge at WASSA 2023 Empathy, Emotion and Personality Shared Task: A Transformer-based Approach for Multi-Label Emotion Classification. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, ON, Canada, 14 July 2023; pp. 558–563. [Google Scholar]
Chiorrini, A.; Diamantini, C.; Mircoli, A.; Potena, D. EmotionAlBERTo: Emotion Recognition of Italian Social Media Texts Through BERT. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; IEEE: New York, NY, USA, 2022; pp. 1706–1711. [Google Scholar]
Chaudhari, A.; Bhatt, C.; Krishna, A.; Mazzeo, P.L. ViTFER: Facial emotion recognition with vision transformers. Appl. Syst. Innov. 2022, 5, 80. [Google Scholar] [CrossRef]
Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 2023, 561, 126866. [Google Scholar] [CrossRef]
Zaidi, S.A.M.; Latif, S.; Qadi, J. Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers. arXiv 2023, arXiv:2306.13804. [Google Scholar]
Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Appl. Sci. 2021, 12, 327. [Google Scholar] [CrossRef]
Vu, T.; Huynh, V.T.; Kim, S.H. Multi-scale Transformer-based Network for Emotion Recognition from Multi Physiological Signals. arXiv 2023, arXiv:2305.00769. [Google Scholar]
Patel, S.; Scherer, K.R.; Björkner, E.; Sundberg, J. Mapping emotions into acoustic space: The role of voice production. Biol. Psychol. 2011, 87, 93–98. [Google Scholar] [CrossRef] [PubMed]
Kanluan, I.; Grimm, M.; Kroschel, K. Audio-visual emotion recognition using an emotion space concept. In Proceedings of the 2008 16th European Signal Processing Conference, Lausanne, Switzerland, 25–29 August 2008; IEEE: New York, NY, USA, 2008; pp. 1–5. [Google Scholar]
Islam, M.R.; Akhand, M.A.H.; Kamal, M.A.S.; Yamada, K. Recognition of emotion with intensity from speech signal using 3D transformed feature and deep learning. Electronics 2022, 11, 2362. [Google Scholar] [CrossRef]
Jennings, P.A.; Greenberg, M.T. The prosocial classroom: Teacher social and emotional competence in relation to student and classroom outcomes. Rev. Educ. Res. 2009, 79, 491–525. [Google Scholar] [CrossRef]
Orji, F.A.; Vassileva, J. Automatic modeling of student characteristics with interaction and physiological data using machine learning: A review. Front. Artif. Intell. 2022, 5, 1015660. [Google Scholar] [CrossRef] [PubMed]
Melweth, H.M.A.; Al Mdawi, A.M.M.; Alkahtani, A.S.; Badawy, W.B.M. The Role of Artificial Intelligence Technologies in Enhancing Education and Fostering Emotional Intelligence for Academic Success. Migr. Lett. 2023, 20, 863–874. [Google Scholar]
Liefooghe, B.; van Maanen, L. Three levels at which the user’s cognition can be represented in artificial intelligence. Front. Artif. Intell. 2023, 5, 293. [Google Scholar] [CrossRef] [PubMed]
Shao, T.; Zhou, J. Brief overview of intelligent education. J. Contemp. Educ. Res. 2021, 5, 187–192. [Google Scholar] [CrossRef]
Martınez-Miranda, J.; Aldea, A. Emotions in human and artificial intelligence. Comput. Hum. Behav. 2005, 21, 323–341. [Google Scholar] [CrossRef]
Liu, F. Psychological Education and Emotional Model Establishment Analysis Based on Artificial Intelligence in the Intelligent Environment. Adv. Educ. Technol. Psychol. 2021, 5, 174–190. [Google Scholar]
Liew, T.W.; Tan, S.M.; Pang, W.M.; Khan MT, I.; Kew, S.N. I am Alexa, your virtual tutor!: The effects of Amazon Alexa’s text-to-speech voice enthusiasm in a multimedia learning environment. Educ. Inf. Technol. 2023, 28, 1455–1489. [Google Scholar] [CrossRef] [PubMed]
Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. Cnn+ lstm architecture for speech emotion recognition with data augmentation. arXiv 2018, arXiv:1802.05630. [Google Scholar]
Sahu, G. Multimodal speech emotion recognition and ambiguity resolution. arXiv 2019, arXiv:1904.06022. [Google Scholar]
Hu, D.; Wei, L.; Huai, X. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. arXiv 2021, arXiv:2106.01978. [Google Scholar]

Figure 1. Architectural diagram of spoken English emotion analysis model.

Figure 2. Heatmap of student spoken language emotion evaluation scores.

Figure 3. Transformer-based LLM architecture.

Figure 4. Emotion recognition model evaluation results.

Figure 5. Box plot of cosine similarity and teacher evaluation scores.

Figure 6. Line chart of training loss for the emotion evaluation network.

Figure 7. Displays the Test Results of the Emotion Evaluation Network.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Yang, J.; Wang, Y.; Qi, Y.; Wang, S.; Li, J. Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English. Appl. Sci. 2024, 14, 3543. https://doi.org/10.3390/app14093543

AMA Style

Wang L, Yang J, Wang Y, Qi Y, Wang S, Li J. Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English. Applied Sciences. 2024; 14(9):3543. https://doi.org/10.3390/app14093543

Chicago/Turabian Style

Wang, Liyan, Jun Yang, Yongshan Wang, Yong Qi, Shuai Wang, and Jian Li. 2024. "Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English" Applied Sciences 14, no. 9: 3543. https://doi.org/10.3390/app14093543

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English

Abstract

1. Introduction

2. Related Work

2.1. Emotion Analysis Research on Spoken Language

2.2. Research on Decoupling of Vocal Emotional Features

2.3. Research on the Application of AI in Educational Evaluation

3. Method and Model

3.1. Dataset

3.1.1. IEMOCAP Dataset Audio

3.1.2. TTS Speech Synthesized Audio

3.1.3. Student-Recorded Spoken Audio and Teacher Emotional Evaluation Scores

3.2. Emotional Voice Synthesis Using LLM

3.2.1. English Spoken Language Emotion Synthesis Method and Fine-Tuning of LLM

3.2.2. Emotion Label Determination Based on Membership Functions

3.3. Deep Representation of Emotional Features in Spoken English

3.4. Assessment of Emotional Expression in English Spoken Language

4. Experiments and Analysis

4.1. Evaluation Metrics

4.2. Performance Evaluation of Emotion Encoding Network

4.3. Effectiveness of Emotion Feature Similarity Evaluation

4.4. Performance Evaluation of the Emotion Evaluation Network

4.5. Experimental Conclusion

5. Conclusions and Prospects

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI