Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

Yang, Jun; Wang, Liyan; Qi, Yong; Chen, Haifeng; Li, Jian

doi:10.3390/app14199121

Open AccessArticle

Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

by

Jun Yang

¹,

Liyan Wang

²,

Yong Qi

¹

,

Haifeng Chen

¹

and

Jian Li

^1,*

¹

School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi’an 710021, China

²

School of Culture and Education, Shaanxi University of Science and Technology, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9121; https://doi.org/10.3390/app14199121

Submission received: 6 August 2024 / Revised: 14 September 2024 / Accepted: 29 September 2024 / Published: 9 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study aims to develop an emotion evaluation method for second language learners, utilizing multimodal information to comprehensively evaluate students’ emotional expressions. Addressing the limitations of existing emotion evaluation methods, which primarily focus on the acoustic features of speech (e.g., pronunciation, frequency, and rhythm) and often neglect the emotional expressions conveyed through voice and facial videos, this paper proposes an emotion evaluation method based on multimodal information. The method includes the following three main parts: (1) generating virtual data using a Large Language Model (LLM) and audio-driven facial video synthesis, as well as integrating the IEMOCAP dataset with self-recorded student videos and audios containing teacher ratings to construct a multimodal emotion evaluation dataset; (2) a graph convolution-based emotion feature encoding network to extract emotion features from multimodal information; and (3) an emotion evaluation network based on Kolmogorov–Arnold Networks (KAN) to compare students’ emotion features with standard synthetic data for precise evaluation. The emotion recognition method achieves an unweighted accuracy (UA) of 68.02% and an F1 score of 67.11% in experiments with the IEMOCAP dataset and TTS data. The emotion evaluation model, using the KAN network, outperforms the MLP network, with a mean squared error (MSE) of 0.811 compared to 0.943, providing a reliable tool for evaluating language learners’ emotional expressions.

Keywords:

second language learning; data generation; multimodal emotion recognition; multimodal emotion evaluation; emotion features

1. Introduction

In the current context of globalization, the importance of second language learning has become increasingly prominent, especially as English has become the main language of international communication [1]. Following the rapid development of artificial intelligence, especially breakthroughs in emotional expression evaluation technologies, it has become possible to score large-scale second language expression exams using advanced technologies. For example, Liu et al. designed an emotion recognition system using both speech and text modalities to improve the accuracy and comprehensiveness of emotion recognition [2]. This advancement has simultaneously improved the objectivity and efficiency as well as the dependability of scoring language expression exams. Nonetheless, most of the current affective assessment platforms focus on the basic characteristics of voice, e.g., pronunciation, intonation, and rhythm. They ignore the importance of emotional expression in oral communication. In recent years, multimodal emotion recognition techniques have developed rapidly, and the accuracy of emotion recognition has been significantly improved by combining visual features such as facial expressions in videos [3]. For example, Hossain et al. designed a multimodal emotion recognition system, which achieves the accurate classification of emotional expressions by integrating audio and video features [4]. However, existing language evaluation platforms do not include the emotion evaluation of facial videos. Emotion in language expression not only enhances expressiveness, but also improves the effectiveness of communication, making information exchange more accurate and vivid. Second language learners generally face challenges with emotional expression. Particularly in non-native environments, students often have difficulty in expressing emotions accurately and naturally, which may affect their performance in both oral exams and real-life communication. Traditional teaching and evaluation methods have not sufficiently focused on training emotional expression in speech. Therefore, incorporating the emotion evaluation of voice and video into the evaluation system of language expression will further enhance the comprehensiveness and accuracy of second language expression evaluation.

Currently, emotion descriptions often use discrete emotion models, which describe emotions as independent labels with no correlation between them. Among them, Paul Ekman, an American psychologist, categorized discrete emotions into six fundamental emotions: anger, disgust, fear, happiness, sadness, and surprise. This form of categorization is one of the more frequently used research methods in the existing field of emotion research [5]. Meanwhile, some researchers use dimensional affective models to capture subtle changes in emotions through continuous variables [6]. Emotion recognition technology combines audio and video analysis to create a mapping relationship between data features and emotion features by extracting audio features such as intonation, speech rate, and volume, and video features such as facial expression, eyes, and muscle movement, so as to achieve an accurate emotion classification. Although emotion analysis using multimodal information has achieved remarkable results in the field of classification [7,8,9,10], the quantitative evaluation of emotion is still limited. The complexity of multimodal information fusion also increases the difficulty of model design [11]. In view of this, this study focuses on multimodal information fusion technology to comprehensively evaluate language learners’ emotional expressions. Through a more accurate quantitative emotion evaluation, effective feedback and guidance will be provided to learners to help them improve their emotional expressions.

This study is dedicated to exploring and realizing a second language emotion evaluation method with multimodal information that can pay special attention to the accurate communication of students’ emotions. In this study, advanced artificial intelligence techniques, including a Transformer model, graph convolutional model, and Large Language Model (LLM), are used to extract emotionally relevant features from multimodal information. In addition, this paper utilizes audio-driven facial video synthesis techniques so as to enhance the accuracy of emotional expression evaluation. And a supervised learning approach based on a Kolmogorov–Arnold Networks (KAN) model is utilized to achieve automated emotion evaluation in a second language. Thus, students are able to gain opportunities to improve emotional expressions in second language learning.

The primary contributions of this research are in the following aspects:

1. This paper constructs a comprehensive framework for emotion evaluation. The framework integrates an LLM, audio-driven facial video synthesis technology, and deep learning for emotional expressions and automatic evaluation. Based on the constructed multimodal dataset, the multimodal feature extraction and fusion of audio and video data is achieved by fusing the emotion encoding network with graph convolution, and the KAN model is used to perform an emotion evaluation of the extracted learners’ emotion features. The framework is able to evaluate students’ emotional expressions more comprehensively and accurately in the context of second language emotional expression practice scenarios.

2. This study introduces generative data to enhance emotion evaluation in second language learning. Utilizing the IEMOCAP dataset, LLM generates audio data, which are then converted into videos expressing the corresponding emotions via the speaker video generation network. Meanwhile, a multimodal emotional expressions dataset is constructed by combining students’ audio and video data, as well as including professional teachers’ data on students’ emotion evaluation.

3. This method automates emotion feature extraction and evaluation. It effectively extracts pure emotion features, enhancing the accuracy and comprehensiveness of emotion evaluation. It also provides automatic emotion feedback and evaluation during the learning process of second language learners. This method allows second language learners to obtain rich practice materials for emotional expressions, understand their performance instantly, and adjust their learning methods based on feedback.

2. Related Work

2.1. Emotion Recognition Dataset

Current research in emotion analysis primarily focuses on constructing and evaluating datasets with emotion labels. In recent years, six important datasets have advanced the field. For example, the IEMOCAP (Interactive Emotional Dyadic Motion Capture) dataset contains information in three modalities, audio, video, and text, covering angry, happy, sad, neutral, and other emotion categories [12]. The MELD (Multimodal EmotionLines Dataset) extends the EmotionLines dataset with more conversational and multimodal information covering the emotion categories of anger, happiness, sadness, neutrality, surprise, disgust, and fear [13]. The CMU-MOSEI (Multimodal Opinion Sentiment and Emotion Intensity) dataset contains more than 23,000 video clips covering the six basic emotions, with each clip containing audio, text, and video information [14]. The AFEW (Acted Facial Expressions in the Wild) dataset contains video clips of facial expressions extracted from movies covering the emotion categories of anger, disgust, fear, happiness, neutrality, sadness, and surprise [15]. The RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song) dataset contains emotionally rich speech and song segments covering the eight emotion categories of neutral, happy, sad, angry, scared, disgusted, surprised, and calm [10]. In addition, the EmoReact dataset focuses on children’s emotional responses while watching videos and contains rich facial expression and body language information for studying children’s emotion recognition and behavior analysis [16]. However, the existing datasets are not only primarily categorical datasets with less emotion evaluation, but also for some specific tasks, the dataset’s information such as audio and video cannot be better applied.

Based on the existing dataset, this study added data on the evaluation of students’ emotional expressions in their second language. The evaluation data were obtained from a team of professional English teachers, who evaluated the students’ audio and video for emotional fullness. In addition, this study generates audio and video data that match the needs of this task by integrating audio and video generation techniques for the students’ second language emotional expressions learning scenarios. This method gives second language learners standard practice materials for emotion expression, while also being able to generate training data that is more compatible with the tasks of this paper. This method focuses on the improvement of students’ language skills and emphasizes their emotional expressions, which helps to improve students’ learning outcomes and overall educational experience through an accurate evaluation of and feedback on their emotional expressions.

2.2. Multimodal Emotion Feature Extraction

In emotion recognition, single-modality information (e.g., using only audio or text) may not effectively capture emotion features. Recent studies have begun to explore multimodal information fusion to improve the accuracy of emotion recognition. For example, Bei Pan et al. summarized the existing research on multimodal emotion recognition [17]; Syed Aun Muhammad Zaidi et al. added attention to the multimodal model to improve the accuracy of emotion recognition [18]; Cristina Luna-Jiménez et al. optimized feature extraction for audio and face to improve the accuracy of an emotion recognition model through a transfer learning technique [19]; and Shamane Siriwardhana et al. [7] and Samarth Tripathi et al. [20] implemented self-supervised emotion recognition using three modalities of audio, video, and text inputs. In addition, Yuanyuan Wang et al. [9] and Tatiana Voloshina et al. [8] enhanced the fusion of multiple-modal data using the Transformer model to achieve increased emotion recognition accuracy. Despite the significant progress that has been made in emotion recognition tasks, current research studies still focus mainly on classification tasks, while the quantitative evaluation of emotions is still limited, especially in terms of applications in the field of emotion evaluation for language learning.

This study realizes the effective decoupling and quantitative evaluation of emotion features from other features in language emotional expressions through an emotion evaluation framework based on multimodal information fusion. The framework not only accurately classifies emotions, but also has a quantitative evaluation function by comparing the evaluation data with the standard data generated by the generative model. It is particularly suitable for evaluating the emotional expressions of second language learners, which further enhances the quality of teaching and the effectiveness of personalized education.

2.3. AI in Educational Evaluation

The importance of emotions in education cannot be overlooked. Fidelia A et al. [21] and Hessah M AL Melweth et al. [22] demonstrated a moderate positive correlation between the usage of artificial intelligence technology and teaching effectiveness. Although modern educational technology has made some progress in integrating AI technology, it mainly focuses on knowledge transfer and the evaluation of the accuracy of pronunciation of language expressions, and pays less attention to students’ emotional expressions. Baptist Liefooghe et al. argued that AI models should be optimized in a more humane way [23]; Tiejun Shao et al. suggested that AI applications need to adhere to a human-centered philosophy [24]; Juan Martínez-Miranda et al. discussed the potential role of emotions in enhancing the decision making process, adaptability, and learning capabilities of AI, pointing the way to subsequent research [25]; Jaya Kannan et al. studied the impact of AI technology on creating new trends in second language education [1].

By introducing emotion evaluation technology into language learning, this study gives second language learners practice material for standard emotional expressions and the ability to automatically and immediately evaluate learners’ emotional expressions. This approach innovatively focuses on knowledge transfer and students’ emotional experience, providing a personalized and humanized approach to education that helps to promote language learners’ learning effectiveness and educational experience. Through enhancing the usage of AI technology in language learning evaluation, a more comprehensive and profound educational evaluation can be achieved to offer students a humanized and emotional learning environment.

3. Method and Model

This study follows up on the previous research on the evaluation of emotion in audio by developing a method based on multimodal information fusion and emotion characterization for the emotion analysis and evaluation of audio and video of students’ emotional expressions in their second language. First, the text and emotion labels in the IEMOCAP dataset are utilized to synthesize emotionally rich corresponding audio data by an LLM. Subsequently, the synthesized audio data and the audio in the IEMOCAP dataset are used to generate a speaker video that matches the corresponding audio through the speaker video generation model, and that is combined with the student audio and video to build a comprehensive dataset. Then, the emotion feature encoding network is introduced, which is based on multimodal information fusion, to extract the emotion features from multimodal data. Finally, a KAN-based network for evaluating linguistic emotional expressions is designed to achieve a regression analysis and accurate evaluation of students’ second language emotional expressions by training on the emotion features extracted from multimodal information based on scoring by professional teachers. This chapter will subsequently detail the differences with citation [26], in that citation [26] mainly focuses on single audio information, while this paper integrates multimodal information. Figure 1 is an illustration of the overall framework, which has three parts.

Step 1. Generating a Mixed Dataset: This paper utilizes an LLM, Typecast [27], to convert emotionally labeled text into audio data that conveys the corresponding emotion. We combined these audios with the audios from the IEMOCAP dataset through the speaker video generation model, Dream-talk [28], to generate speaker videos that match the corresponding audio and emotion. And we combined those with the collected student audio and video data to build a comprehensive dataset of student emotional expression evaluations.

Step 2. Multimodal Emotion Feature Encoding Network: In this work, an emotion feature encoding network is introduced using multimodal information fusion to deeply extract the emotion spatial features from audio, video, and text. The data processing steps include the following: video data are processed through OpenFace’s face detection and feature extraction; audio data are extracted and normalized with features using the ComParE_2016 method in OpenSmile; and text data are pre-processed to generate sentence embeddings using Sentence-BERT (SBERT) [29]. The network uses the Relational Temporal Graph Convolution Network (RT-GCN) [30] to extract global temporal features of multimodal information and Pairwise Cross-Modal Feature Interaction [30] to capture local features between different modalities. It effectively extracts emotion features from other factors of multimodal data (e.g., language style in text, background in video, accent in audio, etc.) to improve the evaluation accuracy.

Step 3. Emotion Evaluation Network: For any input of students’ language emotional expressions (with emotional labels), the process first generates TTS-synthesized audio through an LLM using the corresponding text and emotional labels. Subsequently, audio is synthesized through TTS and the corresponding speaker video is generated using the speaker video generation model. Afterwards, the actual student audio and video data, along with the synthesized audio and video data and corresponding text, are fed into the corresponding emotion feature encoding network, where the relevant emotion features are extracted. The combination of the student data encoding and the generated data encoding is jointly inputted into the Kolmogorov–Arnold Network (KAN)-based emotion evaluation network [31]. This part enables the accurate scoring of students’ emotional expressions by training on the emotion features extracted from both sets of data according to the teacher’s scoring.

3.1. Dataset Composition

The construction of this comprehensive dataset is crucial for the overall methodology of this study, as it provides the necessary multimodal information to accurately evaluate second language learners’ emotional expressions. This study constructs a comprehensive dataset based on the IEMOCAP dataset by generating emotionally rich TTS audio and synthesizing the corresponding speaker video with the speaker video generation model. Unlike citation [26], the dataset in this work contains not only audio and generated TTS speech from the IEMOCAP dataset, but also the corresponding speaker video and text generated by the speaker video generation model. The specific components are as follows:

3.1.1. IEMOCAP Dataset Text, Audio, and Synthesized Video

The IEMOCAP dataset is a multimodal emotional dataset that provides emotionally labeled dialogue resources recorded by 10 speakers. The research team selected 3106 audio samples from the IEMOCAP dataset for the study. The audios were categorized by 3–4 evaluators into 6 categories of emotion, happy, sad, angry, excited, frustrated, and neutral, resulting in 3–4 emotion labels for each audio. In order to determine the main emotion labels of the selected texts, this paper uses a membership function calculation with a normal distribution. With this method, we calculated the membership for each statement on the six emotions in the IEMOCAP dataset [26].

To ensure that the generated audio data have corresponding video information, a diffusion network-based model, Dream-talk, is used in this study to generate videos of the upper frontal part of the speaker corresponding to the audio data. Diffusion models have shown great capabilities in data generation tasks. This process involves the gradual recovery of clear facial movements from noisy data, thereby generating clear, high-quality facial expressions.

In the actual evaluation of students’ emotional expressions, the required recorded videos are of the upper frontal half of the person’s body. Since the video data in the IEMOCAP dataset are side videos of the speaker’s whole body, which does not meet the practical needs of the actual student emotion evaluation task, this study utilizes the 3106 audio data files in the IEMOCAP dataset to generate the speaker video. This enhances the applicability of the dataset to the task of this study.

3.1.2. TTS-Synthesized Audio and Synthesized Video

In order to adequately expand the existing dataset and to give comparative data for arbitrary content input in the emotion evaluation network, this study utilizes a speech generation model, Typecast. The model generated 2892 audio data files with emotions, increasing the diversity of the mixed dataset.

The synthesized audio was also used to further generate videos through the speaker video generation model, Dream-talk. During the synthesis process, it was ensured that the sentiment labels of each synthesized data originated from the labels with the highest scores of the emotion categories in the original audio.

3.1.3. Student-Recorded Video and Audio and Teacher Scoring Data for Emotion Evaluation

In order to apply the language learning emotion evaluation method proposed in this study to real teaching scenarios, 18 emotionally representative audios were selected from the IEMOCAP dataset. These audios covered six categories of emotions. Subsequently, 52 second language learners recorded 936 video and audio samples based on the content and emotions of these audios. Unlike citation [26], the emotion scoring in this paper is not only for audio, but also comprehensively evaluates the emotional performance on video. We assembled a group of English teachers to evaluate the emotional fullness of the students’ audio and video. These teachers’ scores were utilized as criteria for training and evaluating the affective evaluation model. The average performance of each student for each emotion category is shown in Figure 2. On the heatmap, scores are distinguished by color to differentiate between good and bad emotional expression, with red representing high scores and green representing low scores. Differences in color and shade show the distribution and intensity of students’ emotional expression.

From Figure 2, it can be seen that for the students sampled in this study, 36 students have emotional expression evaluation scores all below 7, which is just passing or failing. Moreover, most students display uniformity in their evaluation scores for expressing the different emotions. Evaluation scores for the six categories of emotions were clustered in a similar range for the same student. Only a small number of students excel in the expression of only particular emotions, their evaluation scores for particular emotions being significantly higher when compared to others. The statistical summary of students’ scores can be seen in Table 1.

Table 1 summarizes the descriptive statistics of students’ multimodal emotion evaluation scores across various emotion categories. Each category includes the maximum (max), minimum (min), average (avg), variance (var), and standard deviation (std) of the scores. The scores for emotions like neutral, frustration, anger, sadness, happiness, and excitement show similar ranges and averages, indicating that students generally have consistent abilities in expressing these emotions. The overall average score is 6.83, with a standard deviation of 0.98. This indicates challenges in effectively expressing emotions while learning a second language.

The final dataset contains 6934 sets of video, audio, and text, providing richer and more comprehensive multimodal information. Their data sources and quantities are shown in Table 2. This dataset can be used by subsequent algorithms to offer an accurate judgment of spoken emotions and standard speech examples for second language learners.

3.2. The Deep Representation of Emotion Features in Second Language Expression

The above describes the dataset we constructed. Next, we will utilize this dataset for the deep representation and analysis of emotion features in second language expression. The extraction of emotion features is crucial for understanding and evaluating the emotional expressions of language learners. To extract the same emotion features from multimodal information pairs, this paper, inspired by the CORECT method [30], uses a feature encoding network for multimodal information based on graph convolutional networks (GCNs). First, data processing is performed on video, audio, and text information, respectively; second, global and local features of the language are extracted, respectively, through the RT-GCN and Pairwise Cross-Modal Feature Interaction [30]; finally, the emotion classification is further performed by the emotion recognition network. Unlike citation [26], this paper not only focuses on feature extraction from audio, but also integrates multimodal feature extraction from video and text information. Figure 1, Step 2 illustrates the process of training the multimodal emotion encoding network.

3.2.1. Multimodal Information Processing Methods

Multimodal information processing specifically includes the following steps:

1. Video Data Processing: First, a sequence of frames is extracted from the video at a fixed frame rate; after that, face detection is performed on each of the extracted frames to obtain the position of the face; finally, a facial behavior analysis is performed on the face. The faces are detected and analyzed using the OpenFace 2.2.0. The detected faces are aligned using OpenFace’s facial alignment function to remove the effects of angle and scale. Then, facial features are extracted using OpenFace, which include facial action units, head pose, eye gaze direction, and facial landmark points.

2. Audio Data Processing: Audio features are extracted using the ComParE_2016 method in OpenSmile 2.5.0. ComParE_2016 is a large-scale audio feature set for comparative analysis. Its extracted audio features include MFCCs, energy, spectral features, and so on. Afterwards, the extracted audio features are normalized for subsequent feature fusion and analysis.

3. Text Data Processing: First, the text pre-processing is performed, which includes tokenization, the removal of stop words, and stemming. Then, the SBERT [29] method is used to transform the text into sentence embeddings. SBERT is a model for generating high-quality sentence embeddings. The steps include loading a pre-trained model for generating an initial encoding, inputting the pre-processed text into the SBERT model, and generating fixed-length sentence embeddings. These embeddings capture the semantic information of the text, effectively identifying similarities and emotional features across sentences.

Through the above processing steps, video, audio, and text data are transformed into feature representations suitable for subsequent network processing. The processing is shown in Figure 3, where the processed audio information is stored in a one-dimensional vector of length 6373, the processed video information is stored in a one-dimensional vector of length 527, and the processed text information is stored in a one-dimensional vector of length 768.

3.2.2. Emotion Feature Encoding Network

In the feature extraction stage, this paper integrates the structure by the CORECT method to encode the emotion features. Firstly, global temporal features of video, audio, and text are extracted by a RT-GCN [30] to capture the long-term dependencies in each modality of information. Second, this paper uses Pairwise Cross-Modal Feature Interaction [30] to capture local features between different-modality information. The specific steps are listed below.

Firstly, the video, audio, and text features are constructed as graph structures, where each node represents the features of a time node, and the edges show the relationships between features of each time node. Through the RT-GCN, the node features and edge relations can be processed simultaneously to extract the global temporal features of each modality. Next, the Pairwise Cross-Modal Feature Interaction extracts local features of each modality of information. By calculating interaction features between each pair of modalities, this network captures the mutual influences between modalities, thereby enhancing the precision of emotion feature extraction. Finally, the two types of features are fused to generate a unified emotion feature representation, which provides input for subsequent emotion classification.

3.2.3. Emotion Classification

In the emotion classification stage, the fused multimodal emotion features are fed into the emotion classification network composed of a Multi-Layer Perceptron (MLP) network to achieve the classification of emotion features and output emotion categories. Through the above steps, the multimodal information feature encoding network of this paper efficiently captures and combines the emotion features of video, audio, and text to achieve accurate emotion classification. In this work, the output of the emotion encoding network can be used by the classification network to achieve emotion classification. Therefore, the intermediate features of this output can be considered as an abstract representation of the emotional information.

Through the above method, not only is the efficient extraction of emotional features from language achieved, but also the corresponding emotion feature representation is established. This provides a solid basis for a further evaluation of language learners’ emotional expressions. The quantification and classifying of emotion features enables a more comprehensive evaluation of learners’ emotional expressions in language expression, thus providing them with targeted feedback and guidance to help them improve their emotional expressions capabilities.

3.3. Evaluation of Second Language Emotional Expressions

The above detailed the extraction of emotion features. Next, we will focus on evaluating students’ emotional expressions in a second language. The extracted features will serve as the basis for accurate assessment, providing effective feedback and enhancing learners’ communicative competence. In order to evaluate the students’ emotional expressions in their second language, this paper designs an emotion evaluation network, which analyses and evaluates the two types of multimodal information emotion features and finally obtains the emotion score of the students’ second language expressions. The network structure is shown in Step3 in Figure 1, and the specific process is as follows: (1) According to the given text and emotion tags, obtain the student’s live-recorded video and audio, as well as the virtual voice synthesized with Dream-talk and speaker’s video with Typecast. (2) Process the video, audio, and text of both synthetic and student recording information to obtain the pre-processed multimodal vectors according to Step 2. Input them into the emotion feature encoding network to extract the emotion features from both synthetic and student multimodal information. (3) Concatenate the two emotion encoding vectors with the two kinds of information after data processing. Input them into the emotion evaluation network for processing, and finally obtain the students’ emotional expression scores. Unlike citation [20], this paper further extends IEMOCAP, considering that the text read by students in real-world applications is not necessarily the text in the IEMOCAP dataset, whereas TTS audio generation can be based on any text. Therefore, the method of retaining only TTS audio generation is chosen to be more flexible, to deal with different practical teaching scenarios.

This study evaluates emotional expression information using multimodal data, with complex correlations between the data. Emotional expressions are highly nonlinear in nature, and subtle facial expressions or audio changes may lead to significant emotional fluctuations and exhibit nonlinear distributions. The correlation between one modality and the other two cannot simply produce a correspondence for the emotion features of a certain passage, and the global and local information need to be integrated, which enhances the complexity of the nonlinear features. Compared to MLP networks, KANs [25] have achieved not only a stronger nonlinear mapping ability, but also more accurate function fitting. Therefore, this paper chooses a KAN in the emotion evaluation module, and its specific structure is shown in Figure 4. A KAN is a neural network architecture for representing and approximating multivariate functions. It maps the multivariate inputs into a series of linear combinations by introducing intermediate nodes, then transforms them nonlinearly though univariate activation functions, and finally performs a weighted summation to obtain the final output.

The loss function of the network in this paper is the mean squared error (MSE) loss, and its formula is

L_{e m o t i o n} = M S E = \frac{\sum_{i = 1}^{n} {(f_{e v a l u a t e} (x_{m u l t i m o d a l}) - y_{t e a c h e r})}^{2}}{n}

(1)

where

f_{e v a l u a t e} (\cdot)

is the fitting function of the emotion evaluation network,

y_{t e a c h e r}

is the teacher’s real score,

x_{m u l t i m o d a l}

is the multimodal information input to the network, and n is the number of samples. By inputting both synthetic and student-recorded information to the network, and allowing the network to use synthetic data as a scoring criterion, the network can be made to converge more quickly and obtain students’ emotional expressions scores more accurately.

4. Experiments and Analysis

4.1. Evaluation Metrics

This research evaluates the capability of the emotion feature encoding network using two main metrics: F1 score and unweighted accuracy (UA), as described below:

1. F1 Score: The F1 score is a measure that balances both the precision and recall of the model. In the emotion recognition task within second language emotion learning, the F1 score provides an effective measure of the model’s overall performance across various emotion categories, ensuring that the model can maintain both a high precision and high recall when dealing with various emotion categories. The formula for the F1 Score is

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(2)

2. Unweighted Accuracy (UA): Unweighted accuracy refers to the average of the model’s accuracy across categories, regardless of the quantity of samples in each category. In the task of emotion recognition in second language emotion learning, using unweighted accuracy helps to evaluate the model’s performance in recognizing different emotion categories.

A c c u r a c y = \frac{T_{e m o t i o n}}{T_{e m o t i o n} {+ F}_{e m o t i o n}}

(3)

where

T_{e m o t i o n}

is the quantity of samples with a correct emotion prediction and

F_{e m o t i o n}

is the quantity of samples with an incorrect emotion prediction.

For the emotion evaluation network, we chose to use the MSE as the main evaluation metric. The MSE refers to the mean of the squared differences between the predicted values and the actual labels. In emotion evaluation, it is utilized to assess the model’s performance in emotion score regression.

These evaluation metrics were chosen to give an overall measure of the model’s accuracy, recall, and ability to handle unbalanced data, allowing for an in-depth analysis of the model’s performance.

4.2. Implementation Details

We used PyTorch 2.0.1 to train the model, and used RTX 4090 with 24 GB of video memory for training and testing in the Ubuntu 18 environment. The emotion feature encoding network used the model parameters from [30]. The hyperparameters used to train our model are shown in Table 3.

4.3. Performance Evaluation of the Emotion Encoding Network

As shown in Step 2 of Figure 1, the research team utilized an emotion encoding network based on multimodal information fusion, conducting emotion recognition experiments on the multimodal emotional dataset created from both IEMOCAP and TTS-based multimodal data. Ultimately, the experimental dataset included 5998 group data records, and we chose the F1 score and unweighted accuracy as the evaluation metrics. Table 4 presents the experimental results.

The emotion recognition method designed in this paper obtained 68.02% in the accuracy metric, exceeding the 67.04% of COGMEN [35]. Meanwhile, the F1 score of this paper achieved 67.11%, indicating that the algorithm is robust and can adapt to the various discrepancies in students’ language expressions. The UA and F1 scores of this paper’s six-category emotion categories outperform the four-category categories of citation [26], showing the effective improvement this paper has achieved on previous work.

The recognition results for each emotion category are shown in Figure 5. Referring to the table of experimental results, the emotion recognition model shows a significant improvements in accuracy and F1 scores across all emotion categories after incorporating TTS data. Specifically, the most notable increases in accuracy are seen in the “sadness” and “frustration” categories, which improved by 0.242 and 0.265, respectively. In terms of F1 scores, the “sadness” category shows a particularly significant increase, rising from 0.573 to 0.790.

This research method utilizes video, audio, and text as inputs and employs multimodal feature fusion for feature encoding, demonstrating that the method in this paper has reached SOTA standards in emotion recognition tasks. The aim of this part of the study is to capture emotional features in multimodal data for the score computations of the evaluation network. The results show an excellent performance of this emotion encoding network in capturing multimodal emotional features, proving the availability of the feature extraction network. The extracted features can be used as inputs to the subsequent emotion evaluation network, thus laying the foundation for a more accurate evaluation of language learners’ emotional expressions.

The ablation experiments of the emotion classification network on the multimodal dataset of this paper are shown in Table 5. When data of only a single modality is input, the model gives the best classification results for text data and the worst classification results for video data. When data from both modalities are input, the model gives the best classification results for the combination of audio and text data, while it gives the worst classification results for audio and video data. Thus, text data contributed the highest to emotion consistency, and video data contributed the lowest to emotion consistency. It can also be seen from the table data that the combination of data from all three modalities can be input into the model to obtain the optimal classification results.

4.4. Performance Evaluation of the Emotion Evaluation Network

Using the emotion evaluation network, this paper completes the quantitative analysis of students’ emotional expressions and generates the corresponding scores. This section compares the KAN used in the emotion evaluation module with the MLP network, and the MSE is utilized as the main evaluation metric for the emotion evaluation network. Figure 6 illustrates the specific experimental results.

When using the MLP network, the model’s final MSE is 0.943; however, with the KAN, the model’s final MSE is 0.811. Additionally, as shown in Figure 6, the KAN model converges faster than the MLP network during the training process. Compared to citation [26], this paper not only conducts emotion evaluations on audio but also combines multimodal data (video, audio, and text), resulting in a final MSE of 0.811, which is better than the MSE of 0.847 for the model in citation [26]. Upon validation, the KAN used in this paper has performed excellently in the emotion evaluation tasks, demonstrating the effectiveness and reliability of the emotion evaluation network in evaluating students’ emotional expressions. Through continuous learning and adjustment, the model has essentially reached the assessment level of professional teachers and provides reliable support for students’ emotional expression in second language learning.

The ablation experiments of this paper’s emotion evaluation network on the multimodal student emotion evaluation dataset are shown in Table 6. Because the text data of the generated data and the student-recorded data are the same in the emotion evaluation network, separate text data inputs were not tested. The classification results for the input audio data are better than for the video data when only a single modality of data is input. When data from both modalities are input, the model gives the best classification results for the combination of audio and video data, and the worst classification results for text and video data. It can also be seen from the table data that a combination of data from all three modalities can be input into the model to obtain the optimal evaluation results.

4.5. Conclusions of the Experiment

Following experimental validation, the student emotion evaluation method proposed in this paper shows strong performance in aspects of using audio and video generation models, a multimodal fusion-based emotion encoding network, and a KAN-based emotion evaluation network. This method provides an advanced and effective means of emotional expression evaluation for second language learners. At the same time, it can provide learners with precise emotional feedback and guidance to help them improve their emotional expressions.

5. Conclusions and Outlook

This study, within the background of AI and emotion evaluation technology, integrates multimodal information and audio-based video generation technology into the evaluation system of emotional expressions on the basis of the existing spoken emotion evaluation. Through data generation, multimodal information fusion, and deep learning, this paper achieves emotion feature extraction and automatic evaluation for emotion expression in second language learning. This method significantly enhances the existing evaluation system, addresses the limitations of pure audio evaluation, and provides a more detailed analysis of learners’ emotional expression. This study designed a method for integrating multimodal information and applied it for the first time in a language emotion evaluation system. This method can provide more accurate emotional feedback and improvement approaches, providing targeted guidance for students to enhance their expression ability in a second language.

As artificial intelligence technology further develops, we expect to optimize and extend the methodology of this study. Future research will further refine and fine-tune the emotion analysis model to enhance its ability in dealing with more complex environments and subtle emotional expressions, in order to better serve the evaluation of the emotional expressions of language learners. Meanwhile, continuous emotion evaluation according to time change should also be the direction of subsequent work. Finally, the study of the robustness of emotion evaluation for extreme data is also an important part of future work. This study, as a typical case of the Shaanxi University of Science and Technology’s Large Language Model supporting teaching, provides a strong example and reference for future educational technology innovation.

Author Contributions

Conceptualization, L.W.; Methodology, Y.Q. and H.C.; Software, J.Y.; Formal analysis, Y.Q.; Resources, L.W. and J.L.; Data curation, J.Y.; Writing—original draft, J.Y., L.W., Y.Q. and H.C.; Writing—review and editing, J.L.; Project administration, J.L.; Funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Internationalization of Education and Teaching Reform Research Project of the Shaanxi University of Science and Technology grant number GJ22YB09, the Shaanxi University of Science and Technology 2023 General Projects for University-Level Educational Reforms grant number 23Y080, and the National Natural Science Foundation of China (NSFC) grant number 62306172. The APC was funded by 23Y080.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of School of Culture and Education, Shaanxi University of Science and Technology (protocol code 20240301 and date of approval is 6 January 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Herein, our research team would like to express sincere gratitude for the support and assistance provided by the following projects to this thesis. We are thankful for the National Natural Science Foundation of China (NSFC) (Project No. 62306172), whose funding has facilitated the smooth progress of our research. We are also grateful for the support from two teaching reform projects provided by the Shaanxi University of Science and Technology: Shaanxi University of Science and Technology 2023 General Projects for University-Level Educational Reforms (Project No. 23Y080) and the Internationalization of Education and Teaching Reform Research Project of the Shaanxi University of Science and Technology (Project No. GJ22YB09). The financial and resource support from these projects provided invaluable conditions for the completion of this thesis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kannan, J.; Munday, P. New trends in second language learning and teaching through the lens of ICT, networked learning, and artificial intelligence. Circ. Linguist. Apl. A Comun. 2018, 76, 13–30. [Google Scholar] [CrossRef]
Liu, Y.; Sun, H.; Guan, W.; Guan, W.; Xia, Y.; Zhao, Z. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Commun. 2022, 139, 1–9. [Google Scholar] [CrossRef]
Abdullah, S.M.S.A.; Ameen, S.Y.A.; Sadeeq, M.A.M.; Zeebaree, S. Multimodal emotion recognition using deep learnin. J. Appl. Sci. Technol. Trends 2021, 2, 73–79. [Google Scholar] [CrossRef]
Hossain, M.S.; Muhammad, G. Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 2019, 49, 69–78. [Google Scholar] [CrossRef]
Ekman, P. Basic emotions. In Handbook of Cognition and Emotion; John Wiley & Sons: Hoboken, NJ, USA, 1999; Volume 98, p. 16. [Google Scholar]
Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
Siriwardhana, S.; Kaluarachchi, T.; Billinghurst, M.; Nanayakkara, S. Multimodal Emotion Recognition with Transformer-Based Self Supervised Feature Fusion. IEEE Access 2020, 8, 176274–176285. [Google Scholar] [CrossRef]
Voloshina, T.; Makhnytkina, O. Multimodal Emotion Recognition and Sentiment Analysis Using Masked Attention and Multimodal Interaction. In Proceedings of the 33rd Conference of Open Innovations Association (FRUCT), Zilina, Slovakia, 24–26 May 2023; pp. 309–317. [Google Scholar]
Wang, Y.; Gu, Y.; Yin, Y.; Han, Y.; Zhang, H.; Wang, S.; Li, C.; Quan, D. Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobotics 2023, 17, 1181598. [Google Scholar] [CrossRef] [PubMed]
Aslam, A.; Sargano, A.B.; Habib, Z. Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks. Appl. Soft Comput. 2023, 144, 110494. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 527–536. [Google Scholar]
Zadeh, A.A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Acted Facial Expressions in the Wild Database; Technical Report TR-CS-11; Australian National University: Canberra, Australia, 2011; Volume 2. [Google Scholar]
Nojavanasghari, B.; Baltrušaitis, T.; Hughes, C.E.; Morency, L.P. Emoreact: A multimodal approach and dataset for recognizing emotional responses in children. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo, Japan, 12–16 November 2016; pp. 137–144. [Google Scholar]
Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 2023, 561, 126866. [Google Scholar] [CrossRef]
Zaidi, S.A.M.; Latif, S.; Qadi, J. Cross-language speech emotion recognition using multimodal dual attention transformers. arXiv 2023, arXiv:2306.13804. [Google Scholar]
Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Appl. Sci. 2021, 12, 327. [Google Scholar] [CrossRef]
Tripathi, S.; Tripathi, S.; Beigi, H. Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning. arXiv 2018, arXiv:1804.05788. [Google Scholar]
Orji, F.A.; Vassileva, J. Automatic modeling of student characteristics with interaction and physiological data using machine learning: A review. Front. Artif. Intell. 2022, 5, 1015660. [Google Scholar] [CrossRef]
Melweth, H.M.A.; Mdawi, A.M.M.A.; Alkahtani, A.S.; Badawy, W.B.M. The Role of Artificial Intelligence Technologies in Enhancing Education and Fostering Emotional Intelligence for Academic Success. Migr. Lett. 2023, 20, 863–874. [Google Scholar]
Liefooghe, B.; Maanen, L.V. Three levels at which the user’s cognition can be represented in artificial intelligence. Front. Artif. Intell. 2023, 5, 1092053. [Google Scholar] [CrossRef]
Shao, T.; Zhou, J. Brief Overview of Intelligent Education. J. Contemp. Educ. Res. 2021, 5, 187–192. [Google Scholar] [CrossRef]
Martínez-Miranda, J.; Aldea, A. Emotions in human and artificial intelligence. Comput. Hum. Behav. 2005, 21, 323–341. [Google Scholar] [CrossRef]
Wang, L.; Yang, J.; Wang, Y.; Qi, Y.; Wang, S.; Li, J. Integrating Large Language Models (LLMs) and Deep Representations of Emotional Features for the Recognition and Evaluation of Emotions in Spoken English. Appl. Sci. 2024, 14, 3543. [Google Scholar] [CrossRef]
Liew, T.W.; Tan, S.M.; Pang, W.M.; Khan, M.T.I.; Kew, S.N. I am Alexa, your virtual tutor!: The effects of Amazon Alexa’s text-to-speech voice enthusiasm in a multimedia learning environment. Educ. Inf. Technol. 2023, 28, 1455–1489. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Wang, C.; Zhang, J.; Xu, H.; Song, G.; Xie, Y.; Luo, L.; Tian, Y.; Guo, X.; Feng, J. Dream-talk: Diffusion-based realistic emotional audio-driven method for single image talking face generation. arXiv 2023, arXiv:2312.13578. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Nguyen, C.V.T.; Mai, A.T.; Le, T.S.; Kieu, H.D.; Le, D.T. Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December.
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
Wei, Y.; Wang, X.; Nie, L.; He, X.; Hong, R.; Chua, T.S. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1437–1445. [Google Scholar]
Hu, D.; Wei, L.; Huai, X. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 7042–7052. [Google Scholar]
Joshi, A.; Bhat, A.; Jain, A.; Singh, A.; Modi, A. COGMEN: Contextualized GNN based multimodal emotion recognition. arXiv 2022, arXiv:2205.02455. [Google Scholar]

Figure 1. Architecture of the multimodal information fusion model for analyzing second language emotional expressions.

Figure 2. Heatmap of students’ multimodal emotion evaluation scores. It represents the heat distribution of professional teachers’ scoring of audio and video data recorded by 52 students. The value of each emotion in the graph is the average score of the students’ expression of three sentences with the same emotion but different content.

Figure 3. Flowchart of multimodal information data processing.

Figure 4. KAN structure diagram.

Figure 5. Graph of the results of the emotion recognition model: (a) Comparison of accuracy before and after adding TTS data based on IEMOCAP dataset; (b) comparison of F1 score before and after adding TTS data based on IEMOCAP dataset.

Figure 6. Line graph of training loss for emotion evaluation network. It compares MSE losses of training and testing using MLP and KAN. The table in the figure shows the average and variance of MSE losses for four types of lines.

Table 1. Descriptive statistics table of students’ multimodal emotion evaluation scores. It calculated the scores of all student-recorded data for 6 emotions (156 audios and videos for each emotion).

Emotion Categories	Max	Min	Avg	Var	Std
Neutral	8.33	5.33	6.87	0.55	0.74
Frustration	8.67	5.00	6.87	0.74	0.86
Anger	9.00	5.00	7.01	1.33	1.15
Sadness	9.00	5.00	6.92	1.11	1.06
Happiness	9.00	5.00	6.62	1.05	1.03
Excited	8.67	5.00	6.71	0.90	0.95
All of the Emotions	9.00	5.00	6.83	0.96	0.98

In the table, max represents the maximum value, min represents the minimum value, avg represents the average value, var represents the variance, and std represents the standard deviation.

Table 2. Mixed dataset sources and quantity statistics table.

Data Category	Data Sources			Number of Samples
Data Category	Audio	Video	Text	Number of Samples
IEMOCAP data	IEMOCAP Dataset	Generating from Dream-talk	IEMOCAP Dataset	3106
Synthetic Data	Generating from TTS	Generating from Dream-talk	IEMOCAP Dataset	2892
Student Data	Student Recording	Student Recording	IEMOCAP Dataset	936

Table 3. Table of model-training hyperparameters.

Module	Batch Size	Learning Rate	Dropout	Number of Layer	Layer Size
Emotion Classification Module	10	0.00025	0.5	2 (MLP)	(100, 6)
Emotional Evaluation Module	10	0.0002	-	3 (KAN)	(1024, 512, 6)

Table 4. Table of multimodal emotion recognition results for the IEMOCAP dataset.

Methods	Number of Categories	F1↑ (%)	Accuracy↑ (%)
bc-LSTM [32]	6	59.58	59.10
MMGCN [33]	6	65.71	65.56
DilogueCRN [34]	6	65.34	65.31
Wang et al. [26]	4	64.10	66.60
COGMEN [35]	6	67.27	67.04
Ours	6	67.11	68.02

Optimal values are bolded and sub-optimal values are italicized in the table.

Table 5. Ablation study results of emotion classification network under different modal settings.

Modality Settings	F1↑ (%)	Accuracy↑ (%)
A	47.71	50.42
T	64.52	65.13
V	30.44	25.79
A+T	65.41	66.45
A+V	51.67	50.57
T+V	60.52	61.24
Ours (A+T+V)	67.11	68.02

Optimal values are bolded in the table.

Table 6. Ablation study results of emotion evaluation network under different modal settings.

Modality Settings	MSE↓
A	0.832
V	2.946
A+T	0.841
A+V	0.828
T+V	2.867
Ours (A+T+V)	0.811

In the table, “A” represents the audio modality, “T” represents the text modality, and “V” represents the video modality. Optimal values are bolded in the table.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Wang, L.; Qi, Y.; Chen, H.; Li, J. Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression. Appl. Sci. 2024, 14, 9121. https://doi.org/10.3390/app14199121

AMA Style

Yang J, Wang L, Qi Y, Chen H, Li J. Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression. Applied Sciences. 2024; 14(19):9121. https://doi.org/10.3390/app14199121

Chicago/Turabian Style

Yang, Jun, Liyan Wang, Yong Qi, Haifeng Chen, and Jian Li. 2024. "Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression" Applied Sciences 14, no. 19: 9121. https://doi.org/10.3390/app14199121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

Abstract

1. Introduction

2. Related Work

2.1. Emotion Recognition Dataset

2.2. Multimodal Emotion Feature Extraction

2.3. AI in Educational Evaluation

3. Method and Model

3.1. Dataset Composition

3.1.1. IEMOCAP Dataset Text, Audio, and Synthesized Video

3.1.2. TTS-Synthesized Audio and Synthesized Video

3.1.3. Student-Recorded Video and Audio and Teacher Scoring Data for Emotion Evaluation

3.2. The Deep Representation of Emotion Features in Second Language Expression

3.2.1. Multimodal Information Processing Methods

3.2.2. Emotion Feature Encoding Network

3.2.3. Emotion Classification

3.3. Evaluation of Second Language Emotional Expressions

4. Experiments and Analysis

4.1. Evaluation Metrics

4.2. Implementation Details

4.3. Performance Evaluation of the Emotion Encoding Network

4.4. Performance Evaluation of the Emotion Evaluation Network

4.5. Conclusions of the Experiment

5. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI