Next Article in Journal
A Collision Risk Assessment Method for Aircraft on the Apron Based on Petri Nets
Previous Article in Journal
Online Unmanned Ground Vehicle Path Planning Based on Multi-Attribute Intelligent Reinforcement Learning for Mine Search and Rescue
 
 
Article
Peer-Review Record

Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

Appl. Sci. 2024, 14(19), 9121; https://doi.org/10.3390/app14199121
by Jun Yang 1, Liyan Wang 2, Yong Qi 1, Haifeng Chen 1 and Jian Li 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Appl. Sci. 2024, 14(19), 9121; https://doi.org/10.3390/app14199121
Submission received: 6 August 2024 / Revised: 14 September 2024 / Accepted: 29 September 2024 / Published: 9 October 2024
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Methdology:

What is the correlation between emotional expressions and language proficiency scores?

How emotional expressiveness contributes to overall fluency?

How did you measured the validity of  the emotion feedback provided by the evaluation system, where lessons are adjusted based on detected emotional gaps?

How well emotions align with contextual meaning, considering the context within which a student's emotional expression is presented?

What would happen if different regularization techniques were applied in the KAN network to test which has the most positive impact on generalization? How did you come with current version?

Experiments:

Add ablation study on each modality (text, audio, video) to determine which feature contributes most to emotional expression evaluation. Alsoplay with the KAN structure and remove certain layers or nodes to illustrate their specific contributions to performance.

How did you mitigated potential biases in emotion recognition (e.g., gender, ethnicity, age) to improve the fairness and robustness of the evaluation system?


How does the emotional expression changes over time? Does its performance degrades or improves with increased data and experience?

Add the model’s robustness analysis across more diverse environments (there are no robustness testing here it seems)

What happens with exaggerated and subdued expressions?

What is the consistency of emotional expression across modalities? (e.g., whether a sad voice correlates with sad facial expressions)


Add statistical reliability analysis.

Others:

Plagiarism of 29% is not acceptable, reduce to 15% or less.

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.

 

Methodology:

Comments [1]: What is the correlation between emotional expressions and language proficiency scores?

Response [1]: Thank you for your review comments. Because this study mainly focuses on evaluating emotional expression, in the dataset, professional teachers' scoring of student recorded data is entirely focused on scoring emotional expression, without targeting the content expressed by students.

Comments [2]: How emotional expressiveness contributes to overall fluency?

Response [2]: Thank you for your review comments. Emotional expression is very important in language expression ability. Sometimes, when expressing the same sentence with different emotions, the content they want to express may be different. Meanwhile, this study mainly focuses on the evaluation of emotional expression in language expression. The lack of emotion in language expression for second language learners has been pointed out in Section 1 of the article.

 

Comments [3]: How did you measured the validity of the emotion feedback provided by the evaluation system, where lessons are adjusted based on detected emotional gaps?

Response [3]: Thank you for your review suggestions. This study can be used as an aid for second language teaching. When students use this model to practice emotional expressions, the model can output a standard example of emotional expression with voice generated by tts and video generated by the dreamtalk model. Students can use this standard expression to improve their emotional expression level by themselves. Afterwards, students can use this evaluation model again to know their level of emotional expression after learning. The comparison of the scores before and after is used to prove the effectiveness of the students' learning. The automatic adjustment of the course by the results of the evaluation system is also an important part of the research, but it is not the focus of this paper and can be investigated in the subsequent work.

 

Comments [4]: How well emotions align with contextual meaning, considering the context within which a student's emotional expression is presented?

Response [4]: Thank you for your review suggestions. Due to the complexity of evaluating emotions for scenarios, the focus of this study is on the emotion evaluation of a single paragraph utterance, which has only one emotion by default. The application scenario is: the content and overall emotion of an utterance is specified by the examining teacher or the emotion expression practice student, after which the evaluation model scores the student's video and audio, and gives the standard expression audio and video. The multi-emotion problem for complex scenarios is also considered as a potential direction for future work and will be investigated in subsequent work and added in Section 5.

 

 

Comments [5]: What would happen if different regularization techniques were applied in the KAN network to test which has the most positive impact on generalization? How did you come with current version?

Response [5]: Because the training process of the model using KAN in this study was relatively smooth and there were no serious overfitting phenomena, this article did not test many regularization techniques. This study only tested L2 regularization, dropout, and batch normalization. After multiple tests with different parameters, the researched KAN network used L2 regularization and Batch Normalization. Because it was found that dropout did not have a significant effect on KAN, dropout was not used.。

 

Experiments:

Comments [1]: Add ablation study on each modality (text, audio, video) to determine which feature contributes most to emotional expression evaluation. Also play with the KAN structure and remove certain layers or nodes to illustrate their specific contributions to performance.

Response [1]: Thank you for your review comments. In response to the ablation experiments for each modality, I have added Table 6 to show the contribution of different modalities to the evaluation model, and added a description of the table in front of the table.

In response to the suggestion of KAN, although KAN has the advantage of better interpretability, in the emotion evaluation network of this study, the multimodal data has been mixed into an abstract feature vector after the emotion encoding network. The individual modality data cannot be accurately distinguished in this feature vector. I believe that the removal of layers or nodes from the KAN network here does not indicate the contribution of individual modality data to the model. Therefore I have performed ablation experiments in Table 6 only for multimodal data inputs. I hope to have your understanding.

 

Translated with www.DeepL.com/Translator (free version)

Comments [2]: How did you mitigated potential biases in emotion recognition (e.g., gender, ethnicity, age) to improve the fairness and robustness of the evaluation system?

Response [2]: In section 4.3 of the article, this study successfully classified multimodal data into emotions using emotion encoding networks and MLP classification modules, and the classification results reached SOTA level. Because the data used to train the emotion encoding network contains multimodal data of different genders, races, and ages, this can demonstrate that the emotion encoding network mitigates potential biases in the data.

 

Comments [3]: How does the emotional expression changes over time? Does its performance degrades or improves with increased data and experience?

Response [3]: Thank you for your review comments. This paper focuses on discrete emotion evaluation, and it is based on the premise that by default a paragraph has only one emotion. However, as you suggest, emotion evaluation over time is also an important research component. Therefore, we are also conducting research on continuous emotion recognition of emotion over time, and are also in the process of producing a related dataset. And we have added this in Section 5. But this is not the research content of this paper, so please understand.

 

Comments [4]:Add the model’s robustness analysis across more diverse environments (there are no robustness testing here it seems). What happens with exaggerated and subdued expressions?

Response [4]: Thank you for your valuable suggestions on our work. We understand the importance of robustness analyses in some contexts. However, the goal of this study is to focus on the emotional evaluation of the model in a specific context:student's second language learning scenario. In the normal context of this scenario, students do not submit very extreme data, so robustness analyses are not the focus of this work. We have designed our experiments to ensure that the current test environment is sufficiently representative of the application scenario we are targeting. Nonetheless, future research will further explore the robustness of the model in diverse environments to enhance its adaptability. Therefore, this paper adds the descriptive names of the scenarios studied in this paper to the main contributions in Section 1, and adds a description of the future work in Section 5 Thank you for your interest in our work and your suggestions.

 

Comments [5]:What is the consistency of emotional expression across modalities? (e.g., whether a sad voice correlates with sad facial expressions)

Response [5]:Thank you for your valuable review suggestions. Regarding the consistency of emotion expression across modalities (e.g., speech, facial expression, etc.), we conduct a more in-depth analysis through ablation experiments on emotion classification models. Specifically, we assess the consistency of each modality in emotion expression by gradually removing different modal inputs in our experiments and observing the performance change of the classification model. The results of the experiments are shown in Table 5 in the text, which reflects the complementarities or differences between modalities. Because the classification performance decreases severely when removing the text data, and the classification performance does not change much when removing the other two types of data, there is a strong consistency between the speech and text data, while the video data contributes poorly to the consistency of the emotion.

 

Comments [6]: Add statistical reliability analysis.

Response [6]: Thank you for your valuable review suggestions. Because our dataset, training set, and test set are fixed, the experimental results are deterministic and do not change significantly from one run to the next. Therefore, traditional statistical analysis methods such as confidence intervals and standard errors are not applicable in this scenario. However, it is indeed important to verify the reliability of the experimental results, so we have performed stability analyses of the model's training process in Section 4.4, showing the convergence and stability of the model's performance on both the training and validation sets. We believe that these analyses support our experimental results well and show that the model's performance on the current dataset is stable and reliable. We thank you for your attention and valuable suggestions on our work.

 

Others

Comments [1]: Plagiarism of 29% is not acceptable, reduce to 15% or less.

Response [1]: Thank you for reviewing our paper. The main reason for this similarity is that this paper is based on our previously published work on single-modality emotion evaluation, which has been extended to multi-modality emotion evaluation, and the model structure has been improved. As a result, parts of it inevitably overlap with our previous work. Nonetheless, we will make further revisions to the paper, especially for those parts that are duplicated with our previous work, and reorganize and rephrase them to ensure that the similarity is reduced to 15% or below and meets the publication requirements. The modified parts have been marked in green font in the text. Thank you for your advice and guidance on our work.

 

 

Thank you for your revision suggestions.

 

Reviewer 2 Report

Comments and Suggestions for Authors

In my opinion authors need to provide a summary of their datasets in one single table in which they report the overall number of classes, instances and data types. They need to provide a second table which reports the hyperparameters of all their models. 

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.

 

Comments [1]: In my opinion authors need to provide a summary of their datasets in one single table in which they report the overall number of classes, instances and data types.

Response [1]: I agree with your review comments. I have added a statistical table at the end of section 3.1 regarding the composition of the dataset, data sources, and the number of samples.

 

Comments [2]: They need to provide a second table which reports the hyperparameters of all their models.

Response [2]: I agree with your review comments. I have added section 4.2 to supplement the hyperparameters of the model.

 

Thank you for your revision suggestions.

 

Reviewer 3 Report

Comments and Suggestions for Authors

Here are the comments section by section: 







Comments on the Quality of English Language

The English is generally clear but contains occasional awkward phrasing and minor grammatical errors that need attention.

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.

 

Comments [1] Introduction: It is suggested to expand the background section to cover a wider variety of related works. This will go along away in assisting you place your research in the right position in the current literature.

Response [1] Introduction: I agree with your opinion. I have expanded the background section, which now includes more related works. You can see the modified content in the red highlighted area of Section 1.

 

Comments [2] Methods: Include more information on how the experiments were conducted and more so focusing on how data was collected, captured and then analyzed. Audience should be in a position to repeat your study from the details that are given. Some unnecessary information is provided like how the evaluation metrics are computed, this is well known, no need to explain the details for these metrics, the most important is to show the experimental work clearly.

Response [2] Methods: I agree with your review suggestions. The detailed methods for collecting, capturing, and analyzing dataset information are described in Section 3.1. In order to ensure that readers have access to the research details, we have added section 4.2 to the experimental section, which explains the experimental and model details, and added Table 3 to illustrate the model hyperparameters. Finally, some content on the evaluation metrics has been removed to make it more concise.

 

Comments [3] Results: Add more descriptive captions to the figures and tables in order to make their contents easier to understand. You may wish to discuss the relevance and the further elaboration of findings as well. Figure 5 is appearing before figure 4 (adjust the order), increase the font in figures 4 and 5.

Response [3] Results: I agree with your review suggestions. I have added descriptive language to Figures 2, 6, and Table 1. It's not that Figure 5 appeared before Figure 4, but because of my mistake, I mistakenly wrote Figure 6 as Figure 4. I have made the necessary changes. I have adjusted the font size of Figures 5 and 6.

 

Comments [4] Conclusions: Be sure to reinforce your conclusions by making it clear, what is new in your research and how it contributes towards the progress of the subject field.

Response [4] Results: I agree with your review comments. I have revised the conclusion section of the article to highlight its contribution and its potential application in second language learning scenarios.

 

Thank you for your revision suggestions.

 

Reviewer 4 Report

Comments and Suggestions for Authors

Title: Multimodal Information Fusion and Data Generation for Evaluation of Second Language Emotional Expression

Manuscript ID: applsci-3170128

 

This paper presents a method for evaluating "emotional expressions” in English by language learners.  To this aim, the authors have introduced multimodal synthetic data using the large language models, a graph convolutional networks to extract emotion features, and a Kolmogorov-Arnold networks to estimate the evaluation score by experts.  

This work would be beneficial to foreign language learners who have limited opportunities for one-on-one conversation practice with the natives.  There is a progress from the authors’ previous work [21].  In my opinion, this study has both high practical value and technical novelty.   Moreover, the manuscript is well written and structured.  I have no major concerns on this paper.

 


Minor Concerns

[1] Representation of Figure 2

Figure 2 does not look scientific and actually I could not understand this figure.  Line 309 says Figure 2 presents 36 (students) x 6 (emotion score).   Then, what are ‘u’ and 52 in Figure 2?  Why were the 52 data presented in a circle?  (Do ‘13’ and ‘40’ correspond to each other for example?)  Does each of the six numbers in the radial direction correspond to which of the emotions?  Thus, I recommend revising Figure 2 such that the values are presented in a regular Table (like Table 1) or summarized to histograms.

 

[2] References 7, 24, 29 should cite the published paper or proceedings.

 

 

Typo and Writing Suggestions

 

Line 409: indent

 

Line 436:  “KAN[26] not only possess stronger nonlinear mapping ability, but also more accurate in function fitting.” -> KAN[26] (possesses [or has achieved]) not only a stronger nonlinear mapping ability, but also more (accuracy in function fitting [or accurate  function fitting]).

 

Best regards,

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.

 

 

Minor Concerns

 

Comments [1]: Representation of Figure 2

Figure 2 does not look scientific and actually I could not understand this figure.  Line 309 says Figure 2 presents 36 (students) x 6 (emotion score).   Then, what are ‘u’ and 52 in Figure 2?  Why were the 52 data presented in a circle?  (Do ‘13’ and ‘40’ correspond to each other for example?)  Does each of the six numbers in the radial direction correspond to which of the emotions?  Thus, I recommend revising Figure 2 such that the values are presented in a regular Table (like Table 1) or summarized to histograms.

Response [1]: I apologize for the confusion caused by Figure 2. The "36" mentioned in line 300(after modification) refers to "36 students have emotional expression evaluation scores all below 7", while the "52" in Figure 2 is explained in line 284, which refers to all 52 students. Although using a regular table or summarizing as a histogram is a good suggestion, I still prefer using a circular heatmap to visually present the distribution of students' emotional scores. To make Figure 2 easier to understand, I have made some modifications by labeling the corresponding emotional categories on each ring and explaining the meaning of the x and y labels in the center of the figure. Thank you for your suggestion.

 

Comments [2]: References 7, 24, 29 should cite the published paper or proceedings.

Response [2]: I have made the necessary corrections to references 7, 24, and 29. Due to the addition and modification of literature, they are now 13,30 and 35.

 

 

Typos and Writing Suggestions

 

Comments Line 409: indent

Response Line 409: This has been corrected. Due to modifications, it is now line 407.

 

Comments Line 436: “KAN[26] not only possess stronger nonlinear mapping ability, but also more accurate in function fitting.” -> KAN[26] (possesses [or has achieved]) not only a stronger nonlinear mapping ability, but also more (accuracy in function fitting [or accurate  function fitting]).

Response Line 436: It has been modified to "KAN[26] has achieved not only a stronger nonlinear mapping ability, but also more accurate function fitting." Due to modifications, it is now line 434.

 

Thank you for your revision suggestions.

 

Back to TopTop