Next Article in Journal
Virtual Reality-Based Parallel Coordinates Plots Enhanced with Explainable AI and Data-Science Analytics for Decision-Making Processes
Next Article in Special Issue
MLLP-VRAIN Spanish ASR Systems for the Albayzín-RTVE 2020 Speech-to-Text Challenge: Extension
Previous Article in Journal
Experimental Methodology to Determine Thermal Conductivity of Nanofluids by Using a Commercial Transient Hot-Wire Device
Previous Article in Special Issue
An Analysis of Sound Event Detection under Acoustic Degradation Using Multi-Resolution Systems
 
 
Article
Peer-Review Record

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

Appl. Sci. 2022, 12(1), 327; https://doi.org/10.3390/app12010327
by Cristina Luna-Jiménez 1,*, Ricardo Kleinlein 1, David Griol 2, Zoraida Callejas 2, Juan M. Montero 1 and Fernando Fernández-Martínez 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(1), 327; https://doi.org/10.3390/app12010327
Submission received: 22 November 2021 / Revised: 22 December 2021 / Accepted: 27 December 2021 / Published: 30 December 2021

Round 1

Reviewer 1 Report

The authors present an approach to fuse posteriors from Speech Emotion Recognizer and Facial Emotion Recognizer for Multimodal Emotion Recognition. Results show improvement when fusing results, compared to using individual modes to solve the problem.

Concerns:

1) Although the authors introduce the work as a "multimodal recognizer", my expectation for a "multimodal recognizer" would be what is described in the paper as "early fusion". The reason is that in your case (which is described as "late fusion") the model learned in each mode is not affected in a "multi-task learning fashion" with the model for the other mode. For me, the "late fusion" is more related to an "ensemble approach", but using different input data per model. I think it would be interesting to, at least, mention "ensemble" to contextualize the work.

2) When using the logistic regression to fuse results, since it is not mentioned I assume the default "0.5" posterior threshold is used by logistic regression. It would be interesting to show, besides accuracy, the model precision / recall per class when different posterior thresholds are used. Specially if the classes are unbalanced. Alternatively, a ROC curve per class, for example, could be shown.

3) There is a small typo "sysmtes" instead of "systems" in the Introduction.

Author Response

The authors appreciate very much the time dedicated by the reviewers to understand and make suggestions to improve the next version of this paper. Their comments have helped to increase the quality of this work and to explain some ideas in a clearer way. 

The answer to the questions is written below the question of the reviewer in blue and we have also cited most of the paragraphs added in the article to address all the reviewers’ concerns. 

1) Although the authors introduce the work as a "multimodal recognizer", my expectation for a "multimodal recognizer" would be what is described in the paper as "early fusion". The reason is that in your case (which is described as "late fusion") the model learned in each mode is not affected in a "multi-task learning fashion" with the model for the other mode. For me, the "late fusion" is more related to an "ensemble approach", but using different input data per model. I think it would be interesting to, at least, mention "ensemble" to contextualize the work.

Thank you for your comment. We have added the following lines at the end of Section 2.3 to clarify what we do in the paper to combine sources of information and also to include the word ensemble in case of doubts. 

Due to the simplifications and adequate performance of the late fusion strategy on similar tasks [60,63], we decided to apply a combination of the posteriors of each trained model per modality (aural or visual). Later, we fed a multinomial logistic regression with the generated outputs. This process could also be understood as an ensemble approach: we assembled the posteriors learned by each model on their own, and then, we trained a multinomial logistic regression model for solving a single task, emotion recognition.  

 

2) When using the logistic regression to fuse results, since it is not mentioned I assume the default "0.5" posterior threshold is used by logistic regression. It would be interesting to show, besides accuracy, the model precision / recall per class when different posterior thresholds are used. Specially if the classes are unbalanced. Alternatively, a ROC curve per class, for example, could be shown.

Thank you for the suggestion.

Regarding logistic regression, as our problem is multinomial (we have K classes), we used a multinomial logistic regression that differs a little from the binary logistic regression. In a binary logistic regression, the threshold could play an important role, but in multinomial it does not modify the final result since the probabilities are estimated in a different way, and the decision of a class does not depend on where the threshold is but on the predominant output of 1 of the ‘K logistic regressors’ trained and their comparison on each observation. As a consequence, now the threshold does not have any effect except if you need to introduce a ‘quality criterion’ to discard those samples that do not overpass a certain level of probability. For those cases, some samples could not receive any label since the probability does not reach the expected threshold. In our scenario, we did not include this feature because we did not want to discard any sample during the evaluation. To avoid any confusion, we have added a clarification in the article to remark that we use a multinomial logistic regression. We have also included metrics of the precision, recall, and accuracy per emotion for the top model in Table 3.  

 

3) There is a small typo "sysmtes" instead of "systems" in the Introduction.

Thank you for your observation, we have corrected the typo. 

 

Reviewer 2 Report

In the work presented in this paper, the authors have proposed a methodology for multimodal emotion recognition using aural transformers and action units on the RAVDESS dataset. The work seems novel and holds potential for impact to the discipline. However, the following changes are necessary in the paper. The authors are suggested to make the necessary changes/updates to their paper as per the following comments:

  1. Missing references in the Introduction and Literature Review section: In multiple fact-based statements in these two sections, the supporting references are missing. For example in this sentence there should be a supporting reference - “Among all the suggested hypotheses, the literature indicates that two primary theories have been positioned in the center of the discussion from a psychological standpoint….”
  2. The authors state – “Other works as the proposed in [37–39], also employs CNNs, MLPs, or LSTMs to solve emotion recognition on RAVDESS using spectrograms or pre-processed features, obtaining accuracies of 80.00%, 96.18%, and 81%, respectively.” The accuracy of 96.18% is significantly higher than the accuracy of 86.70% that the authors have obtained. A comparative discussion should be added to justify the relevance of this approach even though the accuracy obtained is lower than a similar work in this field.
  3. In Figure 1 why are the static models limited to k-NN, MLP, and SVC. Why are some of the other machine learning models such as Random Forest, Decision Trees, ANN, not included here?
  4. The explanation related to using this dataset is not clear. The authors have provided just 1 sentence - “In our analysis, we have used the RAVDESS dataset mainly because it is a free of charge reference corpus for the scientific community for speech emotion recognition [34,64,65], but also because of its suitability to our experiments.” Please elaborate discussing how this dataset is suitable for your experiments.
  5. Discussion of potential applications of this work with supporting references is missing. It is suggested that the authors add a paragraph to discuss potential applications of their work such as use cases in ambient assisted living, user interactions with automated/semi-automated systems, and so on. Cite this paper - https://doi.org/10.3390/info12020081 related to the discussion about ambient assisted living and cite other papers to justify the interdisciplinary applications of this approach. This paragraph should also include how the work of this paper aligns with the scope of this journal. 
  6. English proofreading is recommended as there are multiple sentence construction errors and grammatical errors in the paper.

Author Response

The authors appreciate very much the time dedicated by the reviewers to understand and make suggestions to improve the next version of this paper. Their comments have helped to increase the quality of this work and to explain some ideas in a clearer way. 

The answer to the questions is written below the question of the reviewer in blue and we have also cited most of the paragraphs added in the article to address all the reviewers’ concerns. 

  1. Missing references in the Introduction and Literature Review section: In multiple fact-based statements in these two sections, the supporting references are missing. For example in this sentence there should be a supporting reference - “Among all the suggested hypotheses, the literature indicates that two primary theories have been positioned in the center of the discussion from a psychological standpoint….”

Thank you for your comment. We have tried to modify all the statements that did not have a reference or we have looked for references that support them. 

Regarding the one that you mention, we have modified it to define more specifically what we wanted to transmit:

Among all the suggested psychological hypotheses, the literature indicates that two primary theories appear as models for annotating most of the current emotion recognition datasets[17]: The Discrete Emotion Theory and the Core Affect/Constructionist Theory [18]. …

We have also added the following review to support the statement about ‘traditional’ vs. ‘Deep-learning’ classifiers/approaches in Section 2.1: 

According to the reviews of Wani et. al.[28] and Berkeham and Oguz [27], we can distinguish two main ways to perform Speech Emotion Recognition: Using traditional classifiers or using Deep-Learning classifiers. ..

Additionally, we also modified some lines of Section 2.3 to add a review that categorizes the types of fusions that exist:

According to the review of Huang et al. [58], there are three basic ways for merging modalities: early fusion, joint fusion, and late fusion

ADDED REFERENCES:

[17] Ashraf, A.; Gunawan, T.; Rahman, F.; Kartiwi, M., A Summarization of Image and Video Databases for Emotion Recognition; 2022; pp. 669–680. doi:10.1007/978-981-33-4597-3_60.

[18] Thanapattheerakul, T.; Mao, K.; Amoranto, J.; Chan, J. Emotion in a Century: A Review of Emotion Recognition. 2018, pp. 1–8. doi:10.1145/3291280.3291788.

[27] Akçay, M.B.; Oguz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76.

[28] Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access 2021,9, 47795–47814. doi:10.1109/ACCESS.2021.3068045.

[58] Huang, S.C.; Pareek, A.; Seyyedi, S.; Banerjee, I.; Lungren, M.P. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. npj Digital Medicine 2020,3, doi:10.1038/s41746-020-00341-z.

 

2. The authors state – “Other works as the proposed in [37–39], also employs CNNs, MLPs, or LSTMs to solve emotion recognition on RAVDESS using spectrograms or pre-processed features, obtaining accuracies of 80.00%, 96.18%, and 81%, respectively.” The accuracy of 96.18% is significantly higher than the accuracy of 86.70% that the authors have obtained. A comparative discussion should be added to justify the relevance of this approach even though the accuracy obtained is lower than a similar work in this field.

Thanks for your comment. We have moved the paragraph explaining why we cannot compare our contribution with other works. Basically, these publications do not mention exactly how they distribute their data, if it is subject-wise or not. The way in which they split subjects are quite important since introducing information from the same user in the training and the test sets would lead to higher accuracies than using a subject-wise strategy. For this reason, we also included our code to allow other researchers to use our evaluation set-up and compare their contributions in the same conditions. Below you can see the modified paragraph of Section 2.1:

Other works like the proposed in [36-38], also employ CNNs, MLPs, or LSTMs to solve emotion recognition on RAVDESS using spectrograms or pre-processed features, obtaining accuracies of 80.00%, 96.18%, and 81%, respectively. 

Although RAVDESS appears in a growing number of publications, there are no standard evaluation criteria, making it complex to quantify and compare contributions. For example, in [37] they achieve a 96.18% accuracy and use a 10-CV evaluation. Nonetheless, they do not specify how they distribute users in each fold, making it unclear whether the same user participates in the training and test sets or not. Deciding whether the distribution of users is subject-wise or not is a relevant fact to consider when implementing an evaluation setup. Non-subject-wise scenarios will always result in higher performance rates because the training and the test have samples of the same user. 

 

3. In Figure 1 why are the static models limited to k-NN, MLP, and SVC. Why are some of the other machine learning models such as Random Forest, Decision Trees, ANN, not included here?

Thanks for your comment. We did not include other algorithms because we did not use them in our work. In the diagram, we represent the models and steps that we have followed/tested. But it is true that it may be confusing, we have added some additional lines in the caption of the Figure to clarify it.

Figure 1. Block diagram of the implemented systems. The figure represents the analyzed models from the existent family of sequential models, static models, and transformers.

 

4. The explanation related to using this dataset is not clear. The authors have provided just 1 sentence - “In our analysis, we have used the RAVDESS dataset mainly because it is a free of charge reference corpus for the scientific community for speech emotion recognition [34,64,65], but also because of its suitability to our experiments.” Please elaborate discussing how this dataset is suitable for your experiments.

 

Thanks for your observation. We have compacted the advantages of the dataset in a single paragraph in Section 3.1. to make it easier to identify and understand why we selected it: 

We only used the full AV material and the speech channel for our experiments because we are interested in audio-visual emotion recognition on speech rather than songs. This selection limits the number of files to 1,440 videos with a maximum and minimum duration of 5.31 and 2.99 seconds, respectively. The corpus has 24 actors distributed in a gender-balanced way, who speak lexically-matched statements in a neutral North American accent. This setup is suitable to study the para-linguistics associated with emotions, isolating the lexical and reducing the bias in emotional expressions that culture may induce. Among its advantages, it also has a proportional number of files per emotion which avoids problems derived from training algorithms with non-balanced data. Additionally, RAVDEESS is a reference dataset in the research community, employed in several works  [34,64,65].

 

5. Discussion of potential applications of this work with supporting references is missing. It is suggested that the authors add a paragraph to discuss potential applications of their work such as use cases in ambient assisted living, user interactions with automated/semi-automated systems, and so on. Cite this paper - https://doi.org/10.3390/info12020081 related to the discussion about ambient assisted living and cite other papers to justify the interdisciplinary applications of this approach. This paragraph should also include how the work of this paper aligns with the scope of this journal. 

 

Thanks for the suggestion, we have added some extra applications to the paper in the Introduction Section. Additionally, we have also included the mentioned reference that demonstrated one interesting real scenario where our system could be integrated. The paragraph of applications in Section 1 starts as follows: 

Emotions play a crucial role in our life decisions. Comprehending them awakens interest due to their potential applications since knowing how others feel allows us to interact and transmit information more effectively. With the help of an emotion recognizer, other systems could detect loss of trust or changes in emotions by monitoring people's conduct. This capability will help specific systems like Embodied Conversational Agents (ECAs) [1,2] to react to these events and adapt their decisions to improve conversations by adjusting their tone or facial expressions to create a better socio-affective user experience [3].

Automobile safety is another important application of facial expression recognition. Detecting stress, rage, or exhaustion may be decisive in preventing traffic accidents [4] on intelligent vehicles by allowing cars to make decisions based on the driver's current psychological state. Another use for these systems is human-machine interaction in an assisted living experience for the elderly [5]. An emotion recognizer could monitor the emotional state of a person to detect anomalies in their behavior. When an anomaly arises, it could mean that the person requires attention. Also, the emotion recognizer could be practical in the diagnosis of certain diseases (e.g., depressive disorders [6,7], Parkinson [8], and so on) by the detection of deficits in the expression of certain emotions, accelerating the diagnosis as well as the patient's treatment. Emotion recognizers will also be necessary for the future 'Next Revolution' [9], which will require the creation of social robots. These robots should know how to recognize people's emotions and convey and produce their own emotional state to display closer personal relationships with humans.

ADDED REFERENCES: 

[5] Thakur, N.; Han, C.Y. An Ambient Intelligence-Based Human Behavior Monitoring Framework for Ubiquitous Environments. Information 2021, 12, doi:10.3390/info12020081.

6. English proofreading is recommended as there are multiple sentence construction errors and grammatical errors in the paper.

Thank you for your comment, we have modified some sections of the paper to be easier to understand and we have also sent the document to an English philologist to review the style and the grammar of the paper. 

Reviewer 3 Report

Dear Authors,

Please find the attached file for my comments. Please update the paper based on the comments and resubmit it.

Best Regards

Comments for author File: Comments.pdf

Author Response

The authors appreciate very much the time dedicated by the reviewers to understand and make suggestions to improve the next version of this paper. Their comments have helped to increase the quality of this work and to explain some ideas in a clearer way. 

The answer to the questions is written below the question of the reviewer in blue and we have also cited most of the paragraphs added in the article to address all the reviewers’ concerns. 

1. Please add the following acronyms in the abstract: 

    • Speech Emotion Recognition: SER 
    • Facial Emotion Recognition: FER 

Thank you for the note, we have added them in the abstract.

2. Suggested references : 

    • The extensive usage of the facial image threshing machine for facial emotion recognition performance 
    • Foreground Extraction Based Facial Emotion Recognition Using Deep Learning Xception Model 
    • Feature Vector Extraction Technique for Facial Emotion Recognition Using Facial Landmarks 

Thanks for the suggestion, we have included 2 of the references in Section 2.2:

…. According to Nguyen et al. [46] and Poulose et al. [47], landmarks encapsulate meaningful information about a person's facial expression that helps to solve automatic emotion recognition. …

… Finally, the work of H. Kim et al. [57] employs an Xception model to perform Facial Emotion Recognition …

ADDED REFERENCES: 

[47] Poulose, A.; Kim, J.H.; Han, D.S. Feature Vector Extraction Technique for Facial Emotion Recognition Using Facial Landmarks.2021 International Conference on Information and Communication Technology Convergence (ICTC),2021, pp. 1072–1076. doi:10.1109/ICTC52510.2021.9620798.

[57] Kim, J.H.; Poulose, A.; Han, D.S. The Extensive Usage of the Facial Image Threshing Machine for Facial Emotion Recognition Performance. Sensors 2021,21, doi:10.3390/s21062026.

  1. There are so many paragraphs in the Introduction section, and it is better to reduce the number of paragraphs. 

Thank you for your comment, we have reduced the size of the section, maintaining almost exclusively the applications and main contributions of our study. 

  1. Please add the paper contributions at the end of the introduction section. 

We have summarized a little more this section and added the paper contributions at the end of it, as you requested. 

As a summary, the main contributions of this study are: 

    • We implemented a Speech Emotion Recognizer using an xlsr-Wav2Vec 2.0 pre-trained model on an English speech-to-text task. We analyzed the performance reached using two transfer-learning techniques: Feature Extraction and Fine-Tuning.
    • This work also incorporated visual information, which is a rarely used modality on RAVDESS ('The Ryerson Audio-Visual Database of Emotional Speech and Song') due to the difficulties associated with working with videos. However, our results showed it is a valuable source of information that should be explored to improve current emotion recognizers. We designed a Facial Emotion Recognizer using Action Units as features and evaluated them on two models: Static and Sequential.
    • To our knowledge, our study is the first that assembles the posteriors of a fine-tuned audio transformer with the posteriors extracted from the visual information of the models trained with the Action Units on the RAVDESS dataset. 
    •  We also leveraged our code to allow the replication of our results and the set-up of our experiments. In this way, we expect to create a common framework to compare contributions and models' performance on the RAVDESS dataset. We decided to continue with the formulation of our previous paper of Luna-Jiménez et. al [13] that consists of a subject-wise 5-CV technique based on the eight emotions captured in the RAVDESS dataset.
  1. Please reduce the length of the related work section for better understanding. 

Thanks, we have reduced some paragraphs from this section. More specifically:

    • From the introduction to psychological theories (Introduction of Section 2), we have removed Plutchik and Furey and Blue’s theories and summarized information of the other two into 2 paragraphs. 
    • From the Speech Emotion Recognition part of the related work (Section 2.1), we have moved our contribution to the introduction section and summarized some of the works.
    • From the Facial Emotion Recognition part of the related work  (Section 2.2), we have summarized some of the works and added 2 of your references. 
  1. Please define ‘RAVDESS’ when it uses the first time.

    Thank you for the observation, we have added the definition in the first paragraph where it appears that is in the contributions of the Introduction Section: 

    This work also incorporated visual information, which is a rarely used modality on RAVDESS ('The Ryerson Audio-Visual Database of Emotional Speech and Song') due to the difficulties associated with working with videos. However, our results showed it is a valuable source of information that should be explored to improve current emotion recognizers. We designed a Facial Emotion Recognizer using Action Units as features and evaluated them on two models: Static and Sequential.

  2. What is the main reason to use the Bi-LSTM model instead of normal LSTM?

    Thanks for your question. We use the Bi-LSTM instead of the LSTM because they allow us to capture emotional information in both directions and process them in both directions. Additionally, some previous works have demonstrated that bi-LSTM outperforms LSTM in certain tasks, such as the work of S.Siami-Namini [1]. 

    However, LSTMs could also be employed in this work and probably they will perform in a similar way for this problem since the challenge for improving the visual source is probably in another part. We are currently studying several alternatives to discover why the models are not able to reach a higher accuracy, closer to human perception.

    [1] S. Siami-Namini, N. Tavakoli and A. S. Namin, "The Performance of LSTM and BiLSTM in Forecasting Time Series," 2019 IEEE International Conference on Big Data (Big Data), 2019, pp. 3285-3292, doi: 10.1109/BigData47090.2019.9005997.

 

  1. Please reduce the length of the conclusion part. Please summarise the essential findings of your paper in a short paragraph. 

Thanks, we have summarized the main conclusions of our work:

Automatic emotion classification is a difficult task. Although similar patterns seem to exist, there are still many differences between individuals, even when they are actors of the same nationality and speak the same language variety, as it happens in the RAVDESS corpus.

In this paper, we proposed a multimodal system for emotion recognition based on speech and facial data.

Concerning the speech-based models, we have demonstrated that the fine-tuned model using a pre-trained transformer outperforms the feature-extraction strategy by 25.29 points. When compared to human perception, our speech model achieves a 14.82 percent point increase, demonstrating the robustness of the proposed procedure for this modality. Furthermore, our proposal outperforms the previously proposed solution in [35] by 10.21 percent.

For the visual modality, the results show that the sequential model achieves the highest accuracy. The results using the Static and Sequential models still fall short of the scores obtained with the SER and human capability. However, from this study, we have found some issues that will be researched further in the future in order to model the dynamic nature of emotions. An example of said issues is that we have discovered that some frames in the video appear to contain more important information than others, and neither implemented temporal nor static models are capable of capturing this knowledge from the Action Units.

Despite the lower performance of the visual modality regarding the speech modality, the fusion of both sources achieves an accuracy of 86.70% in automatic emotion classification, improving both single modalities. 

In the future, we will intend to improve the visual models by modifying the tested architectures or applying other transformer models. In addition, we will investigate how to extract the most relevant frames that contain a higher emotional load. If we succeed in this study, we expect to achieve a closer performance of our models to human perception. Finally, we will test these strategies in real-world scenarios too. 

 

  1. The authors should update the writing style of the conclusion part before the next submission. 

Thank you for your comment, we have modified some sections of the paper to be easier to understand and we have also sent the document to an English philologist to review the style and the grammar of the paper.




Round 2

Reviewer 2 Report

The authors have improved the paper significantly as per all my comments and suggestions. I do not have any additional comments at this point. I recommend this paper for publication in its current form. 

Reviewer 3 Report

Dear Authors, 

Thank you for addressing all my comments and the paper is accepted from my side. 

Best Regards

Back to TopTop