Next Article in Journal
Excellent Color Purity and Luminescence Thermometry Performance in Germanate Tellurite Glass Doped with Eu3+ and Tb3+
Next Article in Special Issue
Speech Emotion Recognition under Noisy Environments with SNR Down to −6 dB Using Multi-Decoder Wave-U-Net
Previous Article in Journal
Updates on Impact Ionisation Triggering of Thyristors
Previous Article in Special Issue
Personalization of Affective Models Using Classical Machine Learning: A Feasibility Study
 
 
Article
Peer-Review Record

Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures

Appl. Sci. 2024, 14(10), 4199; https://doi.org/10.3390/app14104199
by Fazliddin Makhmudov 1, Alpamis Kultimuratov 2 and Young-Im Cho 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2024, 14(10), 4199; https://doi.org/10.3390/app14104199
Submission received: 19 April 2024 / Revised: 7 May 2024 / Accepted: 12 May 2024 / Published: 15 May 2024
(This article belongs to the Special Issue Advanced Technologies for Emotion Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a novel approach to multimodal emotion recognition by integrating speech and text modalities, leveraging attention mechanisms within BERT and CNN architectures. The research has been tested on two datasets, CMU-MOSEI and MELD, with impressive results. However, there are still some issues that suggest improvements

1.      In the abstract, after introducing the background of the study, it is suggested to add a sentence about the problems of existing methods - and what problems need to be addressed in this paper.

2.      In the keywords, it is suggested to add attention mechanism and cross-modal emotion recognition.

3.      In the Literature Review, it is suggested to add a secondary title to distinguish the literature of different research directions.

4.      Some pictures are not clear, it is suggested to replace them with clear pictures, and the size of the text in the pictures is similar to the size of the text in the main text.

5.      It is suggested to add some mathematical formulas in the Speech module and Text module to describe these two modules in a formal way.

6.      The formulas are centered, for example, formulas (1) (2) (3) are not centered.

7.      It is suggested to add some ablation experiments to illustrate the validity of the innovations.

 

8.      It is suggested to add some visualization diagrams to illustrate the superiority of the effect more intuitively.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper introduced a novel multimodal emotion recognition approach that integrates speech and text modalities using CNNs and a BERT-based model, respectively. The authors utilized Mel spectrograms for speech analysis and bidirectional layers of BERT for text, with an attention-based mechanism to fuse the modal outputs effectively. They conducted testing on the CMU-MOSEI and MELD datasets, showing performance with accuracy rates of 88.4% and 67.81% respectively. Though the studied problem is interesting, the following concerns need to be addressed to improve the work.

1.         I would suggest providing some examples to visualize the feature representation learned by the hidden layers of the proposed deep learning method. This can help readers better understand how the method works on feature learning.

2.         I would like to see an ablation study for the deep learning structure and show the importance of each element of the model.

3.         How about the computational complexity and convergence? It is better to conduct analysis about them.

4.         In a separate paragraph it is required to provide some remarks to further discuss the proposed methods, for example, what are the main advantages and limitations in comparison with existing methods?

5.         Emotion recognition using neuroimaging has also been a promising direction. The authors may review some of the relevant work, for example:  A survey on deep learning based non-invasive brain signals: recent advances and new frontiers; Sparse Bayesian learning for end-to-end EEG decoding.

6.         The authors may briefly discuss the potential limitations of the proposed method and what are the future research directions of this study. How other researchers can work on your study to continue this line of research?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed my comments. The revision can be considered for publication.

Back to TopTop