Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents a novel approach to multimodal emotion recognition by integrating speech and text modalities, leveraging attention mechanisms within BERT and CNN architectures. The research has been tested on two datasets, CMU-MOSEI and MELD, with impressive results. However, there are still some issues that suggest improvements
1. In the abstract, after introducing the background of the study, it is suggested to add a sentence about the problems of existing methods - and what problems need to be addressed in this paper.
2. In the keywords, it is suggested to add attention mechanism and cross-modal emotion recognition.
3. In the Literature Review, it is suggested to add a secondary title to distinguish the literature of different research directions.
4. Some pictures are not clear, it is suggested to replace them with clear pictures, and the size of the text in the pictures is similar to the size of the text in the main text.
5. It is suggested to add some mathematical formulas in the Speech module and Text module to describe these two modules in a formal way.
6. The formulas are centered, for example, formulas (1) (2) (3) are not centered.
7. It is suggested to add some ablation experiments to illustrate the validity of the innovations.
8. It is suggested to add some visualization diagrams to illustrate the superiority of the effect more intuitively.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper introduced a novel multimodal emotion recognition approach that integrates speech and text modalities using CNNs and a BERT-based model, respectively. The authors utilized Mel spectrograms for speech analysis and bidirectional layers of BERT for text, with an attention-based mechanism to fuse the modal outputs effectively. They conducted testing on the CMU-MOSEI and MELD datasets, showing performance with accuracy rates of 88.4% and 67.81% respectively. Though the studied problem is interesting, the following concerns need to be addressed to improve the work.
1. I would suggest providing some examples to visualize the feature representation learned by the hidden layers of the proposed deep learning method. This can help readers better understand how the method works on feature learning.
2. I would like to see an ablation study for the deep learning structure and show the importance of each element of the model.
3. How about the computational complexity and convergence? It is better to conduct analysis about them.
4. In a separate paragraph it is required to provide some remarks to further discuss the proposed methods, for example, what are the main advantages and limitations in comparison with existing methods?
5. Emotion recognition using neuroimaging has also been a promising direction. The authors may review some of the relevant work, for example: A survey on deep learning based non-invasive brain signals: recent advances and new frontiers; Sparse Bayesian learning for end-to-end EEG decoding.
6. The authors may briefly discuss the potential limitations of the proposed method and what are the future research directions of this study. How other researchers can work on your study to continue this line of research?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have addressed my comments. The revision can be considered for publication.