Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Mathematics 2023, 11(23), 4787; https://doi.org/10.3390/math11234787

by Dmitry Ryumin^*

, Elena Ryumina

and Denis Ivanko

Reviewer 1:

Martin Sagayam

Reviewer 2: Anonymous

Reviewer 3:

Jinrong He

Reviewer 4:

Hanif Heidari

Mathematics 2023, 11(23), 4787; https://doi.org/10.3390/math11234787

Submission received: 25 October 2023 / Revised: 22 November 2023 / Accepted: 24 November 2023 / Published: 27 November 2023

(This article belongs to the Section Network Science)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The author have chosen the interesting topic in his research, the points are listed below to revise it,

1. In the scientific papers, avoid singular or plural form in writing and use past perfect tense or past participles.

2. This work is focused on the deep learning "EMO-3DCNN-GRU" architecture for emotion recognition. Try to describe about the significance of chosen the DL in this work.

3. At least once in the manuscript, the full form of the abbreviation is listed in the paper. It should not mismatch with the universal abbreviated terms.

4. EMO-3DCNN-GRU model and LIP-3DCNN-BiLSTM model, configuration is not clear and algorithm is not presented well.

5. Emphasize the result and discussion and include the comparison analysis with other state of art methods.

6. Check all the paper listed in the references cited inside the manuscript.

Comments on the Quality of English Language

Typos errors and grammar errors found in the manuscript. So , the author should check the paper thoroughly and remove the errors.

Author Response

We sincerely thank the reviewer for your valuable time and efforts in reviewing our manuscript. Those comments are all valuable and very helpful for revising and improving our article, as well as the important guiding significance to our researches. We have studied comments carefully and have made correction which we hope meet with approval.

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in red-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the article is shown in blue color.

Comment 1: In the scientific papers, avoid singular or plural form in writing and use past perfect tense or past participles.

Reply: We thank the reviewer for his comment aimed at improving the readability of the article. We have proofread the article and tried to take this comment into account where possible.

Comment 2: This work is focused on the deep learning "EMO-3DCNN-GRU" architecture for emotion recognition. Try to describe about the significance of chosen the DL in this work.

Reply: The choice of a deep model for emotion recognition is described in detail in lines 518-538. To date, the static model is the most effective model in the task of emotion recognition [110], and the dynamic model builds on previous achievements, both by the authors of this article [109,111] and by other researchers [112,113,114]. To strengthen the validity of the model selection, we add the following to the article:

Lines 520-521: This model has confirmed its performance in the emotion recognition task in comparison to other open-source models [110].

Comment 3: At least once in the manuscript, the full form of the abbreviation is listed in the paper. It should not mismatch with the universal abbreviated terms.

Reply: We have carefully reviewed and corrected all abbreviations to ensure consistent alignment of the full form with universal abbreviations throughout the article.

Comment 4: EMO-3DCNN-GRU model and LIP-3DCNN-BiLSTM model, configuration is not clear and algorithm is not presented well.

Reply: We have added an algorithm for our EMOLIPS approach that includes all three emotional strategies. We have also added necessary comments to the algorithm so that researchers can clearly understand how EMOLIPS works. See Lines 485-499. The rationale for the choosing the EMO-3DCNN-GRU and LIP-3DCNN-BiLSTM model architectures is detailed in Lines 519-538 and in Lines 540-560, respectively. The model architectures are shown in Figure 6.

Comment 5: Emphasize the result and discussion and include the comparison analysis with other state of art methods.

Reply: We highlight and discuss the results of the paper in the conclusion section and also provide a comparison with existing models for both AVSR and SER tasks in Experimental results section. To the best of our knowledge, there is no work combining these two aspects that has been found in the scientific literature, and we emphasize this in our research.

The comparison analysis with other state-of-the-art methods for AVSR and SER tasks separately was covered in our previous papers for emotion recognition [109] and audio-visual speech recognition [70]. Thus, in the presented study we refrain from repeating this evaluation, since we already compared the methodology with the results of more than 10 different scientific groups, obtained on benchmarking corpora.

Comment 6: Check all the paper listed in the references cited inside the manuscript.

Reply: All references cited in the article have been thoroughly checked and verified.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The Manuscript titled “EMOLIPS: Towards Reliable Emotional Speech Lip-Reading” reveals an approach for emotional speech lip-reading using two-level approach to emotional speech to text recognition. The proposed approach uses visual speech data to determine the type of speech emotion

The research looks like a continuity to the authors past works (reference 109 and 110) and, also, a base for future researches.

The construction of the manuscript is correct, with all the needed sections. And all the sections’ content is well described and centered on the objective of the research. Also, the Abbreviations list from the end of the manuscript it is a great way to follow all the diverse features used in the research.

The article has an extensive literature review (related work) that helps explain the techniques and algorithms used for VER and lip reading. Yet, the presentation of each of them (like the presentation of AV corpora of emotional speech, which, even that support the decision of what database to be used, can be simplified) can be more focused giving more importance to the result of the research (Method and Experimental Results occupy 5 pages, only 18.5% of the manuscript).

The extensive references help to explain the research methods, algorithms, mathematics and steps of the study.

Some small issues can be mentioned:

1. The equations (formulae) from other works should be mentioned in references.

2. The use of “simpler model architecture” (row 526), for the same data can be verified ?

3. In row 517 “c” isn’t “z”?

Author Response

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in red-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the article is shown in blue color.

Comment 1: The equations (formulae) from other works should be mentioned in references.

Reply: Equations and formulae from other works have been duly referenced.

Comment 2: The use of “simpler model architecture” (row 526), for the same data can be verified?

Reply: For the lip-reading task, a reference model based on 3DResNet-18+BiLSTM is proposed and tested on a large and widely used LRW corpus. In our previous work, we proposed our modification of the above model and also achieved state-of-the-art results [70]. In this article, motivated by our previous results, we compare the new model with the model based on the simplest 3DResNet-18 and achieve an accuracy gain of 23.1% on the CREMA-D corpus. In this article, we focus on building a two-level approach to emotional visual speech recognition on two research corpora using well-established models based on the latest advances. Performing additional experiments is beyond the scope of this article.

Comment 3: In row 517 “c” isn’t “z”?

Reply: There was indeed a mistake. We corrected it.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors The paper proposed the EMOLIPS approach for automatic emotional speech lip-reading, which uses visual speech data to determine the type of speech emotion. The speech data is then processed by one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. The extensive results have confirmed the effectiveness of the proposed method. The paper is well-written and easy to follow. It have meet the standard of publication.

Author Response

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in red-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the article is shown in blue color.

Comment: The paper proposed the EMOLIPS approach for automatic emotional speech lip-reading, which uses visual speech data to determine the type of speech emotion. The speech data is then processed by one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. The extensive results have confirmed the effectiveness of the proposed method. The paper is well-written and easy to follow. It have meet the standard of publication.

Reply: Thank you for your positive review. We appreciate your acknowledgment of the EMOLIPS approach and its effectiveness in resolving multi-emotional lip-reading challenges. Your comments on the clarity and quality of the paper are valued.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Dear Authors

The paper considered visual emotion recognition which is very interesting. I feel inserting some flowcharts or algorithms will be improve the quality of the paper. In the current form, it is hard for the readers to find the algorithm or procedure of your method while making some algorithms, flowcharts will lead to more convenience for the readers.

Sincerely yours

Author Response

The description of different fonts used in this document are as follows:

Reviewers’ original comments are reproduced in red-colored
Plain fonts are our answers to Reviewers’ comments.
Text reproduced from the article is shown in blue color.

Comment 1: I feel inserting some flowcharts or algorithms will be improve the quality of the paper. In the current form, it is hard for the readers to find the algorithm or procedure of your method while making some algorithms, flowcharts will lead to more convenience for the readers.

Reply: We added an algorithm for our EMOLIPS approach that includes all three emotional strategies. We also added necessary comments to the algorithm so that readers can clearly understand how EMOLIPS works.

Lines 485-499.

Author Response File: Author Response.pdf

Article Menu

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Further Information

Guidelines

MDPI Initiatives

Follow MDPI