Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

Electronics 2024, 13(6), 1103; https://doi.org/10.3390/electronics13061103

by Chenjing Sun¹

, Yi Zhou², Xin Huang^1,*, Jichen Yang^3,* and Xianhua Hou¹

Reviewer 1:

Meysam Shamsi

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2024, 13(6), 1103; https://doi.org/10.3390/electronics13061103

Submission received: 26 January 2024 / Revised: 11 March 2024 / Accepted: 15 March 2024 / Published: 17 March 2024

(This article belongs to the Special Issue New Advances in Affective Computing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper present a neural architecture for extracting related features on speech emotion recognition. The proposed NN is based on the pretrained W2V2 model and adapted for classification task.

While the paper is well-structured and the result is well detailed, it lacks of some crucial details.

The training of model contain two phases; representation learning called "emotion embedding" and classification phase called "ConLearnNet training". Although the configuration of the training is well-noted, the training pipeline is missing, particularly in the first phase. How the loss function is calculated from the output of the FC and one Softmax, mentioned in line 182 ? What are the possible output classes of this network? What are the training set for this first phase? What is the accuracy of this model on the classification task?

Line 313: The stopping criteria of training process need more clarification. Authors have noted: "The training is stopped when the accuracy reaches 100% ...". How they calculated accuracy, and what are the samples that have been taken into account for this evaluation?

Line 387: this model and its training need more details. Is it VTF? Have parameters of W2V2 been frozen during the training? What are the difference with your W2V2 fine-tuning?

The literature review and the state of the art can be improved by add more paper on SSL model and fine-tuning pre-trained models. The novelty of the paper should be highlighted by comparing with papers which used constructive loss and same approach. Also, the section of 4.3.5 can be improved by including other state-of-the-art results on IEMOCAP dataset.
Here is some suggested study to be included in the literature review and Table 5:
- Wang, X., Zhao, S., Qin, Y. (2023) Supervised Contrastive Learning with Nearest Neighbor Search for Speech Emotion Recognition. Proc. INTERSPEECH 2023, 1913-1917
- Pepino, L., Riera, P., Ferrer, L. (2021) Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. Proc. Interspeech 2021, 3400-3404,
- Morais, E., Hoory, R., Zhu, W., Gat, I., Damasceno, M., & Aronowitz, H. (2022). Speech emotion recognition using self-supervised features. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6922-6926). IEEE.

Some minor remarks and proposition:
- Line 338: Does the word "our representation" refer to emotion embedding in the W2V2 fine-tuning framework?
- The paragraph of line 40 about literature review of data augmentation is disconnected from the rest of text. Have authors used data augmentation in their experiments?
- The terms defined as VFT and PFT are abandoned in the rest of paper, while some time like in the section 2.2 or line 387 can be referred to avoid ambiguity.
- The distribution of classes in the training set can be added to the Confusion matrix analysis to have a clear view of the problem.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Topic of the article is of relevance and has practical applications.

The article however needs to address the following comments:

1. Expand abbreviations on first usage.

2. The abstract and introduction section should clearly represent the problem being addressed in the article. This is vague in the article.

3. Why are the two datasets chosen in different languages? one dataset is in English and the other in German. Different languages can have different features that represent emotions.

4. As per table 5, the proposed work shows marginal improvement compared to latest available work in literature. What is the additional computational complexity incurred to achieve this marginal improvement in performance?

5. The same applies to table 9. What is the additional computational complexity incurred to achieve this marginal improvement in performance?

6. Clearly, the accuracy for English dataset is much lower than that of German dataset. Please provide appropriate engineering reason for this other than just saying that "it may be due to smaller dataset"

Comments on the Quality of English Language

English is fine. Minor editing and typo corrections is required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper examines speech emotion recognition, using a feed-forward network with skip connections 3 (SCFFN) to fine-tune wav2vec 2.0, and then ConLearnNet for emotion classification. Experimental results are on the IEMOCAP and 8 EMO-DB datasets.

The research is appropriate and appears well done. The paper is well organized and presented. Results are reasonable.

Specific points:

..computers struggle to capture .. - it is rather the human programmers who struggle to program them…

..heir strength of capturing intricate patterns. - perhaps phrase this better

..While, it also introduces certain drawbacks, .. ->

..However, it also introduces certain drawbacks, ..

..inclusion of "dirty data", .. - this needs explaining

..Yue et al. [12] applie global .. ->

..Yue et al. [12] applied global ..

..classification component consists of a FC layer and a softmax layer. - not all SER is so simple

..FC layer adjusts the dimensionality of the learned feature representations to match the number of classes, .. - I think that this is not correctly expressed

..applying pre-trained W2V2 model .. ->

..applying a pre-trained W2V2 model ..

..In which, the contrastive learning .. ->

..The contrastive learning ..

..exhibit discirminative representations .. ->

..exhibit discriminative representations ..

(This error occurs multiple times…)

..by combing the tuned W2V2 .. ->

..by combining the tuned W2V2 ..

..and Berlin emotional database .. ->

..and the Berlin emotional database ..

..by fine-tuning W2V2 model.

..by fine-tuning the W2V2 model.

..fine-tuning pre-trained W2V2 model .. ->

..fine-tuning the pre-trained W2V2 model ..

..As shown the Fig. 2, .. ->

..As shown in Fig. 2, ..

..As shown in Fig. 3. - attach this to an adjacent sentence

..classify the leaned features. ->

..classify the learned features.

..information with long-distance interval [19].

..information with a long-distance interval [19].

..more suitable feature for final SER. ->

..more suitable features for final SER.

..set of all positive called positives. - confusing

..Sail Lab at the University .. ->

..SAIL Lab at the University ..

..we use speaker-independent 10-fold .. ->

..we use the speaker-independent 10-fold ..

..As show in Fig. 3, ConLearnNet .. ->

..As shown in Fig. 3, ConLearnNet ..

Macaron, .. - cite or explain

..w/o .. -> without (no need to abbreviate)

..results on IEMOCAP dataset ..

..results on the IEMOCAP dataset ..

..emotion embedding. In which, .. ->

..emotion embedding, in which, ..

..results, confusion matrix on IEMOCAP dataset is .. ->

..results, the confusion matrix for the IEMOCAP dataset is ..

(This omission of “the” in this context (“on the X corpus…”) appears often…)

..system for each type emotion, ..

..system for each type of emotion, ..

..performed on IEMOCAP database ..

..performed on the IEMOCAP database ..

..using the macaron structure .. - be consistent in capitalizing

..is the optimal model for SER, .. - optimal among the set examined

..better adapted to classification task.

..better adapted to the classification task.

..Wherein constrictive learning is used to ..

.. Constrictive learning is used to ..

..commonly used features such as 3-D log-Mel and W2V2, .. - W2V2 is a system, not a feature

No need to repeat “ICASSP” and the date 2-3 times, as in refs. 9, 11, 18,…

Comments on the Quality of English Language

fine, except for numerous small grammatical errors, mostly omissions of the word "the"

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Since the main contribution of paper is on the Emotion embedding extraction, an ablation experiment of first training phase need to be added to show the importance of this phase.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The article has been revised to address all the comments and suggestions satisfactorily.

Author Response

Thank you very much for your review.

Article Menu

Combining wav2vec 2.0 Fine-Tuning and ConLearnNet for Speech Emotion Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI