Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral Speech

Information 2024, 15(4), 184; https://doi.org/10.3390/info15040184

by Jingwen Yang and Ruohua Zhou^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Information 2024, 15(4), 184; https://doi.org/10.3390/info15040184

Submission received: 28 February 2024 / Revised: 26 March 2024 / Accepted: 26 March 2024 / Published: 28 March 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this paper, the authors introduce a new database for whispered speech in Chinese language.

1) The English language usage is good, but some sentences need to shortened or rephrased for better readability. The paper is organized in a poor manner. Some sections are very difficult to follow. For example, it is a bit difficult to understand what data is used for pre-trained networks and what is used for fine-tuning these networks.

2) The literature review is poor. The authors should also include studies related to normal/whisper speech classification, whisper-normal conversion and so on.

3) Table captions are not self explanatory. For example, Table 6 simply says "EER results in cross-scenario". It would be more clear if the authors expand the captions to include more details, for example, about the databases.

4) About wTIMIT data the authors say "However, this dataset is not suitable for conducting WSR-related research due to the difficulty of obtaining it and the complexity of the recording environment". This claim is not valid. There are works that have used this database for various whisper speech related tasks.

For example, refer to this paper.

Patel, M., et al.. "Novel Inception-GAN for Whisper-to-Normal speech conversion". In Proceedings 10th ISCA Workshop on Speech Synthesis (SSW 10) 20, 2020.

5) From Fig 3. the authors say that there is no fundamental frequency in whispered speech. But they use a new data augmentation method, audio rewinding, to change the pitch and rhythm of the original audio, to expand the data. How is changing the pitch when there is no pitch present is a valid transformation?

Comments on the Quality of English Language

The English language usage is good, but some sentences need to shortened or rephrased for better readability.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present a novel whisper speech data set with fixed utterances for speaker recognition. They also test methods and deep networks that were trained for speaker recognition based on normal speech on this whisper data set and explore their efficiency.

The paper has the merit of the newly proposed dataset and the innovation of exploring networks that were working for normal speech to work in the case of whispered speech. The innovations are minimal, but the reviewer believes it is worth publishing for the first two reasons. The paper in general is very well written.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

Strengths

1. The paper introduces the creation of the Whisper40 dataset, which addresses the lack of large-scale and standardized whisper speech datasets.

2. The use of transfer learning to enhance the performance of the speaker recognition model is a strength as it leverages pre-existing knowledge to improve recognition accuracy.

3. Incorporating advanced speaker recognition models into the WSR system demonstrates a commitment to leveraging cutting-edge technology.

Weaknesses

1. The paper mentions the poor performance of the WSR system due to the small scale of training data, which could limit the generalizability of the results.

2. While the paper introduces new models and methods, there may be a lack of comprehensive comparative analysis with existing approaches in the field.

3. The paper may benefit from providing more detailed evaluation metrics to accurately assess the WSR system's performance.

Methodological Gaps:

1. The paper could provide more details on the data collection process, including participant selection criteria and the recording environment.

2. It would be beneficial to include a comparison with baseline models or existing datasets to showcase the improvement achieved by Whisper40.

Additional Technical Issues:

1. The paper could address the interpretability of the advanced models introduced in the WSR system to understand how they contribute to the recognition accuracy.

2. It would be valuable to include robustness testing to evaluate the performance of the WSR system under different noise conditions or variations in speech patterns.

Comments on the Quality of English Language

English is fine.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have done a good job in improving the manuscript. However, I still have the following concerns.

1. The justification for the need for a new database is not satisfactory. The number of the speakers and utterances in the new database are almost same as those in CHAINs database, but are very less compared to wTIMIT dataset. Present unavailability of wTIMIT is not a valid drawback since it can become available again in near future. Claiming the current unavailability of the databases as well as less number of speakers in existing datasets is not valid. Please discuss specific advantages, for example, the other two databases are recorded from English speakers, so consider the need for a corpus in Chinese language i.e., focus on linguistic differences which may have potential impact of WSR.

2. By tonality change I understand a change in the quality of voice. When whisper itself is a voice quality, augmenting it with a different voice quality doesn't seem appropriate. Since, there is no discussion about the new audio rewinding technique, I couldn't understand the rationale behind this technique. Also it is not clear if the technique is being introduced by the authors or is it is an existing one (no reference to any previous works).

Simple and better data augmentations I would prefer is time-shifting and/or time-stretching and/or noise injection, which do not pose considerable threat to whisper specific characteristics.

Comments on the Quality of English Language

There is a considerable improvement in usage of English language. The authors need to correct a few typos and grammatical errors.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have satisfactorily replied to the concerns raised and complemented the manuscript accordingly.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have sufficiently addressed my comments. I have no further concerns.

Article Menu

Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral Speech

Further Information

Guidelines

MDPI Initiatives

Follow MDPI