Next Article in Journal
A Study on the Coarse-to-Fine Error Decomposition and Compensation Method of Free-Form Surface Machining
Previous Article in Journal
Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets
 
 
Article
Peer-Review Record

Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language

Appl. Sci. 2024, 14(19), 9043; https://doi.org/10.3390/app14199043
by Vasile Păiș *, Verginica Barbu Mititelu, Elena Irimia, Radu Ion and Dan Tufiș
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2024, 14(19), 9043; https://doi.org/10.3390/app14199043
Submission received: 21 August 2024 / Revised: 20 September 2024 / Accepted: 2 October 2024 / Published: 7 October 2024
(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Let’s us raise interest about your work in such important area of studies, as natural language processing, synthesis, as well as data preparation. In your work, dedicated to stydy of Romanian language you conducted a study in attempt to create a dataset of Romanian language speakers including utterances of people of different age and gender. According to your study, you conducted research based on publicly available data and had verified proofs that you use the public information, which makes your dataset more open for use by other researchers in the area. Since your work focuses mostly on creation of the dataset, and methodology related to creation of a dataset, let’s focus more on aspects that may improve presentation of your results, findings and as well make the data more easily to navigate.

Since your paper so far has no critical issues, that need to make noticeable edits, I add suggestions, to consider to improve this paper.

1. In section 2 it is better to move the table 1 below line 59, however, I leave this upon your consideration since in your case it may improve presentation of the language studies in related work.

2. While you mention the annotation software you use to annotate the language coprora it would have been better to reference to mention other known annotation software in the text.

3. Many tables you use have “open space”, that can be filled with some metadata,

commentaries, disambiguation or other useful information that illustrate your dataset. Since it is not obligatory to add more text, please consider adding some additional columns in your tables.

4. If adding some information on your dataset, in regards to YouTube videos or other social

media, please consider (in future studies) adding some inforation on audience coverage of videos, engagement, categories (news, entertainment videos, TV shows, nature, history, travel) which may show representation not only of the age or gender, representation of different videos, regarding general social media in Romanian laguage you are studying in your work and it could have been useful for statistics.

5. While your work is related mostly for the dataset or corpora creation, you briefly mention the

task of evaluation of dataset, which is used for TTS purposes to aid synthesis of utterances in Romanian language. For instance you could have explained, how models like Whisper for instance are useful to evaluate your dataset, the reason you chosen them and probably some discussion regarding them. Since this is not obligatory, I left this upon your considerations or opt for future studies.

Conclusions are the following: the paper has necessary sections, has a background, description of dataset, as well as its evaluation and vast amount of sources related to the study. The paper has no critical issues, however we have some concerns, that I mentioned in the remarks. In general they may be underscored as adding more information regarding some aspects of study, that may improve the overall impression about your paper; however since I consider them non critical problems (non-issues), I left them upon your consideration whether to include or not include in your studies, or, take them into account in the future studies, regarding testing the dataset against NLP algorithms or other studies.

Author Response

Dear reviewer,
Thank you for taking the time to read our paper and to make isightful comments. We took into account all the comments and suggestions and we feel that our paper has been improved as a result..

Comment 1. In section 2 it is better to move the table 1 below line 59, however, I leave this upon your consideration since in your case it may improve presentation of the language studies in related work.
Response 1: The table was moved down.

Comment 2. While you mention the annotation software you use to annotate the language coprora it would have been better to reference to mention other known annotation software in the text.
Response 2: We referenced other annotation software, such as NLP-Cube, RNNTagger, Stanza.

Comment 3. Many tables you use have “open space”, that can be filled with some metadata, commentaries, disambiguation or other useful information that illustrate your dataset. Since it is not obligatory to add more text, please consider adding some additional columns in your tables.
Response 3: We re-organized Tables 4, 5 and added more columns in Tables 3 and 4 to reduce the open space. 

Comment 4. If adding some information on your dataset, in regards to YouTube videos or other social media, please consider (in future studies) adding some inforation on audience coverage of videos, engagement, categories (news, entertainment videos, TV shows, nature, history, travel) which may show representation not only of the age or gender, representation of different videos, regarding general social media in Romanian laguage you are studying in your work and it could have been useful for statistics.
Response 4: We added a mention in Section 3. Methodology with regard to the diverse set of domains and number of views, as well as the fact that we did not perform specific filtering based on these criteria. We further added a mention in Conclusions with regard to future studies.

Comment 5. While your work is related mostly for the dataset or corpora creation, you briefly mention the task of evaluation of dataset, which is used for TTS purposes to aid synthesis of utterances in Romanian language. For instance you could have explained, how models like Whisper for instance are useful to evaluate your dataset, the reason you chosen them and probably some discussion regarding them. Since this is not obligatory, I left this upon your considerations or opt for future studies.
Response 5: The article mentions the dataset useful especially for evaluating ASR. As suggested, we added an explanation for why evaluating against Whisper and other state-of-the-art models is relevant: “Evaluating the performance of state-of-the-art pre-trained models on underrepresented speech is relevant in order to understand the necessity of including more such data in the training of new models or in the fine-tuning process. If the model's performance on the USPDATRO dataset is significantly lower compared to regular data, this is a strong indication that a larger underrepresented speech dataset is needed to be included in the model's training. “ We further added a future work possibility mentioning the applicability of the dataset for TTS: “Furthermore, even though our work focused on the speech recognition task (ASR), gathering more underrepresented speech may enhance the capabilities of text to speech (TTS) synthesis models for such voices.”

Reviewer 2 Report

Comments and Suggestions for Authors

1. When collecting samples from YouTube and other social media platforms, it is important to note that the content on these platforms is often quite informal and spoken. Whether the author has implemented any form of selection or filtering in this regard?

2. In lines 195 ~ 200, the author mentions "Subtitle Edit...", please elaborate. For example, how to get subtitles from the video and whether they are generated by AI or provided by humans.

3. Tables 3 and 4 can be represented by bar chart, it can also indicate whether the sample follows a normal distribution.

4. It is not recommended to use cross-references and citations in the conclusion.

Author Response

Dear reviewer,
Thank you for taking the time to read our paper and to make isightful comments. We took into account all the comments and suggestions and we feel that our paper has been improved as a result.

Comment 1. When collecting samples from YouTube and other social media platforms, it is important to note that the content on these platforms is often quite informal and spoken. Whether the author has implemented any form of selection or filtering in this regard?
Response 1: We clarified the content selection methodology: “The speech content and style highly varies on these platforms, with many recordings being spontaneous and informal. We did not perform a pre-selection of the content based on style characteristics when searching for content. However, the collected samples were later classified by spontaneity and quality, as indicated in Section 5 and Table 3.“

Comment 2. In lines 195 ~ 200, the author mentions "Subtitle Edit...", please elaborate. For example, how to get subtitles from the video and whether they are generated by AI or provided by humans.
Response 2: We clarified the manual annotation of the content: “The actual transcription of the audio tracks was performed manually, by the human annotators, using the Subtitle Edit application and transcriptions were saved in CSV format files with the same ID as the transcribed video file. In order to ensure the high quality of the transcripts, no automated technologies were used at any point of the work. In the transcription process, manual subtitle segmentation was carried out at sentence level”

Comment 3. Tables 3 and 4 can be represented by bar chart, it can also indicate whether the sample follows a normal distribution.
Response 3: We added Figures 1 and 2. In Figure 2 we used a doughnut representation, including both genders, hence the gender information was not included in Figure 1. 

Comment 4. It is not recommended to use cross-references and citations in the conclusion.
Response 4: Following re-organization, the Conclusion no longer contains cross-references and citations.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors, with interest we were reading the paper and we have the following comment:

1. we realize that it is a lot of work to annotate a corpus of recorded speech. But at this moment there are tools available (Maestra) to transcribe Romanian audio to text using AI technology.

2. In the research 4 experienced annotators were used, we were wondering what was the agreement between the annotators, using Krippendorf or alpha scale

3. The goal of the paper was to record speech from usual under represented speech. We were wondering what is the impact of different recordings on the speech recognition. So the question is if (additional0 training for special under sampled speakers groups is necessary.

4. In many countries time and efforts have been spent to create an extensive open domain corpus for the national language. There is still a need for special groups of speakers  with respect to age , speech disabilities and regional dialects. All these topics are not discussed and the impact is not proven.

5. The scientific aspects in the paper are not clear.

6. We recommend the authors to show the impact and use of the created database.

Author Response

Dear reviewer,
Thank you for taking the time to read our paper and to make isightful comments. We took into account all the comments and suggestions and we feel that our paper has been improved as a result.

Comment 1. we realize that it is a lot of work to annotate a corpus of recorded speech. But at this moment there are tools available (Maestra) to transcribe Romanian audio to text using AI technology.
Response 1: We are aware of tools and models for automatic transcription (we also mention a couple of models for the Romanian language). However, as we argue throughout the article, existing models and tools perform badly on underrepresented speech. Thus our dataset and method aims to improve such models and tools by providing a method to gather additional resources for underrepresented speech categories. 

Comment 2. In the research 4 experienced annotators were used, we were wondering what was the agreement between the annotators, using Krippendorf or alpha scale
Response 2: We added a note at the end of the Methodology section describing there were no discrepancies between the annotations for the segments that were double annotated.

Comment 3. The goal of the paper was to record speech from usual under represented speech. We were wondering what is the impact of different recordings on the speech recognition. So the question is if (additional0 training for special under sampled speakers groups is necessary.
Response 3: The impact of the new dataset on speech recognition is presented in Table 6 (on the Whisper state-of-the-art model the WER increased from 0.1261 to 0.4330) and Table 7 (WER increases from 0.4052 to 0.4874). We further explained it in text: “Evaluating the performance of state-of-the-art pre-trained models on underrepresented speech is relevant in order to understand the necessity of including more such data in the training of new models or in the fine-tuning process. If the model's performance on the USPDATRO dataset is significantly lower compared to regular data, this is a strong indication that a larger underrepresented speech dataset is needed to be included in the model's training.” and “the best results on the USPDATRO dataset are below the CoRoLa based results by 8% WER. This indicates the need to include underrepresented speech during model training or fine-tuning.” 

Comment 4. In many countries time and efforts have been spent to create an extensive open domain corpus for the national language. There is still a need for special groups of speakers  with respect to age , speech disabilities and regional dialects. All these topics are not discussed and the impact is not proven.
Response 4: We have added in Related Work more information about national corpora: British National Corpus, Czech National Corpus. Furtheremore, in the conclusions we explained how this work will impact the CoRoLa Romanian national corpus.

Comment 5. The scientific aspects in the paper are not clear.
Response 5: We hope that following this round of reviews and the inclusion of more information, everything is more clear. We feel that the development of new speech resource, including underrepresented speech, together with a method for replicating its creation in other less-resourced languages is a significant contribution.

Comment 6. We recommend the authors to show the impact and use of the created database.
Response 6: The use of the dataset is for evaluating ASR models on underrepresented speech categories is given in Section 6. We further ackowledged its impact on expanding the CoRoLa corpus with missing speech types.

 

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

All problems have been addressed.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

I am satisfied with answers to my questions and wish you success in your future research

Back to TopTop