Next Article in Journal
Long Term Skeletal, Alveolar, and Dental Expansion Effects of the Midfacial Skeletal Expander
Previous Article in Journal
Surface Motion for P-Wave Scattering by an Arbitrary-Shaped Canyon in Saturated Half-Space
Previous Article in Special Issue
Crossband Filtering for Weighted Prediction Error-Based Speech Dereverberation
 
 
Article
Peer-Review Record

Analysis and Investigation of Speaker Identification Problems Using Deep Learning Networks and the YOHO English Speech Dataset

Appl. Sci. 2023, 13(17), 9567; https://doi.org/10.3390/app13179567
by Nourah M. Almarshady *, Adal A. Alashban and Yousef A. Alotaibi
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2023, 13(17), 9567; https://doi.org/10.3390/app13179567
Submission received: 8 August 2023 / Revised: 21 August 2023 / Accepted: 23 August 2023 / Published: 24 August 2023
(This article belongs to the Special Issue Automatic Speech Signal Processing)

Round 1

Reviewer 1 Report

This is a good paper presenting the results of developing a speaker recognition system. The topic is highly relevant to the topics of the journal and the special issue: Automatic Speech Signal Processing.

The authors give a wide overview of existing systems and the methods they use for speaker identification. The literature review can be extended with some more theoretical studies on speech recognition but this is only a recommendation and authors' preference.

There are some problematic conclusions in the discussion. E.g., line 405

"After analysis, it was found that the majority of listeners failed to predict more than  ten samples out of the 35. Therefore, we can conclude that the system we created with the suggested features closely resembles the human auditory system"

High level of human errors does suggest problematic, or unrepresentative data, in terms of distinctiveness of speakers, features, etc., but does not necessarily explain poor system performance. A more detailed analysis is needed to defend such a statement considering why the system performs in such a way. In fact, it also might be the case that the data is not sufficient for such a conclusion.

Results from Table 9 also need more complex analysis. Feature selection is known to be key in ML and DL methods. It needs to be explained why adding certain features worsen the results (comparing line 6 and 7 in Table 9). It is also a matter of discussing the representativeness in the particular dataset and these features might be otherwise important for speaker recognition. It would be interesting to see other research on the topic of feature selection and evaluation.

In my opinion, the conclusion does not need to repeat the results that have already been presented, but rather to summarise, outline the achievements and weaknesses, and to focus on future work. Moreover, it would be interesting to see whether the methodology might be applied to other datasets and whether the expectation is to see similar results.

In general, the text needs to be carefully proofread. Also, there are multiple repetitions that need to be avoided. Some sentences are unclear and need to be carefully reformulated.

line 14 Energy >> energy (I don't see a reason to use capitalisation)

line 18 high errors >> (highly) frequent errors (unclear what is meant by high errors)

line 33 punctuation unclear: DNNs [4]. VQ technique that operates in speaker recognition models with the text-independent system.

line 46 YOHO speech dataset from the Linguistic Data Consortium >> YOHO speech dataset developed by the Linguistic Data Consortium

line 50 the change in vocal cord vibration to each speaker; > the change in vocal cord vibration of each speaker;

line 69 as feature input for evaluating a better speaker identification system’s performance >> as feature input for improving speaker identification system’s performance

line 76 the speakers' percentage recognition rate >> the recognition rate (it is already clear that you talk about speaker identification task; you can reduce repetitions)

line 136 YOHO dataset files, the upper number(101,102…,277) shows the speaker index. >> YOHO dataset files (numbers from 101 up to 277 present the speaker index and below them are listed the enrolment sessions) - or some more clear figure caption 

line 154 does not affect the sound and language content and identities.  

- it is unclear what it means to affect 'identities'

line 185 Table 3. Detected speech results. >> Table 3. Results for speech detection

Dataset Before / After Detection >> preposititons are not capitalised before / after

line 199 Feature extraction is a significant step in developing a speaker identification system before evaluating it. > Feature extraction is a significant step in developing a speaker identification system.

line 281 from it, we extracted Eq.9, Eq.10, and Eq.11. >> from it, we calculate Eqs. 9 - 11.

line 345 To split the data on any model, the ratio of training/testing is a manner rule. -- unclear

line 442 Figure 13. Comparisons with stat-of-the-art. >> statE-of-the-art (also in the graphic)

line 448 In this article, we developed a speaker identification system > In this article, we presented the development of a speaker identification system

 

Author Response

We processed these comments and reorganized the manuscript.

Thank you. 

Author Response File: Author Response.docx

Reviewer 2 Report

Suggestions and comments:

1) Consider dividing large chunks of text into separate paragraphs. Look especially at Chapter 2, now it is hard to follow and does not focus the eye of the potential reader.

2) Figs 2, 4, 5, 10, 11, 13 could have larger fonts, so that it is easier to read. Moreover, in case of fig. 13 omit the decimal part in case of % on the Y axis; leave only the integer part.

3) What are the units of the Y axis in case of Figure 3? What do they describe? Always remember to label both X and Y axes of all plots.

4) Fig. 8 is basically a table, and should be presented and labeled as one.

5) There are several minor editorial and formatting issues, therefore a careful throughout examination would seem necessary.

6) Authors could provide at least principle information considering the utilized laboratory stand, including PC hardware components and used simulation environment, third party libraries, open source software or custom build one, etc.

7) The number and quality of cited references are far too short. Authors are strongly advised to extend the scope of cited related works, etc.

To sum up, this is a good paper, but it could be a very good one. Therefore, Authors are requested to prepared a modified version of their initial submission.

Author Response

We processed these comments and reorganized the manuscript.

Thank you. 

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The Authors have prepared a revised version of their manuscript, which is more informative and pleasant to read. The contents, aim and motivation are understandable. There are still some editorial and formatting issues, e.g., related with formatting of tables, figures, as well as their proportion, etc., yet they can be overcome at a later stage.

To sum up, this paper is practical and interesting, it is surely worth publishing in the Journal. Therefore, I do recommend it to be processed further.

Back to TopTop