Next Article in Journal
Laser Use in Creating Orthodontic Adhesion to Ceramic Surfaces
Previous Article in Journal
gbt-HIPS: Explaining the Classifications of Gradient Boosted Tree Ensembles
 
 
Article
Peer-Review Record

Correct Pronunciation Detection of the Arabic Alphabet Using Deep Learning

Appl. Sci. 2021, 11(6), 2508; https://doi.org/10.3390/app11062508
by Nishmia Ziafat 1,*, Hafiz Farooq Ahmad 2, Iram Fatima 2, Muhammad Zia 1, Abdulaziz Alhumam 2 and Kashif Rajpoot 3,4
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(6), 2508; https://doi.org/10.3390/app11062508
Submission received: 14 January 2021 / Revised: 28 February 2021 / Accepted: 5 March 2021 / Published: 11 March 2021

Round 1

Reviewer 1 Report

This paper describes automatic speech recognition methods/results of Arabic alphabet and detection methods/result of mispronunciation.  Although these results are useful, there are unclear points as follows:

1. In Introduction, there are many studies for detection of mis-pronunciation for other languages, such as English, Chinese, Japanese and so on. Please introduce some of them. 

2. In line, 111, why did you smooth spectra over time?

3. In line 121, for one of  data augmentation methods, the manipulation of rate of speech (or utterance speed) has been used.  Is it unsuitable for your approach ?

4. In line 147, did you use 12 spectral features for every frame for the input of RNN ? If so, why did not you use directly mel-spectrum  ?

5. In Figure 6, Input image (-> input speech) is composed of 227x227x3, the same as the input image of Figure.7.  What do they mean ?  The reviewer guesses that "227" denotes the number of spectral bins, and number of frames and "3" denotes spectrum, delta spectrum and delta-delta spectrum, respectively.  

6. In Figure 6, how many filters (or convolution functions) did you use for every layer?

7. In Table 2, first row, (1 per sample) -> (1 per person)

8. In Table 3, did each speaker for 20 male speakers utter correct pronunciation and mis-pronunciation ? If so, how did he utter them ?

9. In line 226, are speakers of testing data different of speakers of training data ?  If not so, please conduct evaluation experiments on the condition of speaker independent mode, that is, speaker-open/speaker-independent (this is very important).

10. In Tables 5 and 6,   the reviewer guesses that the numbers in horizontal and vertical columns correspond to "Arabic alphabets", if so, please denote "alphabet" instead of the number.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The authors presented a comparison of several deep learning architectures for correct pronunciation detection in Arabic. The topic is relevant, but I can’t accept the paper until these issues are correctly addressed:

Minor Issues

  • Table 1: “Adams” optimizer algorithm? Do you mean Adam? In case it is Adams, please reference to a paper.
  • L56: “DL algorithms learn a model of abstraction from data...” Not very clear, please, rephrase.
  • L165: “By Transfer” -> “by transfer”
  • L271: I previously mentioned it. I don’t know “Adams”.
  • Section 3.4 and L224-227 explain apparently the same... Choose one of them, please. Maybe the second one because it is more detailed.
  • Section 4.3: Validation Strategies: I have the impression you are explained the same, but with different words than in L228-232. If this is right, please, synthesize into only one explanation. It should be enough because these evaluations are not novel.
  • Equation 1 is a bit shifted.
  • L358-359: It seems a conclusion more than a part of the discussion.
  • RNN and BLSTM are mentioned in a random way. Maybe, explaining at the beginning that BLSTM are a type of RNN and from that moment only mention BLSTM. There are many RNN architectures…
  • You provide the details of the DCNN architecture, but I can’t see these details for the BLSTM, when this architecture also has hyperparameters, for instance, number of units/neurons or activation function.
  • If AlexNet (a quite old architecture) provides so good results, have you thought to use more contemporary ones like Xception, Inception-ResNet or NASNet.

Major Issues

  • How many times did you run the experiments? The performances are averages? Could you provide the standard deviation or the confidence intervals?
  • I do not really understand why different evaluation strategies are used. Why not only 5-fold?
  • If the results reported (98.41% and 99.14%) are obtained by running 1 experiment, the superiority of one method could be dubious.

Author Response

Please see the attachment. 

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Although authors' responses are not sufficient, especially, for points 1, 3 and 7,

the originality and usefulness of this manuscript are marginal.

Please revise these points more concretely.

Author Response

Please see the attachment

Author Response File: Author Response.docx

Reviewer 2 Report

The authors have addressed correctly my comments. Thank you.

Author Response

We thank the reviewer for their review and feedback, which has truly helped to improve the quality of this manuscript. We have revised our manuscript accordingly.

Thank you

Back to TopTop