Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots

Appl. Sci. 2020, 10(13), 4602; https://doi.org/10.3390/app10134602

by Moa Lee and Joon-Hyuk Chang^*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2020, 10(13), 4602; https://doi.org/10.3390/app10134602

Submission received: 12 May 2020 / Revised: 29 June 2020 / Accepted: 30 June 2020 / Published: 2 July 2020

(This article belongs to the Special Issue Intelligent Speech and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

I thank the Authors for having addressed my comments. The paper is more clear now. However, I believe that the paper presents some major flaws that prevent it from being acceptable for publication:

the ASR baseline on librispeech is a bit weak. Today, the best systems achieve 2% WER. I do not expect to see those ASR back-ends, but I believe that something better than 6.65% would be necessary to understand the actual benefit of the proposed method.
it is not clear how the speech enhancement is trained (using the 100h of contaminated speech?) and how the following ASR is trained (on the noisy speech? On the enhanced noisy speech?). Is there a model for each SNR?
What is the performance of the speech enhancement component?
The differences in Table 3 are very marginal. Basically the 15dB case is equivalent to the clean one. The fluctuations between the 3 baseline and the DNN_senone seem more related to experimental noise. How many models are trained for each experiment?(i.e. what is the variance?). The 20.52 WER in DNN_senone at 15dB is a bit suspect as the 15dB is very close to the clean case and I do not see a similar major gain in the other cases.
Figure 5 reports errors between 8.5% and 8.7% WER. I do not see these numbers in none of the tables. Plus, there should be a discussion on the results (although differences look like experimental fluctuations).

Table 4 is actually in line with what one would expect (i.e. no differences at 15dB, improvement of the proposed method at 5dB). Also the ranking is what one would expect and it is consistent across all cases. Unfortunately the results of the other table and the results on TIMIT are not. This discrepancy leaves the reader with some doubts whether experiments are significant or not.

Finally, I believe that it should be discussed why 5dB of fan noise do not impact the WER as 5dB of motor and fan noise (unless it is 5dB of each noise).

I do not believe that the paper should be rejected for the issues listed above because I think that they can be fixed (maybe excluding point n.1).

Therefore I am recommending a major revision.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

In this paper, the authors proposed a new method to encode information about the motor-state in a humanoid robot. They showed through the experimental analysis that incorporating this information into the ASR process can improve the robustness of the system. It is a very interesting research topic that has very practical implications too. Here are my comments which I hope might help improve the paper.

Major comments:

Since the microphone in a humanoid robot is close to the movement parts with respect to speakers communicating with the robot, the ego-noise is captured with strong power and the SNR of the captured speech signal is very low. Authors should discuss how realistic the selected SNRs (5 dB and 15 dB) are. If these values are far from those in real situations, they should probably provide the analysis for those missing SNR values (e.g. 0 dB or lower SNR values).
Please discuss about the complexity of the proposed system since one of the authors' concerns was the suitability of the method for embedded systems.
The results are not clearly reported and there are some ambiguities. Please clarify the following things in the text:
- The "clean" and "other" in tables 3 and 4.
- The numbers in the first column in these two tables.
- Why have the authors only reported those numbers for the first baseline system?
- The "average" used in tables 3 and 4 to avoid confusion.

Minor comments:

In page 7, line 198, "prliminary" should read "preliminary".
In page 8, line 207, "represents" should read "represent".
Please follow the journal guidelines for numbering and use a consistent numbering style throughout the text for the tables and sections. They are Greek numbers when you refer to them in the text.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I thank the authors for having replied to most of my comments.

However, I cannot see the new results using the DNN mapping denoising in table 3 and 4. This was a major comment (made by more than one reviewer) and I cannot assess on the results.

Using the new Kaldi baseline improvements are very marginal and could be questionable, but I think that it is ok.

I still believe that formulas from 1 to 5 are useless in 2020. So my recommendation is to remove them.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors in the second version of the manuscript clarified some ambiguous points in the paper. However, there are still some questions in my previous report that have not been addressed properly and still need clarification.

The authors in page 6, line 160, started talking about the "clean" and "other" terms before introducing them.
Moreover, even though they have mentioned that the quality of audio and the transcription of the clean recordings is higher than that of the other recordings, it is still not clear to which database/subset and to which acoustic/transcription conditions the authors are referring by saying "clean" and "other".
The authors also did not discuss enough about how realistic the selected SNR values are. As I mentioned in my first report, the ego-noise itself (regardless of other background noises) can produce very low SNR recordings since the source of the ego-noise is very close to the microphone and it can be captured with strong power. The authors should discuss about the range of SNR values in real situation either by providing an experimental results or by referring to the papers in which it has been evaluated or discussed.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

No further comments.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

This paper proposes to use bottleneck features from a speech and motor-state autoencoder as inputs to a speech recognition system. The paper is well-written and the approach is interesting. I however believe the following questions should be answered prior to publication to make the paper more convincing.

Here are some specific comments:

1) The authors mention they use Cross-Entropy loss. This makes sense for senones and motor-state as targets, but I'm not sure I understand how this loss function can apply to MFCC. Shouldn't MSE Loss be used here?

2) Why is the vector for the motor-state 2-D? It seems like a 1-D binary value would do it.

3) Can you explain why you have to concatenate the MFCC features with the bottleneck features for the second stage?

4) Have you considered training with multiple targets at the same time (e.g. senone, MS and MFCC) with some weighting?

5) When training the auto encoder to predict the motor-state, why would the auto encoder store any information about the MFCC?

6) It would be nice to see some spectrograms of the motor noise, fan noise, speech etc. to get a feeling of what's going on here. Similarly, it would be nice to visualize the bottleneck features over time versus the spectrogram.

7) In the experiments with TIMIT, the baseline performs similarly with motor on and off in terms of PER, whereas the experiments with LibriSpeech show that the baseline performs worse with the motor on compared to the motor off. I'm not sure I understand why TIMIT seems more "robust" to motor noise?

8) The authors mention they examined the effect of the size of the bottleneck features, but that it did not yield any significant performance differences. It would be relevant to report some numbers.

Reviewer 2 Report

This paper attempts to tackle a significant problem related to ASR for robots plagued with internal operating noise. The results show improvements over simple baseline methods. Section 4 was difficult to follow and the writing style could be improved. My main concerns are the following:

1.) I do not see a comparison with state-of-the-art methods

2.) As stated in line 177, the baseline neural network is stripped of fine-tuning techniques such as LDA. How do we know that the deterioration in PER is not due solely (or to an overwhelming degree) to the lack of this extra fine-tuning? In order to clearly show the effect of motor state bottleneck features, both baseline methods should have the same fine-tuning algorithms as the Kaldi implementations.

3.)Line 142: What is the "development set"?

4.)Line 152: There is no info about the testing set

5.) Line 196: More details should be provided on the LIbriSpeech set test. For instance, how was ego-noise inserted?

Reviewer 3 Report

The paper presents an acoustic modelling approach robust to ego-noise in robotic platforms. The idea is to augment acoustic features (MFCC) with features carrying information about the motor state. The latter are derived as bottleneck features of another neural network.

In my opinion the paper has a series of flaws at several levels (experiments, descriptions, soundness, presentation,...) for which I recommend rejecting it. Although I struggle to understand the rationale behind the proposed approach and what the bottleneck features represent, I would encourage the Authors to resubmit the paper if they manage to improve it.

Here I list my comments, the most critical at the beginning:

In the experimental analysis the only comparative method is [9] from the same authors. I think that other approaches should be considered, in particular some of the noise suppression approaches described in the introduction.
Clarify how the recorded ego-noise is used to generate different SNR and how this signals are used in data augmentation for AM training. How many samples for each SNR? Details carefully all the training procedure (data, hyperparameters, etc..)
I do not understand what the bottleneck features describe about the acoustics. Basically the preliminary DNN is the acoustic model used in [9]. So, is it a sort of pre-training? Is it equivalent to adding layers to the AM? The motivations provided by the Authors are rather vague. Are the different models trained in the same (or comparable) way? Is the AM architecture good enough?
Clarify how far the baseline used for TIMIT is from the state-of-the-art. Typically, filter-banks with fmllr are used as features for ASR instead of MFCC. Also the AM architecture is far from the state-of-the-art on TIMIT. Report the WER on clean TIMIT to give an idea of how good the AM is. Note that improving over a bad baseline is useless. So this aspect is fundamental. If the proposed method works, it should help also in presence of a better AM.
I do not understand what are the SNRs in Table 3 and Table 4. If the motor is off in table 3, what noise is that?
The numbers are a bit strange. With motor-on PER is lower than with motor-off. In addition, at 20dB the proposed method is better than the clean condition. In my opinion the Author must consider a state-of-the-art acoustic modelling to see what are the actual benefits of the proposed method
Experiments on Librispeech are not described or discussed at all: what are clean and other in the tables? Is the AM the same used for TIMIT? The state-of-the-art on Librispeech is approx. 5% WER. This last experiment is totally useless if presented in the current form.

Minor comments:

The state of the art describes some works on noise suppression but there is nothing on data augmentation which is where most of the improvement lie in ASR.
Explain why different SNR are there. The ego-noise should be constant.
Formulas from 1 to 4 and text from line 97 to 101 can be removed or compressed. This is basic NN background.
Line 121: clarify better what are the labels. This sentence is not clear. It is clear only later that 3 different labels are used.
Why are only 2 engine state considered? Typically the noise is different depending on what part is moving and how.
Revise English

Minor comments:

Article Menu

Augmented Latent Features of Deep Neural Network-Based Automatic Speech Recognition for Motor-Driven Robots

Further Information

Guidelines

MDPI Initiatives

Follow MDPI