Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Open AccessArticle

Peer-Review Record

Doppler Radar-Based Human Speech Recognition Using Mobile Vision Transformer

Electronics 2023, 12(13), 2874; https://doi.org/10.3390/electronics12132874

by Wei Li^1,*, Yongfu Geng¹, Yang Gao¹, Qining Ding¹, Dandan Li¹, Nanqi Liu² and Jinheng Chen²

Reviewer 1:

Muahammad Islam

Reviewer 2: Anonymous

Reviewer 3:

Alana Da Gama

Reviewer 4:

Leonardo Santos

Electronics 2023, 12(13), 2874; https://doi.org/10.3390/electronics12132874

Submission received: 31 May 2023 / Revised: 21 June 2023 / Accepted: 27 June 2023 / Published: 29 June 2023

Round 1

Reviewer 1 Report

As the problem has not been mentioned in the abstract, there is no direct link between the solution and the problem, which must also be mentioned in the abstract.
Although Mobile Vision Transformer (Mobile ViT) is mentioned in the abstract as well as in the paper, we have not found it as the main keyword of the paper title.
According to the abstract, the accuracy rate is 99.5%, which is the conclusion, whereas in the conclusion it is stated that the accuracy rate is 99%, which explains the difference.
A number of references in the literature review are not properly cited in the text.
The reference [20] is cited but cannot be found in the references section of the paper.
As far as I can tell, the references are not cited in the correct way like [18] and others as well.
The authors do not summarize the problem they want to solve in the introduction of the paper.
The diagram in Figure 1 is only a general representation of the manuscript, I have not been able to find the exact block diagram.
Describe how the authors got the images for the Figure 3 and what the purpose of these images is in the method.
According to the authors, the experimental results did not have a link between the method they used and the results that were obtained.
The experimental results do not need to be shown in Figure 6 as it is not necessary to show it. Only the results need to be provided by the author of the paper.
I would like to know what the purpose of Figure 7 is in the experiment results.
As can be seen in Figure 9, the text cannot be found as well as it is not clear.
There is a need to rewrite the conclusion to the paper.
There needs to be an update to the references and they need to be cited in the text appropriately

Extensive editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper discusses a non-contact Doppler radar detecting the human voice by CNN and Transformer model.

Strengths: CNN image processing approach and test phases.

Points of weakness: state of the art and discussions.

A minor revision is required.

Actions to do:

According to the weaknesses, I suggest to improve the paper by answering to these points:

1- Please provide more details about the setting of the hyperparameters of the CNN network (not only the learning rate) by enhancing the error sensitivity;

2- Please add more comments about the adopted technology by comparing results with other found il literature (as https://doi.org/10.1007/978-981-16-7213-2_28 ) ;

3- Please add in the introduction section more comments about LSTM and CNN applied for image image processing also for other applications:

- https://doi.org/10.3390/s21144693

- https://doi.org/10.3390/s21093289

- DOI: 10.1109/MetroInd4.0IoT51437.2021.9488536

- https://doi.org/10.3390/s22020548

- https://doi.org/10.3390/info11050257

- https://doi.org/10.3390/s21030689

- https://doi.org/10.3390/s21062132

4- Conclusions should be improved buy summarizing better the results.

Minor remarks:

All the images should be with high resolution.

1. What is the main question addressed by the research?

Human voice detection by CNN

2. Do you consider the topic original or relevant in the field?

Average/High

Does it address a specific gap in the field?

Yes (about the merging of hardware technology and CNN-software one)

3. What does it add to the subject area compared with other published
material?

Authors describe a state of the art about a partial scenario (the state of the art should be improved).

4. What specific improvements should the authors consider regarding the
methodology?

The discussion of the model could be improved.

What further controls should be considered?

Hyper-parameter control

5. Are the conclusions consistent with the evidence and arguments
presented and do they address the main question posed?

Conclusions should be improved.

6. Are the references appropriate?

Should be improved.

7. Please include any additional comments on the tables and figures.

Low resolution.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors propose a method using the Mobile Vision Transformer (Mobile ViT) to recognize words based on doppler radar signal, showing an accuracy rate of 99.5%. The idea is very interesting and sound. The paper some minor errors that need attention. Most of the ones detected as listed in sequence.

The authors have validated the idea using only digit words (0-9). How would it behave (accuracy) if more words were added to the recognition? What about testing with different users instead of only one person? Is it possible to create a generalized method so that it could recognize words from different people using the same approach (without new training)?

Please provide more information on how the doppler radar is connected to the computer and data is collected for training.

Please provide more information on the comparison performed with [8].

Please make the entire paper be reviewed by a native speaker because the amount of errors may affect paper quality.

More general comments and minor errors are listed as follows.

" issue, Mobile" -> " issue, a Mobile"

"recognition [1]has " -> "recognition[1] has "

" of contact non-air" -> " of contacting non-air"

"the vibration will be transmitted to the sensor through the voltage change will be converted into a vibration signal sound signal." -> please rewrite

"Capturing human voice with radar devices " -> ?

"antennas[4]so" -> "antennas[4] so"

"signals[9]of" -> "signals[9] of"

"The larynx is not a solid structure, but a cavity inside it, called the laryngeal cavity, in which there are two parallel muscles, covered with mucous membrane, that is, the vocal cords, which are like two rubber bands side by side and have a strong elasticity." -> please rewrite

"Normally human" -> "Normally, human"

"Radar works" -> "Radars work"

" (STFT) [14]for" -> " (STFT)[14] for"

"which means that in STFT, the time resolution and frequency That is to say, in the short-time Fourier transform, the time resolution and frequency resolution cannot be combined, and should be traded off according to the specific needs. " -> please rewrite

" feature[15]is" -> " feature[15] is"

" from A-J respectively." -> "from A-J, respectively."

" under test。" -> " under test."

text from line 220 to line 235 has a smaller font size

"Figure 5 shows" -> "Figure 5 shows"

"Divide the feature map into patches, assuming that the patch size is 2×2 (to facilitate the calculation of the number of ignored channels), that is, the size of each patch consists of 4 Pixel, and when performing self-attention calculation, each token (each pixel or each small color block in the graph) only pays attention to its own token of the same color, achieving the goal of reducing the amount of calculation." -> please rewrite

"be much difference," -> "be much different,"

"if each token to pay attention to the neighboring pixels." -> please rewrite

"pixels.The " -> "pixels. The "

"Figure 5.Light" -> "Figure 5. Light"

"Figure 6.Detecting" -> "Figure 6. Detecting"

"The parameter settings for the FM wave are shown in Table 1, and the data sampling parameters are shown in Table 1." -> "The parameter settings for the FM wave together with the data sampling parameters are shown in Table 1."

"24GHZ" -> "24GHz"

"The same letters are measured 200 times," -> letters or digits (0-9)?

" compare the" -> " compare the"

"previously state," -> "previously stated,"

" as shown in Figure 10. " - > " as shown in Figure 9. "

"In the sixth section we summarize the results of this study and make suggestions for future research." -> there are no future work suggestions on the conclusion section

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

The paper "Doppler radar based human speech recognition using light weight neural network" by Li et al. is fascinating, and I congratulate the authors for their exceptional work.

The paper explores neural networks (Convolutional neural networks and Mobile Vision Transformer) for speech recognition.

However, some topics demand improvements:

There are just a few references in the Introduction Section - authors should bring the classical references on the research area, even better-including reviews.

Figure 1 is generic but valuable. I asked the authors for a more detailed caption.

Table 1 - Could the authors present or comment on sensitivity analysis?

Section 4 - what is the purpose of this Section? Its title needs to be clarified. Should not all methodological details be presented here?

Authors must improve the Conclusions. I suggest using the traditional approach: one paragraph resuming the paper's goals, one paragraph summarizing the results, one paragraph on the limitations, and a final one with perspectives.

Minor editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All my previous comments have been responded to by the authors

Reviewer 3 Report

Dear authors, thank you for your time and effort considering my previous comments. Indeed the paper is improved a lot. I'm satisfied with the modifications performed. Congratulations, I believe the paper can be accepted now. Please, for the final version, consider fixing the following problems still found in the text.

please avoid terms such "the learning speed is relatively fast", as it is not precise

"We discuss the radar principle and the micro-Doppler signal of the vocal cords in Part II." -> "We discuss the radar principle and the micro-Doppler signal of the vocal cords in Section 2."

"Part III presents the interpretation of data measurement and processing." -> "Section 3 presents the interpretation of data measurement and processing."

"Part IV describes the lightweight neural network Mobile ViT." -> "Section 4 describes the lightweight neural network Mobile ViT."

"Part V explains the experimental environment as well as the experimental procedure." -> "Section 5 explains the experimental environment as well as the experimental procedure."

"We summarize the results of this study and make suggestions for future research in PartVI." -> "We summarize the results of this study and make suggestions for future research in Section 6."

"It is easy to know that, the short-time Fourier transform is to multiply a function and a window function first, and then perform a one-dimensional Fourier transform. And through the sliding of the window function to get a series of Fourier change results, these results will be lined up to get a two-dimensional representation." -> please rewrite

"We put these micro-Doppler feature maps into the Mobile ViT neural network for training and validation, analyze the experimental results." -> please rewrite

"As this study, our work is limited in the following aspects." -> please rewrite

Reviewer 4 Report

Congratulations!

Minor editing of English language required

Article Menu

Doppler Radar-Based Human Speech Recognition Using Mobile Vision Transformer

Further Information

Guidelines

MDPI Initiatives

Follow MDPI