Next Article in Journal
New Insights on Environmentally Friendly Materials
Previous Article in Journal
Study on the Frontal Collision Safety of Trains Based on Collision Dynamics
 
 
Article
Peer-Review Record

A Real-Time Dynamic Gesture Variability Recognition Method Based on Convolutional Neural Networks

Appl. Sci. 2023, 13(19), 10799; https://doi.org/10.3390/app131910799
by Nurzada Amangeldy 1, Marek Milosz 2,*, Saule Kudubayeva 1, Akmaral Kassymova 3, Gulsim Kalakova 4 and Lena Zhetkenbay 1
Reviewer 1:
Reviewer 2:
Appl. Sci. 2023, 13(19), 10799; https://doi.org/10.3390/app131910799
Submission received: 24 May 2023 / Revised: 20 July 2023 / Accepted: 9 August 2023 / Published: 28 September 2023

Round 1

Reviewer 1 Report

1.- In line 42, citation 3 in the bibliography does not correspond to the World Health Organization. In this sense, supporting the values exposed between lines 42 and 43 is requested.

2.- The support between lines 118 and 123 is ambiguous due to the fact that there are EMG sensors that can be invasive and contact sensors. In this sense, contact sensors such as the Myo armband are very useful for hand gesture recognition. In the same sense, the Leap Motion Controller (LMC) presents research that tests the device and reports accuracies of 1mm of deviation, including predicting finger positions when occluded. In addition, the new versions of the LMC present several advances. Also, the LMC is a specialized device for tracking the hand and hand objects, returning 27 positions of the hand, fingers, and wrist.

3.- In the Materials and Methods section, lines 166 - 168, the authors mention that they will try to overcome several problems. When the correct thing to do would be to mention the methodology, they will use to face and overcome these problems. In the same sense, between lines 168-170, the authors talk about four categories and do not mention them.

4.- The authors talk about real-time in the document but do not mention the concept of real-time that they apply to their work.

5.- The authors mention that they are building a dataset of 2600 (line 174) samples for training, testing, and validation. They also mention that this dataset satisfies the need for limited datasets (is it that there are no datasets with more samples?). Also, you don't explain how the dataset was constructed, the number of people, the sampling time, etc. Finally, I would like you to explain if the model you propose does not fall into overfitting with this dataset.

6.- Figure 1 is not very expressive to detail the proposed model. Besides, it is blurred.

7.- What is written in lines 229 - 231 is not consistent with what is written in line 174.

8.- In line 230 you mention that each word is stored 60 times, but you do not mention which words they are.

9.- Please specify in the PCA how many features you reduce the data set to.

10.- I ask you to explain what is the input vector to the classifier because in section 2.2 you talk about 480x240 frames while in section 2.1 you talk about extracting features with PCA and clustering them with K-means; what is the objective of doing what you explain?

11.- Please explain what is the contribution of the work.

None

Author Response

Thank you for you work to improve our paper. We did try to fix all your remarks and requirements.

Detailed answers in the file

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper presents a new model using 2D CNN based on regular web cameras, which enables it to be easily implemented with a lower investment. The main subject, sign language classification, is of most importance since it is not understandable by the majority of people.

 

The structure of the document is good and appropriate. However, there are some questions/points that should be address.

The abstract lacks some conclusions regarding, for example, the applicability of the developed model. It just identifies the achieved accuracy.

 

Page 3, lines 118-119, states “EMG signals is one … that may require invasive procedures”. Why is the EMG an invasive procedure and, for example, using a glove or smartwatch is not invasive? Both are harmless to the user…

 

Page 3 lines 149-150, authors state that multilingual sign language may be effectively developed. Since the presented work is also a multilingual sign language model, such model should be more explored, namely on how to train or to extend the training to a different language.

 

During the document authors indicate that the research questions were answered. However, such questions are not clearly presented/defined. Indeed, without such clarification, the reader may not be able to understand sentences regarding research questions.

 

Figure 1 presents the proposed method, but it is not clearly described on text. An example is what authors intend to demonstrate with the blue window.

 

All figures throughout the document must be replaced by others with better quality. In some of them, text is hard to read and others, like Figure 7 it is not readable at all.

 

Page 5, lines 204-206, there is the math calculations to demonstrate how authors achieved the total of 258 landmarks. However, it is not clear why some values were multiplied by 2 and others by 2 and 3. This must be clarified.

 

Page 5, lines 229-232, “…ten words were chosen…”. What were the criteria to select these words, and what are such selected words?

 

The statement of lines 233-239 should be revised. It is not clear what the authors intend to present.

 

PCA was used to reduce the number of dimensions. How many dimensions were reduced? What was the % tolerated loss to achieve such reduction?

 

Figure 2 should be clarified in terms of PCA classes and k-means classes. According to the figure, the initial PCA identified, for example, yellow values that are present all over the plot. However, most of such points were reclassified. How did this impacted the output?

 

In the text there is several references to sub figures (example Figure 3 a)). However, there is only Figure 3. It does not contain its sub-label. Authors should clearly indicate the a), b), c), etc for these figures.

 

Page 8, lines 292-301. Authors state that the number and type of layers may develop model that will overfit or underfit the data. How did authors achieve the best architecture? Which was the procedure to identify the best sequence, type of kernels, and other values?

 

The model’s quality is only identified by its accuracy. However, this indicator per-se is not sufficient to clearly define the quality. Why F1 score or recall were not considered? And why such values are used to compare the model with other authors (table 1)?

 

Page 10, lines 375-380. If using a single camera the model achieves a higher accuracy level, why did authors consider the need of using multiple cameras? 

 

In figures like 6 and 7, despite their poor quality, the words should also be translated. I understand that the model was trained for a non-English sign language, but for the purpose of the paper and to present the results, it would be clearer to analyze the confusions matrices if they were in English.

 

The conclusions present the main goals, but also lacks the research questions answers.

The overall quality of the English is good. Only one sentence should be rewritten since it is not clear what the authors intend to say. This sentence was identified earlier.

Author Response

Thank you for you work to improve our paper. We did try to fix all your remarks and requirements.

Detailed answers in the file

Author Response File: Author Response.pdf

Back to TopTop