Next Article in Journal
Mesh Characteristic Analysis of Spiral Bevel Gear Pairs Considering Assembly Errors and Tooth Tip Chipping Faults
Previous Article in Journal
A New Flexometric Method to Evaluate Range of Movement: A Validity and Reliability Study
Previous Article in Special Issue
Semantic Segmentation of 3D Point Clouds in Outdoor Environments Based on Local Dual-Enhancement
 
 
Article
Peer-Review Record

A Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features

Appl. Sci. 2024, 14(8), 3228; https://doi.org/10.3390/app14083228
by Tamon Kondo 1, Sakura Narumi 1, Zixun He 2, Duk Shin 2 and Yousun Kang 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(8), 3228; https://doi.org/10.3390/app14083228
Submission received: 1 February 2024 / Revised: 4 April 2024 / Accepted: 9 April 2024 / Published: 11 April 2024
(This article belongs to the Special Issue Modern Computer Vision and Pattern Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Paper proposes experiment on a Sign Language Recognition with Vision Transformer and Convolutional  Neural Network to evaluate effectiveness.

Paper uses approach of real-time recognition using Japanese sign language.
Manuscript prepared in good organization and language. Concept, idea and experiments included in paper are supported with methodology and experimented results.

Motivation of this experiment and approach to this field explained and partially supported by related works without key information and results of other experiments on Japanese Sign Language.

Since accuracy results and methodology of the works of Syosaku, T. et al, and Miku, K. et al, is not specified, paper is not giving information to understand advanced level between these experiments and current paper. It is recommended to include accuracy results and methodology from mentioned papers also.

Preparation steps and definition of dataset is clear and defined in a way to reproduce this experiment.

Table 1 regarding information on shooting data has minor need to clarify and/or reconsider. Video duration average of data either has undesired value of 7 sec or is not clarified for its purpose and reason of male part of first data has significant difference from overall pattern. It is recommended to add clarification on table 1 to avoid future confusion to interpret and avoid affecting reproducibility of the results.

Part 3.2 and 3.3 has satisfactory information and covered informant aspects with minor grammar correction need. Methodology and mathematical explanation prepared suitable and Figure selection is also explanatory.

It will be more helpful to explain methodology used in MediaPipe to help readers understand why MediaPipe is crucial and/or helpful than alternative approaches in this experiment.

Structure of the ViT and CNN explained satisfactory with sufficient covering all aspects and supporting these steps with Figure 3. Architecture of both ViT and ConvNet model explained clearly with visual elements. MLP approach, layer selection and Neural Network parameters sufficient to help readers emphasize with test results. This section is satisfactory in context but requires correction in reference specified for core architecture [50].

Experimental results prepared in good organization with clear figures supporting these results. Figure 4 gives clear insight about difference in performance for different layers and these values explained sufficiently. Also Figure 5 shows comparison of the accuracy of model with different encoders in series of epoch.
Paper also explains reasons of accuracy level in both signs and words. Paper gives reasonable explanations to errors in recognition with context of methodology. With the guide of the results paper also gives opinion about strong and weak points of this approach and future potentials of this approach.
In conclusion this paper contains experiments and results suitable of area of the journal. Methodology is clearly explained and paper prepared in well form with need of minor citation corrections and further explanation.
106 - addition of short explanation and results recommended.
145 - re evaluation of the Table recommended since data value gives inconsistency with pattern of data. Furthermore if data is correct explanation to video average duration, reason of choose these values recommended
146 - correction: “calculation of angular features”
162 - correction in list. Continuation with list item ‘4’ recommended.
210 - correction recommended on “core architecture [50]” since it is possibly incorrect reference usage or usage of different value may cause confusion as reference.
399 - reference 13 missing in paper or not used.

Comments on the Quality of English Language

Minor grammar corrections.

Author Response

I appreciate the valuable comments and suggestions provided. I attached the pdf file of my response.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

 

This paper presents a study of Japanese Sign Language recognition using the Vision Transformer (ViT) model and the MediaPipe library for hand and finger pose estimation. The researchers created their own small video dataset of finger-spelled letters and words in Japanese sign language with the help of native signers. They extracted angular features from the finger joint coordinates and applied the ViT and CNN models for recognition, comparing their performance on this dataset.

 

Advantages:

 1) The creation of a real video dataset of Japanese Sign Language with the participation of native signers is a valuable contribution to the field.

 2) The proposed method of extracting angular features from finger joint coordinates is a simple and intuitive approach to gesture representation.

 3) Direct comparison of the performance of the Vision Transformer (ViT) and Convolutional Neural Networks (CNNs) on the same dataset allows the relative advantages of these models to be evaluated.

 4) Analyzing the effect of the number of coding layers in ViT on recognition accuracy provides insight into the capabilities of ViT on relatively small data.

 5) Real-time experiments demonstrate the practical application of the developed models.

 Disadvantages:

 1) The description of the angular feature extraction method lacks sufficient detail, making it difficult to reproduce and evaluate the suitability of this approach.

 2) The analysis of the effect of model hyperparameters, such as input data size, activation functions, optimisers, etc., is not comprehensive enough.

 3) The choice of the Vision Transformer over other video analysis architectures, such as 3D CNNs, is not justified.

 4) The description of the ViT and CNN model architectures used is too brief, lacking details on the number of layers, kernel sizes, activation functions and other components.

 5) The real-time results are only superficially analysed, and no clear reasons are given for recognition errors in dynamic gestures

6) The conclusion only mentions the need for improved data segmentation, but does not propose any concrete ideas or approaches to address this issue.

 Shortcomings:

 1) Provide a more detailed description of the angular feature extraction method, including any mathematical formulae or algorithms used. This allows other researchers to reproduce the approach and evaluate its suitability for sign language gesture recognition.

 2) Justify the choice of the Vision Transformer (ViT) over other architectures, such as 3D convolutional networks, for video analysis. Explain the specific advantages of ViT that make it a better choice for this application.

 3) Provide a more detailed description of the ViT and CNN model architectures used, including the number of layers, kernel sizes, activation functions, and other components. This information is crucial for other researchers to replicate and build on the work.

 Addressing these shortcomings would significantly improve the quality of the paper and provide a more comprehensive understanding of the study. The paper needs further development and refinement.

 

Comments on the Quality of English Language

Moderate editing of English language required

Author Response

I appreciate the valuable comments and suggestions provided. I attached the pdf file of my response.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper presents a method for sign language recognition using ViT and CNNs.

TThe method uses angles extracted from pose estimation keypoints as features. These angles are converted to 2D tensors suitable for ViT and CNNs.

The resulting metrics are very high so we suggest the following:

1. Evaluate the level of Overfitting.

2. Perform ablation studies with different input features and model sizes.

Author Response

I appreciate the valuable comments and suggestions provided. I attached the pdf file of my response.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors took my comments into account and responded to them in some detail, and also made changes in line with the comments.

Comments on the Quality of English Language

Minor editing of English language required

Author Response

We  asked our manuscript for the editing service of English by a native speaker.

We hope our corrected paper satisfies you.

Thank you.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript looks improved. However, there are still some questions.

Can the authors explain how was the data split?

- Does the training and test contain signs from the same person, or are they different people? It is better if the people used in test is different from the ones in training to better evaluate generalization.

- How many participants do you use to collect the data?

- How many videos did you collect? It says, "The dataset encompassed video recordings of six distinct instances, with an equal division of three videos, each contributed by male and female participants." 

What does this mean? 6 x 3 =18 recordings? This is not enough for training a deep learning model. What are instances? Are instance = sign? Please clarify.

If that is correct. Can you explain how a deep learning model can learn from such small dataset? This seems memorization. 

 

Please add a y-axis label to Figure 7.

Author Response

We appreciate the additional valuable comments and suggestions received. This response letter consists of sections addressing the comments of one reviewer.

We attached our response below link.

Thank you.

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

It is more clear now, thank you.

 

Back to TopTop