Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach

Technologies 2024, 12(9), 152; https://doi.org/10.3390/technologies12090152

by Gerardo García-Gil

, Gabriela del Carmen López-Armas^*

, Juan Jaime Sánchez-Escobar^*

, Bryan Armando Salazar-Torres

and Alma Nayeli Rodríguez-Vázquez

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Technologies 2024, 12(9), 152; https://doi.org/10.3390/technologies12090152

Submission received: 5 June 2024 / Revised: 13 August 2024 / Accepted: 20 August 2024 / Published: 4 September 2024

(This article belongs to the Special Issue The Future of Healthcare: Biomedical Technology and Integrated Artificial Intelligence 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Find the following suggestions/comments to improve the quality of the manuscript:

- The citations in (Line 34) should be placed at the end of the sentence. Same as in (Line 37). Consider checking the whole manuscript for such

- The introduction aspect of the "Abstract" needs to be summarized in 1-2 sentences. Then the abstract needs to be restructured.

- CoIt is not clear whether they applied it during training or test. Hence it should be clearly stated the validation of the methodology to ensure a fair evaluation

- Include the potential limitations of the sensor placement and a comparison should be made with the previous methods for sign language recognition

- The discussion section should be structured to bring out the novelty or motivation of the study

- The conclusion section needs modification to say the summary of the findings

- Check some minor spelling and grammatical errors

Comments on the Quality of English Language

Minor English Language Editing

Author Response

Dear Reviewer,

Thank you for reviewing our manuscript, "Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach. “We have carefully considered all your suggestions and made the corresponding revisions. Below, we respond to each of your comments and detail the changes made to the manuscript.

Reviewer Comments 1: The citations in (Line 34) should be placed at the end of the sentence. Same as in (Line 37). Consider checking the whole manuscript for such.

Response 1: Thank you for your observation. "The citations have been placed at the end of the sentences in the paragraphs where they were needed, as you correctly pointed out. An example of this can be seen in the modifications in Section 1, lines 36,43, and 57. Additionally, in Section 2, lines 86 and 127."

Reviewer Comments 2: The introduction aspect of the "Abstract" needs to be summarized in 1-2 sentences. Then, the abstract needs to be restructured.

Response 2: Dear Reviewer, thank you for your suggestion. We have modified it; two sentences in the abstract have been deleted, further restructuring the content. This is the current version:

Abstract: Effective communication is crucial in daily life, and for people with hearing disabilities, sign language is no exception, serving as their primary means of interaction. Various technologies, such as cochlear implants and mobile sign language translation applications, have been explored to enhance communication and improve the quality of life for the deaf community. This article presents a new, innovative method that uses real-time machine learning (ML) to accurately identify Mexican Sign Language (MSL), and is adaptable to any sign language. Our method is based on analyzing six features representing the angles between the distal phalanges and the palm, thus eliminating the need for complex image processing. Our ML approach achieves accurate sign language identification in real-time, with an accuracy and F1 score of 99%. These results demonstrate that a simple approach can effectively identify sign language. This advance is significant as it offers an effective and accessible solution to improve communication for people with hearing impairments. Furthermore, the proposed method has the potential to be implemented in mobile applications and other devices to provide practical support to the deaf community. A video illustrating our method is available for download at https://github.com/gggvamp/pdi/blob/main/videomsl.mp4.

Reviewer Comment 3: CoIt is not clear whether they applied it during training or test. Hence it should be clearly stated the validation of the methodology to ensure a fair evaluation .

Response 3: We greatly appreciate your valuable observations. After careful consideration, the methodology employed is appropriate because our proposed method, based on the DT C4.5, was validated during the training and testing phases. Regarding the observation about the validation of the methodology during training and testing, we provide a detailed clarification below:

Data Splitting and Validation:

Our approach focused on classification and prediction in the sign language domain. The dataset was divided into training and testing sets, assigning 20% of the data to the test set and 80% to the training set to ensure adequate representation of the algorithms. We used the numeric value 42 to initialize the random number generator, providing reproducibility in the data split.

For clarity, Section 3.2, Data Loading and Dataset Preparation, in the manuscript details the validation process used. This section describes preprocessing techniques during training and independent evaluation during testing, with the corresponding modification made to lines [181-197].

The process began with collecting a dataset explicitly designed for our task, extracted from a proprietary database stored in an Excel file with 5690 records collected and trained by four people of different genders and ages. Furthermore, the database can be downloaded from https://github.com/gggvamp/pdi/blob/main/datosO.xlsx since we focus on classification and prediction in the language domain. The last column of the dataset contains the class labels corresponding to the sign language letters, while the previous six columns contain six relevant predictive features of the hand, and can be seen in Figure 3.a, b. The training set was used to fit the model letter by letter. We apply essential preprocessing techniques to ensure the quality and adequacy of the data. This process included identifying and removing outliers, which could bias the analysis results, and the normalization and standardization of the characteristics were also carried out: we adjusted the scales of the six characteristics to achieve a standard distribution, which facilitates the comparison of different hand signs (letters) and improves the stability of the analysis models, this is done for each letter.

Training Phase:

During training, the training set was used to fit the model. We applied essential preprocessing techniques to ensure the quality and adequacy of the data. This process included:

1) Identification and elimination of outliers: We meticulously removed outliers that could potentially skew our analysis results, ensuring the integrity of our data.

2) Normalization or standardization of characteristics: We adjust the characteristics scales to achieve a standard distribution, which not only facilitates the comparison of different variables but also significantly improves the stability of the analysis models, instilling confidence in our techniques.

On page 5, we included text with two figures named Figure 3(a,b), which detail the explanations given above in points 1 and 2, respectively.

Set of characteristics for letter A without filtering.

Set of filtered letter A characteristics

Evaluation and Testing:

After training the model with the training set, we evaluated its performance using the independent test set (confusion matrix). This strict separation between training and testing data ensures that the reported performance on the test set reflects the model's ability to generalize to previously unseen data (cross-validation). This approach helps us avoid overfitting issues and more accurately evaluates the model's performance under real-time and user-independent conditions.

Furthermore, in Section 3.5.2, Validation Phase, in Table 3, 'Validation Process for Hand Gesture Recognition System with DT C4.5,' on line 23, it is stated that the decision tree model was employed in the validation process. This choice guaranteed a thorough and rigorous evaluation of the model's performance. However, we have included a section in the discussion to address this concern and provide additional context.

Reviewer Comment 4: Include the potential limitations of the sensor placement, and a comparison should be made with the previous methods for sign language recognition

Response 4: We value your question and the insights it brings. In response, the lighting conditions in the camera environment play an important role in the performance of the machine learning algorithms, as do the automatic identification of Mexican Sign Language. However, it is essential to note that by adequately controlling the lighting of the camera environment, we were able to obtain the numerical results shown in section 4, where the performance of each algorithm could be obtained regardless of the position or distance of the hand from the camera, within a range of 2 meters. In addition, we made the relevant changes in the first and second paragraphs of page 8 and page 9 of the proposed work, including the comments mentioned above. See Figure 5 (a-b). See lines 291 to 304.

(a) Sign A, close hand of the camera. (b) Hand away from the camera.

Reviewer Comment 5: The discussion section should be structured to bring out the novelty or motivation of the study

Response 5: Thank you for your feedback regarding the need to structure the discussion section better to highlight the novelty and motivation of the study. We have revised the discussion section to address these key aspects. In the revised version, we emphasize that our work is unique. It stands out due to its focus on dimensionality reduction, utilizing only six features representing the angles between distal phalanges and the palm. This approach minimizes the need for complex image processing, simplifying the model while addressing the scarcity of studies on Mexican Sign Language (MSL).

We also highlight that while state-of-the-art algorithms such as Random Forest, XGBoost, and LightGBM are highly accurate, they present significant drawbacks in terms of training time. In contrast, our use of the DT C4.5 algorithm demonstrated competitive performance with superior efficiency in training and prediction, which is crucial for real-time applications. Furthermore, we compare our work with previous studies in the literature, emphasizing how our methodology contributes simplicity and efficiency while maintaining high accuracy. These differences underscore the novelty of our approach and its potential applicability in practical scenarios.

We hope these revisions address your suggestion and provide a more structured discussion focused on our study's innovative and motivational aspects.

Reviewer Comment 6: The conclusion section needs modification to say the summary of the findings

Response 6: Thank you for your feedback on the conclusions section. In the revised version, we have summarized the key findings of our study, highlighting that our Decision Tree C4.5 algorithm demonstrated outstanding performance, achieving nearly 99% precision, recall, and F1 score. Compared to state-of-the-art algorithms such as Random Forest, XGBoost, LightGBM, and CatBoost, DT C4.5 stands out for its balance between computational efficiency and predictive accuracy. While CatBoost and neural networks offer competitive precision, they require significantly longer training times, which may be better for real-time or large-scale applications. On the other hand, DT C4.5 proved to be the most balanced option for our MSL predictor, delivering robust performance without the drawbacks associated with more computationally demanding models. This study underscores the potential of simpler models like DT C4.5 to achieve high accuracy in specialized tasks where advanced models' complexity and resource demands may not be justified.

Reviewer Comment 7: Check some minor spelling and grammatical errors

Response 7: Thank you. We have made an effort to correct spelling and grammatical errors as much as possible.

Your comments have been invaluable in strengthening our manuscript, and we are grateful for your role in improving the quality of our work. Your contributions have significantly improved the final manuscript, and we are truly thankful for your time and effort in reviewing our work.

Sincerely,

The corresponding authors:

PhD. Gerardo García-Gil, PhD, MD. Gabriela del Carmen López Armas and PhD. Juan Jaime Sánchez Escobar.

Technical Industrial Teaching Center, (Centro de Enseñanza Técnica Industrial)

ggarcia@ceti.mx; glopez@ceti.mx; jjsanchez@ceti.mx

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a novel method for real-time Mexican Sign Language (MSL) identification using machine learning. The method focuses on six features representing the angles between the distal phalanges and the palm, reducing the need for complex image processing. The approach achieves high accuracy (99%) and F1 scores, demonstrating its efficiency and potential for implementation in mobile applications to aid communication for people with hearing impairments. Some suggestions are as follow for improvement:

- The dataset appears to be limited in terms of diversity. Increasing the variety of hand shapes, skin tones, and backgrounds in the dataset could enhance the robustness and generalizability of the model.

- While the paper focuses on six specific angles as features, it might be beneficial to explore the inclusion of additional features or combinations of features to see if they could improve model performance further.

- The paper briefly mentions other machine learning models but does not provide a detailed comparative analysis. Including a comprehensive comparison with state-of-the-art models, such as deep learning approaches, could provide a better context for the effectiveness of the proposed method.

- Conduct a detailed error analysis to understand the types of misclassifications the model makes. This could help in refining the model and addressing specific weaknesses

- While the paper claims reduced computational time, providing concrete metrics on processing time and resource usage would be beneficial for readers to understand the practical implications of deploying the model on mobile devices.

- Discussing the accessibility features and usability testing with the target audience (e.g., individuals with hearing impairments) can provide insights into the practical utility and acceptance of the proposed system.

- 9: Discuss/add the followng into the introduction section:

- DOI: 10.3837/tiis.2021.06.006

Comments on the Quality of English Language

- Some minor changes required

Author Response

Dear Reviewer,

Thank you very much for reviewing our manuscript titled " Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach.". We greatly appreciate your detailed and constructive feedback. We have carefully considered all your suggestions and made the corresponding revisions. Below, we respond to each of your comments and detail the changes made to the manuscript.

The paper uses machine learning to present a novel method for real-time Mexican Sign Language (MSL) identification. This method focuses on six features representing the angles between the distal phalanges and the palm, reducing the need for complex image processing. The approach achieves high accuracy (99%) and F1 scores, demonstrating its efficiency and potential for implementation in mobile applications to aid communication for people with hearing impairments. Some suggestions are as follows for improvement:

Reviewer 2 Comments 1: The dataset appears to be limited in terms of diversity. Increasing the variety of hand shapes, skin tones, and backgrounds in the dataset could enhance the robustness and generalizability of the model.

Response 1:

We appreciate your comments and suggestions on including a diverse range of hand shapes, sizes, and colors in our study. We value the importance of diversity in the dataset for evaluating the strength and applicability of the model. However, it's important to note that if a participant had a congenital hand anomaly, the experiment could not be conducted, which would be a clear limitation. Additionally, skin color was not a factor considered in the algorithm programming for our research and, therefore, had no impact on our results. The robustness and veracity of the model precisely consist in that regardless of the size or shape of the hands, it only needs the angles between the distal phalanges of the fingers and the palm to calculate the output label of the predicted letter but given what was suggested we added some different images as shown below:

Reviewer 2 Comments 2: While the paper focuses on six specific angles as features, it might be beneficial to explore the inclusion of additional features or combinations of features to see if they could improve model performance further.

Response 2: We appreciate your comment and the suggestion to explore the inclusion of additional features to improve model performance. The following explains our choice to use six specific features and how this decision was made.

The choice of the number and type of features for a machine learning model is a trade-off between simplicity and performance. In addition, there are always pros and cons to this type of decision; for example, here are some structured points to consider:

Advantages of using fewer features:

Simplicity: Fewer features make the model easier to interpret and less prone to overfitting, especially if the data set is small.
Lower Computational Complexity: With fewer features, the model requires fewer computational resources to train and make predictions.
Noise Reduction: Fewer features can mean less noise and redundancy in the data, improving the model’s ability to generalize to new data.

Disadvantages of using fewer features:

Loss of Information: Important features that capture crucial aspects of the problem may be missing, which could limit model performance.
Less Flexibility: A limited set of features may not capture all the variability in the data, which may result in a less robust model.

Advantages of using more features:

Capture More Information: More features can provide a more complete and richer representation of the problem, potentially improving model performance.
Performance Improvement: In some cases, adding additional features can help improve the accuracy and predictive capability of the model, especially if the new features capture relevant information that was not previously present.

Disadvantages of using more features:

Complexity and Risk of Overfitting: With more features, the model can become more complex and prone to overfitting, especially if the data set is not large enough.
Higher Computational Cost: More features can increase the computational and time requirements for training the model and making predictions.

A balanced strategy might be to start with a few features, such as the six currently in use, and then explore the possibility of adding additional features in a controlled manner. This can be done using feature selection techniques to determine the most relevant features and avoid redundancies. In our case, the Mediapipe uses 21 different features, and at first, we worked with this, but it presented problems of confusing letters and the response time. Even though it was relatively fast, the comparison with taking features from the angles between the distal phalanges was superior.

So, considering this work, reducing features to 6 was the best option. Fewer well-selected features can lead to better model performance and reduce the possibility of errors in the process; it was essential to validate this through experimentation. The goal was to find a balance where the model performs optimally without unnecessary complexity. To support this conclusion, it was based on more detailed information by consulting some articles such as:

Jason Brownlee et al., 2020 in "How to Choose a Feature Selection Method For Machine Learning," emphasize that there is no one-size-fits-all approach to feature selection. He recommends experimenting with different methods and feature sets to find the optimal balance for a specific problem without including redundant variables. Techniques such as correlation coefficients, ANOVA, chi-square tests, and mutual information are commonly used to assess the importance of features.

Pradip Dhal et al. 2021, in "A comprehensive survey on feature selection in various fields of machine learning," discusses various feature selection methods and their impact on different machine learning tasks. He highlights that while reducing the number of features can improve the interpretability and efficiency of the model, it is crucial to ensure that important predictive information is not lost in the process.

Reviewer 2 Comments 3: The paper briefly mentions other machine learning models but does not provide a detailed comparative analysis. Including a comprehensive comparison with state-of-the-art models, such as deep learning approaches, could provide a better context for the effectiveness of the proposed method.

Response 3: We appreciate your comments on needing a more detailed benchmarking analysis with state-of-the-art models. Below, we explain the benchmarking analysis performed and plans to address this suggestion.

A detailed benchmarking analysis was performed in the context of machine learning models, which involves evaluating and comparing different models based on various metrics and criteria. Please keep the following text in mind: "In section 2.1, Figure 2, the Taxonomy of Sign Language Recognition is presented below and is only considered for comparison with algorithms of the same nature.

Components of a detailed comparative analysis

Model Selection: 7 different types of machine learning models were included in the brief as follows:

Basic classification models (Decision Trees, Naive Bayes, K-Nearest Neighbors, Support Vector Machines). Advanced models (Random Forest, Gradient Boosting, lightGBM, CATBosst).

The suggestion is gladly accepted. Some deep learning algorithms were compared, and results were obtained that the deep learning algorithms are more accurate but take a long time to train. To include cross-validation evaluation in the context of neural networks, training and assessment of the model is performed at each fold. However, cross-validation can be resource-intensive due to the high computational load involved in training RNNs.

Classification Report Machine Learning DT C4.5 VS RNN´s:

Model Results

Execution Times

Total Metrics

Cross-Validation Scores:

Training Accuracy: 1.00

Training Time: 0.01 seconds

Precision: 0.99

Cross-Validation Time: 0.07 seconds

Training Loss: 0.00

Prediction Time: 0.00 seconds

Recall: 0.99

CV Scores: [0.9771529 0.98506151 0.98418278 0.98418278 0.98506151]

Testing Accuracy: 0.99

F1 Score: 0.99

CV Mean: 0.98

Testing Loss: 0.01

CV Standard Deviation: 0.00

Dense Neural Network (DNN), also known as a Fully Connected Network (FCN), is a type of artificial neural network in which each neuron in one layer is connected to every neuron in the next layer. This network structure is characterized by dense connections, allowing for the modeling of complex relationships in the data.

Model Results

Execution Times

Total Metrics

Cross-Validation Scores:

Training Accuracy: 0.91

Training Time: 7.06 seconds

Precision: 0.97

Cross-Validation Time: 10.02 seconds

Training Loss: 0.29

Prediction Time: 0.11 seconds

Recall: 0.96

CV Scores: [0.92442882, 0.92970120, 0.90333920, 0.91388398, 0.92618626]

Testing Accuracy: 0.96

F1 Score: 0.96

CV Mean: 0.92

Testing Loss: 0.04

CV Standard Deviation: 0.01

Deep Neural Network (DNN) is a type of artificial neural network with multiple layers between the input and output layers. It is a subclass of deep learning and is particularly suited for complex pattern recognition and classification tasks.

Model Results

Execution Times

Total Metrics

Cross-Validation Scores:

Training Accuracy: 0.92

Training Time: 7.09 seconds

Precision: 0.96

Cross-Validation Time: 29.00 seconds

Training Loss: 0.27

Prediction Time: 0.12 seconds

Recall: 0.96

CV Scores: [0.96660810, 0.97012309, 0.96748679, 0.96836555, 0.97978907]

Testing Accuracy: 0.96

F1 Score: 0.96

CV Mean: 0.97

Testing Loss: 0.04

CV Standard Deviation: 0.01

LSTM networks are a special kind of RNN capable of learning long-term dependencies. They are designed to avoid the long-term dependency problem, often referred to as the vanishing gradient problem.

Model Results

Execution Times

Total Metrics

Cross-Validation Scores:

Training Accuracy: 0.96

Training Time: 12.04 seconds

Precision: 0.97

Cross-Validation Time: 21.78 seconds

Training Loss: 0.14

Prediction Time: 0.42 seconds

Recall: 0.97

CV Scores: [0.94376099, 0.92970120, 0.94024604, 0.92179262, 0.94639718]

Testing Accuracy: 0.97

F1 Score: 0.97

CV Mean: 0.94

Testing Loss: 0.03

CV Standard Deviation: 0.01

Model Results Comparison
Metric	Deep Neural Network	Dense Neural Network	LSTM Network	Decision Tree (C4.5)
Training Accuracy	0.92	0.91	0.96	1.00
Training Loss	0.27	0.29	0.14	0.00
Testing Accuracy	0.96	0.96	0.97	0.99
Testing Loss	0.04	0.04	0.03	0.01

Execution Times Comparison
Metric	Deep Neural Network	Dense Neural Network	LSTM Network	Decision Tree (C4.5)
Training Time	7.09 seconds	7.06 seconds	12.04 seconds	0.01 seconds
Prediction Time	0.12 seconds	0.11 seconds	0.42 seconds	0.00 seconds

Total Metrics Comparison
Metric	Deep Neural Network	Dense Neural Network	LSTM Network	Decision Tree (C4.5)
Accuracy	0.96	0.96	0.96	0.99
Precision	0.96	0.97	0.97	0.99
Recall	0.96	0.96	0.97	0.99
F1 Score	0.96	0.96	0.97	0.99

Cross-Validation Scores Comparison
Metric	Deep Neural Network	Dense Neural Network	LSTM Network	Decision Tree (C4.5)
Cross-Validation Time	29.00 seconds	10.02	21.78 seconds	0.07 seconds
CV Scores	[0.96660810, 0.97012309, 0.96748679, 0.96836555, 0.97978907]	[0.92442882, 0.92970120, 0.90333920, 0.91388398, 0.92618626]	[0.94376099, 0.92970120, 0.94024604, 0.92179262, 0.94639718]	[0.9771529 0.98506151 0.98418278 0.98418278 0.98506151]
CV Mean	0.97	0.92	0.94	0.98
CV Standard Deviation	0.01	0.01	0.01	0.00

Given the compared data, it is determined that:

Decision Tree (C4.5) shows the best training accuracy and training loss (100% and 0.00, respectively). However, the high training accuracy may indicate overfitting, especially since the test accuracy and test loss are also very high but not the highest; it has the fastest training and prediction time (0.01 and 0.00 seconds, respectively). This makes it very efficient for real-time applications; it has the best accuracy, precision, recall, and F1 Score (0.99 in all cases). This indicates that it is the model with the most consistent performance in total metrics. It has the best average cross-validation scores (0.98) and the lowest standard deviation (0.00), indicating high consistency and reliable performance. It stands out for its speed in execution, consistency in cross-validation, and ability to offer the best total metrics. However, there may be a risk of overfitting if the model is too complex.

LSTM Network balances high training and test accuracy with low training and test loss. This suggests that the model generalizes well to new data, has the longest training time (12.04 seconds) and the longest prediction time (0.42 seconds), follows closely in terms of accuracy, recall, and F1 Score (0.97), showing that it also offers solid performance on most metrics, shows good performance in cross-validation with a mean score of 0. 94 and a low standard deviation (0.01). Together with deep neural networks, they are very good choices for tasks where a good balance between accuracy and generalization capability is required but with more extended training and prediction times.

Deep Neural Networks have the best cross-validation score (0.97) of the RNNs but have solid performance with a slightly higher standard deviation. Still, it does not stand out as much as the other options regarding accuracy, loss, and run times. It has the lowest cross-validation score (0. 92) and the highest standard deviation (0.01), which may suggest more variability in its performance and, together with Dense Neural Network, has similar results, with slightly lower training accuracy and training loss compared to LSTM, but with the same test accuracy.

For real-time applications, such as a hand signal predictor in a mobile application, Decision Tree (C4.5) might be the best choice due to its prediction and training speed. However, if accuracy is crucial and longer training and prediction time can be tolerated, LSTM Network or Deep Neural Network might offer better results.

Reviewer 2 Comments 4: Conduct a detailed error analysis to understand the types of misclassifications the model makes. This could help in refining the model and addressing specific weaknesses

Reviewer 2 Comments 5: While the paper claims reduced computational time, providing concrete metrics on processing time and resource usage would be beneficial for readers to understand the practical implications of deploying the model on mobile devices.

Response 4 & 5:

We appreciate your suggestion regarding the detailed analysis of errors and performance times to understand better the types of misclassifications the model makes. If permitted, we will address responses 4 and 5. We agree that this analysis could provide valuable insights for refining the model and addressing its specific weaknesses. Below, we present a detailed analysis of the errors and times recorded by the models:

A detailed comparative analysis was conducted in the context of machine learning models, which involves evaluating and comparing different models based on various metrics and criteria. Below, we explain the key components of this analysis and how to carry it out:

Evaluation Metrics

Some evaluation metrics, such as the confusion matrix and/or cross-validation for the other algorithms, may have been missing. The confusion matrix is a very useful tool for assessing the performance of a classification model. This matrix provides a detailed way to show the algorithm's performance, allowing one to identify how many predictions were correct and how many errors were made. The confusion matrix is especially valuable when working with imbalanced data or, in this case, multiple classes. Conversely, cross-validation is a technique used to evaluate a model's generalization ability, i.e., its performance on unseen data. Its goal is to mitigate the risk of overfitting by ensuring that the model is not simply memorizing the training data.

Cross-validation scores are the model’s performance metrics evaluated on different subsets of the training data. In the context of cross-validation, the data is divided into multiple folds, and the model is trained and evaluated multiple times, each time using a different fold as the test set and the others as the training set.

Interpretation of Results

Individual Scores: Each score represents the model's performance on one of the cross-validation folds.

Mean of Scores: The mean of these scores provides a general estimate of the model’s performance.

Standard Deviation: The standard deviation of the scores indicates the variability in performance estimates, providing insight into the model's stability.

Data and Preparation: To ensure a fair comparison, it is crucial that all models are trained and evaluated on the same preprocessed dataset. This is noted in Section 3.2, particularly in Figures 3(a) and (b), which are included in the article.

Visualization of Results: Please see Tables 4 and 6 to visually compare the performance metrics of different models. We invite you to review Figure 9 (a-i). We agree that providing concrete metrics on processing time and resource usage is essential for understanding the practical implications of implementing the model. Providing these metrics helps readers evaluate the trade-offs between accuracy and efficiency and make informed decisions about the implementation of machine learning models on mobile devices.

The metrics used to measure the performance of machine learning (ML) algorithms are well-established, and we referenced various scientific and academic sources that discuss in detail the most appropriate metrics and their interpretation. Among the works consulted are:

Kevin P. Murphy (2022). Probabilistic Machine Learning: An Introduction. MIT Press. This book, a continuation of the influential Machine Learning: A Probabilistic Perspective (2012), offers an updated approach to machine learning with an emphasis on probabilistic methods and the interpretation of evaluation metrics.

Sebastian Raschka et al. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing. This practical resource covers a wide range of machine learning techniques, including model evaluation and performance metrics, using modern tools like PyTorch and Scikit-Learn.

This paragraph includes a detailed table that includes error tests with standard deviation and the processing times you suggested, which we hope also addresses question 5. The table has been incorporated into the manuscript along with the respective confusion matrices (pages 17 and 18) and performance tables of the compared ML algorithms (lines 511-532).

Model Results
Metric/Method	D T (C4.5)	SVM	N B	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Training Accuracy	1.00	0.98	0.97	0.99	1.00	1.00	1.00	1.00	0.92
Training Loss	0.00	0.02	0.03	0.01	0.00	0.00	0.00	0.00	0.27
Testing Accuracy	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
Testing Loss	0.01	0.04	0.04	0.02	0.01	0.01	0.01	0.00	0.04
Accuracy	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
Precision	0.99	0.96	0.97	0.98	1.00	0.99	0.99	1.00	0.96
Recall	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
F1 Score	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96

Execution Times
Time/Method	D T (C4.5)	SVM	NB	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Training Time	0.01 sec	0.05 sec	0.00 sec	0.02 sec	0.36 sec	0.31 sec	0.51 sec	2.42 sec	7.09 sec
Prediction Time	0.00 sec	0.03 sec	0.01 sec	0.03 sec	0.02 sec	0.00 sec	0.10 sec	0.00 sec	0.12 sec

Cross-Validation Scores
Metric/Method	D T (C4.5)	SVM	NB	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Cross-Validation Time	0.07 sec	0.45 sec	0.02 sec	0.24 sec	1.78 sec	1.58 sec	2.41 sec	11.17 sec	29.00 sec
CV Scores	[0.977, 0.985, 0.984, 0.984, 0.985]	[0.946, 0.960, 0.932, 0.952, 0.951]	[0.954, 0.963, 0.936, 0.926, 0.976]	[0.965, 0.982, 0.954, 0.944, 0.973]	[0.984, 0.992, 0.971, 0.974, 0.976]	[0.937, 0.988, 0.967, 0.976, 0.981]	[0.937, 0.989, 0.967, 0.976, 0.981]	[0.990, 0.992, 0.991, 0.995, 0.993]	[0.967, 0.970, 0.967, 0.968, 0.980]
CV Mean	0.98	0.96	0.95	0.96	0.98	0.97	0.97	0.99	0.97
CV Standard Deviation	0.00	0.01	0.02	0.01	0.01	0.02	0.02	0.00	0.01

Comparison with ML Models:

CatBoost: It demonstrates excellent performance with an overall accuracy and metrics of 1.00. However, its training time is significantly higher (2.42 seconds), and the cross-validation time is also longer (11.17 seconds).

Random Forest (RF): It achieves high accuracy with similar metrics, featuring moderate training times (0.36 seconds) and a good overall balance between accuracy and runtime.

XGBoost: It offers good accuracy and fast prediction times (0.00 seconds). However, its training time (0.31 seconds) and cross-validation time (1.58 seconds) are intermediate compared to other models.

Neural Networks: Although neural networks show good accuracy and recall (0.96), their training times (7.09 seconds) and cross-validation times (29.00 seconds) are significantly longer than those of decision trees and boosting models.

The study reveals that if the goal is efficiency and performance in terms of training and prediction time, the Decision Tree (DT) C4.5 is an excellent option due to its simple application and low computational cost. Its speed and high accuracy compared to other models make it suitable for scenarios where response time is critical and a straightforward, effective model is needed. For scenarios that require maximum accuracy and can tolerate longer training times, CatBoost offers the best performance in terms of accuracy and consistency. Despite its longer training and cross-validation times, its perfect accuracy and superior metrics justify its use if resources and time are not limiting factors.

If a balance between performance, accuracy, and efficiency is desired, Random Forest and XGBoost are solid intermediate options. They provide good accuracy and reasonable times for both training and prediction.

In conclusion, the Decision Tree (DT) C4.5 is an excellent choice if fast training and prediction times with high accuracy are prioritized. For applications that can accommodate longer training times in exchange for maximum accuracy, CatBoost is the preferred option.

We hope to adequately address responses 4 and 5 and provide the basic and additional information needed. We once again appreciate your valuable suggestions for improving our document.

Reviewer 2 Comments 6: Discussing the accessibility features and usability testing with the target audience (e.g., individuals with hearing impairments) can provide insights into the practical utility and acceptance of the proposed system.

Response 6:

We appreciate your valuable comments and suggestions on the proposed system's accessibility and usability. We recognize the importance of evaluating the system's performance in diverse contexts, including accessibility for the hearing impaired.

In our work, we have focused on the technical and performance analysis of the machine learning model. However, we are aware that accessibility and usability are critical factors for the system's adoption and effectiveness in the real world.

To address this issue, the results of this study will be considered in future revisions, and the next version of the system, which are signs (letters and words) with movement, and results will be obtained that are like what we have now. The second point to develop in the future would be to offer our development to a group of hearing-impaired people, to obtain informed consent for participation, and to have a bioethics committee accept our incursion of the study. These implementations would evaluate the system, determine whether it adapts to specific user needs, and determine how its features can be improved to make the system more accessible and acceptable. We welcome your comments again and are committed to improving the accessibility and usability of the proposed system.

Reviewer 2 Comments 7: Discuss/add the following into the introduction section: DOI: 10.3837/tiis.2021.06.006

Response 7: Yes, we have included this in reference number 3.

Thank you for your detailed feedback and valuable suggestions. We have carefully addressed your comments and made the necessary updates to our article. In particular, we have included new metrics, images, and tables as you suggested. Your contributions have been essential in enhancing the clarity and quality of our work. We appreciate your thorough review and the time you have invested in helping us improve.

Sincerely,

The corresponding authors:

PhD. Gerardo García-Gil, PhD, MD. Gabriela del Carmen López Armas and PhD. Juan Jaime Sánchez Escobar.

Technical Industrial Teaching Center, (Centro de Enseñanza Técnica Industrial)

ggarcia@ceti.mx; glopez@ceti.mx; jjsanchez@ceti.mx

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

No Further comments

Article Menu

Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach

Further Information

Guidelines

MDPI Initiatives

Follow MDPI