Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach

García-Gil, Gerardo; López-Armas, Gabriela del Carmen; Sánchez-Escobar, Juan Jaime; Salazar-Torres, Bryan Armando; Rodríguez-Vázquez, Alma Nayeli

doi:10.3390/technologies12090152

Open AccessArticle

Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach

by

Gerardo García-Gil

,

Gabriela del Carmen López-Armas

^*

,

Juan Jaime Sánchez-Escobar

^*

,

Bryan Armando Salazar-Torres

and

Alma Nayeli Rodríguez-Vázquez

Technical Industrial Teaching Center, Department of Investigation, Software Design and Development/Biomedical, Nueva Escocia Street 1885, Guadalajara CP 44638, Jalisco, Mexico

^*

Authors to whom correspondence should be addressed.

Technologies 2024, 12(9), 152; https://doi.org/10.3390/technologies12090152

Submission received: 5 June 2024 / Revised: 13 August 2024 / Accepted: 20 August 2024 / Published: 4 September 2024

(This article belongs to the Special Issue The Future of Healthcare: Biomedical Technology and Integrated Artificial Intelligence 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Effective communication is crucial in daily life, and for people with hearing disabilities, sign language is no exception, serving as their primary means of interaction. Various technologies, such as cochlear implants and mobile sign language translation applications, have been explored to enhance communication and improve the quality of life of the deaf community. This article presents a new, innovative method that uses real-time machine learning (ML) to accurately identify Mexican sign language (MSL) and is adaptable to any sign language. Our method is based on analyzing six features that represent the angles between the distal phalanges and the palm, thus eliminating the need for complex image processing. Our ML approach achieves accurate sign language identification in real-time, with an accuracy and F1 score of 99%. These results demonstrate that a simple approach can effectively identify sign language. This advance is significant, as it offers an effective and accessible solution to improve communication for people with hearing impairments. Furthermore, the proposed method has the potential to be implemented in mobile applications and other devices to provide practical support to the deaf community.

Keywords:

MediaPipe; OpenCV; decision tree; Gini; machine learning; Mexican sign language

1. Introduction

Sign language recognition (SLR) has become crucial for bridging the communication gap between hearing and deaf people, thus facilitating assistive technologies, primarily through mobile applications [1,2,3]. In recent decades, advances in computer vision and machine learning (ML) have led to significant progress in SLR [4,5]. At the same time, research efforts have focused on developing technology-based solutions that improve communication and overall quality of life for the hearing-impaired community [6].

This study focuses on Mexican sign language (MSL) recognition, although it can be applied to other sign language, such as American sign language (ASL), using hand-angle analysis [7,8]. A MediaPipe hand skeleton descriptor is used for training, and the set of six angles for each letter is plotted using OpenCV [9,10,11]. These data are stored in a dataset and used to train a decision tree (DT C4.5) and label the predicted letter. To validate the model, the angles of the hand signals are compared to the training dataset. If a match is found, the model classifies it. Otherwise, it makes a prediction. A schematic of the hand signals of the MSL letters is shown in Figure 1.

This method differs from other ML-based approaches by requiring a relatively modest amount of input data, significantly reducing computational time for accurate MSL interpretation. This demonstrates that classical ML approaches can be more effective for real-time classification problems than some state-of-the-art ML algorithms [12,13]. The main contributions of our method can be summarized as follows:

i.: only six measurements (angles) are used as feature descriptors;
ii.: efficient and accurate prediction;
iii.: relatively small training dataset;
iv.: high efficiency in letter prediction.

The proposed method based on hand-angle analysis achieves significantly higher accuracy and efficiency in MSL recognition than other methods reported in the field, as demonstrated and mentioned below. A relatively small training dataset was used, and feature extraction using six specific hand angles allows for accurate and fast interpretation of sign language gestures, which improves communication and quality of life for the hearing impaired.

The following sections of this paper are organized as follows. Section II discusses the previous related work. Section 3 provides the materials and methods. Section IV presents the results of experimental performance metrics and their analysis. Section 5 presents the discussion, and Section 6 contains the conclusions.

2. Related Work

Sign alphabet recognition is a form of communication that involves images or videos depicting one or two hands. Experts have devoted efforts to predicting letters, words, and ideas in sign languages to assist people with hearing impairments using advanced technologies and artificial intelligence algorithms. Despite the many types of research in this field, there are limitations and opportunities for improving communication between hearing and non-hearing individuals. In this section, we will focus on exploring one-handed SRL. To this end, we present a brief literature review of recent works on SRL using deep-learning techniques, particularly convolutional neural networks (CNNs) and ML based on digital image processing and sensors [14,15,16,17].

Among the reviewed works, Ameen et al. (2016) explored the applicability of deep learning for sign language interpretation by developing a CNN to classify images based on manual spelling. They achieved 82% accuracy and 80% recall using intensity and depth data [18]. Thongtawee et al. (2018) presented an efficient feature-extraction method and algorithm to distinguish American sign language (ASL) from static and dynamic gestures, achieving 95% recognition using an artificial neural network (ANN) [19]. Rastgoo et al. (2020) addressed the challenge of real-time SLR using extra spatial hand relation (ESHR) and hand pose (HP) features, a 2D CNN, singular value decomposition (SVD), and long short-term memory (LSTM) with an accuracy of 86.32% [20,21].

Sharma et al. (2020) proposed a systematic statistical analysis and evaluated previously trained deep models for static Indian sign language (ISL) recognition. They achieved 99.0% and 97.6% recognition accuracy for numbers and letters, respectively, using a public ISL dataset [22]. Katoch et al. (2022) reported a technique using the bag-of-visual-words (BOVW) model to identify ISL alphabets and digits in a live video stream, achieving an average accuracy of 89.24% using support vector machine (SVM) and CNN [23,24,25]. Subramanian et al. (2022) proposed an optimized hand skeleton descriptor integrating the model of closed recurrent unit (MOPGRU) for ISL recognition, achieving an average accuracy of 95% using the bidirectional long short-term memory network (BiLSTM) [26]. Sundar et al. performed a hybrid with long short-term memory (LSTM), achieving an accuracy of 99% [27,28]. Pathan et al. investigated SLR using CNN and an ASL image dataset, achieving a test accuracy of 98.98% [29]. Sanchez et al. investigated a word-level SLR methodology on the Corpus LIBRAS dataset (Brazilian sign language), obtaining an accuracy of 94.33% using BiLSTM [30]. Mohsin et al. focused on letter and number recognition in ASL and achieved 96% accuracy using InceptionV3 [31]. Amangeldy et al. proposed an improved method for continuous recognition of Kazakh sign language, achieving an average accuracy of 97% [32]. Finally, Wali et al. comprehensively reviewed emerging frameworks and algorithms in SLR, identifying state-of-the-art techniques and suggesting new research directions [33].

2.1. The Taxonomy of Sign Language Recognition

Farooq et al. (2021) propose a taxonomy of SLR that includes applications, avatar technology, gesture recognition, natural language to sign language translation, and repositories of written text units, such as letters, words, or sentences. Individual signs in sign language consist of gestures or hand movements, each representing a specific word. They can also be considered static images, where one image corresponds to one word. Signed phrases, or running signs, contain multiple words and are considered sign alphabets or static signs. They are easier to recognize than full sentences. This section reviews several studies in this area [31,34].

Research on sign language translation identifies several possibilities. These efforts aim to improve communication between people with and without hearing impairments. Our proposed approach is highlighted in the white box in Figure 2.

2.2. Machine Learning Models in Sign Languages

Among the ML models similar to ours, MediaPipe was used by Bajaj et al. (2010) to experiment with different combinations of landmarks and classification algorithms, such as K-nearest neighbors (KNN), random forests, and neural networks. After preprocessing to bring the landmarks into a single reference frame, these researchers obtained average accuracies of 82.19% for KNN, 85.30% for random forests, and 90.95% for neural networks, respectively [35,36,37,38]. Sahoo et al. (2014) used various classification techniques, such as position and motion-based SLR using MultiStream (HMM), the Boost Map method (BoostMap), and neural networks, for sign language-related gestures. Their approach focused on determining the shape and center of gravity of the hands in images captured by a digital camera. They used landmarks to train signers and achieved satisfactory accuracy [39,40,41].

In 2021, Shah et al. used SVM multiple-kernel learning classification to recognize static Pakistani sign language alphabets. They extracted visual features from images, such as local binary patterns (LBP), histogram of oriented gradient (HOG), and speeded-up robust features (SURF). They classified them using multiple kernels, achieving an average accuracy of 89.24% [42,43,44]. Hussain et al. (2022) focused on ASL and Irish sign language (ISL) annotation, using MediaPipe to extract features and classify hand gestures. They achieved a maximum accuracy of 93% and 96.7%, respectively, with different classifiers [45]. Despite the extensive existing research, in this study, we will compare ML methods such as SVM, naive Bayes, KNN, random forests, XGBoost, and LightGBM, in addition to DT C4.5 [46,47,48,49].

Unlike previous research, our approach is characterized by its simplicity. It is based on only six features in real-time SLR. The careful feature selection and efficiency of the model support its applicability in practical situations.

3. Materials and Methods

3.1. Software and Hardware Characteristics

For this work, a CPU with an AMD Ryzen 5 5600 G processor and Radeon graphics at 3.90 GHz, 16 GB of RAM, and a Logitech Model C920 HD Pro 1080p 960-000764 webcam were used. The model was built using the MediaPipe classifier as a descriptor of the hand skeleton, OpenCV to measure the internal angles of the hand, and the Python sklearn library (version 9.11.9) to construct the C4.5 decision-tree classifier [50].

3.2. Data Loading and Dataset Preparation

The process began with collecting a dataset explicitly designed for our task, extracted from a proprietary database stored in an Excel file with 5690 records collected and trained by four people of different genders and ages. Furthermore, the database can be downloaded from https://github.com/gggvamp/pdi/blob/main/datosO.xlsx (accessed on 4 April 2024), since we focus on classification and prediction in the language domain. The last column of the dataset contains the class labels corresponding to the sign language letters, while the previous six columns contain the relevant predictive features of the hand, which can be seen in Figure 3a,b. The training set was used to fit the model letter by letter. We apply essential preprocessing techniques to ensure the quality and adequacy of the data. This process included the identification and removal of outliers, which could bias the results of the analysis, and the normalization and standardization of the characteristics were also carried out. We adjusted the scales of the six features to achieve a standard distribution, which facilitates the comparison of different hand signs (letters) and improves the stability of the analysis models. This is conducted for each letter [51,52]. See Figure 3a,b.

We split the data into training and test sets, allocating 20% of the data to the test set and 80% to the training set to ensure adequate representation of the algorithms [53]. Additionally, we used the numerical value of 42 to initialize the random number generator, ensuring the reproducibility of the data splitting. This separation allows for adequate model training with independent data before the evaluation of the test set, avoiding overfitting and providing a more accurate assessment of model performance under real-time and user-independent conditions [54].

Once the data have been collected, the requisite preprocessing is initiated to ensure the quality and suitability of the data for subsequent analysis. This process involves several crucial steps designed to preserve the integrity and consistency of the data. These steps include identifying and eliminating outliers that could skew the analysis results. Additionally, the features are normalized or standardized, adjusting their scales to achieve a standard distribution. This step facilitates the comparison of different variables and improves the stability of the analysis models.

3.3. The Performance Evaluation of Trained Models

The ML algorithms were evaluated using several metrics, including prediction robustness, completeness, sensitivity, specificity, precision, recall, and F1 score (a combination of precision and recall), in addition to time and accuracy in prediction, training, and validation.

Once the models were trained, their performances were evaluated using the test suite, which involved calculating numerous metrics designed to provide a complete and detailed understanding of each model’s performance. Among these metrics are confusion matrices, which provide a robust evaluation of each classification method on datasets containing 21 classes (letters).

The confusion matrix is a valuable source of information about the classifier’s predictions. To improve clarity, it is common practice to normalize the confusion matrix by converting absolute counts to proportions. This normalization allows for comparing model performance between classes with different sample sizes. Additionally, the matrix is presented as a heat map, where darker shades represent higher values. This visualization helps to identify areas of high confusion, as shown in Table 1, and provides a clear understanding of model performance.

3.4. Selection of Algorithms to Compare

Several ML algorithms have been selected for comparison to assess their performances in addressing the classification and prediction problem. The selected algorithms are:

SVM: data classification is achieved by identifying the optimal separating hyperplane between classes in a multidimensional space;
Naive Bayes: this approach is based on Bayes’ theorem and assumes independence between characteristics and the classes;
KNN: data classification is performed by assigning labels based on nearest neighbor labels;
Decision trees: this method classifies data using a decision tree, where each node represents a feature and each leaf a label;
Random forests: combine multiple decision trees to classify data, reducing overfitting;
XGBoost: implements gradient boosting to improve model accuracy using sequential decision trees with regularization and parallelization;
LightGBM is another efficient implementation of gradient boosting. it uses sampling techniques to build decision trees faster and with lower memory usage;
CatBoost is an ML algorithm developed by Yandex for gradient boosting on decision trees. It is particularly effective at handling categorical features directly, preventing overfitting, and offering high performance with both CPU and GPU support;
RNNs: recurrent neural networks are designed for processing sequential data by maintaining a hidden state that captures information from previous time steps. They are widely used in applications like time-series forecasting and natural language processing.

The selected algorithms were implemented at this stage using well-known ML libraries, such as Scikit-learn 1.4.2, XGBoost, and LightGBM. Each model was carefully configured and trained using the training set prepared during the data-preprocessing phase. During the training process, rigorous monitoring was conducted by recording the time required for each model’s training. Several evaluation metrics were calculated and thoroughly analyzed, including training accuracy, a crucial measure of each model’s predictive ability and fit. It should be noted that all algorithms compared were subjected to the same circumstances and initial conditions. This meticulous approach ensured a thorough understanding of the models’ performance and allowed the identification of areas for improvement to optimize their performance in future phases of the project [55,56].

3.5. Proposed Work

This work uses a hand-feature extractor called the MediaPipe Hand Landmarker. This tool identifies key points on the hands in an image. These points can be used to detect significant locations on the hands and apply visual effects to them. See Figure 4a,b.

Measuring the internal angles between the distal phalanges and the palm is crucial to our methodology. These angles capture distinctive features without relying on correlation and convolution processes. Calculating these angles provides a concise representation of the hand signals and, through OpenCV, stabilizes the plot regardless of the position or distance of the hand from the camera. This stability is vital for improving the accuracy and speed of letter prediction in sign language. The angle is calculated using the direction vectors of the two lines

\vec{u} = (u_{1} {, u}_{2})

and

\vec{v} = (v_{1} {, v}_{2})

, and the angle formed by these two lines can be calculated using Equation (1):

\cos α = (\frac{|\vec{u} \cdot \vec{v}|}{|\vec{u}| |\vec{v}|}) = \frac{|u_{1} v_{1} + u_{2} v_{2}|}{\sqrt{u_{1}^{2} + u_{2}^{2}} + \sqrt{v_{1}^{2} + v_{2}^{2}}}

(1)

where

| \vec{u} |

and

| \vec{v} |

are the modules of vectors u and v, respectively, the angles between the distal phalanges and the palm are obtained. It should be noted that this contribution is particularly significant, since not only are the features reduced from 21 to 6 dimensions but also, by focusing on distal angles, the values of these angles are consistent regardless of the position or distance relative to the camera. This is a significant advantage in processing, classifying, and, above all, predicting MSL letter labels. See Figure 5a–d.

3.5.1. Training Stage

During the training stage, real-time video recordings are performed for each letter of the Spanish alphabet using the MSL variant. The training begins by labeling each sign with its corresponding letter, starting with the sign for the letter A and ending with the sign for Y, excluding the letters J, K, Ñ, Q, X, and Z due to their movement nature. The six most critical characteristic points are plotted using Equation (1) to derive and record the six angles, from which 21 of the 27 MSL letters are obtained.

These data are stored in a dataset containing each letter of the corresponding alphabet. For example, in Figure 6, the letter ‘B’ is trained. Records of the features of each letter are stored in a more extensive dataset called ‘alphabet classes’, which is accessible in the training and validation phase. See Figure 6.

The training algorithm outlines a system that uses a camera to detect and track hands in real time. The process begins with importing the necessary libraries for image processing and hand detection. Video capture from the camera is set in a continuous loop, where the algorithm captures a frame and verifies the success of the capture. The algorithm then searches for the user’s hand within the frame. If a hand is detected, the algorithm analyzes it and calculates the coordinates of the hand landmarks. These coordinates are used to draw lines on the frame, representing the hand. Additionally, the algorithm calculates six angles using the hand landmark coordinates, which are stored. The modified frame with the detected hand and calculated angles is displayed on the screen. The loop continues until the user stops it, at which point the camera resources are released. For a detailed description of this phase, refer to Table 2.

3.5.2. Validation Phase

Validation is a crucial step to ensuring the accuracy and reliability of the character recognition system. For this purpose, the DT C4.5 ML method, specifically the C4.5 algorithm, is employed [57]. The validation process comprises several key steps. Initially, hand signals are processed analogously to those used during training. Six internal angles of the hand are extracted as features for validation. These angles are compared to a set of “alphabet classes,” which contain information about the signs corresponding to each of the 21 letters of the alphabet. The objective is to determine if the captured hand signal corresponds to one of the previously trained alphabet classes. If no direct correspondence with the alphabet classes is found, the DT C4.5 algorithm is activated. This algorithm is renowned for its capacity to construct decision trees based on the information provided by the training data. These decision rules may be employed to classify unknown records from the captured hand signal. In this manner, the DT C4.5 functions as a predictor, determining which letter of the alphabet the signal belongs to and labeling it correctly. For further clarification, please refer to Figure 7.

This system employs a camera and a hand recognition model to capture and process gestures. The camera initiates video capture, while hand-gesture data are read from an Excel file, potentially containing information from a previous training session. An infinite loop is established to capture frames from the camera continuously. Each frame is processed with the hand recognition model, extracting the coordinates of the landmarks and calculating the angles between them. The processing cycle persists until the user elects to terminate it. At this juncture, the DT C4.5 model categorizes the captured signals. The image depicts the classified signal, and the resulting video with the gesture labels is displayed for future reference. Table 3 provides a detailed description of the validation process.

The pseudocode presented elucidates the real-time validation of hand-gesture recognition. As stated, it imports libraries and reads gesture data from an Excel file. After initializing the requisite variables, it initiates an infinite loop to capture and process images from the camera.

3.6. Decision Tree

This algorithm needs to be explained because it performed best compared to the ML algorithms in the results section; this decision algorithm and its possible results are presented as a tree. It is used for classification and regression tasks, especially DT C4.5, which has served as the basis for several variants and improvements in the design of decision-tree algorithms, such as random forests [58,59]. In this context, the tool predicts the category (letter) of the hand signs not found in the previously trained set of alphabet classes. Although this tool is not new, our experiments have shown extraordinary efficiency, responsiveness, and accuracy despite its simplicity of application. Its use in this work is further detailed by explaining the parts that make up this algorithm.

3.6.1. Entropy

Entropy is a physical quantity applied to a thermodynamic system in equilibrium. Its function is to measure the number of microstates compatible with the macroequilibrium state of the system. This measurement can be understood as measuring the degree of organization present in the system in that state. In our case, when the entropy level is zero, it represents the maximum order. The decision tree is created based on the gain of information obtained from the training examples and is then used to classify the test set. The classification task is typically performed with nominal attributes and no missing values in the dataset. If a probability distribution

P = (p 1, p 2, \dots, p n)

is provided, then the information carried by this distribution is known as entropy and is calculated by:

E n t r o p y P = - \sum_{i = 1}^{n} p_{i} \cdot l o g (p_{i})

(2)

3.6.2. Information Gain

To select the attribute of a given node at any position in the tree under construction, it is necessary to determine the gain for a t-test and a position p at that node using the following equation:

G a n (p, T) = E n t r P - \sum_{j = 1}^{n} (p j \cdot E n t r) (p j)

(3)

The values (

p j

) represent the set of all possible values for the attribute. This measure given in Equation (3) can be used to determine which attribute is better and to construct the decision tree by considering the node that possesses the attribute with the highest information gain of all the attributes that have not yet been considered in the route from the root node.

3.6.3. Algorithm C4.5

The C4.5 algorithm is an extension of the ID3 algorithm proposed by Quinlan to address some of the deficiencies of ID3, in that it was not designed for numeric attributes and does not use pruning to reduce overtraining. To solve these problems, algorithm C4.5 uses a new calculation to measure the gain ratio [60]. Thanks to the introduction of this new calculation, it is possible to calculate a gain ratio:

R e l G a n (p, T) = \frac{G a n (p, T)}{i n f D i v (p, T)}

(4)

where

i n f D i v (p, t e s t) = - \sum_{j = 1}^{k} p (\frac{j}{p}) \cdot l o g (p ’ (\frac{j}{p}))

(5)

p′(

j / p

) is the proportion of elements present at position p, calculated from the umpteenth test. Unlike the entropy in ID3, the gain ratio is independent of the objects distributed in different classes. C4.5 handles attributes with unknown values more effectively by evaluating the gain ratio for these attributes by considering only the datasets for which that dataset is defined. To accomplish this task, the algorithm estimates the probabilities of different outcomes. Then, the new gain criterion takes the form.

G a n (p) = F (I n f o (T) - I n f o (p, T))

(6)

where

F

is the number of examples in the dataset with known values for the number of examples in an attribute dataset.

I n f o (T) = \sum_{i = 1}^{n} ((p_{j}) E n t r o p i a (p_{j}))

(7)

If we first partition T into sets

p_{1}

,

p_{2}

…,

p_{n}

based on the value of a non-categorical attribute p, then the information needed to identify the class of an element of T becomes the weighted average of the information required to specify the class of a component of

T_{i}

, i.e., the weighted average of Info (

T_{i}

). It also handles attributes with continuous values. Let

p_{j}

be an attribute with a continuous value in a range of constant values. The values of these attributes are examined in the training set. The gain for each partition is calculated, and the gain-maximizing partition is selected. An alternative approach used in the C4.5 algorithm is the technique of post-pruning. The algorithm does not stop during execution. Therefore, it also allows over-fitting, and only at the end are pruning rules applied to improve the generalization ability. Another difficulty is handling continuous value attributes, such as real numbers [61]. Lastly, C4.5 uses a pruning technique to minimize the error rate. This technique reduces the size of the tree by removing parts that may be due to incorrect or missing data, thereby reducing the complexity of the tree and improving its classification performance.

3.6.4. Measure of Gini Impurity

Gini impurity is a metric used to create classification trees, providing more insight into the data distribution per node than classification accuracy alone. It is calculated by considering the proportion of each target category among all the records at a node. The Gini impurity is computed as the sum of the squares of these proportions subtracted from one. For example, when splitting a node, the algorithm seeks the split that maximizes the reduction in impurity, defined as the impurity of the parent node minus the weighted average impurity of the child nodes. The overall objective is to minimize the Gini impurity.

G i n i (X q) = 1 - \sum_{k = 1}^{k} {(p_{k, q})}^{2}

(8)

This is used for categorical attributes. This criterion attempts to estimate the information provided by each attribute based on “information theory”. Entropy is a measure of the uncertainty or randomness of a random variable “x”. By calculating the entropy for each attribute, the information gain of the tree can be determined.

4. Results

4.1. Evaluation Metrics and Their Relevance

Various metrics have been used to evaluate model performance, including training and validation accuracy and training and prediction time. These metrics are essential for understanding different aspects of model performance. Accuracy provides a measure of the quality of predictions, while training and prediction time give information about the computational efficiency of the algorithms, which is critical for real-time or large-scale applications.

Several ML models have been trained using algorithms, including random forest, SVM, naive Bayes, KNN, decision tree C4.5, LightGBM, XGBoost, CatBoost, and RNNs. This broad range allows for a comprehensive comparison between different modeling paradigms, which is crucial for understanding which approaches best fit a dataset and the sign language predictor. The importance of this analysis lies in its ability to determine the algorithm that offers the best performance in terms of accuracy, computational efficiency, and generalizability, all of which are essential for the success and utility of our sign language predictor. The results illustrate how the accuracy of the models varies with increasing the training time. Table 4 presents the performance comparison of the ML models.

Throughout this process, we closely examined how performance metrics, such as accuracy, varied between the training and test sets, and how the performances of the different classification models were reflected in terms of precision, recall, F1 score, and accuracy. Table 5 presents the training and prediction time metrics.

The decision tree, naive Bayes, and k-NN are fast at training and prediction. SVM and random forest have moderate training and prediction times. XGBoost and CatBoost are quick predictors, but CatBoost has a longer training time. LightGBM is efficient for large datasets but slower in prediction. RNNs have long training and prediction times. The choice of algorithm depends on the problem requirements and available computational resources. Cross-validation scores are used to assess the generalization ability of a machine-learning model; see Table 6. Cross-validation is a technique that helps ensure that the model performs well on the training dataset and previously unseen new data.

All models have a high overall accuracy, ranging from 0.96 to 0.99. This indicates that most of the models’ predictions are correct compared to the total predictions. Accuracy measures the proportion of correct positive predictions. The models’ accuracy ranges from 0.96 to 0.99, indicating a very high ability to predict instances correctly. State-of-the-art algorithms, such as CatBoost, XGBoost, and LightGBM, are the most accurate, although they struggle with training time. Metrics using clustered bar graphs facilitate visual comparisons between the different models. However, it is essential to interpret the results with caution. For example, a model with a high training accuracy but a low validation accuracy may indicate overfitting, while longer training and prediction times can be problematic in time-critical applications. See Figure 8.

It is essential to contextualize the results concerning the specific problem and data under consideration. What works well in one dataset may not apply to another. In addition, evaluation results should be cautiously generalized and validated on independent datasets to ensure their robustness and reliability. The implementation provides a valuable exploration of the performance of different ML algorithms. However, to describe the performance of all models based on the graph provided, we note the following:

i.: Random forest: Although random forest is known for its ability to handle large datasets with many features, it appears to be more time-consuming in training and prediction compared to other models. Nonetheless, it provides good accuracy on both training and test sets;
ii.: SVM: it has longer training and prediction times than other models, and its accuracy is not the highest on this dataset;
iii.: Naive Bayes: Although it has shorter training and prediction times, its accuracy is lower than other models. However, it could be a good choice if speed is a priority, and the required accuracy is reasonably high but not critical;
iv.: KNN: It shows very short training times, but the prediction times are longer. Its accuracy is relatively high, but its distance-based nature may not be optimal for large datasets;
v.: DT C4.5: it shows shorter training and prediction times compared to other models, and its accuracy is comparable to and even better than that of other more complex models on this dataset;
vi.: XGBoost and LightGBM: These gradient-boosting models have good accuracy results but longer training times than DT C4.5. However, their prediction times are shorter than those of random forest and SVM;
vii.: CatBoost: It performs excellently, with an accuracy of 1.00. However, its training time is significantly higher (2.42 s), and the cross-validation time is also longer (11.17 s);
viii.: Neural networks: although neural networks have good accuracy and recall (0.96), their training times (7.09 s) and cross-validation (29.00 s) are much longer compared to the decision tree and boosting models.

4.2. Confusion Matrix of the Compared Models

The confusion matrix, a fundamental component in evaluating the performance of the classification model, serves to quantify the model’s accuracy by showing the number of correct and incorrect predictions for each class (letter). The confusion matrix is visualized, highlighting the relationships between classes and facilitating the identification of patterns of classification errors. Below, you can see the confusion matrices that were compared to determine which sign language predictor to choose. See Figure 9a–i.

The last matrix, DT C4.5, shows the number of correct and incorrect predictions for each class in a tabular format. It provides a detailed understanding of how the model classifies instances in each class. The model is accurate overall, with many correct predictions for most classes. The accuracy varies between the classes, indicating that the model may perform better for some classes than others. The high-performing classes are A, C, G, P, T, W, and Y, mostly with correct predictions with no significant false positives or negatives. This indicates that the model classifies these classes effectively. Some classes, such as M, R, and U, show false positives or negatives. These classes present opportunities for improvement to enhance the model’s accuracy in classifying these instances. Overall, the confusion matrix results provide valuable information about the performance of the DT C4.5 model, highlighting areas of strength and areas that need improvement for more accurate classification. Based on these comparative analyses, we could decide which algorithm best balances accuracy and computational efficiency for our sign language predictor. This evaluation allowed us to select the most appropriate model to meet the problem’s requirements and ensure optimal performance in practical situations.

4.3. Performance of the DT C4.5 Classification Model

The following metrics provide a clear and concise evaluation of the proposed method’s performance with DT C4.5 for classifying different letters. These metrics, including accuracy, precision, recall, and F1 score, are based on the most recent results and are presented straightforwardly in Table 4, Table 5 and Table 6 and Figure 8. Please refer to Table 7 for a detailed breakdown.

Table 7 shows the high performance of the proposed method using DT C4.5 in classifying letters.

Accuracy: This represents the proportion of correct predictions from the total predictions made for each class, providing a general measure of model performance. Values range from 0.99 to 1.0, indicating that the model is highly accurate for most classes;
Precision: Indicates the proportion of instances correctly classified as positive out of all instances classified as positive. It measures the model’s ability to avoid misclassifying a negative instance as positive. Values range from 0.94 to 1.0, indicating that the model has a low false-positive rate for most classes;
Recall: Represents the proportion of positive instances correctly identified by the model out of all true-positive instances. It measures the model’s ability to identify all relevant instances in a dataset. Values between 0.94 and 1.0 indicate that the model correctly identifies the most positive instances in each class;
F1 Score: Measures model accuracy by considering both precision and recall. The harmonic mean of precision and recall balances the two metrics. Values range from 0.96 to 1.0, indicating a good balance between precision and recall for most classes;
Support: values vary by class and represent the number of instances of each class in the test dataset.

4.4. Structure of Prediction in DT C4.5

A DT C4.5 is a graphical representation of a set of decision rules used to classify examples or predict outcomes, where each internal node represents a feature (attribute), each branch represents a decision based on that attribute, and each leaf represents the result of the decision. The following describes how decisions are made based on specific features by dividing the dataset into smaller groups at each internal node until a prediction is obtained. This leads to the classification of hand angles into letters of the alphabet. A decision tree is generated and visualized based on the given data, and the resulting image is saved. This process helps understand how DT C4.5 classifies different data and which features are most important for classification. See Figure 10.

4.5. The Results of the Characteristics of DT. C4.5

i.: Feature Names: In the context of a C4.5 DT, these are the features used to make decisions at each tree node. It is almost always specified that default feature names be used;
ii.: Gini: The Gini impurity measures how impure a node is. Determining which features and split values are best for dividing the dataset into smaller subsets is critical. In the decision-tree graph, nodes are split based on the Gini impurity value to minimize impurity in the resulting nodes;
iii.: Examples: In the context of DT C4.5, this refers to the number of data instances that arrive at a particular node during the tree’s training process. The number of samples arriving at each node can be shown in the decision-tree graph, providing information about the data distribution in the tree;
iv.: Value: The value at a node represents the class distribution of the samples arriving at that node. In the decision-tree graph, the value of a node can be visualized as a list showing how many samples of each class are present at that node;
v.: Class: The leaf nodes of the decision tree represent the predominant class of samples arriving at that node. In the decision-tree graph, the class of a leaf node can be displayed as a label indicating the predominant class at that node.

The leaf nodes in the tree indicate the predominant class of arriving samples, constituting the final classification of letters in sign language. These steps are essential for understanding and effectively applying the classification process using a DT C4.5 classifier in SLR.

In summary, the code uses a DT C4.5 classifier to predict letters based on the characteristics of the detected hands. The decision-tree graph provides a visualization of how these decisions are made, and terms such as Gini, impurity, samples, value, and class refer to different aspects of the tree construction process and data distribution in the tree. The images in Figure 11 show the manual sign results for the 21 non-movement letters that define MSL.

It is important to note that the results from the instantaneous completion of the process are obtained from real-time video, not photographs. As with all video systems, the two limitations of the method are the lighting quality and the lack of focus of the webcam. The images in Figure 11 show the 21 letters of the alphabet, excluding those with movement.

5. Discussion

This work is distinctive in the scientific literature because it analyzes six features that represent the angles between the distal phalanges and the palm. This approach minimizes the need for complex image processing. Additionally, there is limited research on MSL.

In recent MSL research published by Gonzalez et al. [12], MediaPipe was used as a descriptor of the face, body, and hands to create avatars. The system includes an easy-to-use graphical interface with modes to translate between MSL and Spanish in both directions. Users can enter characters or text and receive corresponding translations. The performance evaluation shows high accuracy, with the bidirectional neural network model achieving an accuracy of 98.8%. Like us, they reduce the dimensionality of the features in their work to 11 for the face and 5 for the body but keep the 21 critical points of the hand. In contrast, in our work, the dimensionality is reduced to six features, obtaining similar results in accuracy.

On the other hand, Sosa et al. [13] conducted a study using MSL. They proposed a system to recognize and animate signs related to general medical consultations with avatars in real time. This system facilitates dynamic and non-intrusive interaction between hearing doctors and deaf patients. The recognition module uses an MS Kinect sensor to capture sign trajectories and images processed in real time by hidden Markov models (HMMs). The study involved 22 participants and demonstrated the recognition of 82 different signs, achieving average accuracy rates and obtaining F1 scores of 99% and 88%, respectively. The work uses MSL, but the Kinect sensor requires two computers to program speech and train an avatar using motion caption (MoCap), which cannot track finger movements. Therefore, it needs to be adapted afterward. The researchers’ contribution is valuable because it focuses on helping hearing-impaired people communicate in a medical context. Compared to our work, we used a low-cost camera with a medium-capacity computer, which processes finger images much faster. We do not need to train an avatar beforehand or the patient; communication is facilitated directly by the person in need, regardless of hand size or skin color.

According to the same methodology but without the incursion of MSL, we found a promising work by Subramanian et al., who employed a hand-feature descriptor integrating an optimized MediaPipe called gated recurrent units (MOPGRU) for ISL recognition, obtaining an average accuracy of 95% [26]. In contrast, our work did not require an optimized MediaPipe, and we obtained similar results. Just as Hussain [45] identified two alphabets, ASL and ISL-HS, using different kinds of ML, including random forest, DT C4.5, and naive Bayes, to classify hand gestures using a dataset with 28 gestures between letters and 2 signs. The random forest classifier was the best-performing classifier, showing an accuracy of 96.7% with ISL and 93.7% with ASL. However, in our study, the random forest showed more extended training and prediction. However, it is our second-best performance after DT C4.5. Our work has some limitations that make it imperfect; we only have 21 letters of the alphabet. It was not possible to include letters that imply movement, and it would be desirable to complete the MSL. Some classes, such as M, R, and U, show false positives or negatives. These classes are of utmost importance and require special attention to improve the accuracy of the model in classifying these letters.

6. Conclusions

This study presents a comprehensive analysis of various ML models applied to the classification of MSL. Our decision tree C4.5 algorithm demonstrated remarkable performance, achieving near-perfect precision, recall, and an F1 score of 99%. Compared to cutting-edge algorithms like random forest, XGBoost, LightGBM, CatBoost, and neural networks, the DT C4.5 algorithm stands out for its balance between computational efficiency and predictive accuracy. While models such as CatBoost and neural networks offer competitive accuracy, they require significantly longer training times, which may not be ideal for real-time or large-scale applications. CatBoost, in particular, exhibited excellent performance in accuracy and handling categorical data, but its training time was considerably longer compared to DT C4.5.

Although the neural network was adequate, with a precision and recall of 96%, it presented the most extended training and cross-validation times among all the models tested. This makes it less practical for scenarios requiring quick deployment and iteration. However, its ability to handle complex patterns in the data is noteworthy, suggesting its potential for future improvements where computational resources are less constrained.

Our findings highlight the importance of model selection based on the specific needs of the application, such as training speed, prediction time, and accuracy. DT C4.5 proved to be the most balanced option for our MSL predictor, offering robust performance without the drawbacks associated with more computationally demanding models. This study underscores the potential of simpler models like DT C4.5 to achieve high accuracy in specialized tasks where the advanced models’ complexity and resource demands may not be justified.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/technologies12090152/s1, https://github.com/gggvamp/pdi/blob/main/datosO.xlsx (accessed on 4 April 2024), database, https://github.com/gggvamp/MSL/blob/main/letras2%20(3).png (accessed on 4 April 2024), extended decision tree C4.5, and https://github.com/gggvamp/pdi/blob/main/videomsl.mp4 (accessed on 4 April 2024), Video.

Author Contributions

G.G.-G. conceptualized the work, edited the manuscript, and prepared the dataset. B.A.S.-T. conceptualized the work and analyzed the dataset using computer vision algorithms. G.d.C.L.-A. conceptualized the work, wrote parts of the manuscript, provided additional analysis, and revised the manuscript. J.J.S.-E. supervised, wrote parts of the manuscript, provided additional analysis, and revised the manuscript. A.N.R.-V. supervised and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

Technical Industrial Teaching Center supported this research. The action program was under academic direction (research work PI-03-2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This information has been detailed at Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Amrutha, K.; Prabu, P. ML Based Sign Language Recognition System. In Proceedings of the 2021 International Conference on Innovative Trends in Information Technology (ICITIIT), Kottayam, India, 11–12 February 2021; pp. 1–6. [Google Scholar] [CrossRef]
Hekmat, A.; Abbas, H.; Shahadi, H. Sign Language Recognition and Hand Gestures Review. Kerbala J. Eng. Sci. 2022, 2, 209–234. [Google Scholar]
Younas, F.; Nadir, J.; Usman, M.; Khan, M.A.; Khan, S.A.; Kadry, S.; Nam, Y. An Artificial Intelligence Approach for Word Semantic Similarity Measure of Hindi Language. KSII Trans. Internet Inf. Syst. 2021, 15, 2049–2068. [Google Scholar]
Mahesh, B. Machine Learning Algorithms—A Review. Int. J. Sci. Res. 2019, 9, 381–386. [Google Scholar] [CrossRef]
Napier, J. Sign Language Interpreter Training, Testing, and Accreditation: An International Comparison. Am. Ann. Deaf 2004, 149, 350–359. [Google Scholar] [CrossRef] [PubMed]
Alaghband, M.; Maghroor, H.R.; Garibay, I. A survey on sign language literature. Mach. Learn. Appl. 2023, 14, 100504. [Google Scholar] [CrossRef]
Escobar, L. Gestualidad y lengua en la lengua de seÃ±as mexicana. Lingüíst. Mex. Nueva Época 2019, 1, 141–166. [Google Scholar] [CrossRef]
Valli, C.; Lucas, C. Linguistics of American Sign Language: An Introduction, 3rd ed.; Gallaudet University Press: Washington, DC, USA, 2000; ISBN 978-1-56368-097-7. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.-L.; Grundmann, M. MediaPipe Hands: On-device Real-time Hand Tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar] [CrossRef]
Zelinsky, A. Learning OpenCV—Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008) [On the Shelf]. IEEE Robot. Autom. Mag. IEEE Robot Autom. 2009, 16, 100. [Google Scholar] [CrossRef]
González-Rodríguez, J.-R.; Córdova-Esparza, D.-M.; Terven, J.; Romero-González, J.-A. Towards a Bidirectional Mexican Sign Language–Spanish Translation System: A Deep Learning Approach. Technologies 2024, 12, 7. [Google Scholar] [CrossRef]
Sosa-Jimenez, C.O.; Rios-Figueroa, H.V.; Solis-Gonzalez-Cosio, A.L. A Prototype for Mexican Sign Language Recognition and Synthesis in Support of a Primary Care Physician. IEEE Access 2022, 10, 127620–127635. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Wangchuk, K.; Riyamongkol, P.; Waranusast, R. Real-time Bhutanese Sign Language digits recognition system using Convolutional Neural Network. ICT Express 2021, 7, 215–220. [Google Scholar] [CrossRef]
Kasapbaşi, A.; Elbushra, A.; Al-Hardanee, O.; Yilmaz, A. DeepASLR: A CNN based Human Computer Interface for American Sign Language Recognition for Hearing-Impaired Individuals. Comput. Methods Programs Biomed. Update 2022, 2, 100048. [Google Scholar] [CrossRef]
Arooj, S.; Altaf, S.; Ahmad, S.; Mahmoud, H.; Mohamed, A.S.N. Enhancing sign language recognition using CNN and SIFT: A case study on Pakistan sign language. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 101934. [Google Scholar] [CrossRef]
Ameen, S.; Vadera, S. A convolutional neural network to classify American Sign Language fingerspelling from depth and colour images. Expert Syst. 2017, 34, e12197. [Google Scholar] [CrossRef]
Thongtawee, A.; Pinsanoh, O.; Kitjaidure, Y. A Novel Feature Extraction for American Sign Language Recognition Using Webcam. In Proceedings of the 2018 11th Biomedical Engineering International Conference (BMEiCON), Chiang Mai, Thailand, 21–24 November 2018; pp. 1–5. [Google Scholar] [CrossRef]
Rastgoo, R.; Kiani, K.; Escalera, S. Video-based isolated hand sign language recognition using a deep cascaded model. Multimed. Tools Appl. 2020, 79, 22965–22987. [Google Scholar] [CrossRef]
Rastgoo, R.; Kiani, K.; Escalera, S. Real-time isolated hand sign language recognition using deep networks and SVD. J. Ambient Intell. Humaniz. Comput. 2022, 13, 591–611. [Google Scholar] [CrossRef]
Sharma, P.; Anand, R.S. A comprehensive evaluation of deep models and optimizers for Indian sign language recognition. Graph. Vis. Comput. 2021, 5, 200032. [Google Scholar] [CrossRef]
Katoch, S.; Singh, V.; Tiwary, U.S. Indian Sign Language recognition system using SURF with SVM and CNN. Array 2022, 14, 100141. [Google Scholar] [CrossRef]
Tripathi, S.; Singh, S.K.; Kuan, L.H. Bag of Visual Words (BoVW) with Deep Features—Patch Classification Model for Limited Dataset of Breast Tumours. arXiv 2022, arXiv:2202.10701. [Google Scholar] [CrossRef]
Tian, Y.; Shi, Y.; Liu, X. Recent advances on support vector machines research. Technol. Econ. Dev. Econ. 2012, 18, 5–33. [Google Scholar] [CrossRef]
Subramanian, B.; Olimov, B.; Naik, S.M.; Kim, S.; Park, K.-H.; Kim, J. An integrated mediapipe-optimized GRU model for Indian sign language recognition. Sci. Rep. 2022, 12, 11964. [Google Scholar] [CrossRef]
Sak, H.; Senior, A.; Beaufays, F. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. arXiv 2014, arXiv:1402.1128. [Google Scholar] [CrossRef]
Sundar, B.; Bagyammal, T. American Sign Language Recognition for Alphabets Using MediaPipe and LSTM. Procedia Comput. Sci. 2022, 215, 642–651. [Google Scholar] [CrossRef]
Pathan, R.K.; Biswas, M.; Yasmin, S.; Khandaker, M.U.; Salman, M.; Youssef, A.A.F. Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network. Sci. Rep. 2023, 13, 16975. [Google Scholar] [CrossRef]
Ruiz, D.S.; Olvera-López, J.A.; Olmos-Pineda, I. Word Level Sign Language Recognition via Handcrafted Features. IEEE Lat. Am. Trans. 2023, 21, 839–848. [Google Scholar] [CrossRef]
Mohsin, S.; Salim, B.W.; Mohamedsaeed, A.K.; Ibrahim, B.F.; Zeebaree, S.R.M. American Sign Language Recognition Based on Transfer Learning Algorithms. Int. J. Intell. Syst. Appl. Eng. 2024, 12, 390–399. [Google Scholar]
Amangeldy, N.; Ukenova, A.; Bekmanova, G.; Razakhova, B.; Milosz, M.; Kudubayeva, S. Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech. Sensors 2023, 23, 6383. [Google Scholar] [CrossRef]
Wali, A.; Shariq, R.; Shoaib, S.; Amir, S.; Farhan, A.A. Recent progress in sign language recognition: A review. Mach. Vis. Appl. 2023, 34, 127. [Google Scholar] [CrossRef]
Farooq, U.; Rahim, M.S.M.; Sabir, N.; Hussain, A.; Abid, A. Advances in machine translation for sign language: Approaches, limitations, and challenges. Neural Comput. Appl. 2021, 33, 14357–14399. [Google Scholar] [CrossRef]
Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
Louppe, G. Understanding Random Forests: From Theory to Practice. arXiv 2014, arXiv:1407.7502. [Google Scholar] [CrossRef]
Hinton, G.E.; Osindero, S.; Teh, Y.-W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
Bajaj, Y.; Malhotra, P. American Sign Language Identification Using Hand Trackpoint Analysis. arXiv 2020, arXiv:2010.10590. [Google Scholar] [CrossRef]
Sahoo, A.K.; Mishra, G.S.; Ravulakollu, K.K. Sign Language Recognition: State of the Art. 2014.
Maebatake, M.; Suzuki, I.; Nishida, M.; Horiuchi, Y.; Kuroiwa, S. Sign Language Recognition Based on Position and Movement Using Multi-Stream HMM. In Proceedings of the 2008 Second International Symposium on Universal Communication, Osaka, Japan, 15–16 December 2008; pp. 478–481. [Google Scholar] [CrossRef]
Athitsos, V.; Alon, J.; Sclaroff, S.; Kollios, G. BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 89–104. [Google Scholar] [CrossRef] [PubMed]
Rahim, A.; Hossain, N.; Wahid, T.; Azam, S. Face Recognition using Local Binary Patterns (LBP). Glob. J. Comput. Sci. Technol. 2013, 13, 1–8. [Google Scholar]
Huang, C.; Huang, J. A Fast HOG Descriptor Using Lookup Table and Integral Image. arXiv 2017, arXiv:1703.06256. [Google Scholar] [CrossRef]
Verma, R.; Kaur, M.R. Enhanced Character Recognition Using Surf Feature and Neural Network Technique. 2014. Available online: https://www.semanticscholar.org/paper/Enhanced-Character-Recognition-Using-Surf-Feature-Verma-Kaur/49f3939df922881dd857faac71aa5c7b873a606a (accessed on 18 May 2024).
Hussain, M.; Shaoor, A.; Alsuhibany, S.; Ghadi, Y.; Shloul, T.; Jalal, A.; Park, J. Intelligent Sign Language Recognition System for E-Learning Context. Comput. Mater. Contin. 2022, 72, 5327–5343. [Google Scholar] [CrossRef]
Zhang, H. The Optimality of Naive Bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS ), Fredericton, NB, Canada, 1 January 2004. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2006; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, HY, USA, 2017; Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (accessed on 18 May 2024).
Lewis, R. An Introduction to Classification and Regression Tree (CART) Analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine, San Francisco, CA, USA, 22–25 May 2000. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. Mach. Learn. Python 2011, 12, 2825–2830. [Google Scholar]
Hubert, M.; Van der Veeken, S. Outlier detection for skewed data. J. Chemom. 2008, 22, 235–246. [Google Scholar] [CrossRef]
Butcher, B.; Smith, B.J. Feature Engineering and Selection: A Practical Approach for Predictive Models. In The American Statistician; Kuhn, M., Johnson, K., Eds.; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2020; Volume 74, pp. 308–309. ISBN 978-1-13-807922-9. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Shin, J.; Matsuoka, A.; Hasan, M.A.M.; Srizon, A.Y. American Sign Language Alphabet Recognition by Extracting Feature from Hand Pose Estimation. Sensors 2021, 21, 5856. [Google Scholar] [CrossRef]
Obi, Y.; Claudio, K.S.; Budiman, V.M.; Achmad, S.; Kurniawan, A. Sign language recognition system for communicating to people with disabilities. Procedia Comput. Sci. 2023, 216, 13–20. [Google Scholar] [CrossRef]
Joksimoski, B.; Zdravevski, E.; Lameski, P.; Pires, I.M.; Melero, F.J.; Martinez, T.P.; Garcia, N.M.; Mihajlov, M.; Chorbev, I.; Trajkovik, V. Technological Solutions for Sign Language Recognition: A Scoping Review of Research Trends, Challenges, and Opportunities. IEEE Access 2022, 10, 40979–40998. [Google Scholar] [CrossRef]
Fang, G.; Gao, W.; Zhao, D. Large Vocabulary Sign Language Recognition Based on Fuzzy Decision Trees. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2004, 34, 305–314. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Quinlan, J.R. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef]
Li, X.; Yi, S.; Cundy, A.B.; Chen, W. Sustainable decision-making for contaminated site risk management: A decision tree model using machine learning algorithms. J. Clean. Prod. 2022, 371, 133612. [Google Scholar] [CrossRef]
Lyu, Y.; Huang, X. Road Segmentation Using CNN with GRU. arXiv 2018, arXiv:1804.05164. [Google Scholar]

Figure 1. MSL sign language.

Figure 2. The taxonomy of sign language recognition.

Figure 3. shows the measurements of the 6 characteristics, including the letter A in this case. The unfiltered data are displayed in (a), while the filtered (noise-reduced) data using the procedures mentioned above are shown in (b).

Figure 4. The task is effective on static image data and videos. (a) Twenty-one landmarks of the hands in image coordinates. (b) The angles between the distal phalanges and the palm are classified as internal angles αA, αB, αC, αD, αE, and αF.

Figure 5. Regardless of the hand’s location, the algorithm detects it. (a) Training the MSL character ‘A’ near the camera. (b) Training the MSL character ‘A’ away from the camera. (c) Training the character ‘Y’ in the incorrect position. (d) Training the character ‘Y’ in the correct position. The process begins with two stages, namely the training stage and the validation stage.

Figure 6. Training and labeling process of the 21 static letters of the MSL.

Figure 7. Complete block diagram of the training and validation stages.

Figure 8. The metrics of the compared algorithms with the respective accuracy and time.

Figure 9. (a–i): Confusion matrices for various sign language prediction models. Each confusion matrix illustrates the performance of a specific model in correctly classifying sign language gestures. The models compared include (a) recurrent neural networks (RNNs), (b) support vector machines (SVM), (c) naïve Bayes, (d) K-nearest neighbors (KNN), (e) random forest (RF), (f) XGBoost, (g) LightGBM, (h) CatBoost, and (i) decision tree (DT C4.5). These matrices help identify the model with the highest accuracy and best classification performance for SLR.

Figure 10. DT C4.5 is magnified twice to show (1) the computed features and arguments of the input tree and (2) the output of the already labeled leaves in the DT C4.5 classifier for predicting letters based on certain hand features detected by SRL. The meaning of each part of the tree is fully ex-plained.

Figure 11. The outcomes were achieved using the suggested approach.

Table 1. Description of a confusion matrix. The confusion matrix evaluates classification accuracy and determines a classifier’s overall performance. It defines key concepts, such as precision (P), recall (R), true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN).

Actual Class
		Positive	Negative
Predicted Class	Negative	False Negative (FN)	True Negative (TN)	Neg Pre. Value TN/(TN + FN)
	Positive	True Positive (TP)	False Positive (FP)	Precision TP/(TP + FP)
	F1-score (2xPxR)/P + R	Recall TP/(TP + FN)	Specificity TN/(TN + FP)	Accuracy TP + TN/(TP + TN + FP + FN)

Table 2. Training and labeling process.

Pseudocode 1 Hand Angle Data Collection and Training

1:: Start
2:: Import necessary libraries (cv2, mediapipe, numpy, pandas)
3:: Configure mediapipe for hand detection and capture video from the camera
4:: Write data into an Excel file
5:: Initialize counter to zero
6:: While true:
7:: Capture a frame from the camera
8:: If the frame was captured successfully:
9:: Flip the frame horizontally
10:: Convert the frame to RGB format
11:: Process the frame with the hand-detection model
12:: If hands are detected in the frame:
13:: For each detected hand:
14:: For each hand landmark:
15:: Calculate the coordinates of finger landmarks and wrist
16:: Draw circles at finger landmarks and wrist
17:: For each pair of landmarks forming a hand:
18:: Calculate the angles between fingers and wrist
19:: Draw lines between landmarks to represent the hand
20:: Display the calculated angles near the landmarks
21:: Print the calculated angles
22:: Show the frame with detected hands and calculated angles
23:: If the 'ESC' key is pressed:
24:: Exit the loop
25:: Release camera resources and close all windows
26:: End

The process begins with the importation of the requisite libraries and the configuration of hand detection. This is followed by writing data to an Excel file and initiating an infinite loop to acquire and process images captured by the camera. Within this loop, the algorithm detects the hand in the image, calculates the reference angles, displays the results, and stores the calculated angles.

Table 3. Validation process for hand-gesture recognition system with DT C4.5.

Pseudocode 2 Real-Time Hand Gesture Recognition Validation

1:    Start
2:    Import necessary libraries (cv2, mediapipe, matplotlib, numpy, pandas, sklearn)
3:    Configure mediapipe for hand detection and capture video from the camera
4:    Read hand gesture data from an Excel file
5:    Initialize necessary variables and data structures
6:    With mediapipe.Hands(
7:        static_image_mode = False,
8:        max_num_hands = 2,
9:        min_detection_confidence = 0.5) as hands:
10:       While True:
11:            Read a frame from the camera
12:            If the frame was read successfully:
13:                Process the frame with the hand detection model
14:                If hands are detected in the frame:
15:                    For each detected hand:
16:                        Extract coordinates of finger landmarks and wrist
17:                        Calculate angles between finger landmarks and wrist
18:                        Visualize landmarks and hand lines on the frame
19:                        Display calculated angles near the landmarks
20:                        Store the calculated angles
21:            Show the frame with detected hands and calculated angles
22:            If the ‘ESC’ key is pressed:
23:               Perform gesture classification using a decision tree model
24:               Display the predicted gesture label on the frame
25:               If needed, save the output video with gesture labels
26:               Exit the loop
27:        Release camera resources and close all windows
28:    End

Table 4. Performance Models.

Model Results
Metric/Method	D T C4.5	SVM	N B	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Training Accuracy	1.00	0.98	0.97	0.99	1.00	1.00	1.00	1.00	0.92
Training Loss	0.00	0.02	0.03	0.01	0.00	0.00	0.00	0.00	0.27
Testing Accuracy	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
Testing Loss	0.01	0.04	0.04	0.02	0.01	0.01	0.01	0.00	0.04
Accuracy	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
Precision	0.99	0.96	0.97	0.98	1.00	0.99	0.99	1.00	0.96
Recall	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96
F1 Score	0.99	0.96	0.96	0.98	0.99	0.99	0.99	1.00	0.96

The numbers marked in red are the ones with the best performance.

Table 5. Training time and prediction.

Execution Times
Time/Method	D T C4.5	SVM	NB	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Training Time	0.01 s	0.05 s	0.00 s	0.02 s	0.36 s	0.31 s	0.51 s	2.42 s	7.09 s
Prediction Time	0.00 s	0.03 s	0.01 s	0.03 s	0.02 s	0.00 s	0.10 s	0.00 s	0.12 s

The numbers marked in red are the ones with the best time.

Table 6. Indicates the cross-validation metrics.

Cross-Validation Scores
Metric/Method	D T C4.5	SVM	NB	k-NN	RF	XGBoost	LightGBM	CatBoost	RNNs
Cross-Validation Time	0.07 s	0.45 s	0.02 s	0.24 s	1.78 s	1.58 s	2.41 s	11.17 s	29.00 s
CV Scores	[0.977, 0.985, 0.984, 0.984, 0.985]	[0.946, 0.960, 0.932, 0.952, 0.951]	[0.954, 0.963, 0.936, 0.926, 0.976]	[0.965, 0.982, 0.954, 0.944, 0.973]	[0.984, 0.992, 0.971, 0.974, 0.976]	[0.937, 0.988, 0.967, 0.976, 0.981]	[0.937, 0.989, 0.967, 0.976, 0.981]	[0.990, 0.992, 0.991, 0.995, 0.993]	[0.967, 0.970, 0.967, 0.968, 0.980]
CV Mean	0.98	0.96	0.95	0.96	0.98	0.97	0.97	0.99	0.97
CV Standard Deviation	0.00	0.01	0.02	0.01	0.01	0.02	0.02	0.00	0.01

The numbers marked in red are the ones with the best performance.

Table 7. Metrics for the proposed method using the decision tree C4.5.

Letter	Accuracy	Precisión	Recall	F1-Score	Support
A	0.99	0.98	1	0.99	82
B	0.99	0.98	1	0.99	47
C	0.99	1	1	1	54
D	0.99	1	0.95	0.98	42
E	0.99	1	0.99	0.99	70
F	0.99	0.94	0.97	0.96	34
G	0.99	1	1	1	67
H	0.99	1	0.97	0.99	39
I	0.99	1	1	1	62
L	0.99	1	1	1	80
M	0.99	0.98	0.94	0.96	52
N	1	0.97	1	0.99	71
O	0.99	0.97	0.98	0.98	62
P	1	1	1	1	47
R	0.99	1	0.96	0.98	50
S	0.99	0.97	0.94	0.96	35
T	1	1	1	1	54
U	1	0.98	0.97	0.97	60
V	0.99	0.95	0.98	0.97	64
W	1	1	1	1	36
Y	1	0.97	1	0.98	30
Weighted Avg	0.99	0.99	0.99	0.99	1138

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Gil, G.; López-Armas, G.d.C.; Sánchez-Escobar, J.J.; Salazar-Torres, B.A.; Rodríguez-Vázquez, A.N. Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach. Technologies 2024, 12, 152. https://doi.org/10.3390/technologies12090152

AMA Style

García-Gil G, López-Armas GdC, Sánchez-Escobar JJ, Salazar-Torres BA, Rodríguez-Vázquez AN. Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach. Technologies. 2024; 12(9):152. https://doi.org/10.3390/technologies12090152

Chicago/Turabian Style

García-Gil, Gerardo, Gabriela del Carmen López-Armas, Juan Jaime Sánchez-Escobar, Bryan Armando Salazar-Torres, and Alma Nayeli Rodríguez-Vázquez. 2024. "Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach" Technologies 12, no. 9: 152. https://doi.org/10.3390/technologies12090152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Machine Learning for Accurate Mexican Sign Language Identification: A Distal Phalanges Approach

Abstract

1. Introduction

2. Related Work

2.1. The Taxonomy of Sign Language Recognition

2.2. Machine Learning Models in Sign Languages

3. Materials and Methods

3.1. Software and Hardware Characteristics

3.2. Data Loading and Dataset Preparation

3.3. The Performance Evaluation of Trained Models

3.4. Selection of Algorithms to Compare

3.5. Proposed Work

3.5.1. Training Stage

3.5.2. Validation Phase

3.6. Decision Tree

3.6.1. Entropy

3.6.2. Information Gain

3.6.3. Algorithm C4.5

3.6.4. Measure of Gini Impurity

4. Results

4.1. Evaluation Metrics and Their Relevance

4.2. Confusion Matrix of the Compared Models

4.3. Performance of the DT C4.5 Classification Model

4.4. Structure of Prediction in DT C4.5

4.5. The Results of the Characteristics of DT. C4.5

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI