1. Introduction
Communication plays an important role in sending messages, expressing feelings, and conveying perceptions, and it is one of the main ways humans interact with their environment, which involves capturing sounds and interpreting the language used by others who intend to communicate [
1]. Sign language is basically a communication tool used by the deaf community that aims to facilitate communication, both among the deaf and with individuals who are not deaf. Sign language is expressed through hand gestures, facial expressions, and body movements [
2]. In addition, the use of sign language also helps increase the inclusion and involvement of students with disabilities in the general school environment [
3]. Normal hearing plays an important role in the acquisition and production of spoken language because it allows children to be part of the spoken language around them. However, for individuals with hearing disabilities, the challenge is to develop an important means of supporting communication [
4]. Any gesture or visual language that uses certain hand, arm, and finger shapes and movements, along with eye, face, head, and body movements, is called sign language [
5]. People with disabilities are a group of people who have limitations that prevent them from participating and playing a role in daily life. In 2010, the Central Statistics Agency (BPS) reported that as many as 3,024,271 Indonesians, out of 191,709,144 people, had physical disabilities, including hearing and speech disorders [
3].
This study aims to help individuals interact and share information with each other without communication gaps, and to develop a sign language system model to make it easier to communicate with people with disabilities using the Deep Learning method with the YOLO (You Only Look Once) algorithm. In this study, we also decided to use ASL in the form of symbols or speech and in the form of words or sentences [
6]. The method used in this study is the YOLO-v11 algorithm. YOLO is a real-time object detection algorithm developed by Joseph Redmon and Ali Farhadi in 2015 [
7]. In their research, due to its simple architecture, YOLO is also said to be very fast in identifying objects, and the average accuracy obtained reached 88% in the ImageNet 2012 validation [
8]
YOLO-v11 is the latest and greatest (SOTA) model that builds on the success of the previous YOLO version and introduces new features and enhancements to further improve performance and flexibility. YOLO-v11 is designed to be fast, accurate, and easy to use, making it an excellent choice for a variety of object detection and tracking tasks, instance segmentation, image classification, and pose estimation [
9]. The YOLO-v11 algorithm is included in image detection that uses convolutional neural networks (CNNs) to predict bounding boxes and object class probabilities in input images at 45 FPS (2 milliseconds/image) [
10]. In this study, the data used is a manually created ASL dataset containing 4000 images forming hand gestures in various positions [
11].
2. Related Works
ASL is a sign language used by people with disabilities to communicate; therefore, the study entitled “A Comprehensive Application for Sign Language Alphabet and World Recognition, Text-to-Action Conversion for Learners, Multi-Language Support and Integrated Voice Output Functionality” aims to introduce a comprehensive application designed to facilitate learning in communicating for sign language users [
12]. In another study, a machine translation system was also applied, which aimed to convert spoken Turkish into Turkish Sign Language (TID) [
13]. Another study also used a semantic analysis algorithm for simple sentences that were developed and introduced, aimed at translating Russian text into Russian sign language based on a comparison of the proposed syntactic structures [
14].
Several studies have used convolutional neural networks (CNNs) and computer vision algorithms to translate ASL into text and speech in local languages, such as in a study in Nepal that achieved over 99% accuracy [
15]. Furthermore, other studies have shown that sign languages have dialectal variations that make automatic recognition difficult, so methods such as 3D convolutional networks and skeleton-based recognition have been used. The sign language transformer model excelled with a BLEU-4 score of 21.80, more than double that of the previous model (9.58) [
16]. Another study using R-CNN, 3D CNN, and LSTM achieved 99% accuracy in recognizing the sign language vocabulary, while CorrNet reduced the Word Error Rate to 18.8% on the training set [
17]. Translation in PHEONIX-Weather-2014T achieved a BLEU-4 of 24.32, lagging behind English–German (30.9) [
18]. The TGCN pose-based model achieved an accuracy of 62.63% for 2000 words [
19]. Another proposal leverages a lightweight 3D convolution module; exploratory results illustrate the performance of RealTimeSignNet on standard sign dialect datasets, achieving an accuracy of 88.1 on the wide continuous sign dialect dataset (continuous SLR), 98.2% on the discontinuous sign dialect dataset (500 SLR), and 91.50% on the English Sign Dialect dataset (WLAS) [
20].
On the other hand, the study proposes a novel multi-lingual multi-modal SLR framework, MLMSign, achieving high precision ( ) on six benchmark datasets (i.e., Massey, Inactive ASL, NUS II, TSL Fingerspelling, BdSL36v1, and PSL) [
21]. This study addresses the computational complexity associated with sign language recognition (SLR) methods. The proposed method operates using (ResGCN). The method is tested on five challenging SLR datasets—WLASL-100, WLASL-300, WLASL-1000, LSA-64, and MINDS-Libras—and achieves impressive accuracies of 83.33%, 72.90%, 64.92%, 100%, and 96.70%, respectively [
22].
By utilizing the ADDSL dataset, with the YOLOv5 method, a demonstration of the single-stage question locator (YOLOv5) produces an average induction time of 9.02 ms per image and a best accuracy of 92% [
23]. In another study using YOLOv2, Vitis AI, and FINN for sign dialect recognition on a Field Programmable Entryway Cluster (FPGA), a significant average accuracy (mAP) score of 61.2% was achieved on the Indian Sign Dialect (ISL) Hindi dataset [
24]. The YOLOv8 method has been applied for recognizing and parsing sign dialect signals in real-time, with the top recognition precision of the proposed approach being 99.4% on the AASL dataset [
25]. Another thought is particularly focused on the YOLO-v9 event, which was released in 2024. Overall, although both models have great capabilities in real-time hand gesture recognition, YOLO-v9e is dominant in terms of precision and classification. However, if discovery speed can be a primary requirement, then YOLO-v9c may be a higher choice. Both models can accurately identify all 26 ASL letters, illustrating their capabilities in hand gesture recognition applications [
26].
In this study, using image recognition technology or pattern recognition, the system can identify hand gestures and translate them into text or voice that can be understood by individuals who cannot communicate using sign language. This study uses the latest version of the YOLO method, namely YOLO-v11. YOLO-v11 is the latest and greatest (SOTA) model that builds on the success of previous YOLO versions and introduces new features and enhancements to further improve performance and flexibility. YOLO-v11 is designed to be fast, accurate, and easy to use, making it an excellent choice for a variety of object detection and tracking tasks, instance segmentation, image classification, and pose estimation. The YOLO architecture is heavily influenced by the GoogleNet backbone, with a neural network consisting of 24 convolutional layers to perform feature extraction, followed by 2 FCNs (Fully Connected Layers) to make bounding box coordinate predictions and object class classification. Architecture of YOLO illustrated in
Figure 1.
Sign language is the main topic that will be used as the basic material for the image dataset. The image dataset for ASL itself has three categories, namely: letters/alphabet, numbers, and symbols (per word), illustrated in
Figure 2. This study uses a dataset in the form of symbols, words, or sentences.
3. Materials and Methods
The research wants to develop a sign language detection system using YOLO-v11, which is generally widely known as one of the most effective approaches in object detection, because of its ability to process images quickly and efficiently. The process in this research itself is data collection, pre-processing, model training, as well as evaluation and testing, as shown in
Figure 3.
The first image, or
Figure 3a, provides an overview of the model training process, from data collection to the training phase, where the model learns to recognize patterns from the given data. This process involves data pre-processing, dataset partitioning, and the use of machine learning algorithms to optimize the model.
Meanwhile, the second image, or
Figure 3b, focuses more on the implementation of the trained model. Once the training process is complete, the model can be used to detect and recognize gestures in real-time. This is performed using video input or images taken from the camera, where the model classifies and processes the existing gestures directly. In other words, the second image shows the practical application of the trained model in the context of real-world use, namely, live and interactive gesture recognition.
3.1. Data Collection
This research consists of four classes of image objects: “Hello”, “Thank You”, “No”, “Yes”, and “I Love You”, where the dataset consists of 4000 images, consisting of 800 images per class. The following shows images of objects for each class, example given in
Figure 4.
3.1.1. Pre-Processing
Labeling
The process begins with labeling the dataset to provide class information so that the model can properly learn patterns during training. Labeling is performed by marking each sample, such as a hand gesture, with the appropriate label, either manually or with semi-automatic tools. Accurate labeling is essential to ensure the model receives the correct information, allowing for proper detection between the input features and the desired output. This process needs to be consistent and of high quality to avoid errors and biases in the model, especially in complex or overlapping gestures.
Bounding Box
In the context of image object detection, the bounding box provides coordinates for the location of the image object. For example, in an image showing hands forming the word “Hello” in sign language, the bounding box highlights the area where the hand is located. In addition to location, bounding boxes indicate the relative sizes of objects in the image. This information is important to help the model recognize objects even if their sizes are different in different images.
In
Figure 5, we can see the results of the labeling and bounding box process, where each object in the image is labeled according to the relevant category, such as the ASL handgrip gesture meaning “Yes” is labeled “Yes”. The bounding box, a box that surrounds the object, serves to mark the object’s location specifically, helping the model focus on a specific area, such as the hand or face. This process is important to allow the model to detect and distinguish objects in various positions and sizes, improving detection accuracy.
3.2. Model Training
In this study, the dataset was divided into 75% for training, 15% for validation, and 10% for testing so that the 4000 images, consisting of 800 images per class, were divided into 3019 images for training, 583 for validation, and 401 for testing, all in a 640 × 640 pixel resolution. The YOLO-v11 model was trained with a batch size of 16, 50 epochs, and a speed of 0.2 ms for pre-processing, 2.4 ms for inference, 0.0 ms for loss, and 2.3 ms for post-processing per image; the model experienced quite a significant increase in performance during the first few epochs. The model trained in Google Colab with a NVIDIA GeForce RTX 3080 GPU took about 0.936 h. Very good epochs and significant improvements can be seen between epochs 40–50, where the model shows steady improvements in almost all metrics (box loss, class loss, DFL loss, precision, recall, and mAP). Epoch 50 is the peak that recorded a very high mAP50 value (0.994) and a good mAP50-95 value (0.731), indicating that the model has achieved optimal performance in detecting objects. The results summarized in
Table 1.
One of the main components that supports the success of this training is the selection of the Adam optimization algorithm (Adaptive Moment Estimation). The Adam optimization algorithm is designed to speed up the learning process in neural networks. Adam adaptively sets the learning rate for each parameter in the model by using two moment estimates.
3.3. Model Evaluation and Testing
From the graphs above, it can be concluded that the machine learning model being evaluated shows quite good performance. We can conclude from training process, illustrated in
Figure 6a, the loss value that continues to decrease indicates that the model is getting better at learning from training data. In addition, high precision and recall values indicate that the model is able to detect objects with good accuracy, both in terms of location and classification. The orange line is a smoothed result of the blue line, to make it easier to understand the main trends of the training progress.
The graphs in
Figure 6b provide a good overview of the performance of the object detection model during the validation process. It can be seen that the high mAP value also indicates that the model is able to detect objects with good accuracy. On the other hand, fluctuations in the loss and metric values on the validation data indicate that the training process may not be completely stable The orange line is a smoothed result of the blue line, to make it easier to understand the main trends of the validation progress.
4. Result and Discussion
The results of the testing dataset use the Google Colab training process. The total number of datasets is 4000 images divided into five classes: “Hello”, “Thank You”, “No”, “Yes”, and “I Love You”, with 3019 images, including 583 for validation and 401 for testing. The image size is 640 × 640 pixels, and the testing was performed using 50 epochs, shown in
Figure 7.
The Confusion Matrix above shows that most of the predictions are on the main diagonal, which means the model has a high level of accuracy. This indicates that the model correctly identifies the majority of classes, with relatively few misclassifications. However, it can be seen that the model still often misrecognizes “Background” or mixes it with other classes, particularly in cases where the background is visually similar to certain objects or gestures. This suggests that the model may have difficulty distinguishing between the background and objects with similar features, which could be a result of insufficient training data or a lack of clarity in some instances of the background.
As seen from
Table 2, the model has high precision and recall (P: 99.2%, R: 99.3%) with excellent average detection performance (mAP50) (99.4%). The mAP50-95 result is at 73.3%, indicating adequate performance, but it can be improved for a tighter IoU threshold. While viewed individually or per class, it is seen that the “Thank You” class has the best performance on the mAP50-95 metric (77.6%), while the “Yes” class has the lowest performance (68.5%). However, precision and recall are very good for all classes, indicating the model can detect objects consistently. In terms of model efficiency, with an average inference time of 10ms per image, this model is very fast and suitable for real-time applications.
In
Figure 8, the first row shows that the training was conducted using a GPU with CUDA, which is faster than a CPU. The model was trained to recognize 3019 classes. The number of classes comes from the test dataset used. Training Accuracy: The model accuracy on the training data is 94.67%, which means the model successfully classifies most of the training data correctly. Testing Accuracy: The model accuracy on the test data is 93.02%, which shows that the model can generalize well to new data that has never been seen before, stated in
Figure 8.
In the web cam system test, it was able to detect American Sign Language (ASL) gestures effectively and efficiently in real-time. As can be seen in
Figure 9a, the model is able to detect the I Love You sign with a confidence score of 0.69, the Hello sign 0.84. The No sign is detected with a confidence score of 0.68, in
Figure 9b. For the Yes sign, it is detected with a confidence score of 0.39 as shown in
Figure 9c. The system not only recognized the ASL gestures with high accuracy, but it also translated the detected signs into corresponding text, providing immediate feedback to the user. Moreover, the system incorporated a voice output feature that transformed the detected sign language gestures into spoken words, making it even more accessible for individuals who are not familiar with sign language. This audio feedback further enhances the usability of the system, especially for non-sign language users, by bridging the communication gap effectively. The real-time processing capability ensured that the user received prompt and accurate translations without noticeable delays, which is crucial for maintaining a natural flow in conversations.
5. Conclusions
This study, using YOLO-v11 as the main method in detecting sign language, obtained a training model accuracy of 94.67% and a testing model accuracy of 93.02%, indicating that the model has very good performance in recognizing sign language from the training and testing datasets. Additionally, the model is very reliable in recognizing the classes “Hello”, “I Love You”, “No”, and “Thank You” with a sensitivity close to or equal to 100%. This shows that the model has the ability to recognize the main gestures very well. The sensitivity for the “Background” class is still low, which means that the model often misrecognizes “Background” or mixes it with other classes. However, there are several things that need to be improved, namely: adding training data for classes that are often mispredicted, adding representative data, using a loss function that handles class imbalance, and optimizing the threshold or data augmentation for the class. With this research, it is hoped that sign language detection can be applied in life so that it can help people with disabilities in terms of socializing, especially communicating with others, and that communication equality is achieved.