Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s

Li, Xiaohua; Jettanasen, Chaiyan; Chiradeja, Pathomthat

doi:10.3390/computation13030059

Open AccessArticle

Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s

by

Xiaohua Li

¹,

Chaiyan Jettanasen

¹

and

Pathomthat Chiradeja

^2,*

¹

School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

²

Faculty of Engineering, Srinakharinwirot University, Bangkok 10110, Thailand

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(3), 59; https://doi.org/10.3390/computation13030059

Submission received: 13 December 2024 / Revised: 20 January 2025 / Accepted: 6 February 2025 / Published: 24 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Gesture is a natural and intuitive means of interpersonal communication. Sign language recognition has become a hot topic in scientific research, holding significant importance and research value in fields such as deep learning, human–computer interaction, and pattern recognition. The sign language recognition process needs to ensure real-time performance and ease of deployment. Based on these two requirements, this paper proposes an improved YOLOv5s-based sign language recognition algorithm. Firstly, the lightweight concept from ShuffleNetV2 was applied to achieve lightweight characteristics and improve the model’s deployability. The specific improvements are as follows: The algorithm achieved model size reduction by removing the Focus layer, using the ShuffleNetv2 algorithm, and then channel pruning YOLOv5 at the head of the neck layer. All the convolutional layers and the cross-stage partial bottleneck layer with three convolutional layers in the backbone network were replaced with ShuffleBlock, the spatial pyramid pooling layer and a subsequent cross-stage partial bottleneck layer structure with three convolutional layers were removed, and the cross-stage partial bottleneck layer module with three convolutional layers in the detection header section was replaced with a depth-separable convolutional module. Experimental results show that the parameters of the improved YOLOv5 algorithm decreased from 7.2 M to 0.72 M, and the inference speed decreased from 3.3 ms to 1.1 ms.

Keywords:

sign language recognition; lightweight; YOLOv5

1. Introduction

Sign language recognition holds significant research importance and value in fields such as deep learning, human–computer interaction, and pattern recognition [1]. It plays an indispensable role in the future development of human civilization by removing communication barriers between individuals with hearing impairments, thereby normalizing their lives. However, due to the diversity and ambiguity of gestures, spatiotemporal differences, and complex hand deformations, vision-based sign language recognition is a challenging interdisciplinary research topic due to the inherent challenges of visual perception.

In recent years, deep learning, as the vanguard of the new wave of artificial intelligence, has brought new vitality to the field of computer vision and new opportunities for research in sign language recognition. Currently, there are several challenging key issues in the process of sign language recognition: (1) Ensuring the effectiveness of sign language datasets from the deaf community is difficult. On the one hand, in order to adapt the training model for non-specific populations in sign language recognition, a large amount of demonstration data needs to be collected from different individuals; on the other hand, few studies can use authentic datasets from the deaf community. When using standard sign language data, the collected data are limited in scale, have low fault tolerance, and essentially ignore variations. (2) The practical application scenarios for sign language are often complex, with objective factors like background and lighting greatly affecting the algorithm’s recognition accuracy. (3) Compared to traditional gestures, gesture sequences in sign language contain a wealth of expressive terms and flexible movements, with severe body occlusions being more common, making it more challenging to design recognizable representations for sign language. (4) The ultimate goal of sign language recognition is to achieve recognition of continuous sign language [2]. However, there is transitional redundancy between consecutive sign symbols that does not belong to any specific symbol, which significantly impacts the accuracy of continuous sign language recognition.

To address the above challenges, this paper proposes a gesture recognition method based on deep learning, aimed at improving the accuracy and stability of gesture recognition. By enhancing the YOLOv5 algorithm and integrating model lightweighting, efficient representation and learning of gesture signals are achieved. Meanwhile, this paper designs a series of experiments to evaluate the performance of the proposed method on different datasets, comparing it with existing methods. Experimental results show that the improved algorithm achieves excellent performance in gesture recognition tasks, with strong generalization capabilities and robustness.

The rest of this work is organized as follows: Section 2, State of the Art, briefly describes work related to target detection and deep learning-based target detection; Section 3 introduces the basic YOLOv5 network and leads to details of the improved YOLOv5 network for gesture recognition; Section 4, Experiments and Discussion, first describes our dataset and then shows the evaluation and comparison results of our experiments; finally, Section 5 provides an outlook and conclusion.

2. Related Work

2.1. Overview of Object Detection Technology

Object detection can be summarized into traditional methods and deep learning methods [3]. Traditional object detection methods are mainly based on manually designed features and machine learning algorithms, such as Haar features, Histogram of Oriented Gradients (HOG) features, and Scale-Invariant Feature Transform (SIFT) features, combined with classifiers like Support Vector Machine (SVM) and Adaboost to achieve object detection. These methods have achieved good results in certain scenarios, but they are limited in detecting complex scenes and multiple object categories. With the development of deep learning technology, deep learning-based object detection methods have made tremendous progress. Among them, Convolutional Neural Networks (CNN) are one of the most widely used deep learning models. Typical CNN-based object detection methods include the R-CNN series, YOLO series, and SSD series, each with its own advantages and disadvantages. Specific differences are detailed in Table 1. In this field, classic general-purpose models such as LeNet-5, AlexNet, VGGNet, GoogLeNet, and ResNet have emerged in the DL domain, capable of classifying multiple target objects and determining their locations within images. The general framework of the existing target detection system mainly consists of candidate region (region proposal), feature extraction and prediction, non-maximum suppression (NMS), and four evaluation indicators, including the frames per second (FPS) measure of network speed, mean average precision (mAP), precision–recall curve (RP), and intersection over union (IoU), These metrics can evaluate the detection performance of the model under different categories and compare the performance of the different models.

You Only Look Once (YOLO) is a network model for object detection developed by Joseph Redmon and others. YOLOv1 [4], released in 2016, is a single-stage detection network that unifies the two components of detection (object detector and classifier) [5]. YOLOv2 [6], also known as YOLO9000 and released in 2016, can detect over 9000 objects, hence the name. The mAP of YOLOv2 is not ideal; its main advantage is its fast testing speed. YOLOv3 [7], released in 2018, achieved a significant improvement in mAP compared to YOLOv2. YOLOv4 [8], developed by Bochkovskiy and others, introduced numerous optimization strategies based on the original YOLO object detection framework, with enhancements in data processing, backbone network, network training, activation functions, and loss functions.

2.2. Usage of Open Source Datasets for Gesture Recognition

The release of datasets such as ImageNet, MS COCO, Open Images, MNIST, and CIFAR on the internet has allowed many computer science researchers to use them as benchmarks for training algorithms and achieving state-of-the-art results. Public datasets provided by platforms like Kaggle, Papers with Code, HaGRID, Roboflow, and Mendeley Data have also greatly aided AI researchers [9]. Currently, in the field of sign language recognition, several datasets provided by European and American countries are frequently used. The most notable continuous sign language dataset is RWTH-PHOENIX-Weather [10], which provides annotations for facial expressions, hands, and the start and end points of sign language vocabulary, contributing significantly to tracking, feature extraction, and visual and language modeling. Another publicly available continuous sign language dataset is SIGNUM [11]. These two datasets are among the only sentence-level sign language datasets publicly available in recent years, whereas most other datasets are vocabulary-level, such as ASLLVD [12] and ASL-LEX [13].

The Hand Gesture Recognition Image Dataset (HaGRID) is a large-scale image dataset suitable for image classification or image detection tasks in contexts such as video conferencing, smart homes, and intelligent driving. The HaGRID dataset is 716 GB in size, containing 552,992 FullHD (1920 × 1080) RGB images across 18 gesture categories. The data are split into 92% for training and 8% for testing, with 509,323 images used for training and 43,669 images for testing. This dataset includes 34,730 unique individuals and at least as many unique scenes, featuring participants aged 18 to 65. It was mainly collected indoors, with significant lighting variation, including both artificial and natural light. The dataset also includes images captured under extreme conditions, such as facing and backlit windows. Additionally, participants had to display gestures at distances ranging from 0.5 to 4 m from the camera. For various considerations, the HaGRID dataset was used in this experiment.

Additionally, in the related field of gesture recognition, there are many excellent datasets, such as the well-regarded ChaLearn LAP ConGD [14] and 6DMG [15]. The number of Chinese sign language datasets is limited, with the most representative being the DEVISIGN dataset [16]. This dataset was established with sponsorship from Microsoft Research Asia, aiming to provide a large-scale, vocabulary-level Chinese sign language dataset for researchers worldwide to train and evaluate their recognition algorithms. Currently, the dataset includes 4414 Chinese sign language words, totaling 331,050 RGB-D videos along with corresponding skeletal data, contributed by 13 men and 17 women.

3. Proposed Methods

3.1. Introduction of the YOLOv5 Algorithm

The YOLOv5 algorithm is one of the newest algorithms in the YOLO series, which was improved on the basis of YOLOv4 algorithm. YOLOv5 can obtain accurate, real-time and efficient detection results. Compared with YOLOv4, YOLOv5 realizes model lightweighting under the premise of guaranteeing accuracy, and the YOLOv5 network model has high detection accuracy and fast inference speed, with the fastest detection speed up to 140 f/s. The YOLOv5 architecture consists of four kinds of structures, and the main difference between them is that they have different depths and widths of feature extraction, which are named YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5l, respectively. Ov5l and YOLOv5x. For the current problems in gesture recognition, such as missed detection, false detection, and the low gesture recognition rate in complex environments such as uneven light and skin color background, the YOLOv5 s model is more suitable as a benchmark model. The structure of yolov5 is shown in Figure 1. Its architecture consists of the following key components: input layer, backbone network (Backbone), neck network (Neck), and head network (Head).

YOLOv5 accepts fixed-size images as input, which are 640 × 640 pixels by default. In order to improve the generalization ability and robustness of the model, the input image is usually subjected to data enhancement processes such as random cropping, flipping, color dithering, etc., during the training process. The backbone network is responsible for extracting features from the input image. YOLOv5 uses a lightweight backbone network called CSPDarknet53, which is based on the cross-stage partial network, to reduce the amount of computation while maintaining feature extraction capability. CSPDarknet53 contains multiple convolutional layers, a batch normalization layer, and activation functions such as SiLU or Leaky ReLU, and it utilizes residual connections to facilitate information flow and gradient propagation. CSPBlock is introduced in CSPDarknet53, which facilitates feature extraction by means of group convolution and path aggregation to optimize computational efficiency while enhancing feature expression capability. The focus layer is a layer unique to YOLOv5, which effectively improves the sensory field of the model by slicing operations on the input image and combining the neighboring pixels into a larger feature map. YOLOv5’s neck network employs the Path Aggregation Network (PANet) structure, combined with the idea of Feature Pyramid Network (FPN), to realize multi-scale feature fusion.

The head network of YOLOv5 consists of three output branches with different scales, and each branch corresponds to the feature maps at different resolutions. Each output branch contains several convolutional layers for predicting the location of the bounding box, the category confidence, and the category probability. Unlike previous versions, YOLOv5 introduces an Anchor-Free design, which predicts the location and size of object centroids directly on the feature map, simplifying the model structure and improving the detection speed. To better localize objects, YOLOv5 uses a new loss function, grid sensitive loss, which takes into account the relative positional relationships between grid cells, allowing the model to more accurately determine the location of objects.

In addition, YOLOv5 has a new Focus structure, which slices and splits the original input to reduce the amount of computation and speed up the network; it adopts GIOU__Loss as the loss function of Boundingbox, and it adopts weighted great value suppression (NMS) to suppress the non-great value suppression of GIOU__Loss to select the best prediction box quickly, which has the characteristics of high accuracy and fast identification and detection. The weight file of the YOLOv5 target detection network model is smaller, only 10% of YOLOv4, while its detection accuracy is high, and it has small memory occupation and fast detection speed.

3.2. Improved YOLOv5-Lite Networking

The initial YOLOv5 model is prone to lose small target feature information in the feature extraction phase, thus reducing the detection of small targets. Its network model is large in size, involves many parameters, and has high hardware requirements, making it difficult to deploy on devices. To address the above situation, the backbone network in this study is replaced by the lightweight ShuffleNetv2 on the basis of the initial YOLOv5 network in order to achieve the purpose of reducing parameters and realizing a lightweight network. By uniformly decomposing the different channel information of the output features obtained by group convolution, the feature communication between different grouped feature maps after group convolution is ensured, which enhances the feature extraction effect without increasing the computational workload.

Considering that sign language recognition systems need to focus on real-time performance, accuracy, and ease of deployment, a small version of YOLOv5 was chosen for experiments in this paper. Based on these requirements and following the four guiding principles proposed by ShuffleNet V2 for designing lightweight networks [17], the following improvements were made to YOLOv5s: The algorithm achieves model size reduction by removing the Focus layer, using the ShuffleNetv2 algorithm, and then channel pruning YOLOv5 at the head of the neck layer. First, to reduce the number of parameters, the Focus module in the Backbone was replaced with a 3 × 3 Convolution Layer, and all the Convolution Layer and CSP Bottleneck with 3 Convolutions (C3) are replaced with ShuffleBlock. Second, in order to reduce the parallel operation of the model, Spatial Pyramid Pooling (SPP) and its CSP Bottleneck are replaced with ShuffleBlock. Pooling, (SPP) and its successor, the Cross-Stage Partial Bottleneck with 3 Convolutions, CSP Bottleneck with 3 Convolutions, C3 structure, are removed in order to reduce the parallel operation of the model. In the Head part, according to the first improvement guideline, the number of input and output channels of all layers is changed to 96, and all CSP Bottleneck with 3 Convolutions (C3) structures are replaced with Depthwise Separable Convolution (DSC) structures—Separable Convolution (DWConv).

3.2.1. Removal of the Focus Layer

Due to the frequent image slicing operations of the Focus layer, it causes a great load on the chip and aggravates the burden of computational processing, and the conversion of the Focus layer is also more tedious and complicated when deploying to the model. Unlike YOLOv5, in YOLOv5Lite, a faster convolution operation is used instead of the Focus layer, and better performance is obtained, which releases the memory footprint while reducing the computation and makes the model faster.

3.2.2. Using ShuffleNetv2

ShuffleNet is a lightweight deep neural network proposed by Beijing Face++ Technology Co., Ltd. (Beijing, China), which enables the model to maintain high accuracy while reducing the computational effort of the model through two operations, point-by-point group convolution and channelshuffle. Based on this, the operation of ChannelSplit is performed, and ShuffleNetv2 is created at the same time. It has 2 main advantages: ① The 1 × 1 group convolution in ShuffleNet is changed to a normal convolution, which reduces the amount of computation by reducing the use of group convolution. ② Changing the ordinary convolution on branches to depth-separable convolution, which greatly reduces the amount of computation and increases the computation efficiency.

The structure of the improved lightweight YOLOv5 is shown in Figure 2, and its main modules are SFB1_X and SFB2_X in the structure diagram, as shown in unit 1 and unit 2 in Figure 3. The number of input feature channels is divided into two groups according to Unit 1, and the left branch is not processed, while the right branch is processed by convolutional operation and batch normalization. Then, the channel segmentation features of the left branch are fused with the convolutional output of the right branch for channel blending to enhance the information fusion of the two channel mappings. Through Unit 2, the left and right branches are downsampled, the size of the feature map is halved and the dimension is doubled, and the ShuffleNetv2 network structure is formed by the two units together to realize lightweight feature extraction.

In concrete terms, ShuffleBlock is a lightweight building block for use in convolutional neural networks to improve model efficiency and performance, primarily by introducing channel shuffling operations. The core components include grouped convolution, depth-separable convolution, and channel shuffling. First, ShuffleBlock receives an input feature map; then, it applies a 1 × 1 point-by-point convolution to downscale or upscale the input feature map and uses group convolution to reduce computational complexity. Next, a channel blending operation is performed so that information between different groups can be blended. After that, a depth-separable convolution is applied to further reduce computational effort while maintaining strong representation. Finally, a 1 × 1 point-by-point convolution is again used to recover to the desired number of channels, with the possible addition of residual connections (if inputs and outputs of the same size exist). ShuffleBlock significantly reduces computation and memory footprint compared to traditional convolutional layers, making it ideal for mobile devices and other resource-constrained platforms. Due to its lightweight nature, ShuffleBlock can be easily integrated into various CNN architectures as an alternative to heavier traditional convolutional layers. Despite its lower computational cost, ShuffleBlock still delivers detection accuracy comparable to, and in some cases superior to, heavier models.

3.2.3. Pruning FPN + PAN

FPN + PAN originates from CVPR’s PANet, which is split and applied to YOLOv4, greatly improving the feature extraction capability. However, YOLOv5Lite, based on this, performs channel pruning and improves the structure of FPN + PAN in YOLOv4; the specific improvements are as follows: ① choosing to use the same number of channels, which optimizes the memory access and usage; ② choosing to use the original PANet structure and restoring the cat operation of YOLOv4 to a sum operation, which further optimizes memory usage.

4. Experiment

All data used by the models in this experiment come from the HaGRID (Hand Gesture Recognition Image Dataset) [18]. To reduce the dataset’s impact on the experimental results and enhance the model’s generalization ability, 9935 high-quality images were selected from it for training.

4.1. Experimental Environment and Data Configuration

This study was conducted under the Windows 10 operating system, utilizing the PyTorch version 2.0.1 deep learning framework. The software environment configuration includes CUDA12.1 and torch version 2.3.0, developed using the Python3.8 programming language. The training parameter configurations are shown in Table 2.

Model detection performance evaluation is a multi-dimensional process. This paper combines the characteristics of the HaGRID dataset to verify performance from two aspects: detection speed and model complexity. The model detection speed metrics adopt Floating Point Operations (FLOPs) and Frames Per Second (FPS). FLOPs measure the number of floating-point operations required for the model to execute one forward propagation (i.e., the entire computation process from input to output). A higher FLOPs value means the model requires more computational resources. Generally, smaller FLOPs imply a lower computational cost for the model, requiring fewer computational resources (such as GPU or CPU) and less time. FPS represent the processing speed of the model when handling input data. This experiment uses four indicators: weight file size, FLOPs, FPS, and the number of parameters.

4.2. Experimental Results and Analysis

By analyzing the visualized training results, the results.csv file, and the training result images, we observe that the dataset yields excellent training performance. The target objects are positioned at varying distances and sizes, and the image clarity is not high. However, the recognition accuracy remains high, as shown in Figure 4. The confusion matrix for the dataset is shown in Figure 5.

Secondly, we attempted to perform real-time gesture recognition through a camera using the trained HaGRID sign language gesture data model. The recognition response time is less than 1 s, and the real-time gesture recognition accuracy is over 94%, with a maximum of 99.5%, as shown in Figure 6, Figure 7 and Figure 8.

Using the same experimental conditions and dataset, we compared the proposed model with the latest models of Yan et al. [19], Ren et al. [20], and Chen et al. [21] as shown in Table 3, and the results show that our proposed model has higher accuracy and better results.

YOLOv5s was selected as the baseline model. On this basis, lightweight processing was carried out, followed by comparative experiments with three versions of the model: YOLOv3, YOLOv6, and YOLOv8 as shown in Table 4.

From the experimental results, the improved model achieved significant enhancements across all four indicators. The 2.6 GFLOPs indicate that the model’s computational cost is very low. The 416 f·s-1 FPS demonstrate that the model’s detection speed is very fast, which is particularly important for application scenarios pursuing high-speed inference. A parameter count of 0.72 M and a weight file size of 1.6 M show that the model occupies very little disk space and can be deployed to real-time target detection and mobile devices.

5. Conclusions

From the above, it can be seen that the research in this paper aimed to design gesture recognition algorithmic models for recognizing the sign language of deaf people. The YOLOv5 algorithmic model is mainly used for training. In the preparation stage before the experiment, the corresponding version of torch was adapted using Anaconda, and the corresponding environment was constructed according to the officially provided requirements file. During the experiment, we chose the Pycharm 2024.1 platform to run the code and learned how to change the file environment, activate the plug-in installation, and use the algorithmic statements and command line. When writing the program, in addition to the official reference text, we also referred to the tutorials of previous versions of YOLO. Eventually, we successfully designed a new model file to realize the recognition of common control gestures and verified the feasibility of sign language recognition technology for the deaf.

However, due to the relative paucity of research on gesture control related to deafness and the inadequacy of datasets in this area, there are a limited number of weighting files available for reference. In designing the experiments, the original training can only be performed by collecting hand data one by one in the laboratory, which is less efficient. Meanwhile, due to the limitation of the collection equipment, the data samples collected in this experiment are small, and the environment is relatively simple. Although relatively satisfactory results were obtained after several rounds of training, the original model was not further trained. The original model was not further improved.

In subsequent research, we will continue to consider expanding the dataset by acquiring more and clearer images through hardware such as sensors. At the same time, we will also try to run the existing dataset in other algorithms to compare and debug the accuracy and stability of the results. Meanwhile, we hope to obtain more data by recording real life and designing more gesture recognition tasks in the future to achieve higher recognition accuracy.

The application of deep learning and neural networks improves gesture recognition technology and makes gesture devices an innovative way of interaction. It can provide more accurate and personalized operations, allowing for better integration of deaf people into society. In the future, gesture recognition technology could expand its applications in areas such as surgical simulation training, therapeutic rehabilitation, wearable devices, and health monitoring, as well as contribute to medical research and improved doctor–patient communication. Gesture recognition technology provides an intuitive, natural, and safe human–computer interaction for medical devices and promotes the progress of people’s communication.

Author Contributions

Conceptualization, X.L. and C.J.; methodology, X.L. and C.J.; software, X.L.; validation, X.L. and C.J.; formal analysis, X.L.; investigation, X.L.; resources, X.L. and P.C.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and C.J.; visualization, X.L.; supervision, C.J.; project administration, P.C.; funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Srinakharinwirot University Research Fund.

Data Availability Statement

The original data presented in the study are openly available in https://github.com/hukenovs/hagrid, accessed on 1 September 2024.

Acknowledgments

The authors and each project member wish to gratefully acknowledge the financial support for this research from the Srinakharinwirot University Research fund.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, S.; Zhang, Q.; Li, H. Review of Sign Language Recognition Based on Deep Learning. J. Electron. Inf. Technol. 2020, 42, 1021–1032. [Google Scholar] [CrossRef]
Elgendy, M. Deep Learning Computer Vision. Master’s Thesis, Tsinghua University Press, Beijing, China, 2022. [Google Scholar]
Li, Y.; Cheng, R.; Zhang, C.; Chen, M.; Ma, J.; Shi, X. Sign language letters recognition model based on improved YOLOv5. In Proceedings of the 2022 9th International Conference on Digital Home (ICDH), Guangzhou, China, 28–30 October 2022; pp. 188–193. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Paipaixing. To What Extent Has Object Detection Developed|CVHub Takes You to Talk About the 22 Years of Development in Object Detection. 14 February 2023. Available online: https://www.tonguebusy.com/index.php?m=home&c=View&a=index&aid=3895 (accessed on 8 January 2023).
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:abs/1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
von Agris, U.; Zieren, J.; Canzler, U.; Bauer, B. Recent developments in visual sign language recognition. Univers. Access Inf. Soc. 2008, 6, 323–362. [Google Scholar] [CrossRef]
Yuan, T.; Zhao, W.; Yang, X.; Hu, B. Establishment and Analysis of Large-Scale Continuous Chinese Sign Language Dataset. Comput. Eng. Appl. 2019, 55, 110–116. [Google Scholar]
von Agris, U.; Kraiss, K.F. Signum database: Video corpus for signer-independent continuous sign language recognition. In Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies (LREC), Valletta, Malta, 22–23 May 2010; pp. 243–246. [Google Scholar]
Neidle, C.; Thangalim, A.; Sclaroff, S. Challenges in development of the American Sign Language Lexicon Video Dataset (ASLLVD) corpus. In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions Between Corpus and Lexicon, Istanbul, Turkey, 21–27 May 2012. [Google Scholar]
Caselli, N.K.; Sehyr, Z.S.; Cohen-Goldberg, A.M.; Emmorey, K. ASL-LEX:a lexical database of American sign language. Behav. Res. Methods 2017, 49, 784–801. [Google Scholar] [CrossRef] [PubMed]
Wan, J.; Zhao, Y.; Zhou, S.; Guyon, I.; Escalera, S.; Li, S.Z. ChaLearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, M.; Alregib, G.; Juang, B.H. 6DMG: A new 6D motion gesture database. In Proceedings of the Multimedia Systems Conference, Chapel Hill, NC, USA, 22–24 February 2012. [Google Scholar]
Wang, H.; Chai, X.; Hong, X.; Zhao, G.; Chen, X. Isolated sign language recognition with grassmann covariance matrices. ACM Trans. Access. Comput. 2016, 8, 14–22. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Kapitanov, A.; Kvanchiani, K.; Nagaev, A.; Kraynov, R.; Makhliarchuk, A. HaGRID—HAnd Gesture Recognition Image Dataset. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 4560–4569. [Google Scholar]
Yan, S.; Xia, Y.; Smith, J.S.; Lu, W.; Zhang, B. Multiscale Convolutional Neural Networks for Hand Detection. Appl. Comput. Intell. Soft Comput. 2017, 2017, 9830641. [Google Scholar] [CrossRef]
Ren, Z.; Yuan, J.; Meng, J.; Zhang, Z. Robust Part-Based Hand Gesture Recognition Using Kinect Sensor. IEEE Trans. Multimed. 2013, 15, 1110–1120. [Google Scholar] [CrossRef]
Chen, L.; Fu, J.; Wu, Y.; Li, H.; Zheng, B. Hand Gesture Recognition Using Compact CNN via Surface Electromyography Signals. Sensors 2020, 20, 672. [Google Scholar] [CrossRef] [PubMed]

Figure 1. YOLOv5 structure diagram.

Figure 2. Network improvement diagram.

Figure 3. The Shuffleblock structure diagram.

Figure 4. HaGRID sign language gesture recognition results.

Figure 5. Confusion matrix for the dataset.

Figure 6. Real-time identification effect of the training model 1.

Figure 7. Real-time identification effect of the training model.

Figure 8. Recall and accuracy curves for the model.

Table 1. Detection methods in object detection.

Algorithm Type Contrast	R-CNN	SSD, YOLO
Types of detectors	Two-stage detector	Single-stage detector
velocity (FPS)	Generally lower than the single-stage detector	Generally higher than the two-stage detector
Average accuracy	Generally higher than the single-stage detector	Generally lower than the two-stage detector

Table 2. The configuration of the training parameters.

Training Parameters	Parameter Values
Number of batch training samples	8
Number of iterations	100
Initial learning rate	0.01
Image size	640 × 640

Table 3. Comparison of the state-of-the-art work with our proposed model.

Related Work	Methods of Work	Accuracy
Yan et al. [19]	Multiscale Convolutional Neural Networks	85.5%
Ren et al. [20]	Kinect Sensor	94.6%
Chen et al. [21]	Through electromyographic signals	93.1%
Ours	Lightweight yolov5	99.5%

Table 4. Comparison of experimental results.

Model	Paramas/M	FLOP s/G	FPS/f·s-1	Weight File/M
YOLOv5s	7.04	15.8	227	14.0
YOLOv3	103.67	282.3	36	202
YOLOv6	4.23	11.8	222	8.4
YOLOv8	3.01	8.1	238	6.1
Ours	0.72	2.6	416	1.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Jettanasen, C.; Chiradeja, P. Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s. Computation 2025, 13, 59. https://doi.org/10.3390/computation13030059

AMA Style

Li X, Jettanasen C, Chiradeja P. Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s. Computation. 2025; 13(3):59. https://doi.org/10.3390/computation13030059

Chicago/Turabian Style

Li, Xiaohua, Chaiyan Jettanasen, and Pathomthat Chiradeja. 2025. "Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s" Computation 13, no. 3: 59. https://doi.org/10.3390/computation13030059

APA Style

Li, X., Jettanasen, C., & Chiradeja, P. (2025). Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s. Computation, 13(3), 59. https://doi.org/10.3390/computation13030059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploration of Sign Language Recognition Methods Based on Improved YOLOv5s

Abstract

1. Introduction

2. Related Work

2.1. Overview of Object Detection Technology

2.2. Usage of Open Source Datasets for Gesture Recognition

3. Proposed Methods

3.1. Introduction of the YOLOv5 Algorithm

3.2. Improved YOLOv5-Lite Networking

3.2.1. Removal of the Focus Layer

3.2.2. Using ShuffleNetv2

3.2.3. Pruning FPN + PAN

4. Experiment

4.1. Experimental Environment and Data Configuration

4.2. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI