1. Introduction
People express their thoughts, feelings, and intentions through verbal communication, namely, speaking. However, many deaf or hard-of-hearing people have trouble speaking and understanding what is being said. According to the World Federation of the Deaf, there are over 70 million deaf people worldwide, with more than 80 percent living in developing countries [
1]. Many of these deaf and hard-of-hearing people use visual signs known as sign language to communicate with one another, rather than spoken words or auditory phrases [
2]. These visual signs comprise a mix of manual features—including hand shape, posture, position, and movement—and non-manual features, such as head and body posture, facial expressions, and lip movements [
3]. The primary method of communication for deaf people is sign language. Ordinary people cannot understand most signs without training, and the number of interpreters in the community is insufficient, creating a communication barrier between the deaf and the community [
4]. Consequently, many deaf individuals may find it challenging to form social relationships and encounter numerous barriers in education, healthcare, and employment. This results in feelings of social isolation and loneliness. Technological advancements have spurred the development of Sign Language Recognition (SLR) studies aimed at addressing this issue.
The primary objective of SLR studies is to develop algorithms and methods for accurate sign recognition. When deaf people’s gestures are translated into spoken or written language, ordinary hearing people can understand them. Therefore, SLR technologies promote the integration of the deaf minority by eliminating the linguistic barrier between them and the hearing majority [
4]. SLR is crucial not only for facilitating communication between deaf and hearing communities but also for increasing the content available to the deaf, through initiatives like developing accessible educational tools, games for the deaf community, and creating visual dictionaries of sign language [
5]. Developments in the field of deep learning have significantly impacted SLR studies.
In recent years, deep learning methods have succeeded significantly in image and video processing areas [
6,
7,
8,
9]. This progress has had a significant impact on SLR work, improving the accuracy and efficiency of these systems. Existing techniques based on deep learning are generally based on Convolutional Neural Networks (CNNs). Thus, 2D-CNNs have been used to recognize 2D static images such as sign alphabets or sign digits [
10,
11,
12]. However, 2D-CNNs alone cannot extract temporal features of sign language words and sentences composed of video frames. This problem has been overcome by first using 2D-CNNs to create representations of video frames and then using Recurrent Neural Network (RNN) such as Long Short-Term Memory (LSTM) networks [
13], Bidirectional LSTM (Bi-LSTM) networks [
5], or transformers [
14] to extract temporal information. Other structures that capture spatial and temporal information in videos are 3D-CNN [
15], formed by adding a third dimension to the 2D convolution process, and (2+1)D-CNNs [
16], obtained by combining 2D and 1D convolution. For the SLR task, researchers have often used 3D-CNNs [
17,
18,
19] and (2+1)D-CNNs [
20,
21,
22]. Although these models are frequently used in the SLR task, there is rarely any research [
23,
24] investigating the possibility of building a new deep learning model with 3D-CNN and (2+1)D-CNN. In this context, this study develops an innovative deep learning model based on the fusion of the block structures of both methods to take advantage of the advantages of 3D-CNN and (2+1)D-CNN. The resulting model is expected to exhibit a more effective feature extraction and classification performance.
Despite the success of deep learning methods in the SLR task, significant challenges still need to be addressed. In particular, signer differences between signers complicate the recognition process, so the developed systems must perform with high accuracy while being signer-independent [
25]. A complete SLR system should consider facial expressions [
26,
27], body language, and hand gestures since a holistic evaluation of these features is essential for full SLR [
28,
29]. The dynamic nature of sign language requires integrating both spatial and temporal dimensions of the data, which necessitates the creation of efficient models that can capture both spatial and temporal information. Background differences, varying lighting conditions, and various visual clutter also pose additional challenges for algorithms to correctly recognize and discriminate sign words [
30,
31]. The development of real-time applications places high demands on computational power and significant efforts to keep algorithms running fast and accurate [
32]. Overcoming these challenges requires in-depth research and innovative developments in the field of SLR. Only in this way can the communication barriers between sign language users and the wider society be overcome, enabling both parties to better understand and interact.
The motivation of our research is to overcome the current challenges in SLR technology and to facilitate the communication of hearing-impaired individuals in society by enabling the conversion of sign language gestures into text or speech with the SLR system we develop. In this context, our research aimed to increase the accuracy and efficiency of the SLR process, thereby expanding social integration and communication opportunities for hearing-impaired individuals. In this way, it aimed to build bridges of interaction between individuals using sign language and the hearing community so that both groups can better understand and interact with each other.
For this purpose, we are developing a robust and highly accurate system for recognizing isolated sign language words. This system is based on a novel R3(2+1)D-SLR network that combines the advantages of 3D and (2+1)D convolutional blocks. Designed to efficiently extract both spatial and temporal features, this network enhances the precision of SLR. Our approach aims to create a comprehensive and accurate recognition system by fusing the features of the sign components (body, hands, and face) extracted from R3(2+1)D-SLR to classify them with SVM. Moreover, by basing our proposed system on pose data, we enhance its robustness to changing background conditions and performance challenges in real-world scenarios, thereby improving SLR accuracy irrespective of background noise.
In conclusion, this research aims to contribute to the elimination of communication barriers between hearing-impaired individuals and the general public and to make significant advances in SLR technology. We briefly summarize the contributions of this study to the literature as follows:
We introduce a novel R3(2+1)D-SLR network that innovatively merges R(2+1)D and R3D blocks for enhanced sign language interpretation, offering a deeper understanding of both spatial and temporal dimensions of sign language.
Utilizing the advanced capabilities of the MediaPipe library, our study leverages high-precision pose data to extract detailed spatial and temporal features. This methodology allows for a more accurate and nuanced understanding of sign language gestures, improving the model’s performance.
We achieve a comprehensive SLR system by integrating and classifying complex features from the body, hands, and face, utilizing SVM for superior accuracy, highlighting our approach’s effectiveness in capturing the nuances of sign communication.
Incorporating real-world backgrounds in our testing datasets underlines our system’s adaptability and reliability in diverse environments, addressing a critical challenge in SLR.
Our strategic use of pose images to enhance system robustness against variable backgrounds showcases our commitment to developing practical, real-world SLR solutions.
The rest of the paper is organized as follows:
Section 2 presents the literature review;
Section 3 presents the proposed R3(2+1)D-SLR model and feature fusion-based system;
Section 4 presents the experiments and results;
Section 5 concludes the paper.
2. Related Literature
This section reviews recent advancements in vision-based isolated SLR, focusing on research involving the BosphorusSign22k-general [
33,
34] and LSA64 [
35] datasets, and discusses methodological approaches in the field.
In studies on the BosphorusSign22k-general subset, consisting of 174 isolated Turkish Sign Language words, Kindiroglu et al. [
36] developed a feature set named Temporal Accumulative Features (TAF) for the isolated SLR task. This feature set is based on the temporal accumulation of heatmaps obtained from joint movements. This feature set aims to recognize and classify sign language gestures by visually encoding how the movements and hand shapes in sign language videos change over time. As a result of the classification using CNN, a recognition accuracy of 81.58% was achieved. In their study, Gündüz and Polat [
19] adopted a multistream data strategy, leveraging information from the face, hands, full body, and optical flow for training their Inception3D model. Additionally, for the LSTM network, they focused on utilizing data related to body and hand pose data. By integrating the feature streams generated from these models, they inputted the combined data into a dual-layer neural network. This innovative approach increased the accuracy from 79.6% to an impressive 89.3% on the BosphorusSign22k-general dataset using full-body data, demonstrating the significant impact of various feature combinations on SLR research.
Regarding studies on the LSA64 dataset containing 64 Argentine Sign Language words, Ronchetti et al. [
37] proposed a probabilistic model that combines sub-classifiers based on different types of features such as position, movement, and hand shape, achieving an accuracy of 97% on the LSA64 dataset, in addition to an average accuracy of 91.7% in their signer-independent evaluation. Rodriguez et al. [
38] proposed a model using cumulative shape difference with SVM and achieved 85% accuracy on the LSA64 dataset. Konstantinidis et al. [
39] proposed a model based on processing hand and body skeletal features extracted from RGB videos using LSTM layers and late fusion of the processed features. Their proposed model achieved 98.09% accuracy. In later work, the same authors [
40] improved the performance of their method to 99.84% by adding additional streams that process RGB video and optical flow data. Masood et al. [
41] proposed a model that extracts spatial features with CNN and then classifies them by extracting temporal features by feeding the pool layer output into RNN before converting it into a prediction. Their model achieved 95.2% accuracy in 46 subcategories of the LSA64 dataset. Zhang et al. [
42] proposed a neural network with an alternative fusion of 3D-CNN and Convolutional LSTM, called a Multiple Extraction and Multiple Prediction (MEMP) network, to extract and predict motion videos’ temporal and spatial feature information multiple times. On LSA64, the network achieved an identification rate of 99.063%. Imran and Raman [
43] used motion history images, dynamic images, and RGB motion image templates, which can represent a video in a single image, to fine-tune three pre-trained CNNs. By fusing the outputs of these three networks with a fusion technique based on their proposed kernel-based extreme learning machine, they achieved a 97.81% accuracy. The model proposed by Elsayed and Fathy [
44] using 3D-CNN followed by Convolutional LSTM achieved a 97.4% test accuracy for 40 categories. Marais et al. [
45] trained the Pruned VGG network with raw images and achieved a 95.50% test accuracy. In another study by the same authors [
46], using the InceptionV3-GRU architecture, they achieved a 97.03% accuracy in signer-dependent testing and 74.22% in signer-independent testing. Alyami et al. [
47] classified the key points extracted from the signer’s hands and face using a transformer-based model. On the LSA64 dataset, they achieved a 98.25% and 91.09% accuracy in signer-dependent and independent modes, respectively. Furthermore, the combination of hand and face data improved the recognition accuracy by 4% compared to hand data alone, emphasizing the importance of non-manual features in recognition systems.
Rastgoo et al. [
48] utilized a CNN-based model to estimate 3D hand landmark points from images detected by a Single Shot Detector (SSD). They applied the singular value decomposition method, a feature extractor, to the coordinates of these 3D hand key points and the angles between the finger segments, fed the obtained features as input to the LSTM, and predicted 100 Persian signs with a 99.5% accuracy. Samaan et al. [
49] achieved accuracies of 99.6%, 99.3%, and 100% in LSTM, Bi-LSTM, and GRU models, respectively, on a ten-class dataset, utilizing pose landmark points derived from MediaPipe. Although the test accuracy of the proposed method is high, the number of classes is quite low compared to the number of words used in general sign language dictionaries. Castro et al. [
50] introduced a multi-stream approach involving processing summarized RGB frames, segmented regions of the hands and face, joint distances, and artificially generated depth data through a 3D-CNN. In this method, it was shown that the addition of artificial depth maps increased the generalization capacity for different signers. Their method achieved a 91% recognition accuracy on a dataset with 20 classes. Hamza et al. [
51] obtained a 93.33% recognition accuracy using the convolutional 3D model and data augmentation techniques of transformation and rotation on a dataset of 80 classes, where each class contained very few examples. This study demonstrated the effectiveness of data augmentation methods on limited datasets. Laines et al. [
52] presented an innovative approach to isolated SLR using a Tree Structure Skeleton Image (TSSI) representation that converts pose data into an RGB image, improving the accuracy of skeleton-based models for SLR. In the TSSI representation, columns represent landmark points, rows capture the evolution of these points over time, and RGB channels encode the (x, y, z) coordinates of the points. These data were classified with DenseNet-121, achieving recognition accuracies of 81.47%, 93.13%, and 98% for datasets with 100, 226, and 30 classes, respectively. In particular, taking into account critical components of sign language, such as facial expressions and detailed hand gestures, significantly improved the accuracy of the model. In their study, Podder et al. [
25] obtained an 87.69% recognition accuracy on a dataset of 50 classes with their proposed MobileNetV2-LSTM-SelfMLP model using MediaPipe Holistic-based face and hand-segmented data. The method proposed by Jebali et al. [
53] is an innovative training approach to SLR that incorporates an integrated approach of manual and non-manual features. Basically, using deep learning models such as CNN and LSTM, a system is developed that can simultaneously process information from both hand gestures and non-manual components such as facial expressions, and a significant improvement in system performance is observed with the use of non-manual features. It achieved test accuracies of 90.12% and 94.87% on datasets with 450 and 26 classes, respectively.
The real world for SLR recognition may not consist of a flat or fixed background image, but all of the studies mentioned above were evaluated on images with a flat or fixed background. Consequently, this study tested the proposed systems with varied backgrounds added to the test datasets to better simulate real-world conditions. In addition, the proposed R3(2+1)D-SLR network was trained separately for body, hands, and face images obtained by extracting spatial and temporal features from each video with the help of pose data, and the features were fused to classify with SVM for more accurate SLR. Moreover, to make the proposed system more robust to different background conditions, it is proposed to use images derived from pose data instead of standard raw images.
4. Experiments and Results
For the implementation of the proposed SLR system, we utilized a PC equipped with Ubuntu 18.04, an Intel Core i5-8400 processor, 16 GB RAM, and a 12 GB GeForce GTX 1080 Ti GPU. In these experiments, the PyTorch library was used. All models were trained with the SGD optimizer for 35 epochs. The batch size was set to 8 and the learning rate was initially set to 1 × 10−2 and then reduced by 1/10 every 15 epochs.
4.1. Training Models
In order to evaluate the performance of our proposed R3(2+1)D-SLR model, we developed the R(2+1)D-10 model with only R(2+1)D blocks and the R3D-10 model with only R3D blocks, both with the same depth. We also used non-pretrained versions of these models in order to benchmark them against the R3D-18 and R(2+1)D-18 models [
16] with more layers. For the evaluation of the model performances, we used RGB body images from both BosphorusSign22k-general and LSA64 datasets as training material, as detailed in
Section 3.2.2. RGB body data have dimensions of 3 × 32 × 112 × 112. Post-training test performances of these models are compared in
Table 2.
Our R3(2+1)D-SLR model demonstrated a significant performance advantage over existing models on the BosphorusSign22k-general and LSA64 datasets. The R3(2+1)D-SLR model presented a balanced efficiency on the SLR task, both in terms of test accuracy (79.66% and 98.90%) and inference time (116 ms). In particular, the model’s test accuracy was significantly higher than that of models containing only R(2+1)D or R3D blocks of the same depth, providing important evidence for how a block-level fusion approach can improve the generalization ability of models.
The training time and inference time metrics showed that the R3(2+1)D-SLR model provides an appropriate balance between complexity and performance. This balance underlines that the model can be a practical solution for real-time applications and increases the applicability of deep learning models. Moreover, the number of parameters of the model (3.8 M) is able to achieve a high test accuracy while keeping the computational load at reasonable levels, which further enhances the efficiency and applicability of the model.
Consequently, the balanced performance profile provided by the R3(2+1)D-SLR model positions it as a preferable choice for SLR tasks. Therefore, only the proposed R3(2+1)D-SLR network is used for SLR system design in the following sections of the paper.
4.2. Feature Fusion with Raw Data
Although body images capture all structures of a sign, reducing the resolution to 112 × 112 results in significant detail loss. This challenge is addressed in our study by separately analyzing each structure that represents the sign, including face and hands images, in addition to body images. Distinct models were trained for each of these features to harness their unique representational capabilities fully.
Our sign prediction strategy merges 128-dimensional feature vectors extracted from each model before the fully connected layer, creating a comprehensive 384-dimensional feature vector. This composite vector is subsequently utilized as the input for the SVM classifier, with the objective of capturing a holistic representation of sign language words.
Figure 8 outlines our proposed system.
The test results of the models trained for each feature and the results of the test performed by classifying the feature fusion with SVM are shown in
Table 3 and
Table 4.
Our experimental evaluation reveals the effectiveness of this feature fusion approach. As indicated in
Table 3, hands images consistently outperformed full-body images in the BosphorusSign22k-general dataset, capturing distinctive features with a test accuracy of 83.24%. Although face images achieved a lower test accuracy of 22.23%, they were instrumental in identifying unique characteristics across 174 word classes. The integration of body, hands, and face images through the SVM classification resulted in a significant increase in test accuracy to 91.78%.
Further analysis was conducted with test sets modified by the addition of backgrounds, aiming to assess model robustness under varying conditions. The introduction of backgrounds led to a noticeable decline in test accuracy, underscoring the impact of background complexity on model performance. Specifically, the accuracy for body images plummeted from 79.66% to 16.12%, and for hands images from 83.24% to 56.69% within the BosphorusSign22k-general dataset. Despite this, the fused features’ classification via SVM demonstrated resilience, with the test accuracy reducing less dramatically from 91.78% to 72.39%, highlighting the robustness of our feature fusion approach against background variations.
Table 4 illuminates the efficacy of our system across signer-independent (E1 and E2) and signer-dependent (E3) evaluations within the LSA64 dataset. Specifically, the table shows the body image’s superior performance compared to the hands image. Remarkably, the face image demonstrated significant representation for the 64 word classes, underscoring its substantial role in sign word recognition. The synthesized feature fusion, classified via SVM, achieved an impressive signer-independent test accuracy of 99.37% in E1, an average of 99.53% in E2, and 99.84% in the signer-dependent assessment of E3. The introduction of backgrounds to the LSA64 dataset test data precipitated a notable decline in accuracy, albeit the adverse effect was mitigated by the feature fusion strategy. This mitigation underscores the robustness of the fused feature approach, particularly when contrasted with the outcomes from the BosphorusSign22k-general dataset.
Our analysis underlines the need to consider both manual and non-manual features for a correct evaluation in SLR tasks. Our empirical evidence, derived from both datasets, confirms that a fusion of features markedly surpasses the performance of individual features. Moreover, the addition of a background to the test images led to a degradation in accuracy, attributed to the introduction of noise and extraneous information into the classification model. These results show that the background is important in the recognition of sign language words; therefore, the background should be taken into account when designing an SLR model.
4.3. Feature Fusion with Pose-Based Data
To build a more robust SLR system resilient to varying background images, we repeated the model training with pose images as outlined in
Section 3.2.2 and illustrated in
Figure 9. The test accuracy performance of the models is shown in
Table 5,
Table 6 and
Table 7.
Table 5 demonstrates that, for the BosphorusSign22k-general dataset, pose images without backgrounds significantly improved performance for both body and hands features. In particular, there was a significant increase in accuracy in the hands pose image. Building on these insights, to explore the impact of finger color differentiation, we compared the accuracy of using colored versus uncolored hands pose images from the BosphorusSign22k-general dataset.
Table 6, alongside
Figure 10, presents the results for all hands data versions, showcasing the trained model’s outcomes.
The color differentiation of fingers enhanced test accuracy by 5.26% over the uncolored version and 5.16% over the raw image version. This is because the finger’s orientation, position, and shape are better distinguished as each finger is colored with a different color. Therefore, the hands pose image with colored fingers performs better than both the raw hands image and the uncolored hands pose image.
There was a decrease in accuracy for the face pose images obtained from the BosphorusSign22k-general dataset compared to the raw data. This is because although eyebrows, eyes, and lips are prominent in the face pose image, many details that could represent the sign are lost. Despite this challenge with face pose images, in the test with SVM classification of the fused features, there was an increase from 91.78% to 94.52%.
When the evaluations were performed using test images with a background, the test accuracy increased for all features except the face images. The accuracy of the proposed final system increased from 72.39% to 93.25%.
The experimental studies using pose images for the BosphorusSign22k-general dataset were repeated using the LSA64 dataset, and the results are reported in
Table 7. For the LSA64 dataset, the use of pose data generally resulted in a decrease in accuracy in the evaluations with un-backgrounded test images. However, it was observed that using colored hands pose images generally provided a higher test accuracy than the raw images. There was a significant drop in accuracy for the face feature, as important details were lost in the pose data. In the tests performed with backgrounded test images, there was an increase in accuracy for all data types except for the face feature. For the proposed feature fusion-based system, although the accuracy decreased in the tests without background compared to the raw images, in the tests with background the signer-independent recognition accuracy increased from 95.31% to 95.46% for E1, from 93.90% to 97.93% for E2, and from 98.90% to 99.68% for signer-dependent (E3) recognition.
In summary, using pose images instead of normal images improved the accuracy of the final merged system in both datasets when evaluated with test images with backgrounds. Owing to the pose images, the system is robust to different backgrounds, making it more suitable for use in a real-world scenario. Furthermore, using pose data can eliminate disadvantageous factors for recognition accuracy due to different background and lighting conditions and shadows, as well as personal differences such as beards, mustaches, long hair, and different clothes.
4.4. Comparison with Other Studies
This section compares studies in the literature using BosphorusSign22k-general and LSA64 datasets and the proposed final SLR system’s performances. Only the results obtained with the original test sets are included for comparison. Studies using the BosphorusSign22k-general dataset and test results of the proposed system are shown in
Table 8.
As can be seen from
Table 8, while the study presented by Kindiroglu et al. [
36] is based on pose data with an accuracy of 81.58%, Gündüz and Polat [
19] achieved a higher accuracy rate of 89.35% by using multimodal data (RGB, pose, optical flow) and considering body, hands, and face data together. Our method surpassed existing studies, achieving a 91.78% accuracy with raw images and 94.52% with pose-based inputs. Moreover, the proposed R3(2+1)D-SLR network achieved an 88.40% test accuracy with the colored hands pose image. These results show that our model can effectively extract rich spatial and temporal features from both raw images and pose data. By fusing these features extracted from the body, hands, and face, the accuracy of the recognition was significantly improved.
For the LSA64 dataset, signer-dependent and signer-independent evaluations were performed to make comparisons with the studies in the literature. The results are shown in
Table 9.
As can be seen from
Table 9, the proposed feature fusion-based method outperformed the existing works in the literature using the LSA64 dataset both in E1 and E2, which are signer-independent evaluations, and in E3, which is a signer-dependent evaluation. Compared to the method proposed by Marais et al. [
46] based on the InceptionV3-GRU method using the entire video image with signers 5 and 10 separated as a test, our R3(2+1)D-SLR model achieved a recognition accuracy of 94.99% with an increase of 5.44% for the same input modality. This shows the superiority of our proposed deep learning model over InceptionV3-GRU in spatial and temporal feature extraction. This result increased up to 99.37% with the inclusion of hands and face data. While the 91.09% accuracy rate obtained by Alyami et al. [
47] in E2, another signer-independent experimental environment, shows the strengths of transformer models, our model achieved a 99.53% accuracy for the raw-image input and a 98.53% accuracy for the pose-based input on the LSA64 dataset, which provides a significant advantage, especially for in-depth processing of spatio-temporal features. In E3, a signer-dependent evaluation, the 99.84% accuracy rate presented by Konstantinidis et al. [
40] is a remarkable achievement in SLR. This paper proposes a model based on VGG-16 and LSTM that thoroughly examines body, hands, and face data using a multimodal approach covering RGB, pose, and optical flow data. However, our proposed method achieved a 99.84% accuracy for raw image-based input and 100% accuracy for pose-based input, which emphasizes the superiority of our method. In addition, the recognition accuracy of 99.53% achieved by our proposed R3(2+1)D-SLR model with RGB body data emphasizes the importance of the effective use of deep learning models.
5. Conclusions
In summary, our novel R3(2+1)D-SLR network marks a significant advancement in SLR by effectively merging R3D and R(2+1)D convolution blocks. This innovative approach facilitates a deeper understanding and a more accurate capture of sign language’s complex spatial and temporal dynamics. Together with this proposed network, our comprehensive methodology integrating the fusion of body, hands, and face features based on pose data consistently outperformed and surpasses the work in the existing literature on various datasets and under different background conditions.
Our future work will focus on exploring alternative classifiers to the SVM used in our proposed SLR system and optimizing their hyperparameters. In particular, going beyond direct feature fusion, we aim to explore the potential of alternative ensemble techniques such as boosting, bagging, and stacking to more effectively integrate and evaluate different feature sets. We will also focus on the robustness and adaptability of the SLR system, examining the integration of additional data augmentation techniques in addition to background variations to increase the model’s adaptability to various environmental conditions. Together, these efforts aim to propel SLR technology forward, improving communication access and inclusivity for the deaf and hard-of-hearing.