Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments

Kim, Taewan; Kim, Bongjae

doi:10.3390/app14209199

Open AccessArticle

Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments

by

Taewan Kim

and

Bongjae Kim

^*

Department of Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9199; https://doi.org/10.3390/app14209199

Submission received: 30 August 2024 / Revised: 4 October 2024 / Accepted: 9 October 2024 / Published: 10 October 2024

(This article belongs to the Special Issue Deep Learning and Edge Computing for Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

:

Recent AI-based technologies in mobile environments have enabled sign language recognition, allowing deaf individuals to communicate effectively with hearing individuals. However, varying computational performance across different mobile devices can result in differences in the number of image frames extracted in real time during sign language utterances. The number of extracted frames is a critical factor influencing the accuracy of sign language recognition models. If the number of extracted frames is too small, the performance of the sign language recognition model may decline. Additionally, detecting the start and end points of sign language utterances is crucial for improving recognition accuracy, as the period before the start point and after the end point often involves no action being performed. These parts do not capture the unique characteristics of each sign language. Therefore, this paper proposes a technique to dynamically adjust the sampling rate based on the number of frames extracted in real time during sign language utterances in mobile environments, with the aim of accurately detecting the start and end points of the sign language. Experiments were conducted to compare the proposed technique with the fixed sampling rate method and with the no-sampling method as a baseline. Our findings show that the proposed dynamic sampling rate adjustment method improves performance by up to 83.64% in top-5 accuracy and by up to 66.54% in top-1 accuracy compared to the fixed sampling rate method. The performance evaluation results underscore the effectiveness of our dynamic sampling rate adjustment approach in enhancing the accuracy and robustness of sign language recognition systems across different operational conditions.

Keywords:

sign language recognition; mobile environments; dynamic sampling rate; artificial intelligence

1. Introduction

Recently, with the improvement in computing performance of mobile devices such as smartphones, there has been active research on AI-powered mobile applications utilizing artificial intelligence models. For example, various AI-based services are continuously being developed in fields such as education, real-time translation, and privacy protection [1,2,3]. In relation to AI-powered mobile applications, research aimed at improving communication between hearing and deaf individuals through the use of artificial intelligence has also been actively conducted [4,5,6,7,8,9]. A notable example is the use of avatars in sign language services [10,11,12]. In contemporary society, mobile devices play a significant role in communication, information accessibility, and service usage. In this context, recognizing and translating sign language in mobile environments is crucial not only for deaf individuals but also for effectively facilitating smooth communication with hearing individuals.

However, recognizing and translating sign language in the current mobile environment has limitations [13]. One major limitation is that mobile environments lack the computational power compared to server environments [14,15], resulting in fewer image frames being extracted within the same time period. If the number of extracted frames is too small, the performance of the sign language recognition model may decrease. Additionally, accurately detecting the start and end points of a sign language utterance is critical for improving the accuracy of sign language recognition models. As seen in automatic speech recognition (ASR), the failure to accurately detect start and end points can severely degrade the performance of the recognition system [16]. The period before the start and after the end of a sign language utterance often involves no action being performed. Such segments do not adequately capture the unique characteristics of each sign language. Therefore, including these segments in the training of a sign language recognition model can reduce its accuracy. Addressing these challenges is essential for enhancing the performance and reliability of sign language recognition on mobile devices.

To accurately detect the start and end points of a sign language utterance in a mobile environment, it is necessary to dynamically adjust the sampling rate based on the number of frames extracted, rather than relying on a fixed sampling rate, as proposed in previous research [17]. This is because when fewer frames are extracted during a sign language utterance, the movement changes between consecutive frames are greater compared to when more frames are extracted. Therefore, it is necessary to adjust the sampling rate to determine the start and end points of the sign language utterance based on the number of extracted frames. This approach can better accommodate the computational limitations of mobile environments compared to server environments.

In previous research [17], the focus was on detecting the start and end points of sign language utterances by analyzing keypoint variations in sign language videos. This research was conducted in standard computing environments, such as personal computers or server computers. However, the method proposed in that research is not as suitable for resource-constrained mobile computing environments as it is for server environments. This is because mobile environments have varying computational performance across different devices, leading to differences in the number of sign language utterance image frames that can be obtained within the same time period. In practice, the average frame rate was measured to be 17 fps when extracting keypoint images during real-time sign language utterances on a Galaxy Z Flip4 (Snapdragon 8 Gen 2). Additionally, if multiple applications are running on a mobile device, the frame rate may decrease. Similarly, high memory usage on the mobile device can also cause a drop in frame rate [18]. Consequently, applying the fixed sampling rate method proposed in the previous research may degrade the performance of sign language recognition models in mobile environments.

This paper proposes an improved dynamic sampling rate adjustment method for mobile environments based on the technique suggested in previous research. Our method dynamically adjusts the sampling rate to detect the start and end points of sign language utterances based on the number of frames extracted in real time during sign language utterances in mobile environments. By applying this dynamic sampling rate adjustment method, we can enhance the accuracy of sign language recognition models.

To evaluate the recognition accuracy of the proposed dynamic sampling rate adjustment method, we compared its performance to that of a fixed sampling rate method and a no-sampling baseline across various frame rates: 5, 10, 15, 20, and 25 fps. The experimental results indicated that the proposed method consistently outperformed the fixed sampling rate method, particularly at lower frame rates of 5 and 10 fps, demonstrating its effectiveness in computationally constrained environments. Even at higher frame rates, the dynamic method maintained superior accuracy. Overall, it improved top-5 accuracy by up to 83.64% and top-1 accuracy by up to 66.54%, underscoring its ability to enhance both the accuracy and robustness of sign language recognition systems.

The remainder of this paper is organized as follows. Section 2 reviews related research. Section 3 provides a detailed explanation of the proposed dynamic sampling rate adjustment method. Section 4 describes the performance evaluation environment and presents the results of the performance evaluation in detail. Finally, Section 5 concludes the paper with a discussion of the conclusions and future work.

2. Related Works

In this section, we will discuss related research. Various sign language recognition techniques have been studied to improve the accuracy of sign language detection. Table 1 and Table 2 summarize key related works in this area.

Ekbote et al. proposed the development of an automatic recognition system specifically designed for Indian Sign Language numerals (0–9). The system was built using a self-created database of 1000 images, with 100 images dedicated to each numeral. To effectively capture the distinguishing features of these numerals, the study employed advanced feature extraction techniques, including shape descriptors, scale-invariant feature transform (SIFT), and histogram of oriented gradients (HOG). For classification, two robust machine learning models, artificial neural networks (ANN) and support vector machine (SVM), were utilized. The proposed system achieved a remarkable accuracy of up to 99%, demonstrating its potential for accurately recognizing Indian Sign Language numerals [19].

In the field of computer vision, convolutional neural networks (CNNs) have become foundational architectures, significantly impacting various applications, including sign language recognition, facial recognition, object detection, medical image analysis, and autonomous driving. For instance, Pathan et al.’s proposed methodology employs a two-layer image processing technique. In the first layer, images are processed as a whole for training purposes, while the second layer focuses on extracting hand landmarks to refine the recognition process. This approach is implemented through a multiheaded convolutional neural network (CNN) model, which effectively manages the dual processing tasks [20].

Katoch et al. proposed a novel Indian Sign Language (ISL) recognition system that combines Speeded Up Robust Features (SURF), support vector machine (SVM), and convolutional neural network (CNN) to improve communication for the hearing and speech-impaired communities. Through extensive experiments with a well-defined ISL gesture dataset, the system achieved over 90% accuracy, with CNN playing a key role in enhancing feature extraction and classification [21].

Kothadiya et al. proposed a model that utilizes long short-term memory (LSTM) and gated recurrent unit (GRU) networks to detect and recognize gestures from isolated video frames of Indian Sign Language (ISL). The authors developed their own dataset, IISL2020, and achieved approximately 97% accuracy in recognizing 11 different signs using a combination of LSTM and GRU layers [22].

More recently, transformer models have been applied to sign language recognition, showing promising results. Transformers, known for their capability to handle long-range dependencies and parallelize training, are particularly effective for sequential data, such as sign language videos.

Kothadiya et al. proposed a novel approach to recognizing static Indian Sign Language using a transformer encoder. The proposed system, which utilizes a vision transformer to process sign language through positional embedding patches and a transformer block with self-attention layers, significantly outperforms traditional convolutional architectures. Achieving an accuracy of 99.29% with minimal training, the proposed SIGNFORMER demonstrates robustness and effectiveness in real-world applications [23].

Alharthi et al. addressed the gap in existing research by applying pretrained models and vision transformers, typically used in image classification, to Arabic Sign Language (ArSL). Utilizing a dataset of 54,049 images representing 32 different Arabic letters, the research compared transfer learning models, including MobileNet, Xception, and ResNet, with CNN architectures. The results show that transfer learning approaches, particularly ResNet and InceptionResNet, achieved high accuracy of 98%, demonstrating the potential of these methods for enhancing sign language recognition in low-resourced languages [24].

Recent research in sign language recognition leverages keypoint-based approaches, drawing on methods proposed in [25]. This study significantly improves 3D human pose estimation by achieving an 18% reduction in prediction error compared to previous unsupervised methods, demonstrating effective 3D pose prediction from 2D inputs without relying on 3D training data. The results, validated on the Human3.6M dataset, show that PoseNet3D outperforms existing techniques, delivering robust and accurate 3D pose estimations with natural and realistic transitions across frames. The success of the model is attributed to the joint fine-tuning of teacher and student networks using temporal, self-consistency, and adversarial losses, enhancing overall prediction accuracy.

Bird et al. [26] explored the use of neuroevolution techniques to optimize deep neural networks for recognizing American Sign Language (ASL) fingerspelling. By employing hyperheuristic algorithms to fine-tune network architectures and hyperparameters, the study achieved a high mean accuracy of 97.44% on a dataset of 1678 images. The research highlights the potential of these techniques in enhancing the accuracy and efficiency of sign language recognition systems, making them more effective for individuals with hearing impairments.

Together, these studies illustrate the critical role of keypoints in improving the accuracy and efficiency of computer vision systems across diverse applications, from human action and pose estimation to facial and hand gesture recognition. By leveraging keypoint data, these models can achieve high performance while addressing the specific challenges associated with each application area.

Kim et al. [17] proposed a novel approach for detecting the start and end points of sign language utterances. This method, which can be considered a preliminary study to our work, utilizes keypoint data extracted from sign language video frames to identify the precise moments when a sign language utterance begins and ends. Kim et al. employed the MediaPipe Holistic model to extract keypoints from video frames, focusing on 59 keypoints, including 17 from the Pose model and 21 from each Hand model (left and right). Their method calculates the change in position of these keypoints between frames to determine the start and end of sign language gestures. The study compared the performance of their method using two different sets of keypoints: a set of 10 keypoints (including elbows, wrists, and middle joints of thumbs, middle fingers, and little fingers) and a set of 22 keypoints (focusing on wrists and the start and end joints of all fingers). It showed that using 10 keypoints yielded better results, with an average detection error of 2.2 frames and an execution time of 3.7 ns. This suggests that considering the overall arm movement, including the elbow, is more effective in detecting the start and end points of sign language utterances than focusing solely on hand movements. This work provides a foundation for improving sign language recognition models by accurately isolating the actual utterance portions of sign language videos, potentially contributing to increased recognition accuracy in sign language translation systems. Our proposed method builds upon this work, addressing some of its limitations and extending its capabilities to propose a dynamic sampling rate adjustment method that can be effectively used in mobile environments.

3. Dynamic Sampling Rate Adjustment Method

In this section, we introduce our dynamic sampling rate adjustment method, designed to improve the performance of sign language recognition in mobile environments by adjusting the sampling rate based on the number of frames extracted in real time during sign language utterances.

Mobile environments typically have less computational power compared to server environments, resulting in greater variability in the number of frames that can be captured in real time during sign language utterances. The method suggested in previous research [17] does not account for these variations, potentially leading to suboptimal recognition accuracy in mobile environments. Our proposed dynamic sampling rate adjustment method addresses this issue by dynamically adjusting the sampling rate according to the actual frame rate obtained during sign language utterances. Table 3 provides a comprehensive list of symbols and their descriptions to explain our proposed method.

In the previous research of [17], the start and end points of sign language utterances were detected based on the amount of change in the movement of each keypoint.

S R_{s e r v e r}

is the sampling rate for counting and measuring keypoint coordinates. The keypoint coordinates are counted and measured every

S R_{s e r v e r}

frames of the data frame. In the experiment,

S R_{s e r v e r}

was set to 5, and

N_{c h a n g e}

represents the total number of keypoint position changes. Therefore,

N_{c h a n g e}

can be calculated using Equation (1). The counted and measured keypoint information data are structured as

(x, y)

coordinates.

C_{i}

is the i-th keypoint change information, counted and measured by

S R_{s e r v e r}

.

C_{i}

can be calculated using Equation (2).

N_{c h a n g e} = ⌊ \frac{F_{t o t a l}}{S_{r}} ⌋

(1)

C_{i} = \sum_{k = 1}^{K_{m a x}} \sqrt{{(x_{i}^{k_{i d x}} - x_{i - 1}^{k_{i d x}})}^{2} + {(y_{i}^{k_{i d x}} - y_{i - 1}^{k_{i d x}})}^{2}}

(2)

In Equation (2),

k_{i d x}

represents the index of the counted and measured keypoint, and the maximum index is

K_{m a x}

. The maximum value of

K_{m a x}

varies depending on the keypoints used at the start and end of the hand.

x_{i}^{k_{i d x}}

is the x coordinate of the keypoint with index

k_{i d x}

in the i-th keypoint extraction frame. Similarly,

y_{i}^{k_{i d x}}

is the y coordinate of the keypoint with index

k_{i d x}

in the i-th keypoint extraction frame.

C_{T o t a l}

is the sum of the position changes of each keypoint and can be calculated using Equation (3). As shown in Equation (3),

C_{T o t a l}

is calculated as the sum of all

C_{i}

.

C_{t o t a l} = \sum_{i = 1}^{N_{c h a n g e}} C_{i}

(3)

C_{a v g}

is the average value of the position changes of each keypoint and can be calculated using Equation (4).

C_{a v g}

is used as the threshold for determining the start and end of hand motion.

C_{a v g} = \frac{C_{t o t a l}}{N_{c h a n g e}}

(4)

Finally, in the detection algorithm, the start and end of hand motion are determined by Equations (5) and (6). For the start of hand motion, the frame with the smallest index i among the frames larger than

C_{a v g}

becomes the start of hand motion. Conversely, for the end of hand motion, the frame with the largest index i among the frames larger than

C_{a v g}

becomes the end of hand motion.

S_{s t a r t} = min (\forall i, where (C_{a v g} \leq C_{i}))

(5)

S_{e n d} = max (\forall i, where (C_{a v g} \leq C_{i}))

(6)

The key parameter is the sampling rate,

S R_{s e r v e r}

. If

S R_{s e r v e r}

is too small, many frames need to be counted and measured to detect the start and end points of the sign language utterance. Conversely, if it is too large, fewer frames are counted and measured. In short, this parameter determines the frequency at which keypoint changes are observed.

Applying this fixed sampling algorithm directly to mobile environments presents challenges due to mobile devices having lower computing power compared to servers. As a result, typical mobile devices, such as smartphones, extract fewer frames per unit of time, making the use of a server-optimized fixed sampling rate impractical. For example, in the case of a server, it is possible to obtain more than 30 fps (frames per second) during sign language utterances, whereas mobile devices, such as smartphones, may obtain less than 30 fps depending on their computing performance. Therefore, in mobile environments, it is necessary to dynamically adjust the sampling rate based on the obtained frames to detect the start and end points of sign language utterances.

As we explained, in mobile environments, frames are extracted in real time during sign language utterances, leading to variations in the number of frames extracted over the same period compared to servers, where all frames can be extracted from a video. This variation necessitates adjusting the sampling rate dynamically based on each mobile device’s frame extraction rate. By doing so, the accuracy of detecting the start and end points of sign language utterances can be improved, ultimately leading to an improvement in the overall accuracy of the sign language recognition model. This dynamic method is crucial in resource-constrained environments like mobile devices to provide sign language recognition-based services.

A detailed explanation of our dynamic sampling rate adjustment method, based on the actual frame rate of the extracted images when detecting the start and end points of sign language utterances in a mobile environment, is as follows.

The most important aspect of the proposed method is the ability to calculate the actual frame rate when extracting images during sign language utterances, considering the performance of the mobile device. The frame rate in mobile environments (

f p s_{m o b i l e}

) can be calculated using Equation (7).

f p s_{m o b i l e} = \frac{N_{f r a m e s}}{T_{u t t e r a n c e}}

(7)

It is also necessary to calculate the ratio of the frame rate in mobile environments to the frame rate on the server. Based on this, the sampling rate can be adjusted dynamically. The ratio of the frame rate in mobile environments to the frame rate on the server (

R_{f p s}

) is calculated using Equation (8).

R_{f p s} = \frac{f p s_{m o b i l e}}{f p s_{s e r v e r}}

(8)

Finally, in the proposed method, the dynamic sampling rate (

S R_{m o b i l e}

) is determined by adjusting the optimal fixed sampling rate used in server environments according to the frame rate ratio, as shown in Equation (9).

S R_{m o b i l e} = ⌊S R_{s e r v e r} \times R_{f p s} + 0.5⌋

(9)

As shown in Equation (9), this ensures that the sampling rate is dynamically adjusted to maintain consistent recognition performance despite variations in frame rates in mobile environments.

Algorithm 1 illustrates the process of the proposed dynamic sampling rate adjustment for mobile devices. To clearly illustrate the method for calculating the dynamic sampling rate described in Algorithm 1, consider the following scenario in which the optimal fixed sampling rate in server environments (

S R_{s e r v e r}

) is 5. Additionally, assume that the number of frames extracted in real time (

N_{f r a m e s}

) is 50, the duration of the sign utterance (

T_{u t t e r a n c e}

) is 10 s, and the frame rate in server environments (

f p s_{s e r v e r}

) is 30 fps.

Algorithm 1 Dynamic sampling rate adjustment for mobile devices

1:: Input: $S R_{s e r v e r}$ , $N_{f r a m e s}$ , $T_{u t t e r a n c e}$ , and $f p s_{s e r v e r}$
2:: Output: $S R_{m o b i l e}$
3:: Calculate $f p s_{m o b i l e}$ using Equation (7):
4:: $f p s_{m o b i l e} \leftarrow \frac{N_{f r a m e s}}{T_{u t t e r a n c e}}$
5:: Calculate $R_{f p s}$ , the ratio of mobile to server frame rate, using Equation (8):
6:: $R_{f p s} \leftarrow \frac{f p s_{m o b i l e}}{f p s_{s e r v e r}}$
7:: Calculate the dynamic sampling rate, $S R_{m o b i l e}$ , using Equation (9):
8:: $S R_{m o b i l e} \leftarrow ⌊S R_{s e r v e r} \times R_{f p s} + 0.5⌋$
9:: return $S R_{m o b i l e}$

Based on these assumptions, we can compute the frame rate in mobile environments (

f p s_{m o b i l e}

) as shown in Equation (10). The ratio of the mobile frame rate to the server frame rate (

R_{f p s}

) is calculated as shown in Equation (11). Finally, we can obtain the dynamic sampling rate (

S R_{m o b i l e}

) as shown in Equation (12).

f p s_{m o b i l e} = \frac{N_{f r a m e s}}{T_{u t t e r a n c e}} = \frac{50}{10} = 5 fps

(10)

R_{f p s} = \frac{f p s_{m o b i l e}}{f p s_{s e r v e r}} = \frac{5}{30} \approx 0.167

(11)

S R_{m o b i l e} = ⌊S R_{s e r v e r} \times R_{f p s} + 0.5⌋ = ⌊5 \times 0.167 + 0.5⌋ = ⌊1.335⌋ = 1

(12)

In this scenario, the dynamic sampling rate in mobile environments (

S R_{m o b i l e}

) is set to 1. This means that each extracted frame is sampled and checked to detect the start and end points of a sign language utterance. In summary, the proposed method checks image frames at shorter intervals as the number of extracted frames from a mobile device decreases, in order to improve sign language recognition accuracy. Therefore, the proposed method can maintain consistent recognition performance, even with a lower frame rate compared to server environments.

Algorithm 2 provides a method for detecting sign language utterances’ start and end points. Once the dynamic sampling rate for mobile environments (

S R_{m o b i l e}

) is determined based on Algorithm 1, this algorithm uses the movement of hand keypoints across frames to identify the exact points where a sign language utterance starts and ends.

As shown in Algorithm 2, it leverages the calculated

S R_{m o b i l e}

to compute the number of frames to be analyzed for changes, denoted as

N_{c h a n g e}

(where

S_{r}

is replaced by

S R_{m o b i l e}

). By assessing the cumulative movement of keypoints between frames (

C_{i}

) and comparing it to the average movement (

C_{a v g}

), the algorithm detects the start point (

S_{s t a r t}

) and end point (

S_{e n d}

) of the utterance.

Algorithm 2 Detecting start and end points of sign language utterances through dynamic sampling rate

1:: Input: $S R_{m o b i l e}$ , $K_{i n f o}$ , $F_{t o t a l}$ , $K_{m a x}$
2:: Output: $S_{s t a r t}$ , $S_{e n d}$
3:: Initialize $C_{t o t a l} \leftarrow 0$ , $N_{c h a n g e} \leftarrow 0$ , $C_{i} \leftarrow 0$ , $C_{a v g} \leftarrow 0$ , $S_{s t a r t} \leftarrow - 1$ , and $S_{e n d} \leftarrow - 1$
4:: Calculate $N_{c h a n g e}$ using Equation (1) (replacing $S_{r}$ with $S R_{m o b i l e}$ ):
5:: $N_{c h a n g e} \leftarrow ⌊ \frac{F_{t o t a l}}{S R_{m o b i l e}} ⌋$
6:: for $i = 1$ to $N_{c h a n g e}$ do
7:: Calculate $C_{i}$ using Equation (2):
8:: $C_{i} \leftarrow \sum_{k = 1}^{K_{m a x}} \sqrt{{(x_{i}^{k_{i d x}} - x_{i - 1}^{k_{i d x}})}^{2} + {(y_{i}^{k_{i d x}} - y_{i - 1}^{k_{i d x}})}^{2}}$
9:: Accumulate $C_{t o t a l}$ :
10:: $C_{t o t a l} \leftarrow C_{t o t a l} + C_{i}$
11:: end for
12:: Calculate $C_{a v g}$ using Equation (4):
13:: $C_{a v g} = \frac{C_{t o t a l}}{N_{c h a n g e}}$
14:: for $i = 1$ to $N_{c h a n g e}$ do ▹ Determine Start and End Points
15:: if $C_{i} \geq C_{a v g}$ then
16:: if $S_{s t a r t} = - 1$ then
17:: $S_{s t a r t} \leftarrow i$ ▹ First time exceeding $C_{a v g}$
18:: end if
19:: $S_{e n d} \leftarrow i$ ▹ Update end point to current frame
20:: end if
21:: end for
22:: return $S_{s t a r t}$ , $S_{e n d}$

By dynamically determining the sampling rate based on the number of frames extracted in mobile environments, we can gain the following advantages. First, we can provide adaptability to different mobile device capabilities and conditions by adjusting the sampling rate in real time based on frame rates. Second, we can ensure accuracy in detecting the start and end points of sign language utterances, leading to better overall recognition performance. Lastly, we can reduce the use of computational resources, minimizing unnecessary processing, and lowering the computational load.

Figure 1 illustrates an example of the service flow using the proposed dynamic sampling rate adjustment method in a mobile computing environment. As shown in Figure 1, the process begins with the user initiating a sign language utterance, which is captured by the mobile device’s camera. The mobile device processes the captured frames in real time, applying algorithms to detect the start and end points of the utterance, thereby optimizing the input data for recognition.

After detecting these key segments, the processed images are transmitted to a sign language recognition model, typically located on a server or in the cloud. This model performs inference on the input data to identify the specific sign language gestures. Once the inference is completed, the recognition results are sent back to the mobile device for display or further use.

The mobile device focuses on capturing and preprocessing the gesture data, using dynamic sampling rate adjustments to enhance efficiency, particularly in resource-constrained environments. The recognition model then performs the computationally intensive task of interpreting the gestures. By dividing the workload in this manner, the proposed method ensures reliable and effective sign language recognition, despite the limited computational power available on mobile devices.

4. Performance Evaluations

This section provides a detailed explanation of our performance evaluation environments and results. Ideally, the performance evaluation should be conducted in a mobile environment. However, this approach poses several challenges, including the extensive time required to evaluate a large dataset containing 510 sign language classes, the difficulty for sign language performers to master and perform all gestures accurately, and the variability in performance capabilities across different mobile devices. For these reasons, our dynamic sampling rate adjustment method was evaluated in a server environment rather than on a mobile platform. Although the performance evaluation was conducted in a server environment, it was performed with varying fps (frames per second) to account for differences in computing performance across different mobile environments, specifically by adjusting the number of image frames extracted during sign language utterances.

We compared the performance of our dynamic sampling rate adjustment method against the fixed sampling rate method proposed in previous research [17], as well as a baseline with no sampling adjustment. Table 4 shows the hardware and software configuration for the experimental environments.

4.1. Experimental Environments

4.1.1. Sign Language Recognition Model

To evaluate the performance of our dynamic sampling rate adjustment method, we used the Video Swin Transformer model [27]. This model stands out as a cutting-edge architecture in video recognition, celebrated for its efficiency and accuracy in processing video sequences. Since sign language recognition involves inferring actions from video data, the Video Swin Transformer model is highly suitable for sign language recognition and classification tasks. Specifically, we utilized the swin3d_b model with pretrained weights torchvision.models.video.Swin3D_B_Weights.KINETICS400_IMAGENET22K_V1 from PyTorch.

The overall architecture of the Video Swin Transformer used in the performance evaluations is shown in Figure 2. The final output layer was modified to predict 510 classes to match the sign language dataset used for training and validation. The Video Swin Transformer is specifically designed to capture both spatial and temporal information from video frames, making it highly suitable for tasks such as sign language recognition. It achieves this by segmenting video frames into nonoverlapping windows and applying self-attention mechanisms within each window. This methodology enables the model to focus effectively on relevant segments of the frames, facilitating the learning of intricate patterns associated with sign language movements. Key features of the Video Swin Transformer include the following:

Hierarchical structure: The model employs a hierarchical design that processes video data at multiple scales. This allows it to capture both fine-grained details and broader contextual information.
Shifted window partitioning: Unlike traditional transformer models, the Video Swin Transformer uses a shifted window partitioning scheme. This approach enables connections between windows, facilitating the flow of information across different regions of the video frames.
Relative position bias: The model incorporates relative position encoding, which helps maintain spatial relationships between different parts of the input, crucial for understanding the spatial configuration of sign language gestures.
3D patch embedding: Video frames are divided into 3D patches, preserving both spatial and temporal information. This approach is particularly beneficial for capturing the motion dynamics in sign language.
Efficient computation: By limiting self-attention computation to local windows, the Video Swin Transformer achieves linear computational complexity with respect to input size, making it more efficient than global attention mechanisms.

As shown in Figure 2, in our experimental setup, the Video Swin Transformer processed sequences of images extracted from videos by first extracting keypoints of sign language utterances from these images. This approach allowed us to investigate the effectiveness of our dynamic sampling rate adjustment method in maintaining high recognition accuracy across different frame rates, reflecting real-world variability, such as background image information, in video capture conditions. Table 5 shows the hyperparameters and their settings to train the sign language recognition model. The loss function used was LabelSmoothingCrossEntropy, which is suitable for classification purposes. The batch size was set to 4, considering the computing environment used in the experiment. The learning rate was set to

1 \times 10^{- 3}

, a commonly used value. The optimizer was set to SGD.

By leveraging the capabilities of the Video Swin Transformer, our performance evaluations aim to provide evidence that the recognition accuracy of the model can be improved by the proposed dynamic sampling rate adjustment method compared to methods proposed in previous research.

4.1.2. Dataset and Data Preprocessing

We used a dataset provided by KETI (Korea Electronics Technology Institute, Seongnam, Republic of Korea), with some class data publicly available on AIHUB. Table 6 shows the composition of the dataset used in the experiment. Our dataset, used for model training and evaluation, comprises a total of 41,240 video samples, divided into training (28,868), testing (6186), and validation (6186) sets. It covers 510 classes, consisting of 405 individual words and 105 sentences, with each class represented by approximately 80 samples. All videos were captured in full HD (1920 × 1080) resolution, and the dataset includes recordings from multiple sign language speakers, ensuring diversity in the sign language utterances.

Each video sample in the dataset is a recording of a sign language utterance. To prepare the data for training and validation, we employed the following preprocessing steps.

Frame Extraction: Each video is broken down into individual frames.
Video Frame Centralization and Resizing:
- Each extracted frame, originally at a resolution of 1920 × 1080 pixels, is centrally cropped to 1080 × 1080 pixels to ensure the sign language gesture is centered.
- The centrally cropped frame is then resized to 256 × 256 pixels to standardize the input size for the model.
- 17 keypoints (0–16) from the Pose model, shown on the left side of Figure 3.
- 21 keypoints (0–20) from each Hand model (left and right), as depicted on the right side of Figure 3.
- A total of 59 keypoints are extracted per frame.

4.1.3. Data Augmentation and Model Training

To enhance the robustness of our sign language recognition model, we implemented several data augmentation techniques during the training process. In real-world scenarios, the form of sign language utterances can vary significantly depending on the signer. Additionally, the signer may not always perform directly in front of the camera and may even perform signs in reverse. Therefore, data augmentation techniques such as random horizontal flipping, random rotation, and random cropping were applied. Our data loading and augmentation pipeline is implemented in a custom class that is inherited from PyTorch’s Dataset class. The three data augmentation techniques used in our training process are as follows.

Random horizontal flipping: 50% probability;
Random rotation: ±5 degrees;
Random cropping: From 256 × 256 to 224 × 224.

The data loading process is optimized to handle varying video lengths and to apply transformations efficiently. Our implementation allows for the easy adjustment of parameters such as the number of frames, image size, and augmentation intensity. This robust data preparation and augmentation pipeline, combined with our large and diverse dataset, provides a strong foundation for training our sign language recognition model. It ensures that the model is exposed to a wide variety of input variations, promoting generalization and resilience to real-world variability in sign language performances.

4.1.4. Experimental Setup and Evaluation Scenarios

The performance evaluation was conducted using a pre-extracted test dataset consisting of 6186 samples. Each original sample in this dataset represents a sequence of images extracted from videos at 30 frames per second (fps), capturing the keypoints of sign language utterances by the signers. These keypoint images serve as the input for our sign language recognition model.

To assess the effectiveness of our dynamic sampling adjustment method, we used a modular arithmetic method to simulate lower frame rates from the original 30 fps video sequences. Specifically, we adjusted the frame rates to 5, 10, 15, 20, and 25 fps by selectively choosing frames from the original data. For example, to simulate a 15 fps mobile computing environment, we retained frames that satisfy the condition

i mod 2 = 0

(i.e., keep frames 0, 2, 4, 6, …from the original frames). This approach allowed us to evaluate how effectively the dynamic sampling rate adjustment method adapts to different frame rates and maintains recognition accuracy.

The experiments were conducted using three different methods: (1) the original dataset without any modifications, (2) the proposed dynamic sampling rate adjustment method, and (3) the fixed sampling rate method proposed in previous research. We evaluated the performance of each method on the test datasets at 5, 10, 15, 20, and 25 fps. This approach allowed us to compare the recognition accuracy of the original, unmodified sequences against the proposed method and the previous fixed sampling rate method across different frame rate scenarios.

Hereafter, the methods for processing the input data for the sign language recognition model will be referred to as follows: the original dataset without any modifications will be called the Original Method, the proposed dynamic sampling rate adjustment method will be called the Dynamic Method, and, finally, the fixed sampling rate method proposed in previous research will be called the Fixed Method.

4.2. Experimental Results

4.2.1. Accuracy Results

Our experiments evaluated the performance of the Dynamic Method across various frame rates, comparing it with the Fixed Method and the Original Method. Figure 4 shows the results of the top-1 accuracy comparison according to fps. Figure 5 shows the results of the top-5 accuracy comparison according to fps. As shown in Figure 4 and Figure 5, the Dynamic Method consistently outperforms the Fixed Method and Original Method in terms of top-1 accuracy across all tested frame rates.

Figure 4 indicates that at lower frame rates of 5 fps and 10 fps, our Dynamic Method significantly outperforms the Fixed Method, achieving 77.37% accuracy at 5 fps compared to only 10.83% with the Fixed Method. As the frame rate increases, both methods generally show improved accuracy due to more frequent sampling. The Dynamic Method consistently maintains higher accuracy across all tested frame rates, reaching its highest accuracy at 25 fps with a top-1 accuracy of 92.29%, demonstrating its effectiveness in adapting to higher frame rates. Figure 5 shows the top-5 accuracy results according to the frame rate. As shown in Figure 5, the pattern of accuracy performance is similar to the top-1 results.

Based on Figure 4 and Figure 5, it is clear that the Dynamic Method consistently outperforms the Fixed Method and Original Method across all tested frame rates, with significant improvements in accuracy at lower frame rates of 5 fps and 10 fps. This highlights its effectiveness in scenarios with computational limitations, such as mobile environments. As the frame rate increases, the performance gap between the Dynamic and Fixed Methods is reduced, but the Dynamic Method maintains a competitive edge, indicating its robustness in varying operational conditions. The observed performance enhancement validates the efficacy of the Dynamic Method in enhancing the accuracy and reliability of sign language recognition systems deployed in mobile environments.

4.2.2. Impact of FPS on Sign Language Recognition Accuracy

The detection of the start and end points of sign utterances plays a crucial role in the accuracy of sign language recognition models, and this detection is significantly influenced by the frames per second (fps) of the video input. Figure 6 shows examples of how different fps affect the quality of captured keypoint images. Figure 6a shows frames extracted at 5 fps, while Figure 6b shows frames extracted at 25 fps. Both were sourced from an original 30 fps video depicting the sign for the number `0’.

In a fixed sampling rate scenario, where the sampling rate was set to 5, keypoint information is extracted every five frames to detect the start and end points of the sign language utterance. This is because the sign language recognition model was optimized for 30 fps environments. However, when applied to a 5 fps environment, this Fixed Method becomes suboptimal.

Table 7 shows a comparison of the number of extracted images, keypoint extraction frequency, and recognition accuracy according to fps. In a 5 fps environment, during a 5-s sign utterance, only 25 frames can be extracted and made available. This scarcity results in a higher likelihood of capturing irrelevant frames that do not adequately represent the beginning or end of the gesture, thereby reducing recognition accuracy. Therefore, in summary, because the amount of keypoint movement change between each frame is much larger, it is advisable to adjust the sampling rate to a shorter interval. As shown in Figure 6, the amount of change in each keypoint between consecutive frames is greater at 5 fps than at 25 fps.

Conversely, in a 25 fps environment, 125 frames are captured in the same duration, providing the model with more opportunities to evaluate keypoint changes and accurately identify gesture dynamics and movements. This increased frame density improves the model’s ability to discern the start and end points, yielding higher recognition accuracy.

Table 8 provides confusion matrix results for top-1 (accuracy, precision, recall, F1-score) according to frame rates and methods. As shown in Table 8, the performance of the Dynamic Method is significantly higher than the Fixed Method in lower-fps environments. In the 5 fps environment, the Dynamic Method significantly outperforms both the Original and Fixed Methods with higher precision (0.81), recall (0.77), and F1-score (0.76), while the Fixed method particularly struggles with low recall (0.11), highlighting its limitations in such environments. In the 10 fps environment, all methods show improvement, yet the Dynamic Method maintains superior performance with high precision (0.87), recall (0.85), and F1-score (0.84). The Original Method also performs well but trails slightly in F1-score (0.83). In the 15 fps environment, the Dynamic Method continues to excel with balanced precision (0.89) and recall (0.88), resulting in a strong F1-score of 0.87, while the Fixed Method, although improved, remains behind in recall (0.70). In the 20 fps and 25 fps environments, the performance of the Dynamic Method is further improved, consistently achieving the highest F1-scores of 0.90 and 0.92, compared to the Fixed and Original Methods, respectively.

As shown in Table 8, the superior performance of the Dynamic Method, particularly in low-fps environments, can be attributed to its ability to adaptively adjust the sampling rate based on real-time frame rates. Therefore, the Dynamic Method can accurately capture the start and end points of sign language utterances even when fewer frames are available. This adaptability allows the Dynamic Method to maintain high precision and recall by efficiently managing the trade-off between temporal resolution and computational resources, which is especially crucial in mobile environments where computational power is limited. Based on all of the experimental results, the Dynamic Method proves to be a reliable solution for sign language recognition in real-time applications, particularly in environments with challenging operational conditions and varying frame rates.

5. Conclusions and Future Works

In this paper, we proposed a dynamic sampling rate adjustment method designed to enhance the performance of sign language recognition in mobile environments. The proposed method addresses the challenges of varying computational capacities and frame extraction rates across different mobile devices. The primary objective of the proposed scheme is to dynamically adjust the sampling rate based on the real-time frame extraction rate during sign language utterances in mobile environments. By using the dynamic sampling rate, the proposed method improves the detection of the start and end points of sign language utterances. Consequently, it improves overall recognition accuracy compared to the previous fixed method.

Based on the experimental results, the proposed dynamic sampling rate adjustment method significantly outperformed the previous fixed sampling rate method. Specifically, our approach achieved up to an 83.64% improvement in top-5 accuracy and a 66.54% enhancement in top-1 accuracy. These improvements were particularly pronounced in low-frame-rate scenarios. This shows the effectiveness of our method in environments with limited computational resources such as mobile computing environments. We expect that the proposed method can be widely applied not only to the field of sign language recognition in mobile environments but also to the preprocessing of various artificial intelligence services in mobile environments.

This paper has the following limitations. First, the effectiveness of the proposed dynamic sampling rate method was verified in an experimental setting that simulated mobile computing environments by varying the frame rate. This approach was used because performing all sign language utterances directly on a mobile device would be very time-consuming. Therefore, further performance validation on actual mobile devices may be necessary for future research. In addition, the proposed method was applied only to the Video Swin Transformer model, and it is necessary to verify its performance by applying the dynamic sampling rate method to a broader range of AI models.

Our future research work includes the following. Firstly, expanding the research to include a wider range of mobile devices and operating conditions would provide more comprehensive insights into the robustness of the proposed method. Additionally, integrating machine learning models that can predict the optimal sampling rate based on device specifications and environmental factors could further enhance performance.

Author Contributions

Conceptualization, T.K. and B.K.; methodology, T.K. and B.K.; software, T.K.; validation, T.K. and B.K.; formal analysis, T.K. and B.K.; investigation, T.K. and B.K.; resources, T.K. and B.K.; data curation, T.K. and B.K.; writing—original draft preparation, T.K. and B.K.; writing—review and editing, T.K. and B.K.; visualization, T.K. and B.K.; supervision, B.K.; project administration, B.K.; funding acquisition, B.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2022-II220043, Adaptive Personality for Intelligent Agents).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nasir, H.M.; Brahin, N.M.A.; Ariffin, F.E.M.S.; Mispan, M.S.; Wahab, N.H.A. AI Educational Mobile App using Deep Learning Approach. Int. J. Inform. Vis. 2023, 7, 952–958. [Google Scholar] [CrossRef]
Li, Y.; Dang, X.; Tian, H.; Sun, T.; Wang, Z.; Ma, L.; Klein, J.; Bissyandé, T.F. AI-driven Mobile Apps: An Explorative Study. arXiv 2024, arXiv:2212.01635. [Google Scholar] [CrossRef]
Karunya, S.; Jalakandeshwaran, M.; Babu, T.; Uma, R. AI-Powered Real-Time Speech-to-Speech Translation for Virtual Meetings Using Machine Learning Models. In Proceedings of the 2023 Intelligent Computing and Control for Engineering and Business Systems (ICCEBS), Chennai, India, 14–15 December 2023; pp. 1–6. [Google Scholar] [CrossRef]
Guo, Z.; Hou, Y.; Hou, C.; Yin, W. Locality-Aware Transformer for Video-Based Sign Language Translation. IEEE Signal Process. Lett. 2023, 30, 364–368. [Google Scholar] [CrossRef]
Li, W.; Pu, H.; Wang, R. Sign Language Recognition Based on Computer Vision. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 June 2021; pp. 919–922. [Google Scholar] [CrossRef]
Ko, S.K.; Kim, C.J.; Jung, H.; Cho, C. Neural Sign Language Translation Based on Human Keypoint Estimation. Appl. Sci. 2019, 9, 2683. [Google Scholar] [CrossRef]
Chen, Y.; Wei, F.; Sun, X.; Wu, Z.; Lin, S. A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation. arXiv 2023, arXiv:2203.04287. [Google Scholar] [CrossRef]
Elangovan, T.; Arockia Xavier Annie, R.; Sundaresan, K.; Pradhakshya, J.D. Hand Gesture Recognition for Sign Languages Using 3DCNN for Efficient Detection. In Proceedings of the Computer Methods, Imaging and Visualization in Biomechanics and Biomedical Engineering II, Bonn, Germany, 7–9 September 2021; Tavares, J.M.R.S., Bourauel, C., Geris, L., Vander Slote, J., Eds.; Springer: Cham, Switzerland, 2023; pp. 215–233. [Google Scholar]
Naz, N.; Sajid, H.; Ali, S.; Hasan, O.; Ehsan, M.K. Signgraph: An Efficient and Accurate Pose-Based Graph Convolution Approach Toward Sign Language Recognition. IEEE Access 2023, 11, 19135–19147. [Google Scholar] [CrossRef]
Patel, B.D.; Patel, H.B.; Khanvilkar, M.A.; Patel, N.R.; Akilan, T. ES2ISL: An Advancement in Speech to Sign Language Translation using 3D Avatar Animator. In Proceedings of the 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), London, ON, Canada, 30 August–2 September 2020; pp. 1–5. [Google Scholar] [CrossRef]
Kim, J.H.; Hwang, E.J.; Cho, S.; Lee, D.H.; Park, J. Sign Language Production With Avatar Layering: A Critical Use Case over Rare Words. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 1519–1528. [Google Scholar]
Moncrief, R.; Choudhury, S.; Saenz, M. Efforts to Improve Avatar Technology for Sign Language Synthesis. In Proceedings of the 15th International Conference on PErvasive Technologies Related to Assistive Environments (PETRA’22), Corfu, Greece, 29 June–1 July 2022; pp. 307–309. [Google Scholar]
Mondal, R. Mobile Cloud Computing. In Emerging Trends in Cloud Computing Analytics, Scalability, and Service Models; IGI Global: Hershey, PA, USA, 2024; pp. 170–185. [Google Scholar]
Mamchych, O.; Volk, M. Smartphone Based Computing Cloud and Energy Efficiency. In Proceedings of the 2022 12th International Conference on Dependable Systems, Services and Technologies (DESSERT), Athens, Greece, 9–11 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
Silva, P.; Rocha, R. Low-Power Footprint Inference with a Deep Neural Network offloaded to a Service Robot through Edge Computing. In Proceedings of the SAC’23: 38th ACM/SIGAPP Symposium on Applied Computing, New York, NY, USA, 27–31 March 2023; pp. 800–807. [Google Scholar] [CrossRef]
Jayasimha, A.; Paramasivam, P. Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 771–777. [Google Scholar] [CrossRef]
Kim, G.; Cho, J.; Kim, B. A Keypoint-based Sign Language Start and End Point Detection Scheme. KIISE Trans. Comput. Pract. 2023, 29, 184–189. [Google Scholar] [CrossRef]
Waheed, T.; Qazi, I.A.; Akhtar, Z.; Qazi, Z.A. Coal not diamonds: How memory pressure falters mobile video QoE. In Proceedings of the 18th International Conference on Emerging Networking EXperiments and Technologies, New York, NY, USA, 6–9 December 2022; CoNEXT ’22. pp. 307–320. [Google Scholar] [CrossRef]
Ekbote, J.; Joshi, M. Indian sign language recognition using ANN and SVM classifiers. In Proceedings of the 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 17–18 March 2017; pp. 1–5. [Google Scholar] [CrossRef]
Pathan, R.; Biswas, M.; Yasmin, S.; Khandaker, M.U.; Salman, M.; Youssef, A.A.F. Sign language recognition using the fusion of image and hand landmarks through multi-headed convolutional neural network. Sci. Rep. 2023, 13, 16975. [Google Scholar] [CrossRef] [PubMed]
Katoch, S.; Singh, V.; Tiwary, U.S. Indian Sign Language recognition system using SURF with SVM and CNN. Array 2022, 14, 100141. [Google Scholar] [CrossRef]
Kothadiya, D.; Bhatt, C.; Sapariya, K.; Patel, K.R.; Gil-González, A.B.; Corchado, J.M. Deepsign: Sign Language Detection and Recognition Using Deep Learning. Electronics 2022, 11, 1780. [Google Scholar] [CrossRef]
Kothadiya, D.R.; Bhatt, C.M.; Saba, T.; Rehman, A.; Bahaj, S.A. SIGNFORMER: DeepVision Transformer for Sign Language Recognition. IEEE Access 2023, 11, 4730–4739. [Google Scholar] [CrossRef]
Alharthi, N.M.; Alzahrani, S.M. Vision Transformers and Transfer Learning Approaches for Arabic Sign Language Recognition. Appl. Sci. 2023, 13, 11625. [Google Scholar] [CrossRef]
Tripathi, S.; Ranade, S.; Tyagi, A.; Agrawal, A. PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge Distillation. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; pp. 311–321. [Google Scholar] [CrossRef]
Bird, J.J.; Ihianle, I.K.; Machado, P.; Brown, D.J.; Lotfi, A. A Neuroevolution Approach to Keypoint-Based Sign Language Fingerspelling Classification. In Proceedings of the 2023 15th International Congress on Advanced Applied Informatics Winter (IIAI-AAI-Winter), Bali, Indonesia, 11–13 December 2023; pp. 215–220. [Google Scholar] [CrossRef]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. arXiv 2021, arXiv:2106.13230. [Google Scholar] [CrossRef]

Figure 1. An example of the service flow using the proposed dynamic sampling rate adjustment method in a mobile computing environment.

Figure 2. Overall architecture of sign language recognition model based on Video Swin Transformer.

Figure 3. Illustration of the keypoints information used by the MediaPipe models. (Left) The Pose model handles keypoints for full-body tracking. (Right) The Hands model focuses on hand keypoints detection.

Figure 4. Top-1 accuracy comparison results according to FPS.

Figure 5. Top-5 accuracy comparison results according to FPS.

Figure 6. Examples of how different fps affect the quality of captured keypoint images.

Table 1. Overview of sign language recognition techniques and their backbone networks.

Authors	Techniques and Their Explanations	Backbone Network
Ekbote et al. [19]	The paper developed an Indian Sign Language numeral recognition system (0–9) using ANN and SVM classifiers.	SVM
Pathan et al. [20]	The paper employed two layers of image processing: the first for whole images, and the second for hand landmarks.	CNN
Katoch et al. [21]	The paper presented a system for Indian Sign Language recognition using the SURF (Speeded Up Robust Features) method for feature extraction, combined with SVM (support vector machine) and CNN (convolutional neural network) classifiers. The system utilizes the Bag of Visual Words (BOVW) model to enhance the recognition process.	CNN
Kothadiya et al. [22]	The paper employed LSTM and GRU models, combined them sequentially, used dropout for regularization, and trained on the IISL2020 dataset to enhance Indian Sign Language recognition.	LSTM and GRU
Kothadiya et al. [23]	The paper used a transformer encoder with positional embedding patches and self-attention layers, followed by a multilayer perceptron network, to enhance the recognition of fixed Indian Sign Language.	Transformer
Alharthi et al. [24]	The paper employed transfer learning with various pretrained models (MobileNet, Xception, Inception, InceptionResNet, DenseNet, BiT) and vision transformers (ViT, Swin) to recognize Arabic Sign Language, comparing their performance with CNNs trained from scratch.	ViT, Swin

Table 2. Overview of keypoint-based techniques for sign language recognition.

Authors	Techniques and Their Explanations	Backbone Network
Tripathi et al. [25]	The paper developed a sign language recognition system employing video-based, skeleton-based, and deep learning techniques, enhanced by data augmentation for robust performance across various environments.	PoseConv3D
Bird et al. [26]	The paper proposed a neuroevolution approach to enhance fingerspelling recognition in sign language by optimizing deep neural networks through evolutionary algorithms. The technique involves processing ASL fingerspelling images into normalized keypoints, which are then used as inputs for neural networks.	Neuroevolution-optimized deep neural network
Kim et al. [17]	The paper proposed a method for detecting start and end points of sign language utterances using the MediaPipe Holistic model.	R(2 + 1)D

Table 3. Symbols and descriptions.

Symbols	Descriptions
$F_{t o t a l}$	Total number of frames in a sign language utterance
$k_{i d x}$	Index of a keypoint
$K_{m a x}$	Maximum index for keypoints
$S_{r}$	Sampling rate
$N_{c h a n g e}$	Total number of keypoint position change records ( $N_{c h a n g e} = ⌊ \frac{F_{t o t a l}}{S_{r}} ⌋$ )
$C_{i}$	Change record of the i-th keypoint position
$x_{i}^{k_{i d x}}$	x-coordinate of keypoint at index $k_{i d x}$ in the i-th frame
$y_{i}^{k_{i d x}}$	y-coordinate of keypoint at index $k_{i d x}$ in the i-th frame
$K_{i n f o}$	The x and y coordinate information of keypoints for all frames
$C_{t o t a l}$	Sum of all keypoint position change records ( $\sum_{i = 1}^{N_{c h a n g e}} C_{i}$ )
$C_{a v g}$	Average of all keypoint position change records (Threshold value)
$S_{s t a r t}$	Detected start frame of the sign language utterance
$S_{e n d}$	Detected end frame of the sign language utterance
$T_{u t t e r a n c e}$	Duration of the sign language utterance in mobile environments
$N_{f r a m e s}$	Total number of frames captured during $T_{u t t e r a n c e}$
$f p s_{s e r v e r}$	Frame rate in server environments
$f p s_{m o b i l e}$	Frame rate in mobile environments
$S R_{s e r v e r}$	Optimal fixed sampling rate in server environments
$S R_{m o b i l e}$	Dynamic sampling rate in mobile environments
$R_{f p s}$	Ratio of $f p s_{m o b i l e}$ to $f p s_{s e r v e r}$

Table 4. Hardware and software configuration for experimental environment.

Components	Specifications
CPU	AMD Ryzen 9 7950X3D, 16 cores
RAM	Samsung DDR4 64 GB × 2
GPU	ZOTAC GeForce RTX 4090 × 2, 24 GB GDDR6X each
OS	Ubuntu 22.04.2 LTS
CUDA Version	12.2
Python	3.10.11
MediaPipe	0.10.11
OpenCV-Python	4.7.0.72
Torch	2.0.1
Torchvision	0.15.2

Table 5. Hyperparameters and their settings to train the sign language recognition model.

Hyperparameters	Configurations
Loss Function	Label smoothing cross-entropy
Learning Rate	$1 \times 10^{- 3}$
Optimizer	SGD
Epochs	200
Batch Size	4

Table 6. Dataset composition.

Sets	Samples	Classes	Words	Sentences
Training	28,868	510	405	105
Testing	6186	510	405	105
Validation	6186	510	405	105

Table 7. Comparison of the number of extracted images, keypoints extraction frequency, and recognition accuracy according to FPS (when the utterance time is 5 s).

fps	Total Number of Frames	Keypoints Extraction Frequency	Recognition Accuracy
5 fps	25	5	Low
25 fps	125	25	High
30 fps	150	30	Very High

Table 8. Confusion matrix results for top-1 (accuracy, precision, recall, F1-score) according to frame rates and methods.

Metric	Accuracy	Precision	Recall	F1-Score
5 fps
Original	0.52	0.62	0.52	0.51
Fixed	0.11	0.59	0.11	0.17
Dynamic	0.77	0.81	0.77	0.76
10 fps
Original	0.84	0.86	0.84	0.83
Fixed	0.67	0.73	0.67	0.67
Dynamic	0.85	0.87	0.85	0.84
15 fps
Original	0.87	0.90	0.87	0.88
Fixed	0.70	0.73	0.70	0.69
Dynamic	0.88	0.89	0.88	0.87
20 fps
Original	0.83	0.90	0.83	0.85
Fixed	0.89	0.91	0.89	0.89
Dynamic	0.90	0.91	0.90	0.90
25 fps
Original	0.77	0.89	0.77	0.81
Fixed	0.92	0.93	0.92	0.92
Dynamic	0.92	0.93	0.92	0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, T.; Kim, B. Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments. Appl. Sci. 2024, 14, 9199. https://doi.org/10.3390/app14209199

AMA Style

Kim T, Kim B. Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments. Applied Sciences. 2024; 14(20):9199. https://doi.org/10.3390/app14209199

Chicago/Turabian Style

Kim, Taewan, and Bongjae Kim. 2024. "Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments" Applied Sciences 14, no. 20: 9199. https://doi.org/10.3390/app14209199

APA Style

Kim, T., & Kim, B. (2024). Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments. Applied Sciences, 14(20), 9199. https://doi.org/10.3390/app14209199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Techniques for Detecting the Start and End Points of Sign Language Utterances to Enhance Recognition Performance in Mobile Environments

Abstract

1. Introduction

2. Related Works

3. Dynamic Sampling Rate Adjustment Method

4. Performance Evaluations

4.1. Experimental Environments

4.1.1. Sign Language Recognition Model

4.1.2. Dataset and Data Preprocessing

4.1.3. Data Augmentation and Model Training

4.1.4. Experimental Setup and Evaluation Scenarios

4.2. Experimental Results

4.2.1. Accuracy Results

4.2.2. Impact of FPS on Sign Language Recognition Accuracy

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI