1. Introduction
Sign Language prioritizes manual communication using hand gestures, body language, and lip movements instead of sound to communicate [
1,
2]. Usually, sign language is used by people who are deaf or hard of hearing, but it can also be used in situations where it is impossible or difficult to hear sounds. Therefore, a sign language recognition (SLR) system is needed since it helps to connect people who are hard of hearing and those who are not.
In recent years, researchers have focused much attention on SLR because of the rich visual information it provides. Recent SLR studies is usually grouped into isolated sign language recognition (ISLR) or continuous sign language recognition (CSLR). Several works address only ISLR [
3,
4], while others only analyze easier tasks, such as static gestures for alphabet recognition [
5]. Meanwhile, the latest methods are usually more complicated as they solve CSLR tasks [
6,
7,
8]. Compared to ISLR, CSLR is a more challenging problem as it involves the reconstruction of sentences.
CSLR research is still in great demand because its implementation is closely related to everyday conditions in the real world. The aim of this approach is to recognize the series of glosses that occur in a video series without clear segmentation or even none at all. Furthermore, it incorporates a great deal of machine learning research and a thorough understanding of human behavior. For instance, it involves human movement tracking [
9], gesture recognition [
10], and facial recognition [
11]. Nevertheless, there are several challenges to performing CSLR tasks.
First, data collection and annotation are expensive for CSLR [
12]. This is perhaps one of the challenges faced in its development, since the CSLR involved in a large network and the amount of data strongly affect the performance [
13]. Moreover, several available datasets for sign language are weakly annotated [
12,
14,
15]. In order to solve this issue, numerous studies have used a weakly supervised approach, alongside application of an alignment and a feature extractor module to the network architecture [
12].
Second, compared to ISLR, CSLR is more complicated. Sufficient information is acquired by using several features; this has been proven to achieve better performance than using a single feature as reported in previous works [
16,
17,
18]. These multiple features consist of the main feature which is a body image that achieves the highest accuracy and additional features, such as pose, head, left hand, and right hand, which has lower accuracy for individual performance [
17,
18]. Training a large network with a large amount of data is time consuming [
13]. Adding the input stream also increases the training time, while using additional image-based features increases the cost [
19]. Therefore, we need to choose important features so we can train efficiently.
Third, video input has a large number of images in the sequence. Some images have an unclear hand shape due to the fast movement, possibly leading to incorrect information. Therefore, our proposed model utilizes self-attention based on [
20] to help select important information. Moreover, self-attention proven by [
21,
22] has an impact on enhancing performance.
Therefore, we propose a novel model called the novel spatio-temporal attentive multi-feature (STAMF) to handle all problems. We followed previous works [
17,
23], which has been proven to work for CSLR with weak annotation problems. They construct the model using three main components: first is the spatial module, second is the temporal module, and third is the sequence learning module. We propose efficient and effective multi-feature input using the full frame feature along with keypoint features to perform CSLR tasks. The full-frame feature represents the body image as the main feature, and the keypoint features as the additional feature. The keypoint features is the body pose, including the detail of the hand pose. This body pose is the most effective additional feature since in some works it has been proven to achieve the highest accuracy after the full-frame feature [
17,
18]. We also utilize an attention module that uses self-attention based on [
20] to capture the important feature and to help the sequence learning to enhance performance.
The contribution of this manuscript is summarized as follows:
We introduce novel temporal attention into the sequence module to capture the important time points that contribute to the final output;
We introduce the multi-feature that consists of the full-frame feature from the RGB value of the frame as the main feature and keypoint features that includes the body pose with the hand shape detail as an additional feature to enhance model recognition performance;
We use the WER metric to show that our proposed STAMF model outperforms state-of-the-art models on both CSLR benchmark datasets through the experiments.
2. Related Works
There have been several advancements in technology, and a lot of research has been done for SLR. Previous studies [
24,
25,
26,
27] explored the possibility of using ISLR that have a segmentation for each word. In recent years, deep learning-based methods have been used to extract features using convolutional networks, either 2D [
28,
29] or 3D [
30,
31], for their strong visual representation. The majority of early research on sign language recognition centered on ISLR with multimodal characteristics [
30,
31,
32], such as RGB, depth maps and skeletons, which give a better performance.
Nowadays, CSLR has become more popular, although it has not been segmented clearly between each word. Early works use a CNN feature extractor [
6,
33] and HMM [
34] to build the sequence target. Some recent research for CSLR systems [
17,
23] has included three main steps in performing the task of problem recognition. First, they conducted the spatial feature extraction, then temporal segmentation, and finally sentence synthesis with a language model [
35], or they used sequence learning [
17,
23]. This sequence learning used Bi-LSTM and CTC to mine the relationship between sign gloss in the video sequences. Even though it uses a weak annotation that has unsegmented video sequences to define the sign glosses, these approaches have shown promising results.
However, the most recent related CLSR study that implemented a multi-feature approach [
17] used five features simultaneously. The multi-feature approach is heavier compared to using fewer features [
19]. This approach also cannot handle the noisy frames from the video sequence that have unclear information, such us a blurry hand shape due to fast movement. Moreover, relying on RNN based sequence learning may encounter problems with long sequence and may lose the global context [
20].
The current research aims to improve performance by adding a self-attention mechanism [
21,
22] that can handle longer sequence to learn the global context. Self-attention is based on early research [
20] that showed that self-attention has the advantage of being able to handle long dependencies. However, this self-attention is easier to learn a shorter path compared to a longer path in long dependencies. In the previous CLSR works [
21,
22] self-attention could help the network to learn the feature more effectively.
Therefore, in this paper we introduce a novel spatio-temporal attentive multi-feature model. This proposed model effectively extracts the important features and learns the sequence better by giving important information using a self-attention mechanism from multi-feature. All the processes are executed in an end-to-end approach.
3. Proposed Method
This section details the core techniques of our proposed model for CSLR. Therefore, we begin this section by explaining our proposed model’s overview. In addition, we provide more details about each key component, including the spatial module, the temporal module, and the sequence learning module. In addition, we also explain our proposed attention module to help the model learn better. Finally, we can integrate the framework for training and inferencing into our proposed model.
3.1. Framework Overview
Given a video input, our proposed model aims to predict the corresponding sign into a correct gloss sentence. The first module generates multiple spatial features, such as full-frame and keypoint features for each
T frame of the video. Then, the temporal module allows us to extract temporal correlations of the spatial features between frames for both streams. As a final step, the spatial and temporal networks have been linked to bidirectional long-short term memory (Bi-LSTM) and CTC for sequence learning and inferencing. Next, we explain our main components in more detail and consecutively. The overview of our proposed architecture is shown in
Figure 1.
3.2. Spatial Module
The spatial module exploits a full-frame feature and keypoint features, as shown in
Figure 2. This module uses 2D-CNN network architecture as the backbone, and ResNet50 is chosen to capture the multi-features. ResNet50 is more effective to be used compared to recent ResNet architecture in terms of time, while having a comparable result [
36,
37]. The RGB uses ResNet50 directly, while keypoint is obtained by HRNet [
38] from the video frame and is extracted using ResNet50 to get the keypoint features.
3.2.1. Full-Frame Feature
We applied our preprocessing steps to the RGB data then fed our data into the model. We then put them as a full-frame input into our architecture.
Figure 3 shows the illustration of the original RGB image at the left side and the cropped image at the right side. The cropped image used as input by the model. This illustrates the preprocessing step that reduces the less important parts of the image and puts more focus on the signer. This cropping uses random cropping method from [
12] to augment the dataset. The full-frame feature is extracted from the cropped imaged for each frame in the sequence using the ResNet50.
3.2.2. Keypoint Features
We extracted the keypoint features in the spatial module from the data RGB for each frame in the video input. The quality of keypoint features has an important role in our proposed model, so we need to use a robust approach, such as HRNet [
38]. We employed pretrained HRNet [
38] to estimate all of the 133 body keypoints, and we utilized 27 out of the 133 keypoints from its result. As shown in
Figure 4, the left side is the original upper body keypoint, and the right side is the selected 27 upper body keypoints. These 27 keypoints include wrists, elbows, shoulders, neck, hands, and fingers.
3.3. Temporal Module
The temporal module aims to learn spatio-temporal information from the spatial module. Temporal modules are constructed by stacked Temporal Pooling for each stream. As shown in
Figure 5, the Temporal pooling module consists of a temporal convolution layer and a pooling layer to extract features from sequential inputs.
The input is a list of spatial multi-features from the previous stage. The temporal feature is obtained using the temporal convolution layer which is a single 1D convolutional layer with the same input and output lengths, followed by a single pooling layer that decreases the size to a half. Using these two stacked temporal pooling layers is the best configuration, according to the previous works [
12]. After each temporal pooling, we embed an attention module that will be explained in detail in
Section 3.4. At the end, we concatenate the output of temporal pooling from both streams.
3.4. Attention Module
The video has multiple frames where some parts of the image are sometimes blurry. The RTWH-PHOENIX dataset [
33,
39] has more defective frames than the CSL dataset [
8,
40,
41]. This happens when the movement is too fast, creating a blurry image and resulting in the wrong keypoint location. This frame is considered defective and potentially leads to misinterpretation of both the RGB and keypoint features.
Figure 6 shows an illustration of defective frames in the RTWH-PHOENIX dataset [
33]. In order to deal with this problem, we added an attention layer.
Using the CTC algorithm, alignment of the path along with its labeling is performed by using a blank label and removing the repeat labels. CTC prefers to predict blank labels rather than gloss boundaries when it cannot distinguish the gloss boundary, but none of the results are convincing. This leads the network to use CTC to produce spikes in results when analyzing, learning, and predicting [
42,
43]. Generally, the CTC loss seeks the keyframes, and the last result is the prediction of a particular keyframe that has a high probability of being a blank label or a nonblank label. If the gloss predicts the same label or blank label consecutively, it results in the same output. However, if there is an insertion label in between the same label, even if there is only one mistake, it results in a much bigger loss. Here the addition of an attention layer helps to select the important temporal sequence before being used for sequential learning.
The attention module uses a multi head self-attention mechanism [
20]. The multi-head module is used to run several parallel attention mechanisms at the same time. Multi-head attention runs independently to focus on the short-term dependencies or the long-term dependencies in a separate head. Each output is then concatenated linearly and transformed into the desired shape.
Concurrently, the multi-head self-attention mechanism takes care of information from multiple representation subspaces, depending on the history of observations. For simplicity, we denote the input sequences as X. Mathematically, for the single-head attention model, given input
Xt − T + 1:t = [
Xt − T + 1, · · ·,
Xt ] ∈
T × N × P, three subspaces are obtained, namely, the query subspace
Q ∈
N ×dq, key subspace
K ∈
N × dk, and the value subspace
V ∈
N × dv. The latent subspace learning process can be formulated as [
20]:
Then, the scaled dot-product attention is used to calculate the attention output as [
20]:
Furthermore, if we have multiple heads that concurrently follow the multiple representations of the input, we can obtain more relevant results at the same time. The final step is to concatenate all of the heads and project them again to calculate the final score [
20]:
where
Qi =
XWQi,
Ki =
XWVi, and
WO ∈
Rhd × d
model. Finally, it can select the important part from sequence of features because not all information in the sequence are important.
As shown in
Figure 7, we use the attention module in several configurations. The first attention module is placed in the end of the spatial module, while the second and third attention modules are placed in the temporal module. The second attention module, called the early temporal module, is placed after the first block of temporal pooling as input, whereas the third temporal attention module, called the late temporal attention module, is placed after the second block of temporal pooling.
3.5. Sequence Learning
After getting the results from the spatial and temporal feature extractor in the form of a gloss feature, we need to arrange them into a complete sentence. Therefore, we need to use sequence learning to ensure that the gloss is managed in order to form a good sentence. The most prevalent method in the current research refers to Bidirectional Long Short Term Memory (Bi-LSTM) and Connectionist temporal classification (CTC) [
17,
23] to determine the greatest probabilities from all potential alignments. There are many sequence problems that can be solved using CTC and several recent works [
17,
23] propose CTC as an end-to-end training process for CSLR.
Long short-term memory (LSTM) [
44] is a variant of the recurrent neural network (RNN), and it is widely used in sequence modeling. LSTM excellently models long-term dependencies; it can process entire sequence of inputs and use their internal state to model the state transitions. However, the shortcoming of this forward RNNs is that the hidden states are only learnt from a one-way direction. The RNN produces a hidden state as described by the equation below after receiving the feature sequence as input.
where
ht represents the hidden state, where the initial state
h0 is a fixed all-zero vector, and
t is the time step. RNN is used to predict the sign glosses according to the spatial feature sequence input.
Moreover, SL tasks are quite complex. Therefore, it not only requires features received from a one-way context but also needs assistance with the use of two-way information to learn the occurrence of words that come before and after the other words in the sentence. Consequently, we utilize Bi-LSTM [
45] to learn the complex dynamic dependencies of image sequences by transforming them using a spatial and temporal representations into sequences of gloss features. According to the explanation stated earlier, the Bi-LSTM method is able to perform two-way work with the LSTM method. This means that the Bi-LSTM method can better classify sequential data. The computation of Bi-LSTM that uses two-way hidden states can also be seen as repeated computations of the LSTM that is processed from front to back and then from back to front. Afterwards, the hidden state of each time step is passed through a fully-connected layer and a softmax layer [
17].
where
b is a bias vector and
y(t,j) represents the probability of label
j at time step
t. In our TSL task, label
j is the vocabulary that is obtained from the sentence. In practice, Bi-LSTM significantly improves the amount of information a network can access, which in turn improves the context of information that is available in the algorithm. The information contains the knowledge of which word comes after or before the current frame in the sentence feature sequence input.
Bi-LSTM sends hidden states as output to the CTC layer for each time step. Since the results of this Bi-LSTM do not pay attention to the gloss arrangement, we use CTC to handle video sequence mapping
to produce a sign gloss sequence
(
L) with better arrangement. CTC is used in many fields, including hand writing recognition [
46], speech recognition [
47], and sign language recognition [
17,
23]. In this domain, it is used as a scoring function without aligning the input and output sequences. In CSLR, CTC is known as a module developed for end-to-end tasks involving the classification of temporal data without segmentation. Through dynamic programming, CSLR is able to solve the alignment issue by creating a blank label (-) which permits the system to produce optimal results for the representation of data that does not have labels (such as epenthesis data and segments of non-gesture data). In particular, CTC generates a blank label “-“ to extend the vocabulary
V, where
V =
Vorigin ∪ {-}. The purpose of this blank label is to represent the transition and to notify the existence of a blank gloss that does not provide information for the learning process, as
, where
. As a result, CTC is able to manage the output parameters from the spatial and temporal feature extractor and also the Bi-LSTM as the alignment module by summarizing the probabilities of all possible paths. All of the alignment path π have a probability that is given the input sequence as follows [
17]:
In the next step, we define a many-to-one mapping operation B, which removes any blank space and duplicates words from the alignment path. For example,
B (II-am- -a- -doctor) = I, am, a, doctor. As a result, we are able to calculate the conditional probability of the sign gloss sequence l as the total of the probabilities of all routes that can map to
l from
B, in the way described below [
17]:
where
B−1(
l) = {π|
B(π) =
l} is the inverse operation of
B. Finally, the CTC losses of feature sequence are defined as follows [
17]:
The conditional probability p(π|X) can be calculated according to the conditional independence assumption.