Next Article in Journal
The Importance of Weather Factors in the Resilience of Airport Flight Operations Based on Kolmogorov–Arnold Networks (KANs)
Previous Article in Journal
State Estimation for Measurement-Saturated Memristive Neural Networks with Missing Measurements and Mixed Time Delays Subject to Cyber-Attacks: A Non-Fragile Set-Membership Filtering Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization

1
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
2
Science and Technology on Integrated Information System Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(19), 8937; https://doi.org/10.3390/app14198937
Submission received: 5 September 2024 / Revised: 28 September 2024 / Accepted: 30 September 2024 / Published: 4 October 2024

Abstract

:
The advancement of deep learning techniques has significantly propelled the development of the continuous sign language recognition (cSLR) task. However, the spatial feature extraction of sign language videos in the RGB space tends to focus on the overall image information while neglecting the perception of traits at different granularities, such as eye gaze and lip shape, which are more detailed, or posture and gestures, which are more macroscopic. Exploring the efficient fusion of visual information of different granularities is crucial for accurate sign language recognition. In addition, applying a vanilla Transformer to sequence modeling in cSLR exhibits weak performance because specific video frames could interfere with the attention mechanism. These limitations constrain the capability to understand potential semantic characteristics. We introduce a feature fusion method for integrating visual features of disparate granularities and refine the metric of attention to enhance the Transformer’s comprehension of video content. Specifically, we extract CNN feature maps with varying receptive fields and employ a self-attention mechanism to fuse feature maps of different granularities, thereby obtaining multi-scale spatial features of the sign language framework. As for video modeling, we first analyze why the vanilla Transformer failed in cSLR and observe that the magnitude of the feature vectors of video frames could interfere with the distribution of attention weights. Therefore, we utilize the Euclidean distance among vectors to measure the attention weights instead of scaled-dot to enhance dynamic temporal modeling capabilities. Finally, we integrate the two components to construct the model MSF-ET (Multi-Scaled feature Fusion–Euclidean Transformer) for cSLR and train the model end-to-end. We perform experiments on large-scale cSLR benchmarks—PHOENIX-2014 and Chinese Sign Language (CSL)—to validate the effectiveness.

1. Introduction

Sign language serves as a natural and primary means of communication for hearing-impaired individuals, enabling them to convey complex concepts and emotions. The continuous sign language recognition (cSLR) task aims at converting sign videos into a sequence of glosses, and it involves extracting spatial features and interpreting temporal sequences [1]. Therefore, most cSLR model architectures incorporate a convolutional neural network (CNN)-based visual module for spatial analysis and a recurrent neural network (RNN)-based sequential module for temporal sequence interpretation to acquire comprehensive contextual information [2,3,4,5,6].
The semantic information conveyed through sign language is not solely reflected in hand gestures, but also encompasses high-level semantic features such as body posture, subtle facial expressions, lip movements, and eye gaze [7,8]. Thus, the representation of semantic features in sign language videos is multi-scale, making it crucial to comprehensively extract and integrate spatial features of different granularities. On the other hand, the lack of boundary annotations in prevalent datasets, which only provide gloss word annotations, poses a significant challenge to continuous sign language recognition. Therefore, cSLR is regarded as a weakly supervised learning technique [9]. Additionally, aligning extended sign videos with glosses adds to the complexity [10]. Hence, it can be observed that multi-scale perception of spatial features in continuous sign language gestures and temporal modeling of video sequences are crucial for improving recognition accuracy.
Currently, the predominant research approach involves extracting image features using CNNs, followed by temporal modeling with LSTM or 1D-CNN to perceive the semantic information in sign language videos [11,12]. Three-dimensional (3D)-CNNs have also been used to integrate temporal and spatial features for continuous sign language recognition [6,13,14,15,16]. At the same time, research teams have achieved end-to-end learning for sign language recognition and translation using the Transformer model [17,18]. Moreover, hidden Markov models and iterative EM algorithms were also applied in earlier studies to enhance sequence modeling capabilities [19,20]. Although previous research has improved cSLR accuracy, there are still limitations. First, in terms of spatial feature extraction, current studies mainly use convolutional neural networks (CNNs) to convert RGB or depth information into feature maps [10], without considering multi-level and multi-granular details and non-manual semantic information, such as facial expressions and body postures in sign language, which have not been effectively integrated into existing recognition frameworks. For temporal modeling [21], RNN-based models excel in understanding sign language sequences but struggle with parallelizing sequence processing and capturing long-distance contextual relationships due to their network structure [22,23]. Although the Transformer model can address this issue [24], cSLR is a video-text multimodal task, where the information density and redundancy in the video sequence is much higher than that in the text data [25]. Therefore, the original self-attention mechanism cannot fully capture the contextual information and it is difficult to understand the semantics of sign language gestures, which leads to a weak recognition accuracy [18].
To address the above issues, we propose MSF-ET (Multi-Scaled feature Fusion Euclidean Transformer), a novel framework for continuous sign language recognition. We improve both spatial feature extraction and temporal modeling to enhance the learning of high-level semantic information. Although multi-scale features are widely used in visual tasks, for SLR, it is more important to fully integrate visual information of different granularities to model sign language semantics. However, previous work has overlooked the research on feature fusion methods. We designed a multi-scaled feature fusion module to detect image details at different granularities in RGB images, as shown in Figure 1. Whereas existing work has previously employed the concept of multi-scale thinking to enhance continuous sign language translation [9], employing KNN to achieve multi-scaling along the temporal dimension, our approach diverges significantly. Our multi-scale focus emphasizes the incorporation of multi-granularity visual information. More critically, we introduce a multi-scale feature fusion method into cSLR. Although similar methods have been applied in other fields, the adaptation and integration of granularity-aware feature fusion and attention optimization to enhance cSLR present a distinct implementation tailored to the intricacies of sign language, which involves dynamic and nuanced gestures and expressions. Our spatial perception module outputs fused features, incorporating coarse-grained and fine-grained characteristics, enhancing the model’s understanding of semantic information. Regarding the temporal dimension, we first analyze the problems of the original Transformer architecture in sign language video perception. We observe that the dot-product attention mechanism is susceptible to the influence of vector magnitudes when calculating attention weights. Building on this foundation, we refine the traditional dot-product attention mechanism by designing a Transformer model based on Euclidean distance attention metric for effective modeling of video sequences.
Our main contributions are summarized as follows:
  • We introduce a methodology for fusing multi-scale visual feature maps, leveraging a self-attention mechanism to facilitate the flexible and dynamic integration of visual feature maps across diverse dimensions and granularities.
  • We analytically examine the limitations of the vanilla Transformer in modeling sign language videos and propose an improvement to the self-attention mechanism by substituting the dot product with Euclidean distance.
  • Extensive experiments and ablation studies validate the efficacy of our method, significantly enhancing the performance of Transformer-based models on the continuous cSLR task.

2. Related Work

2.1. Continuous Sign Language Recognition

Continuous sign language recognition (cSLR) aims at converting a sign language video into a gloss sequence. The challenge lies in the lack of explicit boundaries in videos for various glosses, which is a weakly supervised sequence labeling problem [10,26]. Deep learning advances cSLR by focusing on spatial representation and temporal modeling of videos. TWo-dimensional convolutional neural networks (2D-CNNs) are employed for perceiving the static information in video frames. Several state-of-the-art pre-trained CNNs, such as VGGs [27], ResNets [28], and EfficientNets [29] are introduced in cSLR, which strengthens the recognition and understanding of frame semantics [30,31,32]. Cui et al. integrate the 2D-CNN with temporal convolution to learning the spatial–temporal information [3]. Similarly, Cheng et al. propose a fully convolutional network (FCN) for cSLR that extracts frame features and models the feature sequence with 1D-CNN [12]. Furthermore, to comprehend movement information, 3D-CNNs are utilized to learn motion characteristics, providing a more comprehensive understanding of dynamic actions in sign language videos [33,34].
Sequence modeling in cSLR is to learn the mapping between sign language videos and gloss sequences. The hidden Markov model (HMM) was adopted in earlier research to predict the glosses through state transitions. Koller et al. adopted HMM in combination with CNNs to achieve sign language recognition [4,19]. With the emergence of recurrent neural networks, improved models such as LSTM and GRU, more suitable for sequence processing, were applied to this task. Cui et al. introduced the bi-directional LSTM to model the sign sequences [3]. In addition, as the input video sequence is considerably longer than the corresponding glosses, the connectionist temporal classification (CTC) [35] approach was adopted for sequence alignment. Recent studies demonstrated that CTC effectively aligns unsegmented data by dynamically adjusting to varying sequence lengths, making it a powerful solution for handling lengthy and complex video sequences [13,36].
There exist some endeavors aimed at enhancing the performance. Xie et al. designed a position-aware temporal convolution to capture local consistent semantics within temporally neighboring frames [9]. This method addresses the issue of CNNs being agnostic to similarity or dissimilarity. Zhou et al. proposed a spatial–temporal multi-cue (STMC) network to learn implicit visual grammars underlying the collaboration of different visual cues [7]. Cui et al. introduced the iterative training that uses the alignment proposal generated by recognition model as strong supervisory inference to tune the feature extraction module [37]. Additionally, the research applied RGB images and optical flow to capture both appearance and motion information, achieving competitive performance.

2.2. Feature Fusion

Incorporating multi-scale features is crucial for boosting performance in various tasks [38,39]. In the CNN backbone networks, recent studies generally agree that lower-level features possess higher resolution and more detailed information, yet due to the limited number of convolutional layers, they contain less semantic information and more noise. Conversely, the higher-level features exhibit the opposite characteristics. The module in the Inception network consists of four parallel branches: 1 × 1 convolution, 3 × 3 convolution, 5 × 5 convolution, and 3 × 3 max pooling [40]. The output feature maps of the branches are concatenated along the channel dimension, combining information from multiple scales to enhance feature representation. Similarly, U-Net [41] and the feature pyramid networks (FPN) [42] approach involve pooling or upsampling features at various scales before prediction.
There is a paucity of research on the issue of multi-scale feature fusion in continuous sign language recognition, despite its fundamental importance in sign language video understanding. There are still some studies that employ multi-modal data for feature fusion. Cui et al. took RGB image and optical flow for hand regions as input to combine the appearance and motion information. The authors summed the two stream outputs generated by CNN modules for fusion [37]. On the other hand, a spatial multi-cue module in STMC was specifically designed for spatial representation and employs a self-contained pose estimation branch to explicitly decompose the visual features of different cues [7]. This approach also models and fuses each cue using sequence modeling, enriching the spatio-temporal information. The results demonstrate the benefits of feature fusion in learning sign language semantics. However, it is worth noting that current methods still have certain limitations. First, acquiring optical flow data can be challenging, and their accuracy is difficult to determine due to their sensitivity to noise [43,44]. Second, the STMC method relies on a pose estimator for keypoint detection. However, in practice, it can be difficult to accurately detect keypoints in motion images, which can lead to errors [45,46]. Therefore, it is necessary to explore more concise and reliable methods to address these issues.
Based on previous research, we propose a spatial feature extraction module for multi-scale feature fusion. This module uses a Transformer encoder to process feature maps of different receptive fields and concatenate them, effectively combining low-level details with high-level semantic information. This not only enhances the model’s ability to capture fine-grained features but also improves its understanding of complex semantic structures. This approach avoids the need for additional data and modules and enables end-to-end training.

3. Methodology

3.1. Overall Model Architecture

Formally, the proposed model aims to understand a sign language video containing many glosses and learn a mapping R : V G that recognizes the input video frame sequence V as a gloss sequence G. The overall model framework is shown in Figure 2. Given a video, a multi-layer convolutional network first extracts spatial features from each frame and generates various feature maps with diverse receptive fields. Then, a fusion module transforms each feature map to feature vector and concatenates all the multi-scaled features. Next, the sequence composed of fused features is learned by temporal module. Finally, the CTC decoder is used to align the sequence for model training and inference.

3.2. Spatial Encoder and Multi-Scale Features

In addition to RGB images, fine-grained details such as facial expressions, lips, and hand shapes are essential for sign language recognition. We leverage the inherent properties of convolutional neural networks (CNNs), which are particularly adept at local perception within images. As layers are stacked in a CNN, the receptive field progressively enlarges, which enables the network to transition from capturing minute details to extracting broader semantic features. We harness this capability by extracting visual features at various receptive field sizes, thus obtaining a range of granularities—from fine-grained details such as facial expressions and hand shapes to more coarse-grained features like overall body posture and gesture dynamics. This fusion of multi-scaled features ensures that our model captures both detailed and global visual information, essential for the nuanced interpretation required in cSLR.
We design stacked convolutional basic modules as a spatial encoder (SE) to extract features from frames of sign language videos. Each basic module comprises two convolutional layers, batch normalization, and rectified linear unit (ReLU) activation function, as shown in Figure 2. The spatial encoder’s last two basic modules yield two distinct feature maps with dimensions of 7 × 7 and 3 × 3 , which are considered fine-grained (FG) and coarse-grained (CG) features, respectively. A larger feature map dimension corresponds to a narrower receptive field, incorporating a higher degree of detail-oriented features. Conversely, smaller dimensions encompass more abstract semantic information. Upon computation, the receptive fields for fine-grained and coarse-grained features are determined to be 31 and 67, respectively. We then propose a self-attention-based approach for fusing these features, leveraging the self-attention mechanism’s ability to capture dependencies across different scales.

3.3. Multi-Scale Features Fusion with Self-Attention

Feature maps are two-dimensional data that need to be converted into one-dimensional vectors for fusion and sequence modeling. Conventional techniques like global average pooling (GAP) or global max pooling (GMP) compress feature maps directly but often result in information loss [47]. Moreover, studies have indicated that the averaging operation in GAP excessively smoothes feature maps, thereby submerging information from crucial regions, whereas GMP is sensitive to outliers due to its maximization operation [48].
To address these issues, we propose a self-attention-based approach that dynamically integrates global information from feature maps. As depicted in Figure 3, the two-dimensional feature map is initially unfolded into a one-dimensional sequence, to which a randomly initialized special token < c l s > is appended at the beginning, serving as the label for the feature map. Subsequently, a two-layer Transformer encoder is employed to process the sequence, with the < c l s > token representing the feature map’s output as it encapsulates global information. Finally, the outputs of multi-scale feature maps are concatenated to achieve fusion. This method dynamically integrates global information from feature maps, offering a more flexible approach. The success of the Vision Transformer (ViT) model, which shares similarities with our method, provides a research foundation for our feature fusion strategy.
Specifically, given a video V = { p 1 , p 2 , , p T } R T × H × W × C and p i R H × W × C , where T denotes the number of frames in V, where T is the number of frames, and H, W, and C represent the height, width, and channels of each frame, respectively. Thus, the process of handling and fusing multi-scale features is illustrated as shown in Equations (1)–(4):
m 1 i , m 2 i = S E ( p i )
v 1 i = f t r a n s _ e n c ( [ m c l s , F l a t t e n ( m 1 i ) ] )
v 2 i = f t r a n s _ e n c ( [ m c l s , F l a t t e n ( m 2 i ) )
c i = C o n c a t ( v 1 i , v 2 i )
where S E is spatial encoder and f t r a n s _ e n c is Transformer encoder layer. In addition, p i R 224 × 224 × 3 denotes the ith frame image and m 1 i R 7 × 7 × 256 and m 2 i R 3 × 3 × 256 are the feature maps. The outputs v 1 i , v 1 i R 1 × 256 represent the outcomes derived from the application of the Transformer encoder to process the sequences, which are both flattened and supplemented with the < c l s > token. Finally, by concatenating v 1 i and v 2 i , we derive c i R 1 × 512 , which serves as the image features for subsequent temporal modeling, providing a comprehensive representation that integrates both fine-grained and coarse-grained information.

3.4. Self-Attention Based on Euclidean Distance

In sequence modeling, learning the relationships between tokens and contextual information is crucial. The Transformer model employs a self-attention mechanism to capture the global context of sequences, demonstrating exceptional capabilities in processing text sequences. However, videos differ from sentences in that every token in a sentence can be converted into a word vector using embeddings, with finite vocabularies constraining the feature representation of words. In contrast, each frame of a video has an infinite number of representations within a specific space, resulting in a gap between the application of Transformers to videos and text.
First, we analyze the limitations of Transformers in processing video sequences. Then, we propose an improved approach to make them more adaptable for perceiving and learning contextual information within visual sequences, leveraging Euclidean distance for enhanced attention.
The self-attention in vanilla Transformer is formulated as Equation (5):
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d k ) V
The crucial part involves calculating the attention map by multiplying query Q R t × c and key K R t × c . We assume Q = [ q 1 , q 2 , , q t ] T and K = [ k 1 , k 2 , , k t ] T , so we can obtain the following inference:
Q K T = q 1 T q 2 T q t T k 1 k 2 k t = q 1 T k 1 q 1 T k 2 q 1 T k t q 2 T k 1 q 2 T k 2 q 2 T k t q t T k 1 q t T k 2 q t T k t
α i , j = q i T k j = q i · k j · c o s θ i j
where, in Equations (6) and (7), α i , j presents the value of row i and column j in the attention map, and θ i j means the angle of vector q i and k j . According to Equation (5), the Q K T necessitates softmax operation in the row direction. In each row, the query vector q i is the same, indicating that the attention weights are influenced by the magnitude and cosine similarity. However, our prior research reveals that an exceedingly large vector may disrupt the attention mechanism, causing it to focus excessively on a specific position while disregarding more global information (as shown in Figure 4).
E D A t t e n t i o n ( Q , K , V ) = S o f t m a x ( E D ( Q , K ) ) V
E D ( Q , K ) = < q 1 , k 1 > < q 1 , k 2 > < q 1 , k t > < q 2 , k 1 > < q 2 , k 2 > < q 2 , k t > < q t , k 1 > < q t , k 2 > < q t , k t > < q i , k j > = q i k j 2 2 = q i 2 2 k j 2 2 + 2 q i k j
Therefore, to address the issue, we propose EDAttention, which replaces the dot product with the Euclidean distance between vectors as attention weights. Specifically, the attention weights are computed based on the Euclidean distance between the query vectors and key vectors (as shown in Equations (8) and (9)).
In the EDAttention mechanism, the attention score α i , j is defined as the negation of the Euclidean distance between vectors q i and k j , resulting in α i , j 0 . Subsequent s o f t m a x operation compresses α i , j ( 0 , 1 ] . Moreover, EDAttention achieves that greater distances between vectors correspond to lower similarity and, consequently, smaller attention weights, which aligns with analytical logic and intuitive perception. This approach ensures that attention is more evenly distributed across all positions in the sequence, allowing the model to fully capture the global information in long video sequences.

3.5. The Local Window in Self-Attention

In continuous sign language recognition, there are strict sequential correspondences between video clips and glosses, even if there is no definite boundaries to segment them. Consequently, we incorporate a local Transformer encoder layer before the model’s output, aiming to reinforce the perception of local information in the sequence following temporal modeling, thereby facilitating the alignment of local video clips with gloss semantics. As illustrated in Figure 5, we introduce a special masking operation before calculating the attention matrix and s o f t m a x processing. This operation restricts the primary receptive range of each frame to a confined window and assigns negative infinity to the attention values of elements outside the window, effectively impeding interaction between distant content and the current frame.
This localized self-attention mechanism focuses more on individual video segments rather than the entire video. In conjunction with the improved Transformer, it composes the temporal modeling module for videos, capturing both local and global contexts. The efficacy of this approach is further substantiated by ablation experiments discussed in the subsequent section.

3.6. CTC Loss and CTC Decoder

The continuous sign language videos are unsegmented so that there are not the obvious boundaries between the glosses. Therefore, we introduce the connectionist temporal classification (CTC) to align the sequences. CTC aims to construct the map from frame sequence to gloss sequence even though these two sequences have different and various lengths, and it is applied for speech recognition first.
In our model, given a sequence S = { s i } i = 1 T generated from video by spatial–temporal modules, CTC endeavors to map it to the sign gloss sequence G = { g j } j = 1 L . The objective of CTC is to maximize the sum of probabilities of all possible alignment paths between the input and target sequence. We introduce the special token < B L A N K > to model the transitions meticulously and eliminate the ambiguity caused by repeated glosses. In addition, we define the alignment path of the source sequence as π = { π t } t = 1 T , where label π t denotes one of the gloss vocabulary included blank token. The probability of alignment path π given the input sequence is represented as follows:
p ( π | S ) = t = 1 T p ( π t | S ) = t = 1 T y t , π t
We calculate the conditional probability of gloss sequence G as the sum of probabilities of all paths which can be mapped to G :
p ( G | S ) = π B 1 ( G ) p ( π | S )
where B denotes the mapping operation that removes all < B L A N K > tokens and repeated words in the alignment path, and B 1 ( G ) = { π | B ( π ) = G } is the inverse operation of B . Eventually, CTC loss, which is as the objective function of the model, is defined as follows:
L C T C = l o g p ( G | S ) = l o g [ π B 1 ( G ) p ( π | S ) ]
The entire training process of MSF-EF is shown in Algorithm 1. During the training stage, we instruct the model by reducing the CTC loss gradually, and the CTC decoder is applied for target sequence prediction and inference. The CTC decoder is utilized to obtain the target sequence with the highest cumulative probability by the beam search algorithm.
Algorithm 1 Multi-scaled feature fusion–Euclidean Transformer (MSF-ET) for continuous sign language recognition.
Input: Sign language videos V R T × 3 × 224 × 224
Output: Sequence of glosses G corresponding to the video V
1:
set i = 0 ;
2:
while  i < max _ epoch  do
3:
    for  ( V , G ) in Dataset D  do
4:
        Initialize MSF-EF model parameters θ
5:
         m 1 , m 2 = f S E ( V ) //Obtain spatial features;
6:
         m 1 [ < c l s > , f l a t t e n ( m 1 ) ] , m 2 [ < c l s > , f l a t t e n ( m 2 ) ]
7:
         v 1 = f t r a n s _ e n c ( m 1 ) , v 2 = f t r a n s _ e n c ( m 2 ) // Multi-granularity features processing;
8:
         c C o n c a t ( [ v 1 , v 2 ] ) // Feature fusion;
9:
         S = f t e m p ( c ) // Temporal sequence modeling;
10:
       L = l o g p ( G | S ) = l o g [ π B 1 ( G ) p ( π | S ) ] // CTC loss calculation;
11:
       θ = θ η θ L // Gradient backward;
12:
       θ θ // Parameter update.
13:
    end for
14:
     i = i + 1
15:
end while

4. Experiments

4.1. Datasets and Metrics

We evaluate the proposed method on two mainstream continuous sign language recognition datasets: PHOENIX-2014 and CSL.
PHOENIX-2014 is the most popular benchmark for cSLR. All the videos are recorded from broadcast news and the split of these for Train, Dev and Test is 5672, 540, and 629, respectively. The dataset includes 9 different signers with a vocabulary size of 1295. All videos are transferred frame sequences at 25 fps.
CSL is a Chinese sign language dataset, and it contains 100 sign sentences with 178 words. All the videos are performed by 50 signers and repeated 5 times. Even though the vocabulary size is small, this dataset has diverse performance because of the signers’ various wear and action speeds.
In this task, we use word error rate (WER) as the metric of evaluating the performance of recognition. There are three types of mistakes of predicted sequence: substitutions (#sub), deletions (#del), and insertions (#ins), respectively. Thus, the WER from hypothesis to reference is defined as follows:
W E R = # s u b + # d e l + # i n s # T o t a l w o r d s .

4.2. Implementation Details

4.2.1. Network Details

The spatial encoder consists of a preprocess module and five basic modules, and the preprocess model includes a convolutional layer and max pooling. Each basic module contains two convolution layers with batch normalization and ReLU activation function. The epsilon and momentum of batch normalization are set to 1 × 10 5 and 0.1 , respectively. A max pooling operation is applied between the two convolution layers for downsampling. All convolutional layers are set with F3-S1-P1, where F, S, and P denote the filter size, stride, and padding size, respectively. We increase the channels module by module, and the output channels is unified to 256.
The fusion module is two vanilla Transformer encoder layers based on Vaswani et al. Specifically, each layer is set as 8 heads and 2048 hidden units, and the dropout rate is 0.1. As illustrated in Figure 3, we design one fusion module to process the different feature maps. In addition, the config of MSF-ET is similar to the fusion module.
In the inference stage, we utilize the CTC beam search algorithm to decode the sequence, and we set the beam size as 5.

4.2.2. Training Setup

During training, all the samples are limited to max length 300 frames. For data augmentation, all the frames in the video are first resized to 256 × 256 and then randomly cropped to 224 × 224 during training and center cropped to 224 × 224 in testing. Specifically, for temporal augmentation, we first randomly repeat the sequence by 20% and then randomly remove 20% content. We utilize Adam to optimize our model, and the weight factors λ 1 and λ 2 are empirically set to be 10 4 and 0.05 , respectively. The initial learning rate is 5 × 10 5 , and the learning rate is reduced by half per epoch after 40 epochs. In this experiment, a single Nvidia RTX 3090 with 24 G memory is used to accelerate model training and inference.

4.3. Ablation Study

4.3.1. Analysis of Multi-Scale Feature Fusion

In our proposed model, we introduce the multi-scale feature fusion method with self-attention to capture more comprehensive information in sign language videos. We specifically choose 3 × 3 and 7 × 7 feature maps to represent coarse-grained and fine-grained features, respectively.
To validate the effectiveness of using 3 × 3 and 7 × 7 feature maps for capturing coarse and fine details, we used class activation mapping (CAM) [49] to visualize the attention weights of feature maps at different scales, allowing us to observe the regions that the model focuses on at different feature scales. As shown in Figure 6, the 7 × 7 feature maps capture finer details such as facial expressions and hand shapes. These fine-grained features are crucial for distinguishing subtle variations in sign language gestures, providing detailed information that is essential for precise recognition. In contrast, Figure 6 shows that the 3 × 3 feature maps capture more coarse-level movements, such as overall body posture and hand trajectories. These coarse-grained features are essential for understanding the general structure and flow of sign language, providing a broader context that is necessary for recognizing overall gestures.
Table 1 compares the performance of different feature selections on the PHOENIX-2014 benchmark. For fine-grained features, we only utilize the 7 × 7 feature map with a receptive field of 31. For coarse-grained features, we use the 3 × 3 feature map. The fusion process is shown in Figure 3. The results demonstrate that using the proposed feature fusion method can achieve better performance compared to using single features, with WER reductions of 4.7 and 10.5, respectively. This indicates that the feature fusion method enhances the spatial representation of the sign language videos and facilitates the model’s understanding of sign language actions.
The visualizations support our experimental findings: fine-grained features, although detailed, also include more noise, making them less effective when used alone. Coarse-grained features provide a clearer semantic representation but miss finer details. By combining both types of features, our model leverages the strengths of each, resulting in a more robust understanding of sign language videos. This comprehensive approach ensures that both the macro (coarse) and micro (fine) aspects of sign language are effectively captured, leading to improved recognition performance.

4.3.2. Analysis of Feature Fusion Method

As for feature fusion, we introduce the self-attention mechanism to handle feature maps of different scales, instead of simple average pooling or max pooling. Furthermore, there are two conventional fusion methods: element-wise addition and concatenation. For cSLR, we conducted experiments to explore the effect of different fusion methods.
In Table 2, first, in the vertical comparison, the fusion methods of concatenation and element-wise addition have almost no significant impact on the performance, indicating that the information contained in both operations is almost consistent. Therefore, it is recommended to choose a more convenient method according to the specific task requirements. The results indicate that our proposed feature map processing method with the self-attention mechanism is significantly better than both pooling methods, with a reduction of 3.5–4 in the WER metric. This is because both pooling methods result in a loss of information [48]. Max pooling only retains the maximum value in a local region, discarding other information that may be valuable [47]. Global average pooling operates on the entire feature map, potentially ignoring detail information and reducing sensitivity to details. Our proposed method flattens the feature maps and uses a Transformer encoder to process them as a sequence, which can perceive global information while dynamically preserving details. In addition, it is more flexible compared to pooling operations. The success of the ViT model can also attest to the effectiveness of this approach to some extent.

4.3.3. Analysis of Attention with Euclidean Distance

For sequence modeling, we explain the shortcomings of the dot product attention mechanism when applied to video sequences. It is affected by the magnitude of feature vectors in the sequence, causing attention weights to overly focus on local segments and thus undermining the model’s ability to perceive the global context. Therefore, we use the Euclidean distance between feature vectors to measure attention in sequence modeling. In Table 3, we observe that using Euclidean distance for sequence modeling in Transformer, whether with single or fused features, results in lower WER (1.5–3.9 lower) compared to the scale-dot product. The results demonstrate that our improvements to the Transformer are more effective in learning and understanding the semantics of continuous sign language gestures. Furthermore, we visualized the attention weight matrices of the Transformer using Euclidean distance and applied it to the same sample data as in Figure 4, as shown in Figure 7. Compared to Figure 4, it is evident from Figure 7 that, after using Euclidean distance to measure similarity, the attention weight distribution is more uniform, effectively alleviating the sparsity issue.

4.3.4. Analysis of Local Transformer

In the task of continuous sign language recognition, each gloss is associated with a specific segment in the continuous sign language video. To augment the model’s ability to learn local information within the sequence, we incorporated a local Transformer prior to the output layer. In order to compare its performance with 1D-CNN and examine the optimal size of the local window, we conducted experiments to investigate the differences in performance across various window sizes on PHOENIX-2014.
Based on the experimental results, it is evident that the absence of a network with local perception capabilities leads to a highly unsatisfactory final outcome. This finding underscores the significance of local temporal perception for cSLR tasks. The integration of a 1D-CNN contributes to an approximate reduction of three in the WER metric, as it leverages a smaller convolutional kernel for processing local regions within extended sequences, ultimately supplying local information for the final output. Utilizing a local Transformer further diminishes the WER, which can be ascribed to the 1D-CNN’s shared convolutional kernel that executes convolution operations using identical parameters on varying local information, thus yielding constrained information extraction. Conversely, the self-attention mechanism of the Transformer facilitates the dynamic assimilation of local sequences, demonstrating a more adaptable and resilient processing approach capable of capturing an abundance of local information. On the other hand, in Table 4, for the 1D-CNN, a larger convolutional kernel contributes to improved performance. However, with regards to the local Transformer, setting the local perception size to 5 yields the best results. If the size is too small, it may not be able to perceive sufficient information, whereas if it is too large, the performance might be adversely affected by the actions corresponding to adjacent glosses.

4.3.5. Analysis of Time Complexity and Inference Time Cost

Our model includes both spatial feature extraction using convolutional neural networks (CNNs) and temporal modeling using our multi-scaled feature fusion–Euclidean Transformer (MSF-ET) and involves several computational stages, each contributing to the overall complexity. The stage of spatial feature extraction employs a stack of convolutional layers. The time complexity for each convolution operation is generally O ( d 2 k 2 ) , where d is the dimension of the feature map, and k is the kernel size. In the feature fusion module, we utilize a self-attention mechanism across different feature maps generated by the spatial encoder. The complexity of the self-attention mechanism is O ( l 2 · d ) , where l is the sequence length, and d is the dimensionality of each token. During the temporal stage, the Euclidean Transformer involves calculating distances between all pairs of input sequence elements, resulting in a complexity of O ( t 2 · d ) .
We conduct additional analyses to measure the processing time. Our MSF-ET model processes each frame through several stages, including spatial feature extraction, feature fusion, and temporal encoding, as shown in Figure 8 In our current implementation on an Nvidia RTX 3090 GPU, the average processing time per 25 frames is approximately 96~134 milliseconds (ms). This includes convolution operations, feature fusion using self-attention, temporal encoding with our Euclidean Transformer, and the CTC decoding. Due to the parallel processing capabilities of attention mechanisms, inference time remains below 250 ms even when the number of video frames exceeds 200 (as shown in the following figure), enabling real-time inference for sign language videos that are under 200 FPS.

4.4. Comparison with Baselines

4.4.1. Evaluation on PHOENIX-2014

We conduct a comparative analysis of various state-of-the-art models to substantiate the efficacy of our research, as shown in Table 5. We categorize the compared models into two major groups based on their distinct sequence modeling approaches: methods based on Transformer and those employing other network architectures. This is due to the weak performance of the original Transformer on cSLR tasks, which warrants a separate comparison. CMLLR and 1-Mio-Hands primarily utilize traditional HMM models to learn sign language videos. CNN-LSTM-HMM combines CNN-LSTM with HMM to exploit multi-stream data. DNF employs multimodal data comprising RGB and optical flow to enhance the capture of motion features. FCN establishes a fully convolutional network for spatial and temporal modeling. Finally, mLTSF-Net fuses temporal information based on local time similarity, thereby capturing the local consistent semantics of adjacent frames. In comparison to these methods, our proposed approach surpasses them to varying degrees without using additional data, achieving end-to-end training. The VAC model, which integrates VGG and BiLSTM, employs alignment learning to enhance its capabilities, achieving the current state-of-the-art results. Our evaluation metrics on the test set are slightly lower (+0.2 WER). Our approach focuses on the application of Transformer models in continuous sign language recognition (cSLR) tasks, optimizing and improving issues related to single visual features and self-attention mechanisms. The alignment learning and auxiliary learning methods used in VAC aim to strengthen the alignment between visual features and glosses, offering valuable insights for us to learn from and further improve our model’s performance.
Within the Transformer-based sequence modeling methods, it is worth noting that the self-attention-based approach consistently falls short compared to CNN or RNNs. Analyzing and addressing this phenomenon also constitutes one of the contributions of this work. As observed in Table 5, the application of Transformer to this task results in a WER metric of 30 or even higher. REINFORCE attempts to improve the encoder–decoder structure using reinforcement learning techniques. SL-Transformers employ pre-trained features from LSTM-HMM, reducing the WER from 50.18 to 26.7. In contrast to these models, our proposed method significantly lowers the WER metric, decreasing it by 15.6 points compared to the reinforcement learning approach, and even achieving a reduction of over 50% relative to the untrained SL-Transformers. The comparison with baseline models attests to the effectiveness of our proposed method, which not only achieves state-of-the-art performance among Transformer-based approaches but also rivals CNN/RNN models.

4.4.2. Evaluation on CSL

To demonstrate the robustness and effectiveness of our approach, we compared it with various baseline methods on a Chinese sign language dataset. In Table 6, on the CSL dataset, our model achieved state-of-the-art (SOTA) results, albeit with a marginal improvement of 0.1 points in WER compared to CNN/RNN methods. The results demonstrate that our approach remains competitive in the domain of Chinese sign language recognition. Furthermore, to the best of our knowledge, there are no reported metrics for Transformer-based models on the CSL benchmark. Therefore, we replicate the SL-Transformer model’s WER on CLS, as illustrated in the penultimate row of Table 6. Employing the self-attention mechanism, our method reduces the WER by 23.3% compared to SL-Transformer on this benchmark, achieving the state-of-the-art for self-attention models.

5. Conclusions

In this paper, we develop an innovative method for multi-scale feature fusion in images, utilizing self-attention to preserve both low-level details and high-level semantic information. Furthermore, we improve vanilla Transformer by adopting Euclidean distance instead of scaled-dot product to measure attention weights, effectively mitigating the attention sparsity issue.
Our experimental results on German and Chinese sign language datasets demonstrate the competitive performance of our proposed model in comparison to existing methods, showcasing its versatility and effectiveness across different domains. The successful application of our approach to cross-modal tasks highlights its potential for further research and development in a wide range of applications, from natural language processing to computer vision.

6. Limitations and Future Work

Our study integrates multi-scale features based on self-attention to fuse visual information of different granularities and employs Euclidean distance to measure the attention mechanism, improving the sequence modeling capability of videos. However, there are still limitations in this study. First, the feature fusion method adds time complexity, which leads to a slower speed. Next, although the multi-scale fusion method based on self-attention is sufficiently flexible, it incurs significant computational overhead. This limitation prevented us from conducting experiments on a broader range of scales (e.g., 14 × 14 and 28 × 28 feature maps). Videos, compared to images, require substantially more computation. On the other hand, this study focuses on feature extraction and understanding, while the VAC model’s experience indicates that aligning visual features with glosses yields better results. Future research will incorporate alignment learning to further investigate continuous sign language recognition.

Author Contributions

Conceptualization, Y.D., T.P. and X.H.; Methodology, Y.D., T.P. and X.H.; Software, Y.D., T.P. and X.H.; Validation, Y.D., T.P. and X.H.; Formal analysis, Y.D., T.P. and X.H.; Investigation, Y.D., T.P. and X.H.; Resources, Y.D., T.P. and X.H.; Data curation, Y.D., T.P. and X.H.; Writing—original draft, Y.D., T.P. and X.H.; Writing—review and editing, Y.D., T.P. and X.H.; Visualization, Y.D., T.P. and X.H.; Supervision, Y.D., T.P. and X.H.; Project administration, Y.D., T.P. and X.H.; Funding acquisition, Y.D., T.P. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported in part by the National Key R&D Program of China under No. 2020-JCJQ-ZD-079-00.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Koller, O.; Ney, H.; Bowden, R. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3793–3802. [Google Scholar]
  2. Cihan Camgoz, N.; Hadfield, S.; Koller, O.; Bowden, R. Subunets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3056–3065. [Google Scholar]
  3. Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7361–7369. [Google Scholar]
  4. Koller, O.; Zargaran, S.; Ney, H. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4297–4305. [Google Scholar]
  5. Song, P.; Guo, D.; Xin, H.; Wang, M. Parallel temporal encoder for sign language translation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1915–1919. [Google Scholar]
  6. Yang, Z.; Shi, Z.; Shen, X.; Tai, Y.W. Sf-net: Structured feature network for continuous sign language recognition. arXiv 2019, arXiv:1908.01341. [Google Scholar]
  7. Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13009–13016. [Google Scholar]
  8. Yin, K.; Read, J. Better sign language translation with STMC-transformer. arXiv 2020, arXiv:2004.00588. [Google Scholar]
  9. Xie, P.; Cui, Z.; Du, Y.; Zhao, M.; Cui, J.; Wang, B.; Hu, X. Multi-scale local-temporal similarity fusion for continuous sign language recognition. Pattern Recognit. 2023, 136, 109233. [Google Scholar] [CrossRef]
  10. Koller, O. Quantitative survey of the state of the art in sign language recognition. arXiv 2020, arXiv:2008.09918. [Google Scholar]
  11. Papastratis, I.; Dimitropoulos, K.; Konstantinidis, D.; Daras, P. Continuous sign language recognition through cross-modal alignment of video and text embeddings in a joint-latent space. IEEE Access 2020, 8, 91170–91180. [Google Scholar] [CrossRef]
  12. Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y.W. Fully convolutional networks for continuous sign language recognition. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 697–714. [Google Scholar]
  13. Wang, S.; Guo, D.; Zhou, W.g.; Zha, Z.J.; Wang, M. Connectionist temporal fusion for sign language translation. In Proceedings of the 26th ACM international conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1483–1491. [Google Scholar]
  14. Pu, J.; Zhou, W.; Li, H. Dilated convolutional network with iterative optimization for continuous sign language recognition. In Proceedings of the 2018 International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; Volume 3, p. 7. [Google Scholar]
  15. Pei, X.; Guo, D.; Zhao, Y. Continuous sign language recognition based on pseudo-supervised learning. In Proceedings of the 2nd Workshop on Multimedia for Accessible Human Computer Interfaces, New York, NY, USA, 25 October 2019; pp. 33–39. [Google Scholar]
  16. Guo, D.; Wang, S.; Tian, Q.; Wang, M. Dense Temporal Convolution Network for Sign Language Translation. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 744–750. [Google Scholar]
  17. Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Multi-Channel Transformers for Multi-Articulatory Sign Language Translation; European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020; pp. 301–319. [Google Scholar]
  18. Camgoz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10023–10033. [Google Scholar]
  19. Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep sign: Enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
  20. Koller, O.; Camgoz, N.C.; Ney, H.; Bowden, R. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2306–2320. [Google Scholar] [CrossRef]
  21. Vallathan, G.; John, A.; Thirumalai, C.; Mohan, S.; Srivastava, G.; Lin, J.C.W. Suspicious activity detection using deep learning in secure assisted living IoT environments. J. Supercomput. 2021, 77, 3242–3260. [Google Scholar] [CrossRef]
  22. Huang, F.; Lu, K.; Yuxi, C.; Qin, Z.; Fang, Y.; Tian, G.; Li, G. Encoding recurrence into transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
  23. Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; GV, K.K.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar]
  24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  25. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
  26. Rastgoo, R.; Kiani, K.; Escalera, S. Sign language recognition: A deep survey. Expert Syst. Appl. 2021, 164, 113794. [Google Scholar] [CrossRef]
  27. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  29. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  30. Ma, Y.; Xu, T.; Kim, K. Two-Stream Mixed Convolutional Neural Network for American Sign Language Recognition. Sensors 2022, 22, 5959. [Google Scholar] [CrossRef] [PubMed]
  31. Kındıroglu, A.A.; Özdemir, O.; Akarun, L. Aligning accumulative representations for sign language recognition. Mach. Vis. Appl. 2023, 34, 12. [Google Scholar] [CrossRef]
  32. Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; Li, W. Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  33. De Castro, G.Z.; Guerra, R.R.; Guimarães, F.G. Automatic translation of sign language with multi-stream 3D CNN and generation of artificial depth maps. Expert Syst. Appl. 2023, 215, 119394. [Google Scholar] [CrossRef]
  34. Zhang, Z.; Pu, J.; Zhuang, L.; Zhou, W.; Li, H. Continuous sign language recognition via reinforcement learning. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 285–289. [Google Scholar]
  35. Graves, A.; Graves, A. Connectionist temporal classification. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 61–93. [Google Scholar]
  36. Borg, M.; Camilleri, K.P. Phonologically-meaningful subunits for deep learning-based sign language recognition. In Proceedings of the Computer Vision—ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 199–217. [Google Scholar]
  37. Cui, R.; Liu, H.; Zhang, C. A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimed. 2019, 21, 1880–1891. [Google Scholar] [CrossRef]
  38. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  39. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  40. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  41. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  42. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  43. Dong, Q.; Cao, C.; Fu, Y. Rethinking optical flow from geometric matching consistent perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1337–1347. [Google Scholar]
  44. Lagemann, C.; Lagemann, K.; Mukherjee, S.; Schröder, W. Challenges of deep unsupervised optical flow estimation for particle-image velocimetry data. Exp. Fluids 2024, 65, 30. [Google Scholar] [CrossRef]
  45. He, X.; Bharaj, G.; Ferman, D.; Rhodin, H.; Garrido, P. Few-shot geometry-aware keypoint localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21337–21348. [Google Scholar]
  46. Xu, X.; Guan, L.; Dunn, E.; Li, H.; Hua, G. DDM-NET: End-to-end learning of keypoint feature Detection, Description and Matching for 3D localization. arXiv 2022, arXiv:2212.04575. [Google Scholar]
  47. Zhao, L.; Zhang, Z. A improved pooling method for convolutional neural networks. Sci. Rep. 2024, 14, 1589. [Google Scholar] [CrossRef]
  48. Zafar, A.; Aamir, M.; Mohd Nawi, N.; Arshad, A.; Riaz, S.; Alruban, A.; Dutta, A.K.; Almotairi, S. A comparison of pooling methods for convolutional neural networks. Appl. Sci. 2022, 12, 8643. [Google Scholar] [CrossRef]
  49. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  50. Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
  51. Min, Y.; Hao, A.; Chai, X.; Chen, X. Visual alignment constraint for continuous sign language recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11542–11551. [Google Scholar]
  52. Zheng, J.; Wang, Y.; Tan, C.; Li, S.; Wang, G.; Xia, J.; Chen, Y.; Li, S.Z. Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23141–23150. [Google Scholar]
  53. Guan, M.; Wang, Y.; Ma, G.; Liu, J.; Sun, M. Multi-Stream Keypoint Attention Network for Sign Language Recognition and Translation. arXiv 2024, arXiv:2405.05672. [Google Scholar]
  54. Zhou, M.; Ng, M.; Cai, Z.; Cheung, K.C. Self-attention-based fully-inception networks for continuous sign language recognition. In ECAI 2020; IOS Press: Amsterdam, The Netherlands, 2020; pp. 2832–2839. [Google Scholar]
  55. Xie, P.; Zhao, M.; Hu, X. Pisltrc: Position-informed sign language transformer with content-aware convolution. IEEE Trans. Multimed. 2021, 24, 3908–3919. [Google Scholar] [CrossRef]
  56. Hu, H.; Zhao, W.; Zhou, W.; Li, H. Signbert+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11221–11239. [Google Scholar] [CrossRef] [PubMed]
  57. Guo, D.; Zhou, W.; Li, H.; Wang, M. Hierarchical LSTM for sign language translation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Figure 1. Currently, most video spatial representation methods for cSLR extract features by pre-trained CNN backbone networks ((left) in the figure). Although the approach can extract high-level semantic information, it lacks perception of details, such as mouth shape and gaze, which are important for understanding sign language. We propose a multi-scale feature fusion method based on self-attention mechanism ((right) in figure), which enables more comprehensive extraction of semantic information.
Figure 1. Currently, most video spatial representation methods for cSLR extract features by pre-trained CNN backbone networks ((left) in the figure). Although the approach can extract high-level semantic information, it lacks perception of details, such as mouth shape and gaze, which are important for understanding sign language. We propose a multi-scale feature fusion method based on self-attention mechanism ((right) in figure), which enables more comprehensive extraction of semantic information.
Applsci 14 08937 g001
Figure 2. Overall model architecture. Our proposed MSF-ET model consists of three main components: spatial encoder, feature fusion module, and temporal encoder. The spatial encoder is composed of multiple 2D convolutional layers, followed by max-pooling to downsample the feature maps with different receptive fields. The feature fusion module uses a self-attention mechanism to fuse the multi-scaled features of the frames. The temporal encoder is composed of the encoder based on Euclidean distance self-attention model and local Transformer layer. The encoder learns the contextual information of the video and the local features for glosses alignment. Finally, connectionist temporal classification (CTC) is used to train the model and decode the gloss sequences.
Figure 2. Overall model architecture. Our proposed MSF-ET model consists of three main components: spatial encoder, feature fusion module, and temporal encoder. The spatial encoder is composed of multiple 2D convolutional layers, followed by max-pooling to downsample the feature maps with different receptive fields. The feature fusion module uses a self-attention mechanism to fuse the multi-scaled features of the frames. The temporal encoder is composed of the encoder based on Euclidean distance self-attention model and local Transformer layer. The encoder learns the contextual information of the video and the local features for glosses alignment. Finally, connectionist temporal classification (CTC) is used to train the model and decode the gloss sequences.
Applsci 14 08937 g002
Figure 3. Multi-scaled features integration and fusion. The spatial encoder outputs feature maps of sizes 3 and 7, respectively. These feature maps are first flattened into 1D vectors. Then, a special token [ c l s ] is added to the head of the vector, similar to ViT. Next, the flattened vectors are added with trainable position embedding and then utilize the Transformer encoder to obtain the global context information of both feature maps. Finally, the two [ c l s ] are concatenated to achieve multi-scale feature fusion.
Figure 3. Multi-scaled features integration and fusion. The spatial encoder outputs feature maps of sizes 3 and 7, respectively. These feature maps are first flattened into 1D vectors. Then, a special token [ c l s ] is added to the head of the vector, similar to ViT. Next, the flattened vectors are added with trainable position embedding and then utilize the Transformer encoder to obtain the global context information of both feature maps. Finally, the two [ c l s ] are concatenated to achieve multi-scale feature fusion.
Applsci 14 08937 g003
Figure 4. The demo for the attention map of vanilla Transformer. The heatmap denotes the attention scores, and the bar above the heatmap is the magnitude of key vectors. The figure indicates that the distribution of attention weights is overly concentrated in regions where the key vectors have longer magnitudes, thereby drowning out information from other positions and hindering the Transformer’s ability to fully comprehend the global information within the sequence.
Figure 4. The demo for the attention map of vanilla Transformer. The heatmap denotes the attention scores, and the bar above the heatmap is the magnitude of key vectors. The figure indicates that the distribution of attention weights is overly concentrated in regions where the key vectors have longer magnitudes, thereby drowning out information from other positions and hindering the Transformer’s ability to fully comprehend the global information within the sequence.
Applsci 14 08937 g004
Figure 5. The detail about self-attention with Euclidean distance and local window. We assume that the window size is 5. Therefore, every frame interacts with others by attention mechanism in the window centered on itself.
Figure 5. The detail about self-attention with Euclidean distance and local window. We assume that the window size is 5. Therefore, every frame interacts with others by attention mechanism in the window centered on itself.
Applsci 14 08937 g005
Figure 6. The CAM visualization of attention weights corresponding to feature maps at different scales. We applied this visualization to videos of different sign language performers to demonstrate the generalizability of the results. ((A) is sourced from ‘01April_2010_Thursday_heute_default-1’ in the PHOENIX2014 validation set. (B) is sourced from ‘03November_2010_Wednesday_tagesschau_default-7’ in the PHOENIX-2014 validation set.)
Figure 6. The CAM visualization of attention weights corresponding to feature maps at different scales. We applied this visualization to videos of different sign language performers to demonstrate the generalizability of the results. ((A) is sourced from ‘01April_2010_Thursday_heute_default-1’ in the PHOENIX2014 validation set. (B) is sourced from ‘03November_2010_Wednesday_tagesschau_default-7’ in the PHOENIX-2014 validation set.)
Applsci 14 08937 g006
Figure 7. An example of attention weight visualization for a Transformer utilizing Euclidean distance-based metrics. The sample data used in this figure are consistent with those in Figure 4. It is evident that the use of Euclidean distance significantly alleviates the phenomenon of attention sparsity.
Figure 7. An example of attention weight visualization for a Transformer utilizing Euclidean distance-based metrics. The sample data used in this figure are consistent with those in Figure 4. It is evident that the use of Euclidean distance significantly alleviates the phenomenon of attention sparsity.
Applsci 14 08937 g007
Figure 8. Relationship between inference time and video sequence length during model inference.
Figure 8. Relationship between inference time and video sequence length during model inference.
Applsci 14 08937 g008
Table 1. Evaluation of various feature selections on PHOENIX-2014. The ↓ means the lower the better.
Table 1. Evaluation of various feature selections on PHOENIX-2014. The ↓ means the lower the better.
Spatial FeaturesDevTest
del/insWER ↓del/insWER ↓
Fine-grained features13.5/4.032.914.1/3.733.4
Coarse-grained features12.0/3.627.111.8/3.627.0
Fused features7.4/3.3 22.47.3/3.222.6
Table 2. Evaluation of different feature fusion methods on PHOENIX-2014.
Table 2. Evaluation of different feature fusion methods on PHOENIX-2014.
Fusion MethodsDevTest
del/insWERdel/insWER
Concatenationmax-pooling14.6/3.326.714.4/3.927.1
avg-pooling12.1/4.925.912.8/3.626.2
self-attention 7.4/3.322.47.3/3.222.6
Summax-pooling13.7/3.726.513.5/4.026.9
avg-pooling12.8/4.226.112.2/3.725.8
self-attention7.6/3.722.87.4/4.123.1
Table 3. Evaluation of different attention weight calculation methods on PHOENIX-2014.
Table 3. Evaluation of different attention weight calculation methods on PHOENIX-2014.
Spatial FeaturesAttentionDevTest
del/insWERdel/insWER
Coarse-grained featuresscaled-dot12.3/4.828.312.7/4.128.5
Euclidean12.0/3.627.111.8/3.627.0
Fine-grained featuresscaled-dot16.1/2.934.715.3/3.534.5
Euclidean13.5/4.032.914.1/3.733.4
Fused featuresscaled-dot8.7/3.426.38.4/4.225.9
Euclidean 7.4/3.322.47.3/3.222.6
Table 4. Evaluation of last local layers and local size before output on PHOENIX-2014. The term “w/o” denotes the direct output of results without employing any network.
Table 4. Evaluation of last local layers and local size before output on PHOENIX-2014. The term “w/o” denotes the direct output of results without employing any network.
Last LayersLocal SizeDevTest
del/insWERdel/insWER
w/o-10.3/3.228.910.5/3/429.3
1D-CNN311.2/3.927.211.8/3.727.7
57.9/3.826.17.8/3.725.9
77.7/3.725.67.9/3.725.8
Local Transformer38.2/3.124.77.6/4.224.2
57.4/3.322.47.3/3.222.6
78.5/4.725.07.7/3.724.6
Table 5. Comparison of our proposed method with existing approaches on the PHOENIX-2014 dataset, where lower values indicate better performance. “RGB” and “Flow” in the table refer to frame images and optical flow, respectively. “SL” stands for supervised learning and “RL” represents reinforcement learning. The ↓ means the lower the better.
Table 5. Comparison of our proposed method with existing approaches on the PHOENIX-2014 dataset, where lower values indicate better performance. “RGB” and “Flow” in the table refer to frame images and optical flow, respectively. “SL” stands for supervised learning and “RL” represents reinforcement learning. The ↓ means the lower the better.
ModelsDevTest
del ↓/ins ↓WER ↓del ↓/ins ↓WER ↓
HMM/CNN/RNN Based
CMLLR [50]21.8/3.955.020.3/4.553.0
SubUNets [2]14.6/4.040.814.3/4.040.7
Staged-Opt [3]13.7/7.339.412.2/7.538.7
Re-sign [4]-/-27.1-/-26.8
LS-HAN [32]-/---38.3
CNN-HMM [19]-/-31.6-/-32.5
DenseTCN [16]10.7/5.135.910.5/5.536.5
CNN-LSTM-HMM [20]-/-26.0-/-26.0
DNF(RGB) [37]7.8/3.523.87.8/3.424.4
DNF (RGB + Flow) [37]7.3/3.323.16.7/3.322.9
FCN [12]-/-23.7-/-23.9
SMC [7]7.6/3.822.77.4/3.522.4
VAC [51]7.9/2.521.28.4/2.622.3
mLTSF-Net [9]8.5/3.323.88.3/3.123.5
CVT-SLR [52]-/-21.8-/-22.0
MSKA-SLR [53]-/-21.7-/-22.1
Transformer Based
REINFORCE (SL) [34]5.7/6.839.75.8/6.840.0
REINFORCE (RL) [34]7.3/5.238.07.0/5.738.3
SAFI [54]16.6/1.831.715.1/1.731.3
SL-Transformers [18]39.3/2.850.1837.0/2.847.96
SL-Tramsformers(pre) [18]13.5/5.726.713.8/6.427.62
PiSLTRc-R [55]8.1/3.423.47.6/3.323.2
signBERT+ [56]-/-34.0-/-34.1
MSF-ET (Ours)7.4/3.322.47.3/3.222.6
Table 6. The evaluation results on CSL dataset.
Table 6. The evaluation results on CSL dataset.
MethodWER
LS-HAN [32]17.3
HLSTM-attn [57]10.2
CTF [13]11.2
DenseTCN [16]14.3
SF-Net [6]3.8
FCN [12]3.0
mLTSF-Net [9]2.8
PiSLTRc-R [55]2.8
SL-Transformer (implement)3.7
MSF-ET (Ours)2.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, Y.; Peng, T.; Hui, X. Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization. Appl. Sci. 2024, 14, 8937. https://doi.org/10.3390/app14198937

AMA Style

Du Y, Peng T, Hui X. Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization. Applied Sciences. 2024; 14(19):8937. https://doi.org/10.3390/app14198937

Chicago/Turabian Style

Du, Yao, Taiying Peng, and Xiaohui Hui. 2024. "Beyond Granularity: Enhancing Continuous Sign Language Recognition with Granularity-Aware Feature Fusion and Attention Optimization" Applied Sciences 14, no. 19: 8937. https://doi.org/10.3390/app14198937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop