HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism

Zhang, Yun; Wang, Fengping

doi:10.3390/app13074558

Open AccessArticle

HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism

by

Yun Zhang

^1,2 and

Fengping Wang

^1,2,*

¹

School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

²

Key Laboratory of Applications of Computer Technology of Yunnan Province, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(7), 4558; https://doi.org/10.3390/app13074558

Submission received: 21 February 2023 / Revised: 30 March 2023 / Accepted: 31 March 2023 / Published: 3 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

The application of dynamic gestures is extensive in the field of automated intelligent manufacturing. Due to the temporal and spatial complexity of dynamic gesture data, traditional machine learning algorithms struggle to extract accurate gesture features. Existing dynamic gesture recognition algorithms have complex network designs, high parameter counts, and inadequate gesture feature extraction. In order to solve the problems of low accuracy and high computational complexity in current dynamic gesture recognition, a network model based on the MetaFormer architecture and an attention mechanism was designed. The proposed network fuses a CNN (convolutional neural network) and Transformer model by embedding spatial attention convolution and temporal attention convolution into the Transformer model. Specifically, the token mixer in the MetaFormer block is replaced by the Spatial Attention Convolution Block and Temporal Attention Convolution Block to obtain the Spatial Attention Former Block and Temporal Attention Former Block. Firstly, each frame of the input image is quickly down-sampled by the PoolFormer block and then input to the Spatial Attention Former Block to learn spatial feature information. Then, the spatial feature maps learned from each frame are concatenated along the channel dimension and input to the Temporal Attention Former Block to learn the temporal feature information of the gesture action. Finally, the learned overall feature information is classified to obtain the category of dynamic gestures. The model achieves an average recognition accuracy of 96.72% and 92.16% on two publicly available datasets, Jester and NVGesture, respectively.

Keywords:

dynamic hand gesture recognition; attention mechanism; MetaFormer; CNN (convolutional neural network); fusion network; temporal feature

1. Introduction

With the increasing prevalence of artificial intelligence in both the manufacturing and service industries, intelligent manufacturing and intelligent services have emerged as the prevailing trends of future development. Within human–machine interaction, gestures have become a vital means of information communication, and are therefore a critical aspect to consider. Driven by the sensory inputs of sound and vision, the use of gestures is considered to be the most effective and powerful method in this context [1]. Gesture recognition has a wide range of practical applications, including remote automation equipment operation, intelligent vehicles, remote medical care, remote teaching guidance, and VR/AR (Virtual Reality/Augmented Reality) game interaction [2]. In addition, the gesture recognition community pays special attention to understanding the ways to improve the quality of life of hearing-impaired people, and gesture recognition can be used in sign language recognition and translation tasks for hearing-impaired people [3]. Numerous studies on gesture recognition have been conducted, and significant progress has been made in this field.

Gesture recognition can be categorized into static gesture recognition and dynamic gesture recognition. Static gesture recognition only requires learning and classifying the spatial characteristics of the gesture, without considering the temporal characteristics. In contrast, dynamic gesture recognition requires taking into account both the spatial and temporal characteristics of the gesture as it changes over time. As a result, dynamic gesture recognition is significantly more complex than static gesture recognition, but the application of dynamic gestures is more extensive. This paper provides a lightweight gesture action recognition network for real-time human–computer interaction and control such as smart car interaction. Despite the great practical potential of effective gesture recognition, it is still an unsolved problem due to the significant differences in the semantic and syntactic structure of gestures. Therefore, for sign language recognition technology for hearing-impaired people, explicit translation from sign language to textual representation is currently an almost unsolvable problem. As of now, there is no fully automatic model or method for the recognition system of many dynamic gestures. Developing such a model requires deep semantic analysis, which can only be carried out at a superficial level due to the imperfection of current text analysis algorithms and knowledge bases, and can only achieve word-level recognition, so the proposed method also has some reference significance for the sign language recognition of hearing-impaired people.

In order to address the above problems, we propose a lightweight CNN and Transformer [4] fusion model based on spatial attention and temporal attention mechanisms. The model is mainly divided into three parts. The first part is the fast down-sampling PoolFormer model [5] to roughly extract the spatial features of gestures and compress them. The second part is the Spatial Attention Former Block based on a spatial attention mechanism to extract the spatial features of gestures. The third part is the Temporal Attention Former Block based on temporal attention to extract the temporal features of gestures.

In summary, the main contributions of this paper are as follows:

(1): This paper proposes a deep neural network that fuses a CNN and Transformer model by embedding spatial attention convolution and temporal attention convolution into the Transformer model.
(2): The spatial features of each frame of the input gesture are extracted, and the temporal features are fused into the channel information in the channel dimension; then, the temporal and spatial features of the temporal features are extracted.
(3): The proposed method is tested on three datasets. The experimental results show that the proposed method has reference value and high robustness for applications in real-time human–computer interaction and control, sign language recognition, and other fields.

This paper is structured as follows: An introduction to related work is presented in Section 2, followed by a detailed description of our proposed method in Section 3. Section 4 presents our experiments and analyses of the results. In Section 5, the work of this paper is discussed, and future work is prospected. Section 6 summarizes the work in this paper.

2. Related Works

2.1. Hand Gesture Recognition Datasets

For deep learning tasks, especially gesture recognition tasks, datasets are crucial because the input and output of the network model and the convergence of parameters are strongly dependent on the dataset. In this section, we describe the datasets related to gesture recognition and the best methods for these datasets.

As shown in Table 1, there are many datasets commonly used for gesture recognition, and these datasets and related methods are briefly introduced in the following. In 2016, Molchanov P et al. introduced NVGesture, a human–vehicle gesture interaction dataset in driving scenarios [6]. Min, Yuecong et al. [7] formulated gesture recognition as an irregular sequence recognition problem, and proposed the PointLSTM method by capturing the long-term spatial correlation across point cloud sequences. The recognition accuracy on the NVGesture dataset reached 87.9%. Mahdi Abavisani et al. [8] used the knowledge from multiple modalities to train a single-modality 3D convolutional neural network to realize the gesture recognition task, and achieved 84.9% accuracy on the NVGsture dataset. Joze et al. [9] used a simple multimode transmission module (MMTM) using a squeeze-and-excitation operation to achieve 84.85% recognition accuracy on the NVGesture dataset. In 2017, Y.hang et al. launched a first-person gesture recognition dataset called EgoGesture [10]. The EgoGesture dataset contains 2081 RGB-D videos, 24,161 gesture samples, and 2,953,224 frames from 50 distinct subjects. C. Cao et al. [11] achieved an average recognition accuracy of 92.2% on the EgoGusture dataset by fusing a 3D convolutional neural network and a spatio-temporal Transformer model. Köpüklü et al. [12] used the ResNeXt-101 deep learning network to achieve an accuracy of 94.03% on the EgoGesture dataset. In 2019, Materzynska J et al. introduced Jester, a dataset of gesture interactions captured by a front-facing camera [13]. Kopuklu et al. [14] used the method of Motion Fused Frames (MFFs) to fuse motion information into static images to better represent the spatio-temporal state of actions, and achieved 96.6% accuracy on the Jester dataset. Zhou, Bolei, et al. [15] proposed the Temporal Relation Network (TRN) based on temporal relation reasoning, which achieved 94.78% accuracy on the Jester dataset. In 2020, Sincan, O. M. et al. introduced AUTSL (Ankara University Turkish Sign Language), a dataset for sign language recognition [16]. Ryumin, D et al. [17] used a set of unique spatiotemporal features and prediction-level, feature-level, and model-level fusion methods to achieve 98.56% accuracy on the AUTSL dataset. Jiang et al. [18] used a multimodal sign language recognition method based on skeleton perception to achieve 98.56% accuracy on the AUTSL dataset. Novopoltsev et al. [19] used a multi-scale vision Transformer model fine-tuned on other sign language datasets to achieve 95.72% accuracy on the AUTSL dataset. In 2020, Li D et al. introduced the WLASL (Word-Level American Sign Language) dataset, another dataset for sign language recognition [20]. Jiang et al. [18] used a multimodal sign language recognition method based on skeleton perception to achieve 58.73% accuracy on the WLASL dataset. Novopoltsev et al. [19] used the Video Swin Transformer method to achieve 58.51% recognition accuracy on the WLASL dataset. In 2021, Khaleghi, Leyla et al. presented MuViHand, a synthetic 3D gesture recognition dataset [21]. Khaleghi et al. [22] proposed the use of the Transformer sequence learning model to achieve 86.1% recognition accuracy on the MuViHand dataset. This paper aimed to explore dynamic gesture recognition technology that can be widely used in human–computer interaction systems. Therefore, this paper selected Jester and NVGesture, two datasets in human–computer interaction environments including RGB images and videos captured by front-facing cameras.

2.2. Dynamic Hand Gesture Recognition

Dynamic gestures are widely used in real-life and production environments due to their ability to convey more information and offer greater discrimination between different gestures than static gestures. Numerous studies have focused on dynamic gesture recognition, including the work proposed in [23], which presents a multi-scale spatio-temporal feature fusion network based on the Convolutional Visual Transformer (CvT) model. The CvT network is used to extract spatial features from a single gesture image, and shallow and deep features of different spatial scales are combined. Additionally, a multi-time-scale aggregation module is designed to extract spatio-temporal features of dynamic gestures, combining the CvT network with the aggregation module to eliminate invalid features. The R-Drop model is applied to the multi-scale spatio-temporal feature fusion network for dynamic gesture recognition to overcome the limitations of the dropout layer in the CvT network. Furthermore, Chen Xuanqi et al. [24] proposed an Attention-guided Spatial Graph Convolution Simple Recurrent Unit (ASGC-SRU) network, which embeds spatial graph convolution into the gate structure of the SRU, enabling it to model temporal and spatial information of complex gestures with high-speed parallel computing. A joint attention guidance module is introduced to provide more importance to the critical joint points. Finally, an attention-enhanced spatial graph dropout (ASD) regularization method is used to reduce overfitting and enhance the accuracy of dynamic gesture recognition. Additionally, Mahdi Abavisani et al. [8] proposed a method to train a single-modality 3D convolutional neural network (3D-CNN) for dynamic gesture recognition by utilizing knowledge from multiple modalities. Specifically, the knowledge from multiple modalities is integrated into a single network to improve the performance of each unimodal network. However, since this method uses 3D convolution, it has limitations in terms of its application scenarios.

In recent years, attention mechanisms have become increasingly popular in gesture recognition research, especially following the success of the Visual Transformer [4]. One example is the Attention-enhanced Graph Convolution LSTM (long short-term memory) network (AGC-LSTM) proposed in [25], which can recognize human behavior from skeletal data by capturing discriminative features in both spatial configurations and temporal dynamics, and by exploring the co-occurrence relationship between the spatial and temporal domains. However, this approach employs the LSTM model, which limits its parallel computing ability. Another example is the multimode transmission module (MMTM) proposed in [9], which uses squeeze-and-excitation operations to recalibrate channel features in each CNN flow with knowledge from multiple modalities, allowing for the full utilization of knowledge from multiple modalities in the convolutional neural network. This module can be added to different levels of the feature hierarchy to achieve slow-modal fusion. While this method significantly improves the recognition accuracy, it has a large number of parameters and computational costs.

3. Method

In this section, we provide a detailed description of our method. Inspired by PoolFormer [5], we adopt a convolutional neural network combined with a Transformer network, similar to the MetaFormer [5] structure, as the basic deep neural network block. Specifically, we design a network model for dynamic gesture recognition based on a spatial attention mechanism and temporal attention mechanism, based on the MetaFormer block. The proposed method is shown in Figure 1. The method consists of four stages, and it takes an n-frame video sequence as input. In the first stage, each frame of the input gesture action video sequence is quickly down-sampled by the PoolFormer block to coarsely extract the gesture features and compress the spatial feature information of the gesture. Then, in the second and third stages, which are stacked with the Spatial Attention Former Block (SAFB) designed based on the spatial attention mechanism, the spatial feature information of each frame of the input video sequence is learned, and the input feature map is further down-sampled to refine the spatial feature information of the gesture in each stage. The feature maps learned for each frame are then concatenated along the channel dimension to obtain the total feature map containing both spatial and temporal information. Finally, the feature map containing spatial and temporal features is input into the Temporal Attention Former Block (TAFB) designed based on the channel attention mechanism in the fourth stage to learn temporal feature information. The learned feature map containing spatial and temporal features is then input into a classifier for dynamic gesture classification and recognition.

In the following section, we provide a detailed description of the MetaFormer block, the Spatial Attention Former Block (SAFB) designed based on the spatial attention mechanism, and the Temporal Attention Former Block (TAFB) designed based on the temporal attention mechanism.

3.1. MetaFormer

PoolFormer has demonstrated that the high performance of Transformer-like models mainly originates from the powerful ability of the MetaFormer structure [5]. The MetaFormer structure is shown in Figure 2a, where the input of MetaFormer is first passed through a patch embedding:

F_{p e} = P a t c h E m b e d d i n g (I) .

(1)

Here, the formula shows that the tokenized feature map

F_{p e}

is obtained after the input data are passed through the patch embedding, where

I \in R^{H \times W \times C}

is the input of MetaFormer, which is a feature map with size

H \times W

and C channels. The patch embedding layer converts the feature map into a tokenized feature map

F_{p e}

. The patch embedding technique is typically implemented by applying a

3 \times 3

convolutional operation with a padding size of 1 and a stride size of 1. This approach ensures that the size of the feature map remains unchanged before and after the operation. The tokenized feature map is then passed through the MetaFormer block, which consists of two residual connection blocks. The first residual connection block includes a normalization layer and a token mixer layer that can learn features, as well as a skip connection, which can be expressed as

F_{t f} = F_{p e} + T o k e n M i x e r (N o r m (F_{p e})) .

(2)

The above equation represents the first residual connection block in the MetaFormer architecture, where

N o r m (\cdot)

represents a normalization layer, such as layer normalization or batch normalization.

T o k e n M i x e r (\cdot)

is the core block in MetaFormer that can learn fine-grained features from the input image. It can be implemented using a convolutional neural network or an MLP (Multilayer Perceptron) mixer. The second residual connection block consists of a Norm layer, an MLP layer with a learnable channel expansion, and an activation function such as Gelu or Relu, as well as a skip connection:

F_{m f} = F_{t f} + σ (M L P (N o r m (F_{t f}))) .

(3)

The above equation represents the second residual connection block in the MetaFormer architecture, where

M L P (\cdot)

is a learnable channel expansion layer that serves as an activation function to enhance the non-linear expressive capacity of the model, such as Gelu or Relu.

Inspired by MetaFoormer, we designed a network model for dynamic gesture recognition based on a spatial attention mechanism and temporal attention mechanism. Each stacked block in this network model is designed based on the MetaFormer block, in which the first stage of the network is replaced with a PoolFormer block that uses the simplest non-learnable parameter pooling operation to replace the token mixer in MetaFormer. The second and third stages are replaced with the Spatial Attention Former Block (SAFB) that extracts spatial feature information using the proposed convolutional operations based on the spatial attention mechanism, and the final stage is replaced with the Temporal Attention Former Block (TAFB) that extracts temporal feature information using the proposed convolutional operations based on the temporal attention mechanism, both of which replace the token mixer in MetaFormer.

3.2. Spatial Attention Former Block (SAFB)

To learn the spatial feature information of each frame of the gesture, we propose the Spatial Attention Former Block (SAFB) that combines convolutional operations and Transformer structures based on the spatial attention mechanism, as shown in Figure 2b, which is built based on the MetaFormer block. The SAFB replaces the token mixer in the MetaFormer block with the Spatial Attention Convolution Block (SACB) based on the spatial attention mechanism, as shown in Figure 3. The SACB uses convolutional neural networks and spatial attention mechanisms to learn the spatial features of the input image. The attention mechanism learns to select spatial information useful for gesture recognition using different weights. Firstly, the input image is fed into two different branches. One branch uses convolutional operations to extract the spatial features of the gesture. Specifically, it first applies a depthwise convolution (DWC) with kernel size k to the input feature map channel by channel, and then applies another depthwise convolution (DWC) with kernel size k to extract gesture spatial features with different receptive fields. The feature maps obtained by the two convolutional operations are then fused and input into a pointwise convolution (PWC) to map the feature maps to different dimensions [26]. This branch can be represented as follows:

F_{c b} = σ (ϕ (φ_{1}^{3, 3} (F_{i n}) \oplus φ_{1}^{5, 5} (F_{i n}))) .

(4)

The above equation represents the spatial feature extraction branch in the SACB, where

F_{i n}

represents the input feature map,

φ_{s}^{k, k} (\cdot)

represents the depthwise convolution with kernel size k and stride s,

ϕ (\cdot)

represents the pointwise convolution,

\oplus

represents the feature fusion operation, and

σ (\cdot)

represents the activation function. The other branch generates the spatial attention weight matrix through the spatial attention mechanism and performs matrix multiplication with the spatial feature maps learned by the first branch to select different spatial feature information. It first applies a 32-channel convolution with a kernel size of 3 and an activation function, followed by a 64-channel convolution with a kernel size of 3 and another activation function to gradually increase the dimensionality of the feature map and enhance the non-linear expression ability of the features. Finally, it applies a 1-channel convolution to generate the attention weight matrix for the corresponding spatial feature map. This branch can be represented as

W_{S A} = f_{1}^{1, 1, 1} (σ (f_{1}^{3, 3, 64} (σ (f_{1}^{3, 3, 32} (F_{i n}))))) .

(5)

The above equation represents the spatial attention weight extraction branch in the SACB, where

F_{i n}

represents the input feature map,

f_{s}^{k, k, c} (\cdot)

represents the convolution operation with kernel size k, output channels c, and stride s,

σ (\cdot)

is the activation function, and

W_{S A}

represents the obtained attention weight matrix. After learning the spatial feature map

F_{c b}

and the spatial attention weight matrix

W_{S A}

through the two branches, the Spatial Attention Convolution Block (SACB) applies the spatial attention weight matrix to the spatial feature map by performing a matrix multiplication operation between the two matrices. This operation can be represented as

F_{S A} = W_{S A} \otimes F_{c b} .

(6)

The above equation represents the multiplication of the obtained spatial attention weight matrix and the spatial feature map to obtain the spatial attention feature map, where

\otimes

represents the matrix dot product operation, and

F_{S A}

represents the obtained spatial attention feature map. Finally, by replacing the token mixer in MetaFormer with the Spatial Attention Convolution Block (SACB), we obtain the Spatial Attention Former Block (SAFB), which can be represented as

F_{o u t} = F_{i n} + S A C B (N o r m (F_{i n})) .

(7)

3.3. Temporal Attention Former Block (TAFB)

In order to learn the temporal features of the gesture video frame sequence, a Temporal Attention Former Block (TAFB) is proposed based on MetaFormer. Its structure is shown in Figure 2c, where the token mixer in the MetaFormer block is replaced with a Temporal Attention Convolution Block (TACB) designed based on the channel attention mechanism, as shown in Figure 4. Before learning the temporal features, the spatial feature maps learned for each frame in the spatial feature extraction stage are concatenated along the channel dimension. In this way, the temporal feature information of the gesture is included in the channel of the obtained total feature map, and the channel attention mechanism is used to learn the temporal feature information of the gesture. Specifically, the spatial attention feature maps learned in the previous stage are first concatenated along the channel dimension according to the frame sequence to obtain the total feature map, which can be represented as

F_{S T} = C o n c a t (F_{S A}^{1}, F_{S A}^{2}, \dots, F_{S A}^{n}) .

(8)

The above equation represents the concatenation of the learned spatial attention feature maps along the channel dimension across the frame sequence, where

F_{S A}^{i}, i \in 1, 2, \dots n

represents the spatial attention feature map obtained after the spatial feature extraction stage of the ith frame,

C o n c a t (\cdot)

denotes the concatenation operation on the channel dimension, and

F_{S T}

represents the total feature map, which contains both the spatial attention features and the temporal features. Then, the total feature map is tokenized through patch embedding and input into the proposed Temporal Attention Former Block (TAFB) to learn the temporal features of the gestures. As the temporal information is embedded in the channel information of the total feature map, a channel attention module can be used to learn the temporal features of the gestures. The implementation of the channel attention mechanism is based on SENet proposed in [27]. First, the input is globally averaged to obtain the global feature value of the gesture, and then it is passed through an MLP layer designed based on fully connected layers and non-linear activation functions, followed by an up-sampling operation to increase the dimensionality of the learned features and obtain the feature map

F_{u s}

Finally, a convolution layer with the same kernel size as the feature map

F_{u s}

and the same number of channels as the input feature map is applied to obtain the temporal attention weight vector

V_{T A}

. This process can be expressed as follows:

V_{T A} = f_{0}^{H_{u s}, W_{u s}, C_{i n}} (Y_{u s} (M L P (G A P (F_{i n})))),

(9)

where

F_{i n}

represents the input feature map with a channel size of

C_{i n}

,

G A P (\cdot)

represents the global average pooling,

Y_{u s} (\cdot)

represents the up-sampling operation, and

H_{u s}

and

W_{u s}

represent the height and width of the feature map

F_{u s}

obtained after up-sampling, respectively. Specifically for MLP, it is implemented using three fully connected layers with activation functions. This can be represented as

V_{t f} = σ (F C_{n_{1}} (σ (F C_{n_{2}} (σ (F C_{n_{3}} (V_{i n})))))) .

(10)

The above equation represents the implementation of an MLP structure, where

V_{i n}

represents the input feature vector,

V_{t f}

represents the output temporal feature vector,

σ (\cdot)

represents a non-linear activation function, and

F C_{n} (\cdot)

represents a fully connected layer with n nodes. The ith (

i \in 1, 2, \dots, n

) component of the temporal attention weight vector

V_{T A}

is multiplied by the feature matrix map of the ith (

i \in 1, 2, \dots, n

) channel of the input feature map F, resulting in a temporal attention feature map. This operation can be represented as

F_{T A} = [V_{T A}^{1} \times F_{i n}^{\cdot, \cdot, 1}, V_{T A}^{2} \times F_{i n}^{\cdot, \cdot, 2}, \dots V_{T A}^{C_{i n}} \times F_{i n}^{\cdot, \cdot, C_{i n}}],

(11)

where

V_{T A}^{i}, i \in 1, 2, \dots, C_{i n}

represents the ith (

i \in 1, 2, \dots, n

) component of the attention weight vector

V_{T A}

, and

F_{i n}^{\cdot, \cdot, i}, i \in 1, 2, \dots, C_{i n}

represents the feature matrix map on the ith (

i \in 1, 2, \dots, n

) channel of the input feature map

F_{i n}

. The process of multiplying

V_{T A}^{i}

and

F_{i n}^{\cdot, \cdot, i}

and then concatenating them along the channel dimension produces the temporal attention feature map

F_{T A}

.

4. Results

4.1. Datasets

This paper conducted experiments on two publicly available datasets, Jester [13] and NVGesture [14], as well as a self-built dataset. The Jester dataset contains 148,092 video frame sequences of 27 dynamic gesture categories performed by 1376 action performers. The total number of video frames is 5,331,312, with an average of 36 frames per gesture action video. Each gesture action category contains no less than 4000 video frames, which are divided into training, validation, and test sets in an 8:1:1 ratio. The NVGesture dataset includes RGB and RGB-D multimodal video data of 25 gesture action categories, consisting of 1050 samples for training and 482 samples for testing. The self-built dataset comprises video frame sequences of eight gesture action categories performed by eight action performers, captured using a computer camera under different environments with insufficient or excessive lighting and large shooting angles. Each action was repeated five times by each performer, and the average execution time for each action was 3 s. The self-built dataset contains a total of 960 video frame sequences, and some key frames are shown in Figure 5. In addition, to improve the robustness of the proposed method, data augmentation techniques were used to enhance the model’s ability. To make the model more adaptable to distorted images, the original images were enhanced using affine transformation and projection transformation methods, with scaling ranging from −10% to 10% in each coordinate axis direction, and 0.9 to 1.1 times scaling of the images. Furthermore, the data augmentation technique of rotating and scaling the original images was also employed, with rotation angles ranging from −15 degrees to +15 degrees and scaling factors ranging from 0.8 to 1.2.

4.2. Experimental Settings

Our experiment was implemented with the PyTorch and OpenCV deep learning frameworks. The experiments were conducted on a Windows 10 computer with an Intel i7-12700 CPU, 64 GB of memory, and two Nvidia GeForce 3090 GPUs, each with a memory of 24 GB. The Adam optimizer was utilized for backpropagation, with an initial learning rate of 0.0005 that was decreased to 0.0002 after 100 epochs. We use the cosine annealing method to adjust the learning rate. The cosine annealing learning rate follows the overall change in the cosine function. In the cosine function, as x changes, the function value first decreases slowly and then accelerates its descent, repeating this cycle for each period. After several rounds of training, the model gradually understands the dataset. At this point, it is necessary to reduce the learning rate so that the model can learn stably and converge towards the global optimal solution. Learning rate can be represented as follows:

η_{t} = η_{\min}^{i} + \frac{1}{2} (η_{\max}^{i} - η_{\min}^{i}) (1 + \cos (\frac{T_{c u r}}{T_{i}} π)) .

(12)

The above equation represents the learning rate, where

i

represents epochs,

η_{\max}^{i}

and

η_{\min}^{i}

represent the maximum and minimum values of the learning rate, respectively.

T_{c u r}

represents the current number of epochs that have been executed, and

T_{i}

represents the total number of epochs in the ith epoch. Since the number of frames in each gesture video sequence in the Jester dataset is not uniform, with each video sequence containing 27 or more frames, a uniform sampling method of 30 frames was adopted to sample the original video sequence for the Jester dataset. If the number of frames in the original video sequence was less than 30 frames, the middle frames were duplicated to achieve a uniform 30 frames. Similarly, for the NVGesture dataset, a sampling method of 35 frames was used. Finally, all original images were resized to

224 \times 224

as inputs.

4.3. Results on Jester Dataset

To validate the effectiveness of the proposed method, the method was compared with the current state-of-the-art methods on both the Jester dataset and the NVGesture dataset. The experimental results on the Jester dataset are shown in Table 2. The results show that the proposed method improved the accuracy from 96.60% of the current state-of-the-art method DRX3D [14] to 96.72%, with an increase of 0.12 percentage points, demonstrating the feasibility of the proposed method. In addition, the proposed method obtained significant improvements in terms of parameter and computational complexity, as well as accuracy, compared to the SlowFast method [28]. Specifically, the proposed method achieved a 9.1-percentage-point improvement in accuracy while reducing parameters by 38.9% and FLOPs by 56.2%. The main reason for this performance is that the proposed method uses simple convolution operations to implement spatial and temporal attention operations while using MetaFormer as the underlying architecture. This feature allows the proposed method to extract spatial and temporal features of gestures using convolutional attention with relatively low parameter and computational complexity while maintaining high accuracy to represent differences between gesture actions.

4.4. Results on NVGesture Dataset

To validate the effectiveness of the proposed method for multimodal data, experiments on the NVGesture multimodal dynamic gesture dataset were conducted. The results are shown in Table 3, and confusion matrix of the experimental results on NVGesture dataset is shown in Figure 6. NVGesture is a multimodal dynamic gesture dataset that includes RGB images, optical flow images, and depth images. Optical flow images can effectively represent the instant movement speed of gesture actions, which is conducive to analyzing the temporal characteristic information of actions of different speeds. From the optical flow images, the structural spatial topology information containing spatial characteristics, such as the movement distance and angle of gesture actions, and the physical motion information containing temporal characteristics, such as the movement direction and speed of gesture actions, can be effectively analyzed. Therefore, this paper selected the conventional combination of RGB images and optical flow images as the input of the model. The experimental results show that the proposed method obtained an accuracy improvement, increasing from 87.9% to 92.16%, which was 4.26% higher than the state-of-the-art method using point clouds as input data, PointLSTM [7].

4.5. Results on Self-Built Dataset

To validate the recognition ability of the proposed method in different experimental environments, experiments were conducted on a self-built dataset. The confusion matrix of the experimental results is shown in Figure 7. The experimental results show that the proposed method achieved an average recognition accuracy of 90.71% for gesture actions in outdoor environments. As the self-built dataset was constructed under different noise conditions such as low light, strong light backgrounds, and tilted shooting angles, this indicates that the proposed method has good robustness.

4.6. Ablation Study

To validate the generalization ability, robustness, and performance of the proposed method, ablation experiments were conducted on the Jester dataset. The comparative results of the ablation experiments are shown in Table 4, with the PoolFormer model selected as the baseline model. The results indicate that when the PoolFormer baseline model is selected, i.e., the simplest pooling layer without learning parameters is selected for the token mixer in each stage, and to ensure a fair comparison, the block layers of the four stages of PoolFormer are all 3-3-6-3. The average recognition accuracy of the final PoolFormer model used as the baseline model was 77.50%, which was 18.66 percentage points lower than that of the method proposed in this paper. When only the SAFB based on the spatial attention mechanism was used in the last three stages, the average recognition accuracy was 88.82%, which was 7.90 percentage points lower than that of the proposed method. When only the TAFB based on the temporal attention mechanism was used in the last three stages, the average recognition accuracy was 86.16%, which was 10.56 percentage points lower than that of the proposed method. The reason why these two methods have a higher average recognition accuracy than the proposed method is that they only consider one of the spatial or temporal features, lacking the ability to extract information from the other feature. In addition, a comparison was performed between the method that only uses the SAFB based on the spatial attention mechanism in the second stage and the method that only uses the TAFB based on the temporal attention mechanism in the last two stages. The average recognition accuracy of this method was 91.47%, which was 5.25 percentage points lower than that of the proposed method due to its inadequate extraction of spatial feature information. The experimental results confirm the effectiveness of the proposed method.

To validate the influence of the proposed method’s depth on the recognition accuracy, we conducted an ablation experiment on the Jester dataset. The experimental results are shown in Table 5. The results show that when all stages of the model were set to three layers, the average recognition accuracy was 89.82%, which was 6.96 percentage points lower than that of the proposed method. The reason for this is the insufficient extraction of spatial attention features in the third stage. When the number of layers in the last three stages was set to six, the average recognition accuracy of the model was 95.17%, which was 1.55 percentage points lower than that of the proposed method. When the second and third stages for extracting spatial features were considered, the model’s average recognition accuracy was 95.56%. These two sets of experimental results indicate that when the model’s depth increases, the average recognition accuracy of the model decreases. This is because an excessively large number of layers in the model can cause overfitting.

In order to verify the effectiveness of the spatial attention branch and temporal attention branch based on the attention mechanism in improving the model, a set of ablation experiments were designed, and the experimental results are shown in Table 6. The first experiment removed the spatial attention branch by removing the global attention matrix that does not learn spatial features in the Spatial Attention Convolution Block (SACB). The method’s average recognition accuracy was 89.82%, which was 12.32 percentage points higher than that of the baseline model, PoolFormer, and 6.90 percentage points lower than that of the method proposed in this paper. The second experiment removed the temporal attention branch by removing the attention matrix that does not learn temporal features in the Temporal Attention Convolution Block (TACB). The method’s average recognition accuracy was 91.22%, which was 13.72 percentage points higher than that of the baseline model, PoolFormer, and 5.50 percentage points lower than that of the method proposed in this paper. The third experiment removed both the spatial and temporal attention branches, resulting in an average recognition accuracy of 82.31%, which was 4.81 percentage points higher than that of the baseline model, PoolFormer, and 14.41 percentage points lower than that of the method proposed in this paper. The experimental results demonstrate that the spatial attention branch and temporal attention branch obtained a high improvement in recognition accuracy for dynamic gesture recognition.

5. Discussion

Real-time gesture recognition plays a key role in many fields, especially in human–computer interaction (HCI) systems, which can bring great convenience to human life. Therefore, exploring real-time and accurate deep learning network models for dynamic gesture recognition has practical significance. In this paper, we proposed a lightweight method based on a spatio-temporal attention mechanism that fuses a CNN with a class of Transformer models. We evaluated our method on the Jester and NVGester datasets, using different evaluation criteria including accuracy, F1 score, sensitivity, specificity value, etc. Since we mainly used the convolution block fused with the attention mechanism to extract the spatio-temporal features of gestures, and comprehensively considered the spatial features and temporal features of gestures globally, we achieved high recognition accuracy on Jester and NVGester datasets with the help of the Transformer architecture. The experimental results show that our method achieved high recognition accuracy while meeting the real-time requirements of interactive systems. We compared our proposed method with other methods, as shown in Table 7. Our proposed method has certain advantages in recognition accuracy compared with other methods, and our network model has fewer parameters, making it a lightweight network model. Therefore, it can be applied to embedded devices with limited computing resources, such as smart cars and smart assembly lines. At the same time, our method uses multimodal data for input, which makes it more versatile. However, our method has a low recognition rate for gesture actions under extreme wild environment, especially in cases where there is multi-hand interaction. This is because the gesture action model is more complex when there is multi-hand interaction, and a larger network needs to be designed.

In addition, we are committed to applying the proposed method in specific application scenarios where computer resources are scarce. To improve the robustness of the model, we organized a dataset with different extreme conditions captured by a computer camera under various environments, such as low lighting, excessive lighting, and tilted shooting angles, to train the network model. Furthermore, we used various methods such as affine transformation, projection transformation, skewness, and scaling to augment the datasets and improve the model’s adaptability.

6. Conclusions

This paper presents a new deep neural network model based on an attention mechanism, which is a variation of the Transformer model, for dynamic gesture recognition. The proposed model builds upon the MetaFormer model. The input video sequence is first processed by the PoolFormer block, which performs fast down-sampling feature extraction for each frame. The Spatial Attention Former Block (SAFB) is then applied to each frame’s gesture image to extract spatial attention features based on the spatial attention mechanism. The spatial attention features from each frame are then fused in the channel direction, incorporating the temporal features of the dynamic gesture into the channel dimension. The Temporal Attention Former Block (TAFB) based on the channel attention mechanism is then applied to extract temporal attention features for gesture action classification. The proposed method was evaluated on the Jester and NVGesture public datasets, as well as a self-built dataset in a complex environment. The experimental results show that the proposed method outperformed the current state-of-the-art methods and exhibited good robustness for data with complex backgrounds, such as low-light conditions, strong light backgrounds, and tilted shooting angles.

In future work, we will explore more robust models for dynamic gesture recognition tasks, especially in outdoor conditions and multi-hand interaction environments. We hope to propose a universally applicable method with high recognition accuracy and robustness in daily life. This work will be driven by a wild gesture dataset collected in various environments. We will create a huge dynamic gesture dataset in different environments in the future. Additionally, we will design an unsupervised network model to apply to the massive dataset which we will create.

Author Contributions

Conceptualization, F.W. and Y.Z.; methodology, F.W. and Y.Z.; software, F.W. and Y.Z.; validation, F.W.; formal analysis, F.W.; data curation, F.W.; writing—original draft preparation, F.W.; writing—review and editing, F.W.; visualization, F.W.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Natural Science Foundation of China (61262043), The Science and Technology Program of Yunnan Province (2011FZ029), and The open research fund from Yunnan Provincial Key Lab (2020106).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available due to restriction of privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kowdiki, M.; Khaparde, A. Adaptive hough transform with optimized deep learning followed by dynamic time warping for hand gesture recognition. Multimed. Tools Appl. 2022, 81, 2095–2126. [Google Scholar] [CrossRef]
Oudah, M.; Al-Naji, A.; Chahl, J. Hand Gesture Recognition Based on Computer Vision: A Review of Techniques. J. Imaging 2020, 6, 73. [Google Scholar] [CrossRef]
Kim, Y.; Baek, H. Preprocessing for Keypoint-Based Sign Language Translation without Glosses. Sensors 2023, 23, 3231. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Molchanov, P.; Yang, X.; Gupta, S.; Kim, K.; Tyree, S.; Kautz, J. Online detection and classifification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016; pp. 4207–4215. [Google Scholar]
Min, Y.; Zhang, Y.; Chai, X.; Chen, X. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Abavisani, M.; VaeziJoze, H.R.; Patel, V.M. Improving the performance of unimodal dynamic hand gesture recognition with multimodal training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1165–1174. [Google Scholar]
Joze, H.R.V.; Shaban, A.; Iuzzolino, M.L.; Koishida, K. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, Y.; Cao, C.; Cheng, J.; Lu, H. EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition. IEEE Trans. Multimed. (T-MM) 2018, 20, 1038–1050. [Google Scholar] [CrossRef]
Cao, C.; Zhang, Y.; Wu, Y.; Lu, H.; Cheng, J. Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatio-temporal Transformer Modules. In Proceedings of the IEEE International Conference On Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Köpüklü, O.; Gunduz, A.; Neslihan, K.; Rigoll, G. Real-time hand gesture detection and classification using convolutional neural networks. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019. [Google Scholar]
Materzynska, J.; Berger, G.; Bax, I.; Memisevic, R. The jester dataset: A large-scale video dataset of human gestures. In Proceedings of the IEEE/CVF International Conference Computer Vision Workshop, Seoul, Republic of Korea, 7–28 October 2019; pp. 2874–2882. [Google Scholar]
Kopuklu, O.; Neslihan, K.; Gerhard, R. Motion fused frames: Data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Sincan, O.M.; Keles, H.Y. AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods. IEEE Access 2020, 8, 181340–181355. [Google Scholar] [CrossRef]
Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Novopoltsev, M.; Verkhovtsev, L.; Murtazin, R.; Milevich, D.; Zemtsova, I. Fine-tuning of sign language recognition models: A technical report. arXiv 2023, arXiv:2302.07693. [Google Scholar]
Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Khaleghi, L.; Sepas-Moghaddam, A.; Marshall, J.; Etemad, A. Multi-view video-based 3D hand pose estimation. IEEE Trans. Artif. Intell. 2022, 1–14. [Google Scholar] [CrossRef]
Khaleghi, L.; Joshua, M.; Ali, E. Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, Montreal, QC, Canada, 21–25 August 2022. [Google Scholar]
Liu, J.; Wang, Y.; Tian, M. Dynamic Gesture Recognition Network based on Multi-scale spatio-temporal feature Fusion. J. Electron. Inf. Technol. 2022, 44, 1–9. [Google Scholar] [CrossRef]
Chen, X.; She, Q.; Zhang, B.; Ma, Y.; Zhang, J. Based on attention to guide the airspace image convolution SRU dynamic gesture recognition. Control Decis. 2023, 1–9. [Google Scholar] [CrossRef]
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Howard Andrew, G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Hu, J.; Li, S.; Gang, S. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Tran, D.; Ray, J.; Shou, Z.; Chang, S.-F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv 2017, arXiv:1708.05038. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhang, W.; Wang, J.; Lan, F. Dynamic hand gesture recognition based on short-term sampling neural networks. IEEE/CAA J. Autom. Sin. 2020, 8, 110–120. [Google Scholar] [CrossRef]
Sharir, G.; Asaf, N.; Lihi, Z.-M. An image is worth 16x16 words, what is a video worth? arXiv 2021, arXiv:2103.13915. [Google Scholar]
Zhang, C.; Zou, Y.; Chen, G.; Gan, L. Pan: Towards fast action recognition via learning persistence of appearance. arXiv 2020, arXiv:2008.03462. [Google Scholar]
Yang, X.; Pavlo, M.; Jan, K. Making convolutional networks recurrent for visual sequence learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionx, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 1–14. [Google Scholar] [CrossRef]
Min, Y.; Chai, X.; Zhao, L.; Chen, X. FlickerNet: Adaptive 3D Gesture Recognition from Sparse Point Clouds. BMVC 2019, 2, 1–13. [Google Scholar]

Figure 1. Framework of the proposed method.

Figure 2. Comparison of MetaFormer block, Spatial Attention Former Block, and Temporal Attention Former Block. (a) MetaFormer Block. (b) Spatial Attention Former Block (SAFB). (c) Temporal Attention Former Block (TAFB).

Figure 3. Spatial Attention Convolution Block (SACB).

Figure 4. Temporal Attention Former Block (TACB).

Figure 5. Illustration of the self-built dataset. (The first row represents low-light, the second row represents high-light, and the third row represents tilted shooting. Each column represents a category of hang gesture).

Figure 6. Confusion matrix of results on the NVGesture dataset.

Figure 7. Confusion matrix of results on the self-built dataset.

Table 1. Datasets for dynamic gesture recognition.

Dataset	Release Year	Characteristics
NVGesture [6]	2016	Front-facing cameras, gesture interaction in driving scenarios
EgoGesture [10]	2017	Egocentric
Jester [13]	2019	Front-facing cameras, hand gesture interaction
AUTSL [16]	2020	Sign language, multimodal
WLASL [20]	2020	Sign language
MuViHand [21]	2021	3D synthetic gesture

Table 2. Comparison of the proposed method on the Jester dataset.

Method	FLOPs (G)	Parameters (M)	Accuracy (%)	F1-Score (%)	Sensitivity (%)	Only Specificity (%)
R3D-34 [29]	25.420	63.791	85.48	-	-	-
R(2 + 1)D-34 [30]	25.813	63.748	84.75	91.20	86.30	89.21
STSNN [31]	-	-	95.73	96.33	87.20	91.36
STAM [32]	137.464	104.807	88.94	90.02	90.63	92.55
PAN [33]	9.128	23.866	90.74	94.56	88.96	91.04
DRX3D [14]	-	-	96.60	95.21	91.23	93.33
SlowFast [19]	13.176	35.17	87.62	90.04	94.44	96.22
CvT-MTAM [23]	11.544	26.979	92.26	96.66	89.30	90.64
Ours	5.766	21.485	96.72	97.2	94.38	96.76

Table 3. Comparison of the proposed method on the NVGesture dataset.

Method	Mode	Accuracy (%)	F1-Score (%)	Sensitivity (%)	Only Specificity (%)
R3DCNN [6]	IR image	63.5	73.35	75.62	73.22
R3DCNN [6]	Optical flow	77.8	78.42	79.53	80.63
R3DCNN [6]	Depth video	80.3	83.12	85.14	84.53
PreRNN [34]	Depth video	84.4	85.66	85.97	86.57
MTUT [8]	Depth video	84.9	85.48	88.52	89.91
PointNet++ [35]	Point clouds	63.9	71.53	74.31	72.11
FlickerNet [36]	Point clouds	86.3	88.56	89.04	90.10
PointLSTM [7]	Point clouds	87.9	89.97	90.65	91.18
Human [6]	RGB video	88.4	90.45	91.56	92.56
MMTM [9]	RGB video + optical flow	84.85	88.61	88.09	90.28
STSNN [31]	RGB video + optical flow	85.13	91.23	92.61	93.61
Ours	RGB video + optical flow	92.16	92.49	92.55	93.94

Table 4. Ablation results of the proposed method.

Method	Accuracy (%)
PoolFormer [5]	77.50
3PoolFB + 3SAFB + 6SAFB + 3SAFB	88.82
3PoolFB + 3SAFB + 6SAFB + 3TAFB	96.72
3PoolFB + 3SAFB + 6TAFB + 3TAFB	91.47
3PoolFB + 3TAFB + 6TAFB + 3TAFB	86.16

Table 5. Ablation results of the number of blocks of the proposed method.

Method	Accuracy (%)
PoolFormer [5]	77.50
3PoolFB + 3SAFB + 3SAFB + 3TAFB	89.82
3PoolFB + 3SAFB + 6SAFB + 3TAFB	96.72
3PoolFB + 6SAFB + 6SAFB + 6TAFB	95.17
3PoolFB + 6SAFB + 6SAFB + 3TAFB	95.56

Table 6. Ablation results of spatial attention and temporal attention.

Method	Accuracy (%)
PoolFormer [5]	77.50
Remove Spatial Attention	89.82
Remove Temporal Attention	91.22
Remove Spatial and Temporal Attention	82.31
Ours	96.72

Table 7. Comparison of our method with other methods.

Method	Characteristics
MTUT [8]	Unimodal, multimodal training, unimodal data, 3D convolution
GRX3D [14]	Motion Fused Frames, data-level fusion, convolutional neural network
SlowFast [28]	SlowFast network, two streams, 3D convolution
PointLSTM [7]	Point cloud, LSTM, CNN, unimodal data
Ours	Fusion of CNN with Transformer, multimodal data, lite, attention mechanism

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, F. HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism. Appl. Sci. 2023, 13, 4558. https://doi.org/10.3390/app13074558

AMA Style

Zhang Y, Wang F. HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism. Applied Sciences. 2023; 13(7):4558. https://doi.org/10.3390/app13074558

Chicago/Turabian Style

Zhang, Yun, and Fengping Wang. 2023. "HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism" Applied Sciences 13, no. 7: 4558. https://doi.org/10.3390/app13074558

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HandFormer: A Dynamic Hand Gesture Recognition Method Based on Attention Mechanism

Abstract

1. Introduction

2. Related Works

2.1. Hand Gesture Recognition Datasets

2.2. Dynamic Hand Gesture Recognition

3. Method

3.1. MetaFormer

3.2. Spatial Attention Former Block (SAFB)

3.3. Temporal Attention Former Block (TAFB)

4. Results

4.1. Datasets

4.2. Experimental Settings

4.3. Results on Jester Dataset

4.4. Results on NVGesture Dataset

4.5. Results on Self-Built Dataset

4.6. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI