*Article* **3D Skeletal Joints-Based Hand Gesture Spotting and Classification**

**Ngoc-Hoang Nguyen, Tran-Dac-Thinh Phan, Soo-Hyung Kim, Hyung-Jeong Yang and Guee-Sang Lee \***

Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Korea; hoangnguyenkcv@gmail.com (N.-H.N.); phantrandacthinh2382@gmail.com (T.-D.-T.P.); shkim@jnu.ac.kr (S.-H.K.); hjyang@jnu.ac.kr (H.-J.Y.)

**\*** Correspondence: gslee@jnu.ac.kr

**Abstract:** This paper presents a novel approach to continuous dynamic hand gesture recognition. Our approach contains two main modules: gesture spotting and gesture classification. Firstly, the gesture spotting module pre-segments the video sequence with continuous gestures into isolated gestures. Secondly, the gesture classification module identifies the segmented gestures. In the gesture spotting module, the motion of the hand palm and fingers are fed into the Bidirectional Long Short-Term Memory (Bi-LSTM) network for gesture spotting. In the gesture classification module, three residual 3D Convolution Neural Networks based on ResNet architectures (3D\_ResNet) and one Long Short-Term Memory (LSTM) network are combined to efficiently utilize the multiple data channels such as RGB, Optical Flow, Depth, and 3D positions of key joints. The promising performance of our approach is obtained through experiments conducted on three public datasets—Chalearn LAP ConGD dataset, 20BN-Jester, and NVIDIA Dynamic Hand gesture Dataset. Our approach outperforms the state-of-the-art methods on the Chalearn LAP ConGD dataset.

**Keywords:** continuous hand gesture recognition; gesture spotting; gesture classification; multi-modal features; 3D skeletal; CNN

#### **1. Introduction**

Nowadays, the role of dynamic hand gesture recognition has become crucial in visionbased applications for human-computer interaction, telecommunications, and robotics, due to its convenience and genuineness. There are many successful approaches to isolated hand gesture recognition with the recent development of neural networks, but in real-world systems, the continuous dynamic hand gesture recognition remains a challenge due to the diversity and complexity of the sequence of gestures.

Initially, most continuous hand gesture recognition approaches were based on traditional methods such as Conditional Random Fields (CRF) [1], Hidden Markov Model (HMM), Dynamic Time Warping (DTW), and Bézier curve [2]. Recently, deep learning methods based on convolution neural networks (CNN) and recurrent neural networks (RNN) [3–7] have gained popularity.

The majority of continuous dynamic hand-gesture recognition methods [3–6] include two separate procedures: gesture spotting and gesture classification. They utilized the spatial and temporal features to improve the performance mainly in gesture classification.

However, there are limitations in the performance of gesture spotting due to its inherent variability in the duration of the gesture. In existing methods, gestures are usually spotted by detecting transitional frames between two gestures. Recently, an approach [7] simultaneously performed the task of gesture spotting and gestures classification, but it turned out to be suitable only for feebly segmented videos.

Most of the recent researches [8–11] intently focus on improving the performance of the gesture classification phase, while the gesture spotting phase is often neglected on the

**Citation:** Nguyen, N.-H.; Phan, T.-D.-T.; Kim, S.-H.; Yang, H.-J.; Lee, G.-S. 3D Skeletal Joints-Based Hand Gesture Spotting and Classification. *Appl. Sci.* **2021**, *11*, 4689. https:// doi.org/10.3390/app11104689

Academic Editor: Hyo-Jong Lee

Received: 14 April 2021 Accepted: 18 May 2021 Published: 20 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

assumption that the isolated pre-segmented gesture sequences are available for input to the gesture classification.

However, in real-world systems, spotting of the gesture segmentation plays a crucial role in the whole process of gesture recognition, hence, it greatly affects the final recognition performance. In paper [3], they segmented the videos into sets of images and used them to predict the fusion score, which means they simultaneously did the gesture spotting and gesture classification. The authors in [5] utilized the Connectionist temporal classification to detect the nucleus of the gesture and the no-gesture class to assist the gesture classification without requiring explicit pre-segmentation. In [4,6], the continuous gestures are often spotted into isolation based on the assumption that hands will always be put down at the end of each gesture which turned out to be inconvenient. It does not work well for all situations, such as in "zoom in", "zoom out" gestures, i.e., when only the fingers move while the hand stands still.

In this paper, we propose a spotting-classification algorithm for continuous dynamic hand gestures which we separate the two tasks like [4,6] but we avoid the existing problems of those methods. In the spotting module, as shown in Figure 1, the continuous gestures from the unsegmented and unbounded input stream are firstly segmented into individually isolated gestures based on 3D key joints extracted from each frame by 3D human pose and hand pose extraction algorithm. The time series of 3D key poses are fed into the Bidirectional Long Short-Term Memory (Bi-LSTM) network with connectionist temporal classification (CTC) [12] for gesture spotting.

**Figure 1.** Gesture Spotting-Classification Module.

The isolated gestures segmented using the gesture spotting module are classified in the gesture classification module with a multi-modal M-3D network. As indicated in Figure 1, in the gesture classification module, the M-3D network is built by combining multi-modal data inputs which comprise RGB, Optical Flow, Depth, and 3D pose information data channels. Three residual 3D Convolution Neural Network based on ResNet architectures (3D\_ResNet) [13] stream networks of RGB, Optical Flow and Depth channel along with an LSTM network of 3D pose channel are effectively combined using a fusion layer for gesture classification.

The preliminary version of this paper has appeared in [14]. In this paper, depth information has been considered together with 3D skeleton joints information with extensive experiments, resulting in upgraded performance.

The remainder of this paper is organized as follows. In Section 2, we review the related works. The proposed continuous dynamic hand gesture recognition algorithm is intently discussed in Section 3. In Section 4, the experiments with proposed algorithms conducted on three published datasets—Chalearn LAP ConGD dataset, 20BN-Jester, and NVIDIA Dynamic Hand Gesture Dataset are presented with discussions. Finally, we conclude the paper in Section 5.

#### **2. Related Works**

In general, the continuous dynamic gesture recognition task is more complicated than the isolated gesture recognition task, where the sequence of gestures from an unsegmented and unbounded input stream are separated into complete individual gestures, called gesture spotting or gesture segmentation before classification. The majority of recent researchers solve the continuous dynamic gesture recognition task using two separate processes—gesture spotting and gesture recognition [1,4–6].

In the early years, the approaches for gesture spotting were commonly based on traditional machine learning techniques for the time series problems such as Conditional Random Fields (CRF) [1], Hidden Markov Model (HMM) [2], and Dynamic Time Warping (DTW) [3]. Yang et al. [1] presented a CRF threshold model that recognized gestures based on system vocabulary for labeling sequence data. Similar to the method introduced by Yang, Lee et al. [2] proposed the HMM-based method, which recognized gestures by the likelihood threshold estimation of the input pattern. Celebi et al. [3] proposed a template matching algorithm, i.e., the weighted DTW method, which used the time sequence of the weighted joint positions obtained from a Kinect sensor to compute the similarity of the two sequences. Krishnan et al. [15] presented a method using the Adaptive Boosting algorithm based on the threshold model for gesture spotting using continuous accelerometer data and the HMM model for gesture classification. The limitations of these methods are the parameter of the model has been decided through experience and the algorithm is sensitive to noise. In the recent past, with the success of deep learning applications in computer vision, deep learning approaches have been utilized for hand gesture recognition to achieve impressive performance compared to traditional methods.

The majority of the methods using recurrent neural networks (RNN) [16–18] or CNN [8,10,19–21] focus only on isolated gesture recognition, which ignores the gesture spotting phase. After the dataset for continuous gesture spotting-Chalearn LAP ConGD dataset was provided, a number of methods have been proposed to solve both phases of gesture spotting and gestures recognition [3,4,6]. Naguri et al. [6] applied 3D motion data input from infrared sensors into an algorithm based on CNN and LSTM to distinguish gestures. In this method, they segmented gestures by detecting transition frames between two isolated gestures. Similarly, Wang et al. [3] utilized transition frame detection using two streams CNN to spot gestures. In another approach proposed by Chai et al. [4], continuous gestures were spotted based on the hand position detected by Faster R-CNN and isolated gesture was classified by two parallel recurrent neural network SRNN with RGB\_D data input. The multi-modal network, which combines a Gaussian-Bernoulli Deep Belief Network (DBN) with skeleton data input and a 3DCNN model with RGB\_D data, was effectively utilized for gesture classification by Di et al. [7]. Tran et al. [22] presented CNN based method using a Kinect Camera for spotting and classification of hand gestures. However, the gesture spotting was done manually from a pre-specified hand shape or finger-tip pattern. And classification of hand gestures used only fundamental 3DCNN networks without employing the LSTM network. The system is based on the Kinect system and the comparison using a commonly used public dataset is almost impossible.

Recently, Molchanov et al. [5] proposed a method for joint gesture spotting and gesture recognition using a zero or negative lag procedure through a recurrent three-dimensional convolution neural network (R3DCNN). This network is highly effective in recognizing weakly segmented gestures from multi-modal data.

In this paper, we propose an effective algorithm for both spotting and classification tasks by utilizing extracted 3D human and hand skeletal features.

#### **3. Proposed Algorithm**

In this section, we intently focus on the proposed method using two main modules: gesture spotting and gesture classification. For entire frames of continuous gesture video, the speed of hand and finger estimated from the extracted 3D human pose and 3D hand pose are utilized to segment continuous gesture. The isolated gesture segmented by gesture spotting module is classified using the proposed M-3D network with RGB, Optical flow, Depth, and 3D key joints information.

#### *3.1. Gesture Spotting*

The gesture spotting module is shown on the left of Figure 1. All frames of continuous gesture sequence are utilized to extract 3D human pose using the algorithm proposed in [23]. Through RGB hand ROI localized from 3D hand palm position *Jh(x,y,z)* when the hand palm stands still and over spine base joint, we use a 3D hand pose estimation algorithm to effectively extract the 3D position of the finger joints. From the extracted 3D human pose, the hand speed *vhand* is estimated using the movement distance of the hand joint between two consecutive frames.

• **3D human pose extraction:** From each RGB frame, we obtain a 3D human pose by using one of the state-of-the-art methods for 2D/3D human pose estimation in the wild-pose-hgreg-3d network. This network has been proposed by Zhou et al. [23] which provides the pre-trained model on the Human3.6M dataset [24]. This is the largest dataset providing both 2D, 3D annotations of human poses in 3.6 million RGB images. This network is a fast, simple, and accurate neural network based on 3D geometric constraints for weakly-supervised learning of 3D pose with 2D joint annotations extracted through the state-of-the-art of 2D pose estimation method, i.e., stacked hourglass network of Newell et al. [25]. In our proposed approach, we use this 3D human pose estimation network to extract the exact 3D hand joint information, which is effectively utilized for both gesture spotting and gesture recognition task.

Let *Jh*(*xhk, yhk, zhk*), *Jh*(*xhk*−1, *yhk*−1, *zhk*−1) be the 3D position of the hand joint at the *k*th frame, and (*k* − 1)th frame, respectively. The hand speed is estimated as

$$
\omega\_{\text{hundred}} = \mathbf{a} \cdot \sqrt{(\mathbf{x}\_{hk} - \mathbf{x}\_{hk-1})^2 + (y\_{hk} - y\_{hk-1})^2 + (z\_{hk} - z\_{hk-1})^2} \tag{1}
$$

where *α* is the frame rate.

The finger speed is estimated by the change in distance between the 3D position of fingertips of the thumb and the index finger in sequence frames. Let denote *Jft(xftk, yftk, zftk), Jfi(xink, yink, zink)* the 3D position fingertips of the thumb and the index finger at the *k*th frame, respectively. The distance between the two fingertips at the *k*th frame is given as

$$d\_{fk} = \sqrt{\left(\mathbf{x}\_{ftk} - \mathbf{x}\_{ink}\right)^2 + \left(y\_{ftk} - y\_{ink}\right)^2 + \left(z\_{ftk} - z\_{ink}\right)^2} \tag{2}$$

where *dfk* and *dfk*−<sup>1</sup> represent the distances of the *k*th frame and previous frame, respectively, the finger speed *vfinger* is estimated as

$$w\_{finger} = a \cdot \left(d\_{fk} - d\_{fk-1}\right) \tag{3}$$

The function utilizes *vhand* and *vfinger* extracted from each frame:

$$
\upsilon\_k = \upsilon\_{hand} + \upsilon\_{finger} \tag{4}
$$

and is used as the input of the Bi-LSTM network to spot gestures from video streams, as shown in Figure 2. In our network, the Connectionist temporal classification [12] CTC loss is used to identify whether the sequence frames are in gesture frames or transition frames.


$$\begin{cases} \quad \dot{\boldsymbol{t}}\_{t} = \sigma(\boldsymbol{W\_{i}}[\boldsymbol{x}\_{t}, \boldsymbol{h}\_{t-1}] + \boldsymbol{b}\_{i} \\ \quad \boldsymbol{f}\_{t} = \sigma\Big(\boldsymbol{W\_{f}}[\boldsymbol{x}\_{t}, \boldsymbol{h}\_{t-1}] + \boldsymbol{b}\_{f} \\ \quad \boldsymbol{o}\_{t} = \sigma(\boldsymbol{W\_{o}}[\boldsymbol{x}\_{t}, \boldsymbol{h}\_{t-1}] + \boldsymbol{b}\_{o} \\ \quad \boldsymbol{\tilde{c}\_{t}} = \tanh(\boldsymbol{W\_{c}}[\boldsymbol{x}\_{t}, \boldsymbol{h}\_{t-1}] + \boldsymbol{b}\_{c}), \\ \quad \boldsymbol{c}\_{t} = \boldsymbol{f}\_{t} \* \boldsymbol{c}\_{t-1} + \boldsymbol{i}\_{t} \* \boldsymbol{\tilde{c}\_{t}}, \\ \quad \boldsymbol{h}\_{t} = \tanh(\boldsymbol{c\_{t}}) \* \boldsymbol{o}\_{t} \end{cases} (5)$$

where *<sup>i</sup>*, *<sup>f</sup>*, and *<sup>o</sup>* are the vectors of input, forget and output gate, respectively. *<sup>c</sup>*#*<sup>t</sup>* and *ct* are called the "candidate" hidden state and internal memory of the unit. *ht* represents the output hidden state. *σ*(.) is a sigmoid function while W and b are connected weights matrix and bias vectors, respectively.

**Figure 2.** Gesture segmentation with Bi\_LSTM and CTC loss.

• **Bi-LSTM network**: While the output of a single forward LSTM network depends only on previous input features, the Bi-LSTM network is known as an effective method for sequence labeling tasks, which is beneficial to both previous and future input features. Bi-LSTM can be considered as a stack of two LSTM layers, in which, a forward LSTM

layer utilizes the previous input features while the backward LSTM layer captures the future input features. The benefit of the fact that the Bi-LSTM network considers both previous and future input features is its effectiveness to classify the frame in sequence frame, gesture frame, or transition frame. The prediction error can be reduced by using Bi-LSTM instead of LSTM.

• **Connectionist temporal classification:** The Connectionist temporal classification CTC is known as the loss function which is highly effective in sequential label prediction problems. The proposed algorithm utilizes CTC to detect whether the sequence frames are in gesture frames or transition frames with input from a sequence of Soft-Max layer outputs.

#### *3.2. Gesture Classification*

The isolated gestures segmented by the present gesture spotting module are classified into individual gesture classes in the gesture recognition module. The proposed gesture recognition module is a multi-model network called the M-3D network. This model is based on a multi-channel network with three different data modalities, as shown on the right of Figure 1.

In our approach, from each frame of a video, we extract optical flow, 3D pose (hand joint, thumb tip, and index fingertip joint) information of multi-channel features input to the model. Optical flow is determined by two adjacent frames. There are some existing methods of optical flow extraction such as Farneback [28], MPEG flow [29], and Brox flow [30]. The quality motion information of optical flow clearly affects the performance of the gesture recognition model. Therefore, the Brox flow technique is applied to our approach as it has better quality performance compared to other optical flow extraction techniques.

While the key hand and finger joints positions are extracted by the 3D human pose and 3D hand pose extraction network presented in Section 3.1, we only focus on the two most important joints of thumb tip and index fingertip which can describe all gesture types. Our gesture classification algorithm is based on the combination of three 3D\_ResNet stream networks of RGB, Optical Flow, Depth channels with an LSTM network of 3D key joint features.

• **Three stream RGB, Optical Flow, and Depth 3D\_ResNet networks:** The 3D\_CNN framework is regarded as one of the best frameworks for spatiotemporal feature learning. The 3D\_ResNet network is an improved version of the residual 3D\_CNN framework based on ResNet [31] architecture. The effectiveness of 3D\_ResNet has been proved by remarkable performance in action video classification.

The single 3D\_ResNet is described in Figure 3. The 3D\_ResNet consists of a 3D convolutional layer and is followed by a batch normalization layer and rectified-linear unit layer. Each RGB and Optical Flow stream model is pre-trained on the largest action video classification dataset of the Sports-1M dataset [32]. Input videos are resampled into 16 frames-clips before being fed into the network. Let a resampled sequence of 16 frames RGB frames be *Vc =* {*xc*1, *xc*2, ... , *xc*16}, Optical Flow frames be *Vof* = {*xof*1, *xof*2, ... , *xof*16} and Depth frames be *Vd* = {*xd*1, *xd*2, ... , *xd*16} and operation function 3D\_ResNet network of RGB, Optical Flow and Depth modalities be Θc(.), Θof(.) and Θd(.), respectively. Hence, the prediction probability of two single networks for i classes is

$$P\_{\mathfrak{C}}\{p\_1, p\_2, \dots, p\_{16}|V\_{\mathfrak{C}}\} = \Theta\_{\mathfrak{C}}(V\_{\mathfrak{C}}) \tag{6}$$

$$P\_{of}\left\{p\_{1\prime}, p\_{2\prime}, \dots, p\_{16}\middle|V\_{of}\right\} = \Theta\_{of}\left(V\_{of}\right) \tag{7}$$

$$\Pr\{p\_1, p\_2, \dots, p\_{16} \vert V\_{\mathcal{D}}\} = \Theta\_{\mathcal{D}}(V\_{\mathcal{D}}) \tag{8}$$

where *pi* is the prediction probability of video belonging to the *i*th class.

• **LSTM network with 3D pose information:** In dynamic gesture recognition, temporal information learning plays a critical role in the performance of the model. In our approach, we utilize the temporal features by tracking the trajectory of the hand palm together with the specific thumb tip and index fingertip joint. LSTM framework is suitably proposed to learn the features for the gesture classification task. The parameters of our LSTM refer to the approach [33]. Input vectors from a sequence of the LSTM network frames is defined as: *Vj =* {*vj*1, *vj*2, ... , *vj*16} where: *vjk =* {*Jh*(*xhk, yhk, zhk*), *Jft*(*xftk, yftk, zftk*), *Jfi*(*xink, yink, zink*)} is a 9 × 1 vector which contains 3D position information of key joints at kth frame. The input of the LSTM network corresponds to the dimension of a single frame of sequences of 16 sampled frames in a gesture video that is a tensor for 1 × 9 numbers. The prediction probability output using LSTM with input *Vj* is

$$P\_L\{p\_{1\prime}, p\_{2\prime}, \dots, p\_{16} \vert V\_{\dot{\jmath}}\} = \Theta\_L\{V\_{\dot{\jmath}}\} \tag{9}$$

where Θ*L*(.) denotes the operation function of the LSTM network.

**Figure 3.** The overview of 3D\_ResNet architecture. This figure showed the number of feature map, kernel size of the 3D convolutional layer (3D Conv), batch normalization layer (Batch-Norm), and Rectified-Linear unit layer (ReLU).

• **Multi-modality fusion:** The results of the multiple different channel networks are fused in the final fusion layer to predict a gesture class. It is a fully connected layer where the number of output units is equal to the number of classes on the dataset. The output probability of each class is estimated by pre-trained last fusion layer with Θ*fusion*(.) operation function:

$$P\{p\_1, p\_2, \dots, p\_{16} \mid V\_{\varepsilon}\} = \Theta\_{f \text{sim}} \left\{ \begin{array}{ll} P\_{\varepsilon} \{p\_1, p\_2, \dots, p\_{16} \mid V\_{\varepsilon}\}, & P\_{of} \left\{p\_1, p\_2, \dots, p\_{16} \middle| V\_{of}\right\}, \\ & P\_{\mathcal{D}} \{p\_1, p\_2, \dots, p\_{16} \mid V\_{\mathcal{D}}\}, & P\_{\mathcal{L}} \{p\_1, p\_2, \dots, p\_{16} \middle| V\_{\mathcal{I}}\} \end{array} \right\} \tag{10}$$

The performance of the gesture recognition task is improved by combining the temporal information learning by LSTM network with spatiotemporal features learning by 3D\_ResNet that is proved through experimental results.

#### **4. Experiments and Results**

In this section, we describe the experiments that evaluate the performance of the proposed approach on three public datasets: 20BN\_Jester dataset [34], NVIDIA Dynamic Hand Gesture dataset [5], and Chalearn LAP ConGD dataset [35].

*4.1. Datasets*


This weakly segmented gesture video includes the preparation, nucleus, and transition frames of gesture.

• **Chalearn LAP ConGD dataset:** is a large dataset containing 47,933 gesture instances with 22,535 RGB-Depth videos for both continuous gesture spotting and gesture recognition task. The dataset includes 249 gestures performed by 21 different individuals. This dataset is further divided into three subsets: training set (14,314 videos), validation set (4179 videos), and test set (4042 videos).

The summary of the three datasets is shown in Table 1.


**Table 1.** Ablation studies on the ISBI 2016 and ISBI 2017 datasets.

#### *4.2. Training Process*

• **Network training for hand gesture spotting:** To train the Bi-LSTM network for segmentation of continuous gestures, we firstly use a pre-trained 3D human pose extraction network (on Human3.6M dataset) and a pre-trained 3D hand pose extraction network (on Stereo Hand Pose Tracking dataset) to extract the 3D position of key poses. The quality between human and hand pose extraction algorithms are demonstrated in Figure 4. Using those extracted input features for the network, we train the Bi\_LSTM network with the provided gesture segmentation labels by a training set of Chalearn LAP ConGD dataset.

**Figure 4.** (**a**) The 2D and 3D human pose estimation examples and (**b**) The 2D and 3D hand pose estimation examples.

Bi-LSTM network is trained with CTC loss for predicting the sequence of binary output values to classify whether the frame belongs to gesture frame or transition frame. In Bi\_LSTM, the input layer has 20 time-steps, the hidden layer has 50 memory units, and the last fully connected layer output has one binary value per time-step with a sigmoid active function. The efficient ADAM optimization algorithm [36] is applied to find the

optimal weight of the network. The spotting output of the Bi-LSTM network by a given speed input is displayed as in Figure 5.

• **Network training for hand gesture classification:** The single-stream network (pretrained on Sports-1M dataset) is separately fine-tuned on the huge dataset Chalearn LAP ConGD dataset. Each fine-tuned stream 3D\_CNN network weights is learned using ADAM optimization, learning rate with an initial value of 0.0001 reducing by half for every 10 epochs on 200 epochs. Ensemble modeling with 5 3D\_ResNet models is applied to increase the classification accuracy. The LSTM network parameters are selected through the observations of experimental results. The optimal LSTM model parameters are 3 memory blocks, and 256 LSTM Cells per memory block. The pretrained LSTM network is trained with a learning rate of 0.0001 on 1000 epochs. After pre-training of each streaming network, we retrain these networks with a specific dataset. Finally, we concatenate the prediction probability outputs of these trained models to train the weights of the last fusion fully connected layer for gesture classification. Besides training with the 3D\_ResNet framework, we also train with the 3D\_CNN framework to prove the effectiveness of the proposed algorithm.

**Figure 5.** Example of sequence frames segmentation by the Bi\_LSTM network. The blue line is the given speed input, and the red line is gesture spotting output (a value of 1.0 indicates the gesture frames).


Table 3 shows that our gesture recognition module obtained a positive result. The recognition performance is improved by using 3D data information of key joints. Moreover, the recognition performance of our method is among the top performers of existing approaches, with an accuracy of 95.6% on the 20BN-Jester Dataset and an accuracy of 82.4% on the NVIDIA Hand Gesture dataset.

• **Continuous hand gesture spotting classification:** To entirely evaluate our approach on continuous dynamic hand gesture spotting recognition, we apply the Jaccard index [3] for measuring the performance. For a given gesture video, the Jaccard index estimates the average relative overlap between the ground truth and the predicted sequences of frames. A sequence S is given by *i*th class gesture label and binary vector ground truth *Gs,i*, while the binary vector prediction for the *i*th class is denoted as *Ps,i*. The binary vector *Gs,i* and *Ps,i* are vectors with 1-values indicating the corresponding frames in which the *i*th gesture class is being performed. So, the Jaccard Index for the given sequence S is computed by the following formula of

$$J\_{s,i} = \frac{G\_{s,i} \parallel P\_{s,i}}{G\_{s,i} \parallel P\_{s,i}} \tag{11}$$

When *Gs,i* and *Ps,i* are empty vectors, the Jaccard Index *Js,i* is set as 0. For a given sequence S containing L number of true class labels *ls*, the Jaccard Index is estimated by the function:

$$J\_s = \frac{1}{l\_s} \sum\_{i=1}^{l} J\_{s,i} \tag{12}$$

For all testing sequence of n gestures: *s* = {*s*1, *s*2, ... , *sn*} the mean Jaccard Index *Js J* →*<sup>s</sup> J* →*<sup>s</sup>* is applied to evaluation as follows:

$$\overline{f\_s} = \frac{1}{n} \sum\_{j=1}^{n} f\_{s,\bar{j}} \tag{13}$$

The spotting-recognition performance comparison of our proposed approach to the existing methods by evaluation experiment on the test set of Chalearn LAP ConGD dataset is shown in Table 4.

**Table 2.** Gestures spotting performance comparison with different methods on NVIDIA Hand Gesture dataset. Bold values are highest indices.


**Table 3.** Gesture classification performance comparison of different methods on the 20BN\_Jester dataset and NVIDIA Hand Gesture dataset. Bold values are highest indices.



**Table 4.** The spotting-recognition performance comparison of our proposed approach to existing methods on the test set of the Chalearn LAP ConGD dataset. Bold values are highest indices.

From the results illustrated in Table 4, the mean Jaccard Index on the test set of the Chalearn LAP ConGD dataset shows that the proposed method achieves satisfactory performance on the dataset. By using 3D key joint features and multiples, the recognition performance is significantly enhanced.

#### **5. Discussions**

In Section 4.3, we have shown the effectiveness of our method on the three datasets. In terms of hand gesture spotting, we get the best results of both indexes on the NVIDIA Dynamic Hand Gesture dataset and Chalearn LAP ConGD dataset. The extraction of human pose and hand pose helps us track the hand movement more accurately and detect the beginning and the end of the sequence, avoiding the minor motion that could contaminate the following classification task. In the task of hand gesture classification, Table 3 presents the efficiency of the addition of modalities into our model on both the 20BN\_Jester dataset and the NVIDIA Dynamic Hand Gesture dataset. Different views of data are crucial to the performance of the hand gesture classification. Continuous gesture classification is more difficult when there are several kinds of gestures in one video, which means the capability of gesture spotting greatly influences the performance of gesture classification. In Table 4, we get the best results when doing both tasks on the Chalearn LAP ConGD dataset.

#### **6. Conclusions**

In this paper, we presented an effective approach for continuous dynamic hand gesture spotting recognition for RGB input data. The continuous gesture sequences are firstly segmented into separate gestures by utilizing the motion speed of key 3D poses as the input of the Bi-LSTM network. After that, each segmented gesture is defined in the gesture classification module using a multi-modal M-3D network. In this network, three 3D\_ResNet stream networks of RGB, Optical Flow, Depth data channel, and LSTM networks of 3D key pose features channel are effectively combined for gesture classification purposes. The results of the experiments conducted on the ChaLearn LAP ConGD Dataset, NVIDIA Hand Gesture dataset, and 20\_BN Jester dataset proved the effectiveness of our proposed method. In the future, we will try to include other different modalities to improve the performance. The tasks of gesture spotting and classification in this paper are performed separately into 2 steps. The upcoming plan is to do both tasks by one end-to-end model so that it is more practical in real-world problems.

**Author Contributions:** Conceptualization, G.-S.L. and N.-H.N.; methodology, N.-H.N.; writing review and editing, N.-H.N., T.-D.-T.P., and G.-S.L.; supervision, G.-S.L., S.-H.K., and H.-J.Y.; project administration, G.-S.L., S.-H.K., and H.-J.Y.; funding acquisition, G.-S.L., S.-H.K., and H.-J.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT). (NRF-2020R1A4A1019191) and also by the Bio & Medical Technology Development Program of the National Research Foundation (NRF)& funded by the Korean government (MSIT) (NRF-2019M3E5D1A02067961).

**Institutional Review Board Statement:** Not Applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** Not Applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

