*Article* **Multi-Hand Gesture Recognition Using Automotive FMCW Radar Sensor**

**Yong Wang 1,\*, Di Wang 1, Yunhai Fu 2, Dengke Yao 1, Liangbo Xie <sup>1</sup> and Mu Zhou <sup>1</sup>**


**Abstract:** With the development of human–computer interaction(s) (HCI), hand gestures are playing increasingly important roles in our daily lives. With hand gesture recognition (HGR), users can play virtual games together, control the smart equipment, etc. As a result, this paper presents a multi-hand gesture recognition system using automotive frequency modulated continuous wave (FMCW) radar. Specifically, we first constructed the range-Doppler map (RDM) and range-angle map (RAM), and then suppressed the spectral leakage, and dynamic and static interferences. Since the received echo signals with multi-hand gestures are mixed together, we propose a spatiotemporal path selection algorithm to separate the mixed multi-hand gestures. A dual 3D convolutional neural network-based feature fusion network is proposed for feature extraction and classification. We developed the FMCW radar-based platform to evaluate the performance of the proposed multi-hand gesture recognition method; the experimental results show that the proposed method can achieve an average recognition accuracy of 93.12% when eight gestures with two hands are performed simultaneously.

**Keywords:** frequency modulated continuous wave radar; gesture recognition; multi-hand; deep learning

#### **1. Introduction**

With the development of wireless sensing [1], human–computer interaction (HCI) [2] has widely been applied in daily life. The hand gesture recognition (HGR) technique, an important 'way' of HCI, is used in smart homes, robot control, virtual games, etc. For example, with HGR, users can control smart devices and play interactive virtual games. More importantly, with the development of intelligent vehicles, the application of gesture recognition in intelligent-assisted driving is particularly important. The driver can control various functions inside the car through gestures, such as adjusting the in car entertainment system or switching on or off the air conditioner, to help drivers concentrate and improve the driving safety. As a result, HGR is receiving a lot of attention; this paper focuses on HGR.

According to the acquisition method of hand gesture data, HGR can be divided into three types: (i) wearable sensor-based HGR [3], (ii) vision-based HGR [4], and radar-based HGR [5,6]. Based on the wearable sensors, the wearable-based HGR in [3] can acquire the motion information of hand gestures and achieve recognition accuracy as high as 99.3%. Since this method involves wearing sensors, wearable sensors usually bring uncomfortable and inconvenient experiences to users. On the other hand, vision-based HGR [4] applies cameras to capture RGB or depth images of hand gestures, in combination with image processing or computer vision for gesture recognition. Although the recognition accuracy is relatively high, vision-based HGR is usually invalid in case of poor lighting conditions and non-line-of-sight. Radar-based HGR applies the radar sensor to collect the hand gesture

**Citation:** Wang, Y.; Wang, D.; Fu, Y.; Yao, D.; Xie, L.; Zhou, M. Multi-Hand Gesture Recognition Using Automotive FMCW Radar Sensor. *Remote Sens.* **2022**, *14*, 2374. https://doi.org/10.3390/rs14102374

Academic Editors: Zhihuo Xu, Jianping Wang and Yongwei Zhang

Received: 20 April 2022 Accepted: 13 May 2022 Published: 14 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

motion in a device-free manner. It has a non-contact advantage, is not affect by light, and has attracted much attention in both academic and the industry [7,8].

In general, radar-based HGR mainly contains two stages: (i) signal preprocessing and (ii) gesture feature extraction and classification. In the first stage, the raw radar signal is collected and processed to obtain the motion parameters (such as range or Doppler) of hand gestures. In References [7,8], the authors used two-dimensional fast Fourier transform (2D-FFT) to estimate the range and Doppler map (RDM) of hand gestures. To improve the quality of RDM, Wang et al. in [9] adopted the gradient threshold to filter out the step peak in RDM, and then used the wavelet transform to further enhance RDM. To reduce the interferences, the authors in [10] adopted the neighbor threshold detection method instead of the constant false alarm rate (CFAR) [11] to filter out the interference targets and detect the hand gesture targets. After suppressing the interference signal by the first-order recursive filter, the authors in [12] used the lognormal CFAR to detect the hand gestures. Although there were some works focusing on interference suppression and hand gesture detection, they only considered the range and Doppler information of hand gestures. In fact, the angle information of hand gestures provided an additional dimension to describe the hand gesture motion. As a result, the motion parameters (range, Doppler, and angle) in combination with interference suppression should be fully utilized to explore the performance of HGR. In the stage of feature extraction and classification of HGR, a deep learning method, such as the convolutional neural network (CNN) [13,14] was proven to be an effective mean for HGR. The deep convolutional neural network (DCNN) was applied to extract the gesture features of micro-Doppler and achieve gesture recognition with 14 types of hand gestures [15]. However, the time features of the hand gesture motion were ignored, restricting the recognition accuracy of HGR. In References [16,17], 3D convolutional neural networks (3D-CNN) were presented to extract the motion features of continuous gestures. However, this 3D-CNN only focused on the local time features of hand gestures, and ignored the global time features of hand gestures. Zhang et al. [18] applied the long short term memory (LSTM) network [19] to learn the global time features. In [12], Yang et al. proposed a reused LSTM network to extract the trajectory features of the range, angle, and Doppler of hand gestures.

The aforementioned researchers conducted a lot of work and promoted the development of HGR. However, all of them focused on HGR with a single hand gesture. In many applications, such as virtual games and collaborative control, multi-hand gestures should be recognized simultaneously. When there are multiple dynamic hand gestures in front of the radar, the echo signals of multiple gestures mix together, and the multi-hand gesture recognition becomes more difficulty. Peng et al. [20] explored the multi-hand gesture recognition using the different ranges that corresponded to different hand gestures. Unfortunately, this method is not applicable when the multi-hand gestures have the same range. By using the range and angle information of multi-hand gestures with the beamforming technique, Wang et al. [21] extracted each gesture signal successively and then carried out a 2D-FFT operation to obtain the Doppler spectrum of each hand gesture. Although range and angle parameters are used for multi-hand gesture recognition, the Doppler information is missing. More importantly, when the multi-hand gestures have the same range and different hand gestures have similar Doppler characteristics, the recognition accuracies of the above two multi-hand gesture recognition methods are pessimistic. Therefore, this paper applies the automotive frequency modulated continuous wave (FMCW) radar, and tries to design a multi-hand gesture recognition system by making full use of the threedimensional parameters of range, Doppler, and angle, and designing a novel deep learning network. The main contributions of this paper are summarized as follows.

*Firstly*, we applied 3D-FFT on the fast-time domain, slow-time domain, and antenna domain to estimate the range, Doppler, and angle parameters. Specifically, the 2D-FFT was applied to construct the RDM, and the angle FFT was applied on the results of range FFT among multiple antennas to construct the range-angle map (RAM). We applied the Hanning window to suppress the spectral leakage, two-dimensional CA-CFAR (2D-CA-CFAR) to suppress the dynamic interference, and the average power of several continuous frames in feature maps to suppress the static interference in RDMs and RAMs of multi-hand gestures.

*Secondly*, we propose a spatiotemporal path selection algorithm to separate the multihand gestures in RDMs and RAMs. We designed a dual 3D-CNN-based feature fusion network (D-3D-CNN-FN) for feature extraction and classification of the separated hand gestures. Specifically, the dual 3D-CNN network was presented to extract the features of RDMs and RAMs, and the extracted features from multiple frames were then fused and input into LSTM. The output feature sequence is classified by softmax.

*Finally*, we designed a platform and used eight types of multi-hand gestures to validate the effectiveness of the proposed multi-hand gesture recognition system; the experimental results verify the superiority of the proposed method.

The rest of this paper is organized as follows. Section 2, introduces signal processing of FMCW radar. Section 3 details the proposed multi-hand gesture recognition system, including interference suppression, multi-hand gestures separation, and hand gesture recognition. Experiments are carried out in Section 4 and followed by a conclusion in Section 5.

#### **2. Radar Signal Processing**

In this section, we analyze the working principle of the automotive FMCW radar and construct the RDM and RAM maps of multi-hand gesture parameters in details.

#### *2.1. IF Signal of FMCW Radar*

The FMCW radar generates a linear frequency modulation continuous signal through a waveform generator and transmits it by the transmitters. The transmitted signal is reflected by multi-hand gestures, and is then received by the receiving antennas. The intermediate frequency (IF) signal is obtained (shown in Figure 1) by mixing the transmitted and received signals, filtering out the high frequency part.

**Figure 1.** IF signal extraction.

The transmitted signals of the FMCW radar is

$$s\_{tx}(t) = A\_{tx} \cos(2\pi(f\_0 t + St^2/2 + \phi\_0(t))),\tag{1}$$

where *Atx* is the amplitude, *f*<sup>0</sup> is the initial frequency, and *S* = *B*/*Tc* is the slope of transmitted signal. *B* is the bandwidth of the radar, *Tc* is the sweep period, and *φ*0(*t*) is the initial phase of the transmitted signal.

The received signal reflected by the *k*-th target (hand) can be expressed as

$$s\_{rx}(t,k) = A\_{rx} \cos(2\pi(f\_0(t-\tau\_k) + \zeta(t-\tau\_k)^2/2 + \phi\_0(t-\tau\_k))),\tag{2}$$

where *Arx* and *τ<sup>k</sup>* = 2*Rk*/*c*, respectively, represent the amplitude and flight time of the received echo signal, *Rk* is the range from the *k*-th target (hand) to the radar, and *φ*0(*t* − *τk*) is the phase of the received signal.

The received signal and the transmitted signal are sent to the mixer and passed through a low power filter (LPF) to obtain an IF signal. The IF signal is expressed as

$$s\_{IF}(t) = \sum\_{k=1}^{K} A\_k \cos(2\pi(S\_l \tau\_k + f\_0 \tau\_k)) + N(t),\tag{3}$$

where *K* is the number of hand gesture targets, *Ak* is the amplitude of the IF signal of the hand gesture target, and *N*(*t*) is the white Gaussian noise.

#### *2.2. Theory of Parameters Estimation*

To achieve multi-hand gesture recognition, in this paper, the range, Doppler, and angle parameters of the FMCW radar are applied. In this subsection, we analyze the estimation theories of these three parameters.

#### 2.2.1. Range Estimation

If we obtain the delayed time of the echo signal, the range between the radar and hand gesture target is calculated, i.e., *Rk* = *τkc*/2. In fact, the delayed time cannot be obtained directly. Fortunately, according to the principle of the IF signal extraction in Figure 1, we find that the delayed time determines the frequency of the IF signal, and the relationship is as follows.

$$f\_{IF} = \mathbf{S} \cdot \boldsymbol{\tau}\_k = \frac{B}{T\_c} \cdot \boldsymbol{\tau}\_k. \tag{4}$$

Therefore, the delayed time is expressed as

$$
\pi\_k = \frac{T\_c}{B} f\_{IF} \,\tag{5}
$$

Then, the range can be computed

$$R(f\_{IF}) = \frac{cT\_c}{2B} f\_{IF} \,\tag{6}$$

Since the radar configuration parameters are predefined, the bandwidth *B* and the sweep period *Tc* are fixed. Then, different hand gesture targets in front of the radar result in different IF frequencies, and different ranges can be obtained. According to [22], the range resolution of the FMCW radar is *dres* = *<sup>c</sup>* <sup>2</sup>*<sup>B</sup>* . The maximum detection range of the FMCW radar

$$R\_{\text{max}} = d\_{\text{res}} T\_{\text{c}} f\_{\text{s}}.\tag{7}$$

The range resolution is determined by the bandwidth, and the maximum detection range is affected by bandwidth, sweep period, as well as the sampling frequency. Therefore, to maintain the requirement with different ranges, these three parameters should be carefully designed.

#### 2.2.2. Doppler Estimation

To measure the Doppler of a moving hand gesture target, at least two chirps are required. Specifically, the phase difference of two continuous chirp signals is first calculated [23]

$$
\Delta\phi = \phi\_2 - \phi\_1 = \frac{4\pi\upsilon T\_c}{\lambda}.\tag{8}
$$

where *φ*<sup>1</sup> and *φ*<sup>2</sup> are the phases of the two chirp signals, *λ* is the wavelength, and *υ* is the Doppler of the moving target; it is expressed as *<sup>υ</sup>* <sup>=</sup> *<sup>λ</sup>* <sup>4</sup>*πυTc* .

The Doppler resolution is *<sup>υ</sup>res* <sup>=</sup> *<sup>λ</sup>* <sup>2</sup>*MnTc* , where *Mn* is the chirp number in a frame. The maximum detected Doppler is *υmax* = *<sup>λ</sup>* 4*Tc* . The Doppler resolution is decided by the chirp number and sweep period, and the maximum detection Doppler is only determined by the sweep period.

#### 2.2.3. Angle Estimation

Since the FMCW radar has multiple transmitting and receiving antennas, we can estimate the angle of the hand gesture target using the phase differences of multiple receiving antennas. The multiple receiving antennas cause path differences of the echo signal from the same target, resulting in the phase difference. The phase difference between two adjacent receiving antennas is

$$
\Delta \varphi = 2\pi \Delta d / \lambda\_\prime \tag{9}
$$

where Δ*d* = *l* sin *θ* is the path difference, *l* is the range between the two continuous receiving antennas, and *θ* is the arrival angle of the signal.

Then, the arrival angle can be expressed as

$$\theta = \sin^{-1}(\frac{\lambda \Delta \varphi}{2\pi l}).\tag{10}$$

Then, we can estimate the arrival angle by searching the spectral peak.

#### *2.3. 3D-FFT-Based RDM and RAM Construction*

Based on the range, Doppler, and angle estimation analysis, this subsection gives the RDM and RAM construction process in detail. Assume that *M* frames are transmitted, and each frame contains *N* chirps and the number sampling points of each chirp is *Nadc*. Since the three parameters can be estimated by 3D-FFT [24], we performed 3D-FFT to construct the RDM and RAM. The process of 3D-FFT for range, Doppler, and angle estimation is shown in Figure 2.

**Figure 2.** 3D-FFT process.

To obtain the three parameters, FFT was firstly carried out on the sampling points of each chirp (such an operation is called range-FFT) to obtain the range of the multi-hand gestures. Based on the results of the first FFT, the FFT was further carried out over *N* chirps of a frame to obtain the Doppler frequency (called Doppler-FFT). By performing the 2D-FFT on different frames, we obtained the RDMs of a complete multi-hand gesture. RDMs can be obtained using a single transmitting and a single receiving antenna. With multiple transmitting and receiving antennas, the results of 2D-FFT (RDM) from different antennas were summed, resulting in less clutter and a higher SNR [25]. On the other hand, with multiple transmitting and receiving antennas, we can estimate the angle of the multi-hand gestures-based on the results of RDM. Generally, the third FFT is carried out on the results of RDM over different antennas. Then, we searched the spectral peak on the results of the third FFT to obtain the range-angle maps of multi-hand gestures.

#### **3. Proposed Multi-Hand Gesture Recognition System**

In this section, the proposed multi-hand gesture recognition system is presented, shown in Figure 3. Since there are a lot of interferences in the parameter maps (i.e., RDM and RAM), to achieve a satisfactory recognition accuracy, we have to suppress the interferences. The multi-hand gesture is then separated by the proposed spatiotemporal path selection algorithm. Finally, the separated hand gestures are trained and tested by the designed deep learning network.

**Figure 3.** Proposed multi-hand gesture recognition flowchart.

#### *3.1. Interference Suppression*

There are many interferences in the IF signal. To suppress the interference from the interfering radars, the authors in [26] designed novel orthogonal noise waveforms; the key idea is that the signal in the current transmission pulse is orthogonal to the next transmission signal. In [27], the authors proposed a tunable Q-factor wavelet transform to suppress the mutual interference between automotive FMCW radars. In [28], Wang utilized a onedimensional CFAR to detect interferences and the detection map was dilated to generate a mask for interference suppression. In case of radar self-motion, the authors in [29] investigated a range alignment processing approach for breathing estimation considering the radar self-motion and target motion. This work can be applied in the field of assistive devices for the disabled. In [30], Haggag et al. introduced a reliable and robust probabilistic method to estimate the radar self-motion. Different from the above cases, in our considered scenario, the interfering radar was not considered and the interferences in the IF signal

mainly came from the static and moving objectives in the test environment. In this paper, the interferences mainly came from the thermal noise inside the radar system, moving targets (such as torso and arms), and static targets (wall or static objects), etc. These interferences in RDMs and RAMs have significant impacts on the separation of multi-hand gestures and the recognition accuracy of hand gestures. Towards this end, we suppressed the interferences in RDMs and RAMs from the following three aspects.

#### 3.1.1. Spectral Leakage Cancellation

The spectral leakage mainly comes from the system noise and the aperiodicity of signal truncation of the ADC sampling. The system noise causes ghost targets near the radar, while aperiodicity of signal truncation spectrum tailing in the whole frequency band, resulting in pseudo peaks around the target. Therefore, we added the Hanning window [31] to the 'three-dimensional' of the data cube. Specifically, the Hanning window was added to the sampling points of each chirp, the same sampling points of multiple chirps and the sampling points from different antennas. The results in the next section show that such an operation reduces the spectral leakage.

#### 3.1.2. Dynamic Interference Suppression

Moving targets, such as the arms and torso, cause dynamic interferences in RDMs and RAMs. Therefore, we applied the 2D-CA-CFAR algorithm [32] to suppress these dynamic interferences in RDMs and RAMs. The 2D-CA-CFAR algorithm computes the average interference power in the referenced window in RDM (or RAM), and obtains an adaptive power threshold. The power in RDM (or RAM), higher than the threshold, is marked as hand gesture, otherwise it is marked as interference. Finally, we moved the referenced window along the range and Doppler domain (or range and angle domain) in turn to suppress the dynamic interference of RDM and extract the multi-hand target.

#### 3.1.3. Static Interference Suppression

Although the 2D-CA-CFAR algorithm suppresses the dynamic interferences from the torso or arms, there still exists static interferences caused by the wall or static objects. Obviously, the Doppler of the static target in RDM is zero, and the range and angle of static targets in RAM keep stable. Therefore, we computed the average power of RDMs (or RAMs) over several continuous frames and subtracted the average power from RDMs (or RAMs) obtained after dynamic interference suppression. In the experiments, the averaged power was computed using five continuous frames. The authors in [33] proposed a range and Doppler cell migration correction (RDCMC) algorithm to solve the range and Doppler cell migration problem. In the future, we will apply this algorithm in combination with our interference suppression scheme on RDM and RAM to obtain better feature maps of multi-hand gestures.

#### *3.2. Spatiotemporal Path Selection Algorithm*

The parameter maps of multi-hand gestures were obtained after map construction and interference suppression. Since one frame constructs one RDM and one RAM, we collected *M* frames to ensure that the RDMs and RAMs contained complete multi-hand gestures. Due to the different range, the Doppler, and angle of the multi-hand gesture, different hand gesture targets show different highlights in RDM and RAM. The same hand gesture needs to be matched in RDM and RAM over different frames. However, since the range, Doppler, and angle of the multi-hand gesture are changing, the strengths of the same hand gestures in RDM and RAM on different frames are different. As a result, we cannot simply apply the strength to match the same hand gesture over different frames. Thus, based on the continuous variation characteristics in space and time, we propose a spatiotemporal path selection algorithm to separate the multi-hand gesture.

To show the separation procedure, we took RDM as an example (RAM follows the same procedure). Firstly, we found the hand gesture target that had the maximum amplitude in the first frame of RDM. We denoted by *Y*1(*r*1, *d*1) the maximum amplitude of the first hand gesture, and (*r*1, *d*1) the corresponding coordinate. Then, we calculated the cost function from (*r*1, *d*1) in the first frame to all the points in the second frame. The cost function is defined as

$$
\omega\_{ij}^1 = -|\mathcal{Y}\_1(r^1, d^1)| - |\mathcal{Y}\_2(r\_{ij\prime}^2 d\_{ij}^2)| + \omega\_{ij\prime}^1 \tag{11}
$$

where *Y*2(*r*<sup>2</sup> *ij*, *<sup>d</sup>*<sup>2</sup> *ij*) is the amplitude of the hand gestures in the second frame, *i* = 1, ··· , *Nadc* and *<sup>j</sup>* = 1, ··· , *<sup>N</sup>*, *<sup>r</sup>*<sup>2</sup> *ij* and *<sup>d</sup>*<sup>2</sup> *ij* are, respectively, the coordinates of the range and Doppler at the second frame, *ωij* is the weight function between (*r*1, *d*1) and (*r*<sup>2</sup> *ij*, *<sup>d</sup>*<sup>2</sup> *ij*) . The weight function is defined as

$$\|\omega\_{ij}^1 = a\_i^1\|r^1 - r\_{ij}^2\|\mathbf{2} + a\_j^1\|d^1 - d\_{ij}^2\|\mathbf{2}\_{\prime} \tag{12}$$

where *α*<sup>1</sup> *<sup>i</sup>* and *<sup>α</sup>*<sup>1</sup> *<sup>j</sup>* are the weight factors of the range and Doppler domain, respectively, and ·<sup>2</sup> is the Euclidean norm.

We found the minimum value of the cost function by searching all of the computed cost values. Following the search procedure, we can obtain the coordinates of the first hand gesture target in all the *M* RDMs. Since the hand gesture target occupied several consecutive coordinates, we used a rectangular window to select the hand gesture target based on the obtained coordinates. To extract the rest of the hand gesture targets, we subtracted the obtained hand gesture targets from RDMs until all the hand gestures were separated. The detailed separation process is concluded in Algorithm 1.

**Algorithm 1** Spatiotemporal path selection algorithm for multi-hand gesture separation.


5: Compute the maximum amplitude *Yf*(*r<sup>f</sup>* , *df*) and the corresponding coordinate at the *f*-th frame (*r<sup>f</sup>* , *df*), and record the cost function *C*.


*Yf*(*r<sup>f</sup>* , *df*) at the *f* + 1-th frame.

12: Construct a window centered on (*rf*<sup>+</sup>1, *df*+1) , and obtain the RDM of the *Nn*-th hand gesture at the *f* + 1-th frame.


15: Subtract the RDM of the *Nn*-th hand gesture target from the RDM of multi-hand gesture, and update *Nn* = *Nn* − 1. 16: **end for**

The input of Algorithm 1 is the *M* RDMs of multi-hand gestures and the output is the separated *M* RDMs of the *Nn* single hand gestures. The hand gesture target in the first RDM was selected randomly, and the selected hand gesture was then subtracted. Then, by applying Algorithm 1, we obtained the *M* RDMs of each hand gesture. Similarly, the RAMs of each hand gesture can be obtained.

#### *3.3. D-3D-CNN-FN for HGR*

After the multi-hand gestures were separated by the proposed spatiotemporal path selection algorithm, we input the separated hand gesture dataset of RDMs and RAMs into the designed deep learning network for HGR. A total of 32 frames contained a single complete multi-hand gesture motion. To realize real-time applications, we had to segment the continuous data flow into a single complete gesture motion by finding the start and end times of the continuous multi-hand gesture motion. Since the motion features of each hand gesture are described by multiple continuous RDMs and RAMs, the features and continuity of each hand gesture should be carefully considered. Therefore, we propose a dual 3D-CNN-based feature fusion network for HGR, shown in Figure 3. The dual 3D-CNN is applied to extract the features of RDMs and RAMs, and the LSTM network is further applied to extract the continuous features of the gesture motion. The detailed procedure is described as follows.

#### 3.3.1. 3D-CNN-Based Feature Extraction

Since each hand gesture has *M* RDMs and *M* RAMs, we applied two 3D-CNNs to respectively extract the feature of RDMs and RAMs. Then, the extracted feature sequences of range-Doppler and range-angle were fused. The detailed architecture of the dual 3D-CNN is shown in Figure 4.

The sizes of RDM and RAM were 64 × 64 and 64 × 128, respectively, and the sizes were determined by the sampling points of each chirp and the chirp number in each frame. To extract the features of 32 RDMs, the designed network contained five 3D convolution and pooling layers, and one full connection layer. We carried out one 3D convolution and one maximum pooling operation in each layer of the first three 3D convolution and pooling layers. In the fourth and fifth layers, we carried out two 3D convolutions and one maximum pooling operation. To reduce the network parameters and improve the generalization ability of the network, all the convolution operations adopted a 3 × 3 × 3 convolution kernel, followed by the linear activation function of ReLu to reduce the interdependence

between parameters and alleviate the overfitting phenomenon of the network. To fulfill the characteristics of RAMs, the 3D-CNN for feature extraction of RAMs contained six 3D convolution and pooling layers; each layer applied maximum pooling. Since the size of RAM is 64 × 128, we used 1 × 1 × 2 and 1 × 2 × 2 in the first two pooling layers to maintain the size of RDM. To extract subtle local features of hand gestures, we carried out two 3D convolution operations in the final 3D convolution layer.

#### 3.3.2. LSTM-Based Time Sequential Feature Extraction

The dual 3D-CNN extracts the dynamic hand gesture features of RDMs and RAMs over *M* frames, and obtains range-Doppler and range–angle features with two 1 × 1024 feature sequences. To obtain the time sequential features of the complete dynamic hand gesture, the LSTM network was applied. Since the two time sequential features were fused sequentially to obtain the feature sequence with 2 × 1024, the step of the applied LSTM network is 2. Then, the feature with size 2 × 1 is sent to each LSTM, and the time sequential feature with 1 × 1024 is the output. The LSTM for the time sequential feature extraction is shown in Figure 5. In Figure 5, *ft* the forgetting gate, *it* is the input gate, *ct* is the storage unit, and *ht* is the hidden layer state. At each time step, *it*, *ft*, *ct*, *ot*, and *ht* are expressed as

$$\begin{cases} \begin{aligned} \dot{\boldsymbol{t}}\_{t} &= \sigma(\boldsymbol{W}\_{\boldsymbol{m}\boldsymbol{i}} \cdot \mathbf{TF}\_{t} + \boldsymbol{W}\_{\boldsymbol{h}\boldsymbol{i}} \cdot \boldsymbol{h}\_{t-1} + \boldsymbol{W}\_{\boldsymbol{c}\boldsymbol{1}} \boldsymbol{c}\_{t-1} + \boldsymbol{b}\_{t}) \\ \boldsymbol{f}\_{t} &= \sigma\Big(\boldsymbol{W}\_{\boldsymbol{m}\boldsymbol{f}} \cdot \mathbf{TF}\_{t} + \boldsymbol{W}\_{\boldsymbol{h}\boldsymbol{f}} \cdot \boldsymbol{h}\_{t-1} + \boldsymbol{W}\_{\boldsymbol{c}\boldsymbol{f}} \boldsymbol{c}\_{t-1} + \boldsymbol{b}\_{f} \Big) \\ \boldsymbol{\tilde{c}}\_{t} &= \sigma\boldsymbol{f}\_{t} \odot \boldsymbol{i} + \boldsymbol{i}\_{t} \odot \tanh(\boldsymbol{W}\_{\boldsymbol{m}\boldsymbol{c}} \cdot \mathbf{TF}\_{t} + \boldsymbol{W}\_{\boldsymbol{h}\boldsymbol{c}} \cdot \boldsymbol{h}\_{t-1} + \boldsymbol{b}\_{c}) \\ \boldsymbol{c}\_{t} &= \boldsymbol{f}\_{t} \cdot \boldsymbol{\tilde{c}}\_{t-1} + \boldsymbol{i}\_{t} \cdot \boldsymbol{\tilde{c}}\_{t} \\ \boldsymbol{o}\_{t} &= \tanh(\boldsymbol{W}\_{\boldsymbol{m}\boldsymbol{o}} \cdot \mathbf{TF}\_{t} + \boldsymbol{W}\_{\boldsymbol{h}\boldsymbol{o}} \cdot \boldsymbol{h}\_{t-1} + \boldsymbol{W}\_{\boldsymbol{c}\boldsymbol{o}} \boldsymbol{c}\_{t-1} + \boldsymbol{b}\_{o}) \\ \boldsymbol{h}\_{t} &= \boldsymbol{o}\_{l} \cdot \tanh(\boldsymbol{c}\_{t}) \end{aligned} \tag{13}$$

where *<sup>σ</sup>*(·) is the sigmoid function, tanh(·) is the tanh function, *<sup>σ</sup>*(*x*) = <sup>1</sup> <sup>1</sup>+*e*−*<sup>x</sup>* , *Wmi*, *Whi*, *Wci*, *Wm f* , *Wh f* , *Wc f* , *Wmc*, *Whc*, *Whc*, *Wmo*, *Who*, and *Wco* are the weights in the LSTM unit, and *bf* , *bi*, and *bc* are the corresponding biases.

**Figure 5.** LSTM network-based time series feature extraction.

#### 3.3.3. Gesture Classification

The output of the LSTM network is the time sequential feature of the hand gesture with 1 × 1024. To classify the hand gesture using the time sequential features, we first normalized the sequences and input them into the full connection layer. The feature sequences were then input into the following normalized exponential function

$$softmax\_{v\_i} = \frac{\exp(w\_i^T v\_i)}{\sum\_{j=1}^k \exp(w\_j^T v\_j)}\,\tag{14}$$

where *i* is the *i*-th hand gesture, *vi* is the *i*-th element of the feature sequence, *wi* is the weight corresponding to *vi*, and *k* is the types of hand gestures.

#### **4. Experiments and Analysis**

#### *4.1. Experimental Setup*

In this paper, we designed and built a multi-hand gesture recognition platform using the automotive FMCW radar and data collection card provided by Texas Instruments (TI). The FMCW radar sensor is AWR1642 [34], and the data collection card is DCA1000 [35], shown in Figure 6. The collected data were processed by a personal computer (PC); the recognition results were displayed with the graphical user interface (GUI). The radar had two transmitting and four receiving antennas; the start frequency was 77 GHz and the bandwidth was 4 GHz. We collected 32 frames to acquire the complete multi-hand gestures; the detailed radar parameters are configured in Table 1. In the experiment, the Intel-6700K processor and NVIDIA-GTX1080 graphics card were used.

**Figure 6.** The adopted radar sensor and data collection card.



We designed eight types of multi-hand gestures with two-hands (the method presented in this paper is also feasible to the case of multiple hands (more than two); we will leave the test for future work. Since the FMCW radar can distinguish the two hands from the range and angle, the presented method can be applied to recognize the gestures of the driver and passenger in the front row of the car), shown in Figure 7a. The arrows indicate the movement directions of the hand gestures. Although there were eight types of multi-hand gestures, each hand performed four types of hand gestures. After multi-hand gesture separation, there were eight hand gestures in total, namely left hand slides to left (LSL), left hand slides to right (LSR), left hand slides to up (LSU), left hand slides to down (LSD), and right hand slides to left (RSL), right hand slides to right (RSR), right hand slides to up (RSU), right hand slides to down (RSD), shown in Figure 7a. Since the left hand and right hand perform similar actions, these similar features increased the recognition difficulty. To improve the robustness of the design deep learning network, we collected multi-hand gestures of three men and two women, and each person collected 100 multi-hand gesture data. As a result, each multi-hand gesture had 500 data; there were 4000 data for the eight types of multi-hand gestures in total.

**Figure 7.** Hand gesture types.

#### *4.2. Results and Analysis*

4.2.1. Effect of Interference Suppression

To show the effect of interference suppression, we used the RDM of the multi-hand gesture G1 in Figure 7 at the 10-th frame, and the results are shown in Figure 8a–d. It can be seen from Figure 8 that the spectral leakage in the original RDM was suppressed. Moreover, the dynamic interferences (caused by the torso and arms as well as micro-motion in the environment) in Figure 8b labeled by black boxes were suppressed. The static interferences caused by the wall or static objects were also suppressed in Figure 8d.

**Figure 8.** Effect of interference suppression for RDM. (**a**) Original. (**b**) With the added Hanning Window. (**c**) Dynamic interference suppression. (**d**) Static interference suppression.

We also used the RAM of the same hand gesture, as in Figure 8a–d to show the effect of interference suppression, shown in Figure 9a,b. We can see from Figure 9a that the two hand gesture targets are located at −30◦ and 40◦. Due to the spectral leakage and interferences, there were interference targets in the original RAM, marked by black and white boxes. Figure 9b shows that the two hand gesture targets are clearly after interference suppression.

**Figure 9.** Effect of interference suppression for RAM. (**a**) Original. (**b**) After interference suppression.

#### 4.2.2. Impact of Training Dataset Size

In this HGR experiment, we first analyzed the influence of the ratios of the training to the testing dataset of 2:8, 5:5, 7:3 and 8:2 on the results, shown in Figure 10. It can be seen that when the ratio is 2:8, the generalization ability and recognition result of the D-3D-CNN-FN network is very poor, which is mainly because of the poor fitting ability of the small training dataset. By increasing the ratio, there are more experimental samples in the training set, which makes the recognition accuracy of the proposed network higher. When the ratio is 7:3, the D-3D-CNN-FN network has the best performance, and a ratio of 7:3 in the following experiments.

**Figure 10.** Accuracy under different dataset ratios.

4.2.3. Impact of Learning Rate

Since the learning rate adjusts the weight during network training using the loss function, Figure 11 compares the recognition performance of the network under different learning rates. It can be seen that with a large learning rate, such as 0.001 and 0.005, the D-3D-CNN-FN network fails to converge or falls into a local optimum. If the weight is too small ( such as 0.00001), the D-3D-CNN-FN network will converge very slowly. As a result, we set the learning rate to 0.00008 in the following experiments.

**Figure 11.** Accuracy under different learning rates.

4.2.4. Recognition Accuracy Comparison

To verify the effectiveness of the D-3D-CNN-FN, we took 3D-CNN [17] with RDMs or RAMs and dual 3D-CNN with RDMs and RAMs for comparison. The comparison methods are marked by 3D-CNN+RDM, 3D-CNN+RAM and D-3D-CNN+RDM and RAM, respectively. The dual 3D-CNN (D-3D-CNN) applied two 3D-CNN networks to extract features of RDMs and RAMs, directly followed by the softmax classifier. The ratio of the training dataset to the test data set was 7:3, the initial learning rate was set to 0.0008, and the iteration number used for training was 10,000.

The training accuracy comparison on the test dataset is shown in Figure 12. Since the proposed D-3D-CNN-FN applied dual 3D-CNN for local feature extraction of both RDMs and RAMs and used LSTM for global time feature extraction, the proposed D-3D-CNN-FN has higher recognition accuracy than D-3D-CNN. Moreover, 3D-CNN with RDMs or RAMs have poor recognition accuracy mainly because of the limited motion parameters.

**Figure 12.** Training process comparison of each network.

The recognition accuracies of each type of gesture are summarized in Table 2. We used 'Ave'. to represent the average recognition accuracy in Table 2. Compared to 3D-CNN with only RDM or RAM, D-3D-CNN uses two network branches to respectively learn the local features of RDM and RAM, and then fuses the features and input into a full connection layer for classification. As a result, the average recognition accuracy (86.95%) of dual 3D-CNN is improved by 4.2 compared to the single branch 3D-CNN. With fused feature input into the LSTM for global feature extraction, the average recognition accuracy of the proposed D-3D-CNN-FN is about 6.2 higher than that of the D-3D-CNN network.


**Table 2.** Recognition accuracy comparison (%).

#### **5. Conclusions**

In this paper, we proposed a multi-hand gesture recognition method using an automotive FMCW radar sensor. The 3D-FFT was applied to construct the maps of RDM and RAM. Then, the interference was suppressed and the multi-hand gestures were separated by the proposed spatiotemporal path selection algorithm. The dual 3D-CNNs were proposed to extract the features of RDM and RAM, and the fused features were input into the LSTM network. The performance of the presented system was validated on the self-built dataset. The results showed that the averaged recognition accuracy of the proposed method was 93.12%, which was improved by 6.2 compared with the state-of-the-art. In the future, we will design an efficient and robust interference suppression scheme, as well as a deep learning network with the consideration of radar self-motion. Moreover, we will carry out real-time tests with more hands and more multi-hand gesture types. The multi-hand gesture recognition system applies the radar to recognize the multi-hand gestures in a device-free manner. It has potential application in intelligent driving, industry 4.0, HVAC systems, etc.

**Author Contributions:** Y.W. conceived the original idea and wrote the paper; D.W. and D.Y. collected and tested the multi-hand gesture recognition system; Y.F., L.X. and M.Z. analyzed the data and revised the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Natural Science Foundation of China under grant 61901076; in part by the National Science Foundation of Chongqing under grant cstc2020jcyjmsxmX0865; in part by the China Postdoctoral Science Foundation under grant 2021M693773; and in part by the Science and Technology Research Program of Chongqing Education Commission, under grant KJQN201900603.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank the reviewers and editor for their valuable comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

