Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM

Bai, Yujing; Wang, Jun; Chen, Penghui; Gong, Ziwei; Xiong, Qingxu

doi:10.3390/app14156782

Open AccessArticle

Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM

by

Yujing Bai

¹,

Jun Wang

^1,2,

Penghui Chen

^1,*

,

Ziwei Gong

³ and

Qingxu Xiong

¹

School of Electronic and Information Engineering, Beihang University, Beijing 100191, China

²

Key Laboratory of Intelligent Sensing Materials and Chip Integration Technology of Zhejiang Province, Hangzhou Innovation Institute, Beihang University, Hangzhou 310052, China

³

Gaode-Ride Sharing Business, Alibaba (Beijing) Software Services Co., Ltd., Beijing 100012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6782; https://doi.org/10.3390/app14156782 (registering DOI)

Submission received: 26 June 2024 / Revised: 27 July 2024 / Accepted: 1 August 2024 / Published: 3 August 2024

(This article belongs to the Special Issue Advanced Wireless Networks and IoT Technologies for Emerging Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Gesture plays an important role in human–machine interaction. However, the insufficient accuracy and high complexity of gesture recognition have blocked its widespread application. A gesture recognition method that combines state machine and bidirectional long short-term memory (Bi-LSTM) fusion neural network is proposed to improve the accuracy and efficiency. Firstly, gestures with large movements are categorized into simple trajectory gestures and complex trajectory gestures in advance. Afterwards, different recognition methods are applied for the two categories of gestures, and the final result of gesture recognition is obtained by combining the outputs of the two methods. The specific method used is a state machine that recognizes six simple trajectory gestures and a bidirectional LSTM fusion neural network that recognizes four complex trajectory gestures. Finally, the experimental results show that the proposed simple trajectory gesture recognition method has an average accuracy of 99.58%, and the bidirectional LSTM fusion neural network has an average accuracy of 99.47%, which can efficiently and accurately recognize 10 gestures with large movements. In addition, by collecting more gesture data from untrained participants, it was verified that the proposed neural network has good generalization performance and can adapt to the various operating habits of different users.

Keywords:

deep learning; hand gesture recognition; millimeter-wave radar

1. Introduction

Gesture is an important approach in human–machine interaction. Gestures as the most frequently used body language that can help people exchange information and express emotions in daily communications. Therefore, gestures can be applied to many scenarios in daily life [1], such as virtual reality [2], somatosensory games [3], smart homes [4], and car control [5]. By now, existing gesture recognition methods are implemented based on some hardware such as wearable sensors [6,7,8,9] and visual image sensors [10,11,12], which have achieved good results in specific situations. However, these gesture recognition technologies still have some limitations, such as inconvenience in use, susceptibility to environmental interference, and privacy disclosure.

With the improvement of the semiconductor manufacturing level and the development of signal processing algorithms, millimeter-wave radar is gradually being applied everywhere [13,14,15]. Gesture recognition based on millimeter-wave radar can effectively compensate the shortcomings of the wearable and visual image sensors. Gesture recognition based on radar has the following advantages: (1) There is no need to wear additional equipment. (2) It is not affected by light intensity, dust, haze, etc., and can work normally all day and in all weather conditions. (3) Due to the penetrability of electromagnetic waves, the radar can be integrated into the equipment to increase stability. (4) It is unable to obtain user’s personal information, ensuring privacy.

At present, some institutions and scholars have studied several relevant methods of gesture recognition based on millimeter-wave radar [16,17,18,19,20,21]. In [16], Google systematically studied the solution of radar gesture recognition, and customized the entire set of hardware and algorithm. Finally, relying on its integrated miniaturized radar chip, gesture recognition was implemented efficiently. And it was successfully integrated into the Pixel4 phone, which took the first step towards commercialization. In [17], the Nvidia Company fused millimeter-wave radar, optics, and depth-of-field sensors in order to exploit the complementary advantage of different sensors. And then they used a convolutional neural network (CNN) to classify and recognize the fused data. Finally, the recognition accuracy of 10 kinds of gestures reached 94.1%, and it was successfully applied to the vehicle system. In [18], Infineon proposed a two-dimensional CNN long short-term memory (2D CNN-LSTM) network for gesture recognition. Considering the fact that gesture actions are far from the radar in some applications, this research mainly focused on gesture recognition in long-distance scenarios. Finally, we achieved an accuracy of 94.75% in recognizing five gestures at long distances.

To improve the effectiveness and the ability of gesture recognition in complex scenes, some scholars have applied different deep learning methods in gesture recognition and achieved good results.

In [22], Thomas Stadelmayer et al. proposed a variational autoencoder architecture-based deep metric learning, which is optimized using a novel loss function combining a statistical distance triplet loss and center loss. This method can learn the nonlinear characteristics in data better and has less sensitivity to training strategies. Finally, the experimental results show that compared with existing deep metric learning methods, the proposed gesture recognition system using radar spectrograms improves the accuracy of classification recognition and the ability to suppress random gesture actions.

In [23], Jaehoon Jung et al. proposed a method for recognizing air-writing digits based on ultra-wideband (UWB) and frequency-modulated continuous wave (FMCW) radar using transfer learning. Firstly, this method processes the radar echo signal to remove clutter signals and converts the radar echo signal into appropriate data formats. Then, the range time map (RTM) and Doppler time map (DTM) of the target are used as inputs of the CNN to classify the radar signal image based on learning the nonlinear characteristics of each digit. The experimental results show that the proposed digit recognition method can recognize air-writing digits from 0 to 9, and the accuracy is greater than 98%. In addition, compared with the standard CNN method, the proposed transfer learning method exhibits superior performance in distinguishing each digit.

In [24], Peijun Zhao et al. proposed a CubeLearn model that directly extracted features from raw radar signals and constructed an end-to-end deep neural network for millimeter-wave FMCW radar gesture recognition. The experimental results show that the proposed CubeLearn model improves classification accuracy and can be used in edge computing devices due to its high computational efficiency.

In [25], Myoungseok Yu et al. proposed an optimal detection method for segmenting hand gesture data in real-time recognition systems. Their approach achieved an impressive accuracy of 96.88%, which represents a significant improvement of 11.84% over traditional methods. This enhanced accuracy is critical for the effective segmentation of frames, thereby improving the overall performance of real-time gesture recognition systems.

In [26], Jae-Woo Choi et al. proposed a hand gesture recognition system for real-time applications using 60 GHz FMCW radar. The system includes both signal processing and machine learning components, with an LSTM encoder designed to learn the temporal characteristics of the RDM sequences. The proposed system successfully distinguishes between 10 gestures collected from 10 participants, achieving a high classification accuracy of 99.10%. It also maintains a high accuracy of 98.48% for gestures from new participants.

However, there are still some shortcomings in the existing research. Firstly, due to the fact that there are multiple gesture types that are required to be classified, using a single neural network to recognize all the gestures will result in insufficient recognition accuracy. Secondly, due to the lack of consideration for the complexity of different gesture actions and the massive input of all gesture data for training, the input data are large, the network model is complex, and the recognition efficiency is low. Furthermore, general research only considers collecting all gesture data and compiling them into a dataset for neural network training and classification without considering gesture segmentation issues, which cannot be practically applied in real-time gesture recognition scenarios.

Therefore, with the aim of solving the above issues, a radar gesture recognition method based on state machine combined with bidirectional LSTM fusion neural network is proposed in this context. By categorizing gesture actions as either a simple trajectory gesture or a complex trajectory gesture in advance, six simple trajectory gestures are recognized using state machine, and four complex trajectory gestures are recognized using bidirectional LSTM fusion neural network. For simple trajectory gestures, as the hand only moves in a straight line, there is no need to use complex neural networks to recognize it. Instead, a state machine is used to determine the motion status of the gesture at different moments to recognize the current gesture in real time. When gestures cannot be recognized by the state machine, a bidirectional LSTM fusion neural network is used to classify the collected data and recognize complex trajectory gestures. Finally, by combining the classification results of state machines and neural networks, the recognized gesture types are output. Therefore, the main contributions of this article are as follows.

(i) A simple trajectory gesture recognition method based on a state machine is proposed, which determines the gesture status using motion parameters and identifies the current gesture type based on the status;

(ii) A complex trajectory gesture recognition method based on bidirectional LSTM fusion neural network is proposed, which inputs complex gesture data that cannot be recognized by the state machine into the neural network for classification and recognition of complex gesture types.

The remainder of this paper is organized as follows. Related works are introduced in Section 2. The methodology is presented in Section 3. The experimental results and discussion are shown in Section 4. Finally, the conclusion is provided in Section 5.

2. Related Works

In this section, related works on hand gesture recognition based on FMCW radar are introduced.

2.1. Radar Systems

A low-cost commercial FMCW millimeter-wave radar is used for gesture recognition experiments. The picture of the radar is shown in Figure 1a. The frequency range is 75–81 GHz with a maximum sweep bandwidth of 3 GHz. It has four transmit and four receive antennas which can form sixteen virtual receiving channels by working in a Time Division Multiplex Multiple-Input Multiple-Output (TDM-MIMO) mode. The radar chip is packaged using Antenna in Package (AiP) technology, which greatly reduces the radar size in order to apply it in compact scenarios such as automotive sensing, gesture recognition, and human detection [27].

The radar hardware is mainly formed by an RF front-end chip, a controller chip, and some peripheral circuits. The structural diagram of the simplified FMCW radar with a single transmitter and receiver is shown in Figure 1b.

This radar is designed with a compact AiP antenna, which can be easily integrated in small devices. The configuration of the transmit and receive antennas is illustrated in Figure 2a, with the transmit antenna depicted in red and the receive antenna in blue. The receiver antenna spacing is set at half of the carrier wavelength. By employing the TDM-MIMO mode, it is possible to obtain 16 virtual array elements, as illustrated in Figure 2b.

The radar system parameters are shown in Table 1. According to the parameters, the range resolution is 5 cm, the velocity resolution is 1.14 m/s, and the angle resolution is 1/6 rad.

The millimeter-wave radar emits an FMCW waveform, which is reflected by the target and received by the receive antenna. After mixing, filtering, and other operations, the radar extracts the intermediate frequency signal from the echo, which contains the target information regarding range, velocity, and angle. Finally, by collecting multiple frames of radar data, the continuous motion trajectory is obtained to achieve gesture recognition.

2.2. Gesture Design

In daily life, gestures play a ubiquitous role exemplified by the directive motions employed by traffic officers. Thus, the gesture protocols ought to align with the everyday customs, facilitating an intuitive comprehension and application for users.

The three-dimensional spatial model utilized for millimeter-wave radar to detect human gesture motion is illustrated in Figure 3a, and the real scene to collect human gesture data is shown in Figure 3b. The radar is located at the coordinate origin, and the antenna array plane is parallel to the XOZ plane, enabling the detection of gesture motion across the three-dimensional coordinates of X, Y, and Z. The radar can obtain target parameters such as range, azimuth, elevation angle, radial velocity, and scattering cross-section. However, due to the dispersion of electromagnetic power after being scattered by the hands, there is no obvious characteristic in the target scattering cross-section of gestures in actual measurements, rendering it ineffective for recognition. Therefore, the scattering cross-section parameter is not considered a feature for gesture motion.

According to the complexity of commonly used gesture motion trajectories, gestures can be divided into two categories: simple trajectory gestures and complex trajectory gestures. Simple trajectory gestures refer to gestures that move in a single direction along a straight line, which are easily recognizable. Complex trajectory gestures, on the other hand, involve not only linear motion but also reciprocating and circular arc movements, making their motion trajectories relatively complex and not directly recognizable. Therefore, for these two different categories of gestures, simple trajectory gestures can be recognized directly based on the solution of trajectory parameters, while deep learning algorithms are employed to recognize complex trajectory gestures.

According to Figure 3, the positive direction of the Z-axis is defined as upward and the negative direction is defined as downward. The positive direction of the X-axis is defined as leftward and the negative direction as rightward. The positive direction of the Y-axis is defined as backward and the negative direction as forward. Therefore, simple trajectory gestures are designed as six types of gestures: upward, downward, leftward, rightward, forward, and backward. All these gestures are the most frequently used ones in human daily life. The schematic diagram of these six gesture motions is shown in Figure 4.

Considering the practical and easily distinguishable features, complex trajectory gestures are designed as four types of gestures: waving to the left and right, drawing circles counterclockwise, drawing a check mark, and drawing a cross mark. For example, in applications, the users can switch file views by waving left or right, adjust the progress bar or volume by drawing circles, and operate the confirmation and cancellation in certain scenarios by drawing check or cross marks. The schematic diagram of these four gesture motions is shown in Figure 5.

2.3. Gesture Parameter Estimation

To classify and recognize 10 gestures, range, velocity, azimuth, and elevation information need to be estimated from the raw echo data of radar. Because gestures have different motion trajectories, their classification and recognition can be achieved by analyzing and processing these trajectories.

2.3.1. Range Estimation

The millimeter-wave radar can measure the target distance using electromagnetic waves. The transmission signal of the FMCW radar is a linearly modulated wave whose frequency varies linearly with time. This radar uses up frequency modulation pulses which are sawtooth waves and the transmission signal

s_{T X} (t)

can be expressed as

s_{T X} (t) = \exp (j 2 π (f_{c} t + \frac{1}{2} K t^{2}))

(1)

where

f_{c}

and

K

denote the carrier frequency and modulation rate.

The electromagnetic wave emitted propagates through space and is reflected by the target. The received echo signal

s_{R X} (t)

can be expressed as

s_{R X} (t) = \exp (j 2 π (f_{c} (t - τ) + \frac{1}{2} K {(t - τ)}^{2}))

(2)

where

τ

is the two-way propagation delay of electromagnetic waves from radar to target.

When the target is stationary, the time delay

τ

can be expressed as

τ = \frac{2 R}{c}

(3)

where

R

denotes the range between radar and target, and

c

denotes the light speed.

By mixing the echo signal with the transmission signal, an intermediate frequency (IF) signal can be obtained, which includes the distance information of the target. The IF signal can be expressed as

\begin{matrix} s_{I F} (t) & = s_{R X} (t) \cdot s_{T X}^{*} (t) \\ = \exp (j 2 π (- K τ t - f_{c} τ + \frac{1}{2} K τ^{2})) \end{matrix}

(4)

So, the target range can be obtained, which is expressed as

R = \frac{c f_{0}}{2 K}

(5)

where

f_{0}

denotes the IF frequency.

2.3.2. Velocity Estimation

When the target is moving, millimeter-wave radar can measure the target radial velocity. After electromagnetic waves are reflected by moving targets, the two-way propagation delay

τ

from the target to the radar can be expressed as

τ = \frac{2 (R + v t)}{c}

(6)

where

v

denotes the radial velocity of the target.

Due to the small frequency variation caused by the target velocity, the phase in the IF signal can be used to obtain the velocity information. Ignoring the quadratic term in the phase of the IF signal, the phase of the IF signal can be expressed as

φ = - 2 π f_{c} τ

(7)

To measure the target velocity, two adjacent pulses are used to calculate the phase difference, which can be expressed as

Δ φ_{v} = φ_{2} - φ_{1} = - 2 π f_{c} (τ_{2} - τ_{1})

(8)

where

τ_{1}

and

τ_{2}

denote the time delay corresponding to the first and second pulses, respectively, which can be expressed as

\{\begin{matrix} τ_{1} = \frac{2 (R + v t)}{c} \\ τ_{2} = \frac{2 (R + v (t + T_{c}))}{c} \end{matrix}

(9)

where

T_{c}

represents the pulse repetition period.

Substituting

τ_{1}

and

τ_{2}

into (8), the phase difference can be expressed as

Δ φ_{v} = \frac{4 π f_{c} T_{c} v}{c} = \frac{4 π T_{c} v}{λ}

(10)

Therefore, the target radial velocity can be obtained as

v = \frac{λ Δ φ_{v}}{4 π T_{c}}

(11)

2.3.3. Angle Estimation

Millimeter-wave radar utilizes multi-channel receiving antennas to measure the target angle. Due to the interval between each receiving antenna being

d

, the echo signal propagation distance difference between each two adjacent receiving antennas is

d \sin θ

. Based on this distance, the angle information can be calculated.

In the IF signal, the phase difference generated by the distance difference can be expressed as

Δ φ_{a} = \frac{2 π d \sin θ}{λ}

(12)

Therefore, the target angle can be obtained as

θ = \arcsin \frac{λ Δ φ_{a}}{2 π d}

(13)

2.4. Gesture Segmentation

Because radar echo signals are continuously collected, preprocessing is necessary to obtain information on gesture movements for the subsequent classification and recognition.

Moving Target Indication (MTI) is a technique that employs clutter suppression methods to enhance the signal-to-noise ratio (SNR) of moving targets and separate them from stationary background clutter.

Because of the frequency difference between the signal from moving targets and stationary clutter, the MTI filter suppresses stationary clutter by forming a stopband at direct current and integer multiples of the pulse repetition frequency while retaining the information of the moving targets in radar systems. To suppress clutter, a double delay canceller is utilized for extracting moving targets in gesture recognition.

The transfer function of the double delay canceller can be expressed as

H (z) = {(1 - z^{- 1})}^{2} = 1 - 2 z^{- 1} + z^{- 2}

(14)

To compare the radar echo signal of a gesture before and after the MTI process, it is necessary to use the high-resolution range profile (HRRP) of the radar echo signal. Additionally, the HRRP of the gesture will be used to segment the gesture data sequence. The HRRP of the gesture before and after MTI processing is shown in Figure 6. It can be observed that after MTI processing, the echo signal of the gesture becomes more distinct, which benefits the improvement of gesture detection performance in the subsequent steps.

After obtaining the gesture HRRP, due to the presence of noise in the receiver output signal, there is still an impact on target detection. Performing constant false alarm rate (CFAR) detection on HRRP is necessary to obtain the gesture target. By using order statistics CFAR (OS-CFAR) to detect targets in the gesture HRRP, complete actions are obtained, as shown in Figure 7.

By using OS-CFAR detection results, the start and end positions of the gesture can be determined for segmenting movements. Firstly, a sliding window is used along the time direction to detect gesture targets with a fixed step length. If the duration of the gesture’s existence exceeds a threshold, the corresponding time of the gesture’s occurrence is set as the start position of the gesture. If the gesture target continuously appears and disappears within the sliding window, and the duration of the gesture disappearance exceeds a threshold, the disappearance position of the gesture is set as the end position. At this point, a complete gesture action is collected. The start and end positions of a complete gesture action after segmentation are illustrated by the red lines in Figure 7.

2.5. Data Collection

Different users have different motions when using the same gesture to interact with a machine, which may cause misunderstandings of the gesture. Therefore, it is important to consider the diversity of data from different users when collecting gesture data. This ensures that deep learning algorithms have generalization ability even with limited sampling data in true scenarios where gestures are located.

Due to the absence of open-source datasets for millimeter-wave radar gesture data, the gesture dataset was collected by 15 individuals, with each collecting 200 sets of data to ensure diversity and comprehensiveness. In total, 3000 sample data were collected, with 750 samples for each type of gesture.

The dataset is typically divided into training, validation, and test sets. The partitioning strategy for a dataset usually depends on its size. For small-scale datasets (below 10,000 samples), traditional methods, such as a 6:2:2 ratio, are suitable for partitioning. For large-scale datasets (over 1 million samples), a common practice is to allocate 1% for the test set and 1% for the validation set. Since the dataset is limited in size, a 6:2:2 ratio is adopted to split the training, validation, and test sets.

3. Methodology

At present, gesture recognition based on FMCW radar mainly uses deep learning neural networks for classification and recognition. However, there are still some shortcomings. Using a single neural network for classification and recognition of multiple types often results in insufficient accuracy. Due to the simultaneous input of all gesture data during training, this results in a large amount of input data, a complex network model, and lower recognition efficiency. Conventional research typically emphasizes direct training and recognition using collected gesture datasets and overlooks gesture segmentation issues, resulting in a lack of practicality.

To overcome the shortcomings, the commonly used gestures are categorized into two types: simple trajectory gestures and complex trajectory gestures. Simple trajectory gestures which have linear motions can be directly recognized by solving parameters. Therefore, a state machine method can be employed for recognizing simple trajectory gestures. Complex trajectory gestures cannot be directly recognized, so the neural network method can be employed for recognizing them. Finally, by combining the recognition results of the two methods, the final recognized gesture is obtained. The diagram of recognition process is illustrated in Figure 8.

The specific processing flow is as follows. Initially, the millimeter-wave radar acquires the IF signal of the hand target through the transmitting and receiving of electromagnetic waves which contains target information such as distance, velocity, azimuth, and elevation angle. Secondly, the IF signal is processed to estimate the gesture parameters. Thirdly, MTI processing and OS-CFAR detection are conducted on the IF signal to detect the start and end positions of the gesture in order to segment the complete gesture movements. Fourthly, preprocessing is applied to the acquired gesture trajectory, which involves median filtering and uniform resampling to accommodate the various operating habits of different users. After preprocessing, the gesture trajectory is recognized to obtain the result. About simple trajectory gestures, using a neural network may lead to complexity and inefficiency. Directly judging the motion trajectories is more suitable for recognizing such gestures. In addition, trained neural networks can only recognize pre-set gestures, making it difficult to expand gesture types. When it comes to complex trajectory gestures, direct classification using parameters can lead to confusion. Therefore, neural networks are employed to identify specific gestures in three dimensions: distance, azimuth, and elevation. Finally, the classification results of the two methods are combined and the recognized gesture type is output.

3.1. Simple Trajectory Gesture Recognition

3.1.1. Gesture Trajectory Extraction

The gesture trajectory parameters include distance, azimuth, and elevation angle, representing the three-dimensional trajectory in the coordinate system during the gesture movement. The process of extracting gesture trajectory involves the measurement of target range and velocity, target detection, and estimation of target azimuth and elevation angles. The specific signal processing flowchart is shown in Figure 9.

Firstly, 2D FFT processing is conducted on the radar IF echo signal to generate a range–velocity map. The specific procedure involves arranging the IF signals from multiple pulses into a two-dimensional data matrix and conducting FFT calculations along both the sampling point and pulse accumulation dimensions. In FMCW radar, according to the principle of range and velocity measurement, targets with different distances and velocities will be distinguished and displayed at different positions on the range–velocity map after 2D FFT processing.

Secondly, target detection is performed on the range–velocity map to extract the hand target, while retaining the phase information of the corresponding position on the image for the subsequent calculation of the target’s azimuth and elevation angles. The specific process involves using detection methods to filter out arm information and extracting hand information. This is because the obtained range–velocity map contains both hand and arm information, which may interfere with gesture trajectory extraction. In general, the amplitude of hand signals is greater than that of the arm signals. Therefore, OS-CFAR detection can be employed for hand target detection while filtering out arm targets. The steps of OS-CFAR involve performing one-dimensional CFAR detection along the distance direction on the range–velocity map, followed by extracting the CFAR detection results along the velocity dimension. If the radiation value in the distance dimension exceeds the predefined threshold, a hand target at that distance is identified, thus obtaining target distance and velocity information.

In Figure 10, the range–velocity map displays two target peaks, with the peak of higher amplitude representing the hand target and the peak of lower amplitude representing the arm target. After OS-CFAR detection, only the information related to the hand target was retained in the range–velocity map, while the information concerning the arm target was filtered out.

Afterward, estimate the azimuth and elevation of the hand target to obtain its azimuth and elevation information. The specific process involves using the Capon algorithm to process the complex information retained at the distance and velocity units corresponding to the target’s location.

The core of the Capon algorithm is filtering time-domain echo data, suppressing frequency components and noise unrelated to the target through adaptive filters. This allows the target signal frequency to pass through without distortion, facilitating the estimation of signal power at the target frequency.

The signal power

P_{C apon} (θ)

at angle

θ

in Capon algorithm can be expressed as

P_{C apon} (θ) = \frac{1}{α^{H} (θ) {\hat{R}}_{x}^{- 1} α (θ)}

(15)

where

α (θ)

represents the steering vector and

{\hat{R}}_{x}

represents the correlation matrix of the receiving signal. Then, by stepping at a certain angle, the observed angle is scanned and the power value at the corresponding angle is calculated. Subsequently, a peak search is conducted on the power spectrum, and the angle corresponding to the peak represents the desired target angle.

For gesture targets, calculate the correlation matrix

{\hat{R}}_{x}

using the Capon algorithm on the target’s distance and velocity units, then conduct a peak search on the power spectrum to determine the azimuth of the gesture target. The estimation method for the elevation is similar to that of the azimuth. Combined with the previously obtained target distance and velocity information, the three-dimensional spatial position of the gesture motion trajectory is fully depicted.

Finally, select the data corresponding to the minimum distance between the gesture and the radar, retrieve the three-dimensional spatial coordinates of the gesture trajectory for each frame, and plot the target motion trajectory.

3.1.2. Median Filtering of Gesture Trajectory

After obtaining the motion trajectory of gestures, due to the interference and noise present in the target detection, there are false alarms in the detection results which affect subsequent recognition. Therefore, median filtering can be used to filter out discrete points and smooth the gesture trajectory. Median filtering is a nonlinear signal processing technique based on statistical sorting theory, which effectively suppresses noise and is commonly used in noise reduction processing. The basic principle involves replacing the value at a certain point in a digital sequence with the median value of the data adjacent to that point. This approach not only brings the value closer to the true data value but also smooths the sequence data by eliminating isolated noise points.

The measurement values of

n

points are

h_{1}, h_{2}, h_{3}, \dots \dots h_{n}

. To detect whether

h_{i}

is an outlier, the filter window length is set to

2 m + 1

(

2 m + 1 < n

). This means that

2 m + 1

measurement values

h_{i - m} \dots h_{i} \dots h_{i + m}

are sequentially taken from the

n

measurement sequence and sorted in descending order. The median value after sorting is taken as the filtering output, denoted as

m e d i a n (h_{j})

. The pseudocode for median filtering is presented in Algorithm 1.

Algorithm 1. Median filtering

MEDIAN (points, m)

N \leftarrow L E N G T H (p o i n t s)

n e w P o i n t s \leftarrow p o i n t s

f o r e a c h p o i n t p i f o r i \geq 1 i n p o i n t s d o

j \leftarrow i + k

i f j \leq N t h e n

t e m p W i n d o w \leftarrow p o i n t s [j - k, j + k]

S O R T (t e m p W i n d o w)

n e w P o i n t s [i] \leftarrow t e m p W i n d o w [k]

return newPoints

To verify the effectiveness of median filtering, a set of data for a counterclockwise circular hand gesture in the

x o y

plane were collected for testing. Median filtering was applied to process the gesture trajectory, and the filtered results of the

x

,

y

, and

z

three-dimensional coordinates are shown in Figure 11. Based on the filtered results, a relatively smooth gesture trajectory can be obtained through median filtering in three dimensions: range, azimuth, and elevation. To assess the results of median filtering, we employed the mean-squared error (MSE) to calculate the difference between the curves before and after median filtering. In this example, the MSE values for X, Y, and Z are 0.0012 m, 0.0016 m, and 0.0013 m, respectively, which illustrates that the median filtering maintains the curve information effectively.

3.1.3. Uniform Resampling of Gesture Trajectory

After smoothing and filtering the gesture trajectory, the time length and frame rate of different gestures vary due to different user habits, which can affect the recognition performance. Therefore, it is necessary to resample the gesture trajectory data. Resampling ensures that the gesture trajectory is not influenced by different users or gesture actions, and ensures that the data points of all gesture trajectories are consistent. Trajectory resampling primarily involves interpolation based on the length of the trajectory, ensuring that the data points along each segment of the gesture trajectory are evenly distributed [28].

The specific method involves first calculating the total length of the data point paths, then dividing the total length by the interval length between new points, and finally adding new points through linear interpolation to achieve uniform resampling of the trajectory. The pseudocode for trajectory resampling is presented in Algorithm 2.

Algorithm 2. Trajectory resampling

RESAMPLE (points, n)

I \leftarrow P A T H - L E N G T H (p o i n t s) / (n - 1)

D \leftarrow 0

n e w P o i n t s \leftarrow p o i n t s 0

f o r e a c h p o i n t p i f o r i \geq 1 i n p o i n t s d o

d \leftarrow D I S T A N C E (p i - 1, p i)

i f (D + d) \geq I t h e n

q_{x} \leftarrow p_{{i - 1}_{x}} + ((I - D) / d) \times (p_{i_{x}} - p_{i - 1_{x}})

q_{y} \leftarrow p_{i - 1_{y}} + ((I - D) / d) \times (p_{i_{y}} - p_{i - 1_{y}})

A P P E N D (n e w P o i n t s, q)

I N S E R T (p o i n t s, i, q) / / q w i l l b e t h e n e x t p i

D \leftarrow 0

e l s e D \leftarrow D + d

return newPoints
PATH-LENGTH (A)

d \leftarrow 0

f o r i f r o m 1 t o |A| s t e p 1 d o

d \leftarrow d + D I S T A N C E (A i - 1, A i)

return d

Resample the filtered gesture motion trajectory and plot the trajectory before and after resampling in the

x o y

plane. The images are illustrated in Figure 12, where the blue dots represent the starting positions of the gesture trajectory and the yellow dots represent the ending positions. From the graph, it can be observed that after resampling, the non-uniformly distributed motion points of the gesture trajectory are evenly spaced. The resampling process converts gesture trajectory with varying speeds into uniform trajectory.

3.1.4. Gesture Trajectory Recognition

After extracting target parameters, segmenting gesture movements, and preprocessing gesture trajectories, the motion data collected by millimeter-wave radar are normalized for all gestures. Gesture recognition based on target trajectory parameters is to determine the current motion state of a gesture. Typically, most research uses neural networks for gesture recognition. However, neural network methods still have some shortcomings.

Neural networks, due to their large-scale architectures and numerous parameters, often require substantial computational resources and time for processing. The extensive matrix operations and complex calculations involved in training and deploying these networks can be both resource-intensive and slow, particularly for real-time applications. In contrast, finite-state machines offer a more efficient alternative by utilizing a simplified computational model. The finite-state machines operate with a fixed number of states and transitions, performing only basic arithmetic operations and conditional logic. This reduced complexity translates to significantly lower computational demands, making finite-state machines a valuable tool for applications where efficiency and speed are critical. By leveraging finite-state machines, we can enhance computational efficiency, reduce resource consumption, and achieve faster processing times compared to the resource-heavy neural network approaches. Therefore, a recognition method that uses finite-state machines to determine gesture actions based on gesture trajectory parameters is proposed to improve the computational efficiency.

Firstly, to determine the starting and ending moments, as well as different gesture states during the movement, a complete gesture action can be decomposed into 10 motion states, as outlined in Table 2.

The explanation for various motion states is that when the hand target has not entered the radar detection range, the state is defined as No Target. When the motion trajectory of the hand target is not in a straight line, the state is Random Motion. When the hand target gradually approaches the radar, the state is Gesture Forward. If the current state is Gesture Forward and the distance between the hand target and the radar is less than 0.2 m, it is Gesture Activated. When the state is Gesture Activated, the specific direction of the gesture movement can be judged, such as left and right swings and up and down swings, which correspond to the state Gesture Leftward, Gesture Rightward, Gesture Upward, and Gesture Downward. If the distance between the hand target and the radar gradually increases, the gesture state is determined to be Gesture Backward. When in the state of Gesture Backward, if the distance between the hand target and the radar is greater than 0.25 m, it is judged as Gesture Deactivated.

The state transition diagram of the finite-state machine is depicted in Figure 13. The threshold for Gesture Activated is set to 0.2 m based on actual daily requirements, while the threshold for Gesture Deactivated is set to 0.25 m to prevent state oscillation when the gesture is close to 0.2 m.

Assessing the linearity of the trajectory is to determine whether the current gesture contains linear motion, thereby identifying whether the current motion constitutes an effective gesture action. To analyze the trajectory, we use a sliding window to segment it into parts, where the length of each segment is determined by a predefined threshold, such as five points per segment. The method of assessment involves calculating the ratio of the segment length to the entire trajectory length. If this ratio exceeds 0.9, the trajectory is classified as a straight line; otherwise, it is classified as a curve.

Judging the direction of the trajectory involves determining the direction of the current action, aiming to distinguish the upward, downward, leftward, and rightward directions of the gesture action. The method of judgement includes projecting the three-dimensional trajectory of gesture motion onto the

x y

plane,

y z

plane, and

x z

plane separately. Then, determine the direction of gesture motion by calculating the ratio of trajectory projections and analyzing the trend of trajectory changes on the three coordinate planes.

3.2. Complex Trajectory Gesture Recognition

Considering the limitations of simple trajectory gesture recognition methods in accurately identifying complex gesture actions, the adoption of deep learning neural networks becomes imperative. Consequently, a bidirectional LSTM fusion neural network has been introduced to effectively recognize these complex gestures.

This neural network is mainly composed of convolutional layers and bidirectional LSTM networks. The input data are a three-dimensional data block containing target distance information, azimuth information, and elevation information. Since gesture actions involve continuous motion processes, use the bidirectional LSTM network to extract temporal features for recognition. Simultaneously, the convolutional layer is employed to extract image features from the three-dimensional input data. By integrating the two network structures, the gesture features in the radar data are maximally extracted, improving the accuracy of gesture recognition.

The algorithm flow of the bidirectional LSTM fusion neural network is as follows: firstly, the radar IF signal data are preprocessed and split into time series data. Each frame of radar IF data is then transformed into a data block containing information in three dimensions: distance, azimuth, and elevation. Subsequently, feature extraction is conducted on the gesture data block sequence using a three-dimensional convolutional network. Finally, the bidirectional LSTM network module is utilized to predict and classify gestures from a time series.

3.2.1. Data Preprocessing

The radar IF signal can be divided into fast time dimension, slow time dimension, and receiving channel dimension. The fast time dimension represents a single chirp, with the phase of the chirp containing the target distance information which can be obtained by performing one-dimensional FFT on this dimension. The slow time dimension is the accumulation of multiple chirps, and the phase difference information between different chirps contains the target velocity information which can be obtained by performing one-dimensional FFT on this dimension. Finally, the phase difference between different channels contains the target angle information. Capon processing to the channel dimensions can obtain the azimuth and elevation angle information of the gesture. Therefore, the selected gesture features, distance, azimuth, and elevation information are fused into a three-dimensional feature block as input to the neural network for classification and recognition.

Since the duration between the start and end positions of the gestures varies, the number of chirps collected for different gestures also differs. To ensure that the size of the data input into the neural network remains consistent, a sliding window is employed to segment the gesture data. Each intercepted sliding window corresponds to a frame of data, with the length of the sliding window corresponding to the number of pulses within a frame. During the experiment, the frame rate of a complete gesture motion was set to 10 frames, each containing 100 chirp pulses. The process of segmentation using sliding windows is illustrated in Figure 14.

3.2.2. Bidirectional LSTM Fusion Neural Network

The bidirectional LSTM fusion neural network mainly consists of three-dimensional convolutional layers and a bidirectional LSTM neural network. The schematic diagram of the neural network is shown in Figure 15. The three-dimensional convolutional layer is mainly used to extract image features, while the bidirectional LSTM network is mainly used to extract temporal features from input data.

In the proposed neural network, the input data are a 10-frame 3D data cube containing information about distance, azimuth, and elevation. A 3D convolutional neural network is employed to extract features from this input data cube. Subsequently, the extracted gesture features are fed into the bidirectional LSTM neural network via a fully connected (FC) layer and a sequence unfolding layer to capture temporal features of the gestures. Finally, an FC layer is utilized to classify and recognize the features that are output by the bidirectional LSTM.

Due to the correlation between gestures during the movement, it is difficult to infer the trend of future trajectories solely based on the gesture data before the current moment, as different gesture actions may have similarities within the same time period. For example, the two types of gestures set up earlier, the check and cross marks, all start with the hand waving downwards. The changes in distance, azimuth, and elevation are consistent. Therefore, in addition to the current moment of hand movement information, more gesture movement information before and after that moment is needed to determine whether the gesture belongs to the check or cross gesture. The incorporation of bidirectional LSTM neural networks into gesture recognition can effectively address the aforementioned issues by leveraging data from both before and after gesture movements simultaneously, which enhances the accuracy of gesture recognition.

3.2.3. Neural Network Parameter Design

The proposed bidirectional LSTM fusion neural network comprises two components: a three-dimensional convolutional neural network and a bidirectional LSTM neural network. The network parameter is outlined in Table 3.

In the input layer, the input data cube is resampled into 16 × 16 × 16 distance, azimuth, and elevation data blocks, with a time series length of 10. Afterwards, the sequence folding module is used for processing, as a convolutional layer is required on the time series to extract high-dimensional features. Therefore, the folded sequence needs to be fed into a three-dimensional convolutional neural network for feature extraction.

In a three-dimensional convolutional neural network, there are two convolutional pooling layers. In the first convolutional pooling layer, the kernel size is 5 × 5 × 5 with a step size of 1, using the same filling method to maintain the output dimension unchanged after convolution. The pooling layer size is 2 × 2 × 2 with a step size of 2, using the Max pooling method to perform feature scaling on the data size. In the second convolutional pooling layer, the kernel size is 3 × 3 × 3, with a step size of 1, using the same filling method. The pooling layer is the same as the previous layer, and the extracted features are a four-dimensional feature matrix of 4 × 4 × 4 × 10. After FC layers and sequence unfolding module, the features are input into a bidirectional LSTM neural network.

In the bidirectional LSTM neural network, there are 128 LSTM cell units, and due to the bidirectional sequence output, the output dimension is 256. Finally, gesture classification prediction is completed through the FC layer and Softmax layer.

3.2.4. Neural Network for Comparison

To verify the gesture recognition performance of the proposed network, a three-channel convolutional neural network was selected and compared. The effectiveness and robustness of the proposed neural network in recognizing gesture actions were evaluated based on comparison results across various parameters. The schematic diagram of the structure of the three-channel convolutional neural network is shown in Figure 16.

The input data for each channel of the three-channel convolutional neural network are two-dimensional image data, which include the distance time map, azimuth time map, and elevation time map of gesture. Each channel contains a two-layer convolutional pooling structure. Due to the mean normalization of the input data, the input data size is 32 × 32. Therefore, the convolution kernel size and hyperparameter settings for each channel are the same for reducing network complexity. After completing the high-dimensional feature extraction of the data between the three channels, a self-attention layer is introduced to fuse the three-channel data. By calculating the similarity and Softmax processing of the outputs of the three channels, the attention spectrum between the channels can be obtained. Afterwards, the attention spectrum is fused with the initial gesture features to obtain dependency information between the three channels. Finally, the prediction results of the classification are output through two FC layers to achieve the classification and recognition of gestures.

The input data size of the three-channel neural network is 32 × 32, and the parameter settings of the convolutional neural network and fusion classification module are shown in Table 4.

4. Results

4.1. Simple Trajectory Gesture Recognition Results

To verify the effectiveness of the proposed simple trajectory gesture recognition method, a millimeter-wave radar was built to collect multiple sets of gesture data. During the experiment, the radar was installed on a tripod, with the radar antenna line of sight facing being straight ahead. The azimuth antenna array was arranged in a horizontal direction, and the elevation antenna array in a vertical direction, as shown in Figure 3.

This experiment aims to verify six simple straight-line trajectories: forward, backward, leftward, rightward, upward, and downward. Experimental subjects continuously perform dynamic gestures from the perspective of the radar antenna. During the gesture movement, they can assess the current status of the gesture in real-time and generate the corresponding gesture. Then, the type of gesture action can be determined without the need to wait for the completion of the entire gesture.

The measured data of six gesture actions are shown in Figure 17, where the blue part of the trajectory represents the starting position, the yellow part represents the ending position, and the image title is the corresponding gesture action automatically recognized based on the gesture trajectory. According to Figure 17, the proposed method can effectively recognize six simple trajectory gesture actions.

To further verify the recognition accuracy and robustness of the proposed method, 80 sets of data were collected for each gesture, totaling 480 sets for testing. By employing the proposed method to analyze the collected data, the recognition results and accuracy for the six gestures can be obtained. The recognition accuracy is presented in Table 5.

The experimental results show that under actual conditions, the proposed method has an average recognition accuracy of 99.58% for six gestures, achieving good recognition results. Reference [29] employs 2DCNN and LSTM neural networks to classify and recognize a total of ten gestures, which include the six simple gestures listed above. This is achieved by fusing radar images and optical images as inputs to the neural network. And the average accuracy of recognition is 95.36%. Compared to the referenced method, the proposed simple trajectory gesture recognition method demonstrates superior recognition performance. By directly extracting the trajectory parameters of gesture, we achieve gesture recognition through discriminating gesture motion status, thus reducing the complexity of feature extraction, simplifying the process of gesture recognition, and enhancing recognition accuracy. Moreover, by employing finite-state machine to recognize gestures, this method can continuously recognize gesture actions without any interruptions, aligning more closely with practical operating habits.

4.2. Complex Trajectory Gesture Recognition Results

For complex trajectory gestures that cannot be recognized by the state machine method, a neural network is constructed for recognition. To verify the recognition effectiveness of neural networks for four complex trajectory gesture actions, we utilized MATLAB 2022b software to construct and implement the designed neural network. Additionally, a dataset was collected for training to assess the classification and recognition performance of the neural network.

4.2.1. Network Training

The gesture dataset consists of 3000 samples, with an average of 750 samples collected for each type of gesture. Due to the insufficient sample size, the dataset is divided into training, validation, and test sets in a ratio of 6:2:2. The purpose of the training set is to train the neural network and adjust the model parameters. The validation set aids in the network training process by assisting in hyperparameters tuning and addressing underfitting or overfitting issues. Finally, the test set evaluates the performance of the trained neural network model.

To compare the three-channel CNN with the bidirectional LSTM fusion neural network, the two neural networks were trained under the same conditions. The batch size for training is set to 32, so the number of iterations for a single round is 50, and the maximum number of training iterations is set to 40. The optimization algorithm selects the Adam method, with a loss function of cross-entropy loss.

Firstly, compare the impact of initial learning rates on neural network training. Set the initial learning rates to 1 × 10⁻⁵, 5 × 10⁻⁵, 1 × 10⁻⁴, 5 × 10⁻⁴, and 1 × 10⁻³ to train the neural network. The training accuracy and loss curves for the three-channel CNN are shown in Figure 18a, and the results for the bidirectional LSTM fusion neural network are shown in Figure 18b.

According to the experimental results, the optimal initial learning rate for a three-channel CNN is 1 × 10⁻⁴. For bidirectional LSTM fusion neural networks, the optimal initial learning rate is 1 × 10⁻⁴.

Afterwards, the initial learning rates of the two neural networks are set to the optimal parameters obtained previously for training, and an early cut-off is set when the training accuracy no longer improves. The recognition accuracy and convergence speed of the two neural networks are compared. By training two types of neural networks using datasets, the training accuracy and loss curves are obtained as shown in Figure 19.

According to the training results, the recognition accuracy of the three-channel CNN on the validation set after training is 97.35%, while the recognition accuracy of the bidirectional LSTM fusion neural network is 98.94%. The experimental results show that the bidirectional LSTM fusion neural network has better recognition performance in the validation set, and under the same initial learning rate conditions, the bidirectional LSTM fusion neural network converges faster and has better convergence characteristics.

4.2.2. Recognition Results

To evaluate the recognition performance of the neural network after training, the test dataset is input into the neural network for recognition. Accuracy and F1-score are selected as evaluation criteria to assess the performance of the neural network where the accuracy is the ratio of the correctly identified sample size

N_{c o r r e c t}

to the total sample size

N_{t o t a l}

, defined as

Accuracy = \frac{N_{c o r r e c t}}{N_{t o t a l}}

(16)

F1-score is an evaluation metric that considers both accuracy and recall. For a single category

k

, the

{F 1}_{k}

is defined as

F 1_{k} = \frac{2 \cdot {Precision}_{k} \cdot {Recall}_{k}}{{Precision}_{k} + {Recall}_{k}}

(17)

where

{Precision}_{k}

represents the accuracy of classification, defined as

{Precision}_{k} = \frac{T P_{k}}{T P_{k} + F P_{k}}

(18)

{Recall}_{k}

represents the recall rate of classification, defined as

{Recall}_{k} = \frac{T P_{k}}{T P_{k} + F N_{k}}

(19)

where

T P_{k}

indicates that the predicted category

k

is correct,

F P_{k}

represents predicting other categories as category

k

, and

F N_{k}

represents predicting category

k

as other categories.

By calculating and averaging the

F 1_{k}

of each category, the F1-score of all categories is obtained, defined as

F 1 = \frac{1}{K} \sum_{k = 0}^{K - 1} {F 1}_{k}

(20)

The test set is input into two types of trained neural networks, and the corresponding recognition accuracy and F1-score are calculated. The results are presented in Table 6. The experimental results show that the average recognition accuracy of the proposed neural network model on the test set is 99.47%, representing a 3.19% increase compared to the accuracy of the three-channel CNN, which stands at 96.28%. Furthermore, the F1-score of the proposed neural network model is 0.996, surpassing the 0.968 F1-score of the three-channel CNN, indicating superior recognition performance.

To better observe the classification and recognition performance of two neural networks on the test set, confusion matrices depicting the classification and recognition results of four gestures are presented in Figure 20. The confusion matrix of the three-channel CNN is shown in Figure 20a, while that of the bidirectional LSTM fusion neural network is displayed in Figure 20b.

The test set data and training set data of the above experimental results are both from the same dataset. To verify the generalization and robustness of the proposed neural network, a total of 406 sets of gesture data were collected from two new subjects, with an average of about 100 sets collected for each gesture. Since the data from the two new subjects were not included in the training process of the neural network and were sourced from different datasets, they were utilized to evaluate the performance of the proposed model when encountering data not included in the training dataset.

The confusion matrix for gesture recognition of the new participants is shown in Figure 21. It can be seen that the proposed neural network has an average recognition accuracy of 98.77% for four types of gestures. And its recognition ability is comparable to the recognition results of the training set, which still has good recognition results. It demonstrates that the proposed bidirectional LSTM fusion neural network has good generalization performance on different datasets.

4.3. Combined Results

By combining the recognition results of simple trajectory gestures and complex trajectory gestures, the total recognition results are summarized in Table 7. The confusion matrix of the 10 gestures and their total recognition results are presented in Figure 22.

It can be seen that the proposed method has an average recognition accuracy of 99.64% for a total of 10 types of gestures. Reference [30] proposed a gesture recognition method for complex scenes based on a lightweight multi-CNN-LSTM model, which includes three CNNs for extracting features (RTM, DTM, and ATM) and one LSTM for capturing temporal features. The recognition accuracy for 14 experimental gestures in their study reached 97.28%. To facilitate comparison, similar gestures were extracted from the reference, including BFB, FBF, RLR, LRL, DUD, UDU, circle, Z, FP, and CWF. The average accuracy for these 10 types of gestures was 97.3%. Compared to the accuracy reported in the reference, the proposed method demonstrates superior recognition performance. Additionally, due to the use of a finite-state machine, the proposed method also exhibits higher efficiency than the reference method.

5. Discussion

The experimental results show that good recognition results can be achieved by dividing gestures into simple trajectory gestures and complex trajectory gestures and classifying them separately. Given the distinct motion characteristics inherent in various types of gestures, customizing recognition methods tailored to each gesture type enables the effective utilization of unique information across different gestures, thereby enhancing both recognition efficiency and accuracy. Therefore, a joint recognition method are proposed to distinguish the two types of gestures. Use the finite-state machine to recognize the simple trajectory gestures and for complex trajectory gestures, a bidirectional LSTM fusion neural network is proposed for recognition.

For simple trajectory gestures, due to the fact that the gesture motion trajectory is essentially a straight linear motion, the difference between various gestures is only in the direction of motion. Therefore, the key to recognize gestures through a finite-state machine is to determine the linearity and direction of the gesture motion trajectory. Firstly, determine whether the linearity of the gesture trajectory meets the linear state. Then, determine the direction of the gesture motion trajectory. By combining the results of both judgments, identify the corresponding gesture and achieve gesture recognition. This method uses a direct and simple judgment method to determine the type of gesture, resulting in higher recognition accuracy compared to traditional neural networks. Moreover, due to its smaller computational complexity, the recognition efficiency is also high. In addition, using finite-state machines can help us to make judgments on continuous gesture motion and output a real-time judgment of gesture status even if multiple well-defined gestures are made continuously over a period of time. However, traditional gesture recognition methods require waiting for the current action to be completed before recognition. The proposed method solves the problem of not being able to recognize continuously changing gestures.

For complex trajectory gestures, due to the complexity of the gesture motion trajectory, it is not possible to determine the gesture motion status by finite-state machines. There are similar motion parts across different gestures. For example, both the check and cross gestures begin with a motion to the right and downward, which can confuse the finite-state machine, so neural networks are used for recognition. A bidirectional LSTM fusion neural network is proposed for four gestures. Due to the ability of convolutional layers to process input image features, this neural network uses convolutional layers to extract image features from input data. Due to the ability of a bidirectional LSTM structure to effectively capture both pre- and post-correlations in the gesture motion process, it can thoroughly incorporate the sequence features preceding and following the current time point. Therefore, the bidirectional LSTM is used to extract the temporal features during the gesture motion process. The proposed neural network utilizes both features for recognition, fully utilizing the information collected by radar. Therefore, the experimental results show that compared to the three-channel CNN, it has better recognition and classification results.

During the training process of the neural network, the initial learning rate parameter of the three-channel CNN and the bidirectional LSTM fusion neural network are adjusted independently. Through experiments, the optimal learning rate is selected to train the neural network. The experimental results show that when the learning rate is set too low, the convergence speed of training is too slow, the training time to achieve the best recognition is too long, and the training efficiency is too low. When the learning rate is set too high, the accuracy curve and loss curve during the training process oscillate and cannot converge to the optimal recognition. Therefore, based on the above experimental results, when the initial learning rates of the two neural networks are set to 1 × 10⁻⁴, the accuracy curve and loss curve can converge well to the optimal recognition state.

Finally, the generalization performance of the proposed neural network model was validated by collecting gesture data from new participants as a new test set. The final experimental results show that for the new test set, the proposed neural network model still achieves good recognition performance, with an average recognition accuracy of 98.77%, verifying that the proposed neural network model has good generalization performance and can be applied to the recognition of different user gestures.

6. Conclusions

A radar gesture recognition method based on a finite-state machine combined with a bidirectional LSTM fusion neural network is proposed to address the issue of gesture recognition in human–computer interaction applications.

Firstly, the characteristics and motion parameters of gestures are discussed. Based on the complexity of gesture motion trajectories, ten typical gestures are divided into six simple trajectory gestures and four complex trajectory gestures. Afterwards, for simple trajectory gestures, gesture motion trajectories are extracted by processing radar echo signals and six simple trajectory gestures are classified and recognized by solving trajectory parameters and utilizing a finite-state machine. For complex trajectory gestures, as they cannot be directly distinguished by a state machine, deep learning is used to recognize and classify complex trajectory gestures. A bidirectional LSTM fusion neural network is proposed to classify and recognize four complex trajectory gestures. Finally, multiple sets of gesture data are collected to validate the effectiveness of the proposed algorithm, including 480 sets of data for six simple trajectory gestures and 3000 sets of data for four complex trajectory gestures. After experimental verification, the average accuracy of simple trajectory gesture recognition based on finite-state machine reached 99.58%. For complex trajectory gesture recognition based on a bidirectional LSTM fusion neural network, the accuracy is 99.47%. In addition, the proposed neural network can achieve an accuracy of 98.77% for gesture data that are not included in the training dataset. The experimental results show that the proposed gesture recognition method has good recognition and generalization performance.

Author Contributions

Conceptualization, P.C.; methodology, Y.B. and Z.G.; validation, Y.B. and Z.G.; formal analysis, P.C.; writing—original draft preparation, Z.G. and Y.B.; writing—review and editing, Y.B. and P.C.; visualization, Y.B. and Z.G.; supervision, P.C.; project administration, Q.X.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund of Qianjiang Laboratory, Hangzhou Innovation Institute, Beihang University, grant number 2020-Y7-A-010.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank all the participants who helped with data collection during the experiment.

Conflicts of Interest

Ziwei Gong was with the School of Electronic and Information Engineering, Beihang University, Beijing, China, and now is employed by Gaode-Ride Sharing Business, Alibaba (Beijing) Software Services Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Williams, E.H.; Bilbao-Broch, L.; Downing, P.E.; Cross, E.S. Examining the Value of Body Gestures in Social Reward Contexts. NeuroImage 2020, 222, 117276. [Google Scholar] [CrossRef] [PubMed]
Herbert, O.M.; Pérez-Granados, D.; Ruiz, M.A.O.; Cadena Martínez, R.; Gutiérrez, C.A.G.; Antuñano, M.A.Z. Static and Dynamic Hand Gestures: A Review of Techniques of Virtual Reality Manipulation. Sensors 2024, 24, 3760. [Google Scholar] [CrossRef]
Bachmann, D.; Weichert, F.; Rinkenauer, G. Review of Three-Dimensional Human-Computer Interaction with Focus on the Leap Motion Controller. Sensors 2018, 18, 2194. [Google Scholar] [CrossRef] [PubMed]
Alabdullah, B.I.; Ansar, H.; Mudawi, N.A.; Alazeb, A.; Alshahrani, A.; Alotaibi, S.S.; Jalal, A. Smart Home Automation-Based Hand Gesture Recognition Using Feature Fusion and Recurrent Neural Network. Sensors 2023, 23, 7523. [Google Scholar] [CrossRef] [PubMed]
Khan, F.; Leem, S.K.; Cho, S.H. Hand-Based Gesture Recognition for Vehicular Applications Using IR-UWB Radar. Sensors 2017, 17, 833. [Google Scholar] [CrossRef]
Toro-Ossaba, A.; Jaramillo-Tigreros, J.; Tejada, J.C.; Peña, A.; López-González, A.; Castanho, R.A. LSTM Recurrent Neural Network for Hand Gesture Recognition Using EMG Signals. Appl. Sci. 2022, 12, 9700. [Google Scholar] [CrossRef]
Colli Alfaro, J.G.; Trejos, A.L. User-Independent Hand Gesture Recognition Classification Models Using Sensor Fusion. Sensors 2022, 22, 1321. [Google Scholar] [CrossRef]
Jiang, Y.; Song, L.; Zhang, J.; Song, Y.; Yan, M. Multi-Category Gesture Recognition Modeling Based on sEMG and IMU Signals. Sensors 2022, 22, 5855. [Google Scholar] [CrossRef]
Vásconez, J.P.; Barona López, L.I.; Valdivieso Caraguay, Á.L.; Benalcázar, M.E. Hand Gesture Recognition Using EMG-IMU Signals and Deep Q-Networks. Sensors 2022, 22, 9613. [Google Scholar] [CrossRef]
Al Farid, F.; Hashim, N.; Abdullah, J.; Bhuiyan, M.R.; Shahida Mohd Isa, W.N.; Uddin, J.; Haque, M.A.; Husen, M.N. A Structured and Methodological Review on Vision-Based Hand Gesture Recognition System. J. Imaging 2022, 8, 153. [Google Scholar] [CrossRef]
Mujahid, A.; Awan, M.J.; Yasin, A.; Mohammed, M.A.; Damaševičius, R.; Maskeliūnas, R.; Abdulkareem, K.H. Real-Time Hand Gesture Recognition Based on Deep Learning YOLOv3 Model. Appl. Sci. 2021, 11, 4164. [Google Scholar] [CrossRef]
Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef] [PubMed]
Wei, Z.; Zhang, F.; Chang, S.; Liu, Y.; Wu, H.; Feng, Z. MmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review. Sensors 2022, 22, 2542. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Li, W.; Dou, Z.; Zou, W.; Zhang, A.; Li, Z. Activity Recognition Based on Millimeter-Wave Radar by Fusing Point Cloud and Range–Doppler Information. Signals 2022, 3, 266–283. [Google Scholar] [CrossRef]
Jing, H.; Li, S.; Miao, K.; Wang, S.; Cui, X.; Zhao, G.; Sun, H. Enhanced Millimeter-Wave 3-D Imaging via Complex-Valued Fully Convolutional Neural Network. Electronics 2022, 11, 147. [Google Scholar] [CrossRef]
Lien, J.; Gillian, N.; Karagozler, M.E.; Amihood, P.; Schwesig, C.; Olson, E.; Raja, H.; Poupyrev, I. Soli: Ubiquitous Gesture Sensing with Millimeter Wave Radar. ACM Trans. Graph. 2016, 35, 142:1–142:19. [Google Scholar] [CrossRef]
Molchanov, P.; Gupta, S.; Kim, K.; Pulli, K. Multi-Sensor System for Driver’s Hand-Gesture Recognition. In Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; Volume 1, pp. 1–8. [Google Scholar]
Hazra, S.; Santra, A. Radar Gesture Recognition System in Presence of Interference Using Self-Attention Neural Network. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; pp. 1409–1414. [Google Scholar]
Yu, J.-T.; Tseng, Y.-H.; Tseng, P.-H. A mmWave MIMO Radar-Based Gesture Recognition Using Fusion of Range, Velocity, and Angular Information. IEEE Sens. J. 2024, 24, 9124–9134. [Google Scholar] [CrossRef]
Li, B.; Yang, Y.; Yang, L.; Fan, C. Sign Language/Gesture Recognition on OOD Target Domains Using UWB Radar. IEEE Trans. Instrum. Meas. 2023, 72, 2529711. [Google Scholar] [CrossRef]
Kern, N.; Paulus, L.; Grebner, T.; Janoudi, V.; Waldschmidt, C. Radar-Based Gesture Recognition Under Ego-Motion for Automotive Applications. IEEE Trans. Radar Syst. 2023, 1, 542–552. [Google Scholar] [CrossRef]
Stadelmayer, T.; Santra, A.; Weigel, R.; Lurz, F. Radar-Based Gesture Recognition Using a Variational Autoencoder With Deep Statistical Metric Learning. IEEE Trans. Microw. Theory Tech. 2022, 70, 5051–5062. [Google Scholar] [CrossRef]
Jung, J.; Lim, S.; Kim, J.; Kim, S.-C. Digit Recognition Using FMCW and UWB Radar Sensors: A Transfer Learning Approach. IEEE Sens. J. 2023, 23, 18776–18784. [Google Scholar] [CrossRef]
Zhao, P.; Lu, C.X.; Wang, B.; Trigoni, N.; Markham, A. CubeLearn: End-to-End Learning for Human Motion Recognition from Raw mmWave Radar Signals. IEEE Internet Things J. 2023, 10, 10236–10249. [Google Scholar] [CrossRef]
Yu, M.; Kim, N.; Jung, Y.; Lee, S. A Frame Detection Method for Real-Time Hand Gesture Recognition Systems Using CW-Radar. Sensors 2020, 20, 2321. [Google Scholar] [CrossRef] [PubMed]
Choi, J.-W.; Ryu, S.-J.; Kim, J.-H. Short-Range Radar Based Real-Time Hand Gesture Recognition Using LSTM Encoder. IEEE Access 2019, 7, 33610–33618. [Google Scholar] [CrossRef]
Ma, T.; Deng, W.; Chen, Z.; Wu, J.; Zheng, W.; Wang, S.; Qi, N.; Liu, Y.; Chi, B. A CMOS 76–81-GHz 2-TX 3-RX FMCW Radar Transceiver Based on Mixed-Mode PLL Chirp Generator. IEEE J. Solid-State Circuits 2020, 55, 233–248. [Google Scholar] [CrossRef]
Wobbrock, J.O.; Wilson, A.D.; Li, Y. Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology, Newport, RI, USA, 7–10 October 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 159–168. [Google Scholar]
Liu, H.; Liu, Z. A Multimodal Dynamic Hand Gesture Recognition Based on Radar–Vision Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 1–15. [Google Scholar] [CrossRef]
Hao, Z.; Sun, Z.; Li, F.; Wang, R.; Peng, J. Millimeter Wave Gesture Recognition Using Multi-Feature Fusion Models in Complex Scenes. Sci. Rep. 2024, 14, 13758. [Google Scholar] [CrossRef]

Figure 1. Physical image (a) and structural diagram (b) of millimeter-wave radar.

Figure 2. Schematic diagram of radar antennas. (a) L-shaped distribution of transmitting and receiving antennas. (b) Schematic diagram of 16 channel virtual array elements.

Figure 3. The spatial position model and real scene for gesture motion. (a) The spatial position model. (b) The real scene for collecting gesture data.

Figure 4. The schematic diagram of 6 simple trajectory gestures. (a) Forward. (b) Backward. (c) Upward. (d) Downward. (e) Leftward. (f) Rightward.

Figure 5. The schematic diagram of four complex trajectory gestures. (a) Waving hand to the left and right. (b) Drawing circles counterclockwise. (c) Check mark. (d) Cross mark.

Figure 6. Gesture HRRP before and after MTI. (a) HRRP before MTI. (b) HRRP after MTI.

Figure 7. Gesture HRRP after CFAR detection. The yellow parts are the detected gesture signals.

Figure 8. Gesture recognition processing flowchart.

Figure 9. Diagram for extracting gesture trajectory parameters.

Figure 10. Hand target detection results.

Figure 11. Median filtering results of gesture trajectory. (a) X-dimensional median filtering. (b) Y-dimensional median filtering. (c) Z-dimensional median filtering.

Figure 12. Resampling results of gesture trajectory. (a) Before resampling. (b) After resampling. The gesture trajectories begin with the dark bule points and end with the light yellow points.

Figure 13. State diagram of the gesture finite-state machine.

Figure 14. Schematic diagram of using sliding window to capture time series for different gestures.

Figure 15. Schematic diagram of bidirectional LSTM fusion neural network.

Figure 16. Schematic diagram of three-channel convolutional neural network.

Figure 17. Measurement data and recognition results of 6 gesture actions. (a) Gesture Forward recognition. (b) Gesture Backward recognition. (c) Gesture Leftward recognition. (d) Gesture Rightward recognition. (e) Gesture Upward recognition. (f) Gesture Downward recognition. The gesture trajectories begin with the dark bule points and end with the light yellow points.

Figure 18. Training accuracy and loss curve of three-channel CNN and bidirectional LSTM fusion neural network. (a) The validation accuracy of three-channel CNN. (b) The validation loss of three-channel CNN. (c) The validation accuracy of bidirectional LSTM. (d) The validation loss of bidirectional LSTM.

Figure 19. The training accuracy and loss function curves of two types of networks. (a) The validation accuracy of two networks. (b) The validation loss of two networks.

Figure 20. Confusion matrix for classification and recognition of test sets. The background color of the grids is in accordance with the data value.

Figure 21. Confusion matrix for test sets of new participants. The background color of the grids is in accordance with the data value.

Figure 22. Confusion matrix for all the 10 gestures. The background color of the grids is in accordance with the data value.

Table 1. Radar system parameters.

Parameter	Value
Pulse Repetition Time (PRT)	110 μs
Sweep Bandwidth (BW)	3 GHz
Carrier Frequency (fc)	75 G
ADC Sample Rate (SR)	3.63 MHz
Sweep Slope	42.5 MHz/μs
Sample Numbers per Pulse	256
Pulse Numbers per Frame	16
Virtual Element Numbers	16

Table 2. Gesture state definition.

State Definition	State Description
No Target	No target detected in 3 frames.
Random Motion	Undefined gesture action.
Gesture Forward	Target approaching in 3 frames.
Gesture Backward	Target moving away in 3 frames.
Gesture Activated	During approaching, target distance is less than 0.2 m.
Gesture Deactivated	During moving away, target distance is greater than 0.25 m.
Gesture Leftward	Target moving left in 3 frames.
Gesture Rightward	Target moving right in 3 frames.
Gesture Upward	Target moving up in 3 frames.
Gesture Downward	Target moving down in 3 frames.

Table 3. Bidirectional LSTM fusion neural network parameters.

Layer	Layer Size	Output Size
Input layer	16 × 16 × 16 × 1	16 × 16 × 16 × 1
3D CNN
Conv1 + ReLU	5 × 5 × 5	16 × 16 × 16 × 10
Maxpooing1	2 × 2 × 2	8 × 8 × 8 × 10
Conv2 + ReLU	3 × 3 × 3	8 × 8 × 8 × 10
Maxpooling2	2 × 2 × 2	4 × 4 × 4 × 10
FC1	256	1 × 1 × 1 × 256
Dropout + ReLU	0.5	1 × 1 × 1 × 128
FC2	64	1 × 1 × 1 × 64
Dropout + ReLU	0.5	1 × 1 × 1 × 32
Flatten	\	32
Bidirectional LSTM neural network
Bi-LSTM	256	256
FC3	32	32
Dropout	0.5	16
FC4	4	4
Softmax	\	4

Table 4. Three-channel CNN parameters.

Layer	Layer Size	Output Size
Input layer	32 × 32 × 1	32 × 32 × 1
CNN
Conv1 + ReLU	5 × 5	28 × 28 × 6
Maxpooing1	2 × 2	14 × 14 × 6
Conv2 + ReLU	5 × 5	10 × 10 × 6
Maxpooling2	2 × 2	5 × 5 × 16
FC1	84	84
Fusion classification module
Self-attention layer	3 × 84	3 × 84
FC2	84	84
FC3	4	4
Softmax	\	4

Table 5. Simple trajectory gesture recognition results.

Gesture Type	Accuracy
Gesture Forward	100%
Gesture Backward	98.75%
Gesture Leftward	100%
Gesture Rightward	100%
Gesture Upward	98.75%
Gesture Downward	100%
Avg. Accuracy	99.58%

Table 6. Complex trajectory gesture recognition results.

Network	Avg. Accuracy	F1-Score	Wave		Circle		Check		Cross
Network	Avg. Accuracy	F1-Score	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
TriCH-CNN	96.28%	0.968	94.7%	96.4%	96.0%	93.2%	97.0%	99.3%	97.7%	100.0%
Bi-LSTMFN	99.47%	0.996	99.2%	100.0%	99.3%	99.3%	99.4%	100.0%	100%	100.0%

Table 7. The recognition results of all the 10 gestures.

Avg. Accuracy	Gesture Forward	Gesture Backward	Gesture Leftward	Gesture Rightward	Gesture Upward	Gesture Downward	Wave	Circle	Check	Cross
99.64%	100.0%	98.75%	100%	100%	98.75%	99.58%	100.0%	99.3%	100.0%	100.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, Y.; Wang, J.; Chen, P.; Gong, Z.; Xiong, Q. Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM. Appl. Sci. 2024, 14, 6782. https://doi.org/10.3390/app14156782

AMA Style

Bai Y, Wang J, Chen P, Gong Z, Xiong Q. Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM. Applied Sciences. 2024; 14(15):6782. https://doi.org/10.3390/app14156782

Chicago/Turabian Style

Bai, Yujing, Jun Wang, Penghui Chen, Ziwei Gong, and Qingxu Xiong. 2024. "Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM" Applied Sciences 14, no. 15: 6782. https://doi.org/10.3390/app14156782

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hand Trajectory Recognition by Radar with a Finite-State Machine and a Bi-LSTM

Abstract

1. Introduction

2. Related Works

2.1. Radar Systems

2.2. Gesture Design

2.3. Gesture Parameter Estimation

2.3.1. Range Estimation

2.3.2. Velocity Estimation

2.3.3. Angle Estimation

2.4. Gesture Segmentation

2.5. Data Collection

3. Methodology

3.1. Simple Trajectory Gesture Recognition

3.1.1. Gesture Trajectory Extraction

3.1.2. Median Filtering of Gesture Trajectory

3.1.3. Uniform Resampling of Gesture Trajectory

3.1.4. Gesture Trajectory Recognition

3.2. Complex Trajectory Gesture Recognition

3.2.1. Data Preprocessing

3.2.2. Bidirectional LSTM Fusion Neural Network

3.2.3. Neural Network Parameter Design

3.2.4. Neural Network for Comparison

4. Results

4.1. Simple Trajectory Gesture Recognition Results

4.2. Complex Trajectory Gesture Recognition Results

4.2.1. Network Training

4.2.2. Recognition Results

4.3. Combined Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI