1. Introduction
Gesture recognition is a topic in computer science and language technology [
1]. As an extremely efficient non-verbal interaction method, gesture interaction will provide strong technical support for excellence in emerging fields such as smart medical devices, assistive devices for the disabled, smart homes, and smart military operations [
2,
3]. Most of the current major research work on gesture recognition is focused on machine-vision-based recognition methods, which can pose many limitations in practical applications. Inertial sensor-based gesture recognition is mainly focused on the improvement of recognition algorithms. This limits the application of gesture recognition algorithms in practical products [
4,
5]. The inertial sensor-based recognition method has an important role in helping to improve the accuracy of gesture recognition. Qiu et al. [
6] introduced the devices and key applications of common wearable sensors, and discussed further research directions. It can be seen that wearable device technology could be the main research direction in gesture recognition research.
Gesture recognition technology can be divided into static gesture recognition technology and dynamic gesture recognition technology according to whether it can recognize gestures based on time series [
4]. At present, there are two main ways to collect gesture data, which are non-contact, based on machine vision sensors, and contact, based on data gloves [
7]. The mainstream data-acquisition gloves are divided into three categories, as shown in
Figure 1. Vision-based action recognition has been widely used and generally consists of four steps: gesture detection and segmentation, gesture tracking, feature extraction, and gesture classification [
8]. Kinect [
9] is a depth vision sensor released by Microsoft in 2010. Based on its built-in algorithm, it can automatically identify and track the dynamic skeletal structure of the human body and apply it to the hand to research human gestures. Researchers have two methods of gesture recognition using Kinect: (1) recognition based on the dynamic skeleton of the human body [
10]; (2) recognition based on spatial depth sensing [
5]. In the first approach, Ren et al. [
11] obtained the skeleton data of 25 joint points of the human body by Kinect, obtained their coordinates in 3D space in real-time, and investigated the importance of each joint bone in dynamic gesture expression. However, they based this on the visual sensor to process the information of all parts of the whole body, and the gesture data were not expressed in enough detail. In the second approach, Wang et al. [
12] obtained a higher accuracy rate by studying people’s gesture habits and the depth data from the Kinect depth sensor to control the 3D-printed robot. However, the results indicated that the recognition speed still needs to be improved. The leap motion sensor [
13] is a somatosensory controller released by Leap in 2013. Unlike Kinect, it mainly performs skeletal motion tracking of the hand. Li et al. [
14] generated finger-motion data with the help of Leap Motion Controller, which was used to calculate the angle of finger joints. This is a typical application of vision-based recognition; in addition to this, the research on inertial sensor-based recognition also has to be further improved.
The research on gesture recognition based on data gloves generally uses inertial sensors, myoelectric sensors, pressure sensors, and bending sensors on data gloves to obtain various gesture signals during hand movement. Alemayoh et al. [
15] used inertial force sensors to capture motion data and then trained four neural networks using deep learning methods. The results showed that the vision transformer (ViT) network performed the best, with 99.05% recognition accuracy. Lin et al. [
16] designed a data glove with multi-channel data transmission based on hand poses and emotion recognition to achieve simultaneous control of a robotic hand and a virtual hand. Zhao et al. [
17] designed a motion-capture device based on a human sensor network with 15 sensor nodes and used the gradient descent method to fuse sensor data to improve the localization accuracy of the motion capture system. Liu et al. [
18] proposed a novel gesture recognition device, which consists of a data glove with bending sensors and inertial sensors and a data arm ring with myoelectric sensors, to build a gesture recognition device. However, the sensors need to be used with electrode patches, which are very inconvenient to wear and replace; this is an area in urgent need of improvement. Fu et al. [
19] proposed a gesture-recognition method based on a data glove and back propagation neural network. Only the gesture data of numbers 0–10 were used in the experiments, which lacked the recognition of dynamic gestures. Gałka J. et al. [
20] introduced a construction of an accelerometer glove and its application in sign language gesture recognition. The basic data of inertial motion sensors and the design of gesture acquisition systems are also introduced. The solution presents the results of gesture recognition, selects a specific set of sign language gestures, and uses a description method based on the hidden Markov model (HMM) and parallel HMM methods. Using parallel hidden Markov models for sensor fusion modelling reduces the error rate by more than
while maintaining
recognition accuracy. Qiu et al. [
21,
22] used inertial sensors and data fusion algorithms to calculate the joint angles of kayakers, four machine learning algorithms were used to investigate the effect of different data combinations on phase classification, and extended Kalman filtering methods were used to fuse the sensor information all of which show good classification accuracy. Tai et al. [
23] studied the continuous recognition of six types of gestures using smartphones combined with long- and short-term memory neural networks (LSTM), but the gesture actions used were too simple. The LSTM algorithm in [
24] can be combined with convolutional neural networks for VGR-based gesture recognition; although these algorithms have been found to be effective, they still recognize only a single continuous gesture, and the problem of multi-class dynamic gesture recognition remains to be solved. Yuan et al. [
25] design a wearable device with two arm loops and a data glove with integrated flexible sensors to capture fine arm and joint movements, and introduced an LSTM model with fused feature vectors as input to verify that the contextual information of gestures can be integrated in the gesture-recognition task and achieve excellent recognition results. Fan et al. [
26] proposed a two-stage multi-headed attention human interaction action recognition model based on inertial measurement unit, which can accurately recognize seven interaction actions with an average recognition accuracy of 98.73%.
In order to recognize two types of gestures based on inertial sensors in indoor scenes, we propose recognition and analysis algorithms based on machine learning and deep learning. Applying traditional machine learning algorithms to static gesture recognition, we propose a bidirectional long- and short-term memory neural network model (attention-BiLSTM) based on the attention mechanism for the recognition study of 10 dynamic sign languages. The raw data are collected through a homemade data-collection glove and the gesture information is predicted using different machine learning algorithms. The aim was to improve the accuracy of gesture recognition and reduce the time of gesture recognition by building a gesture model, thus expanding the application scenarios of gesture recognition. The important contributions of this paper are as follows.
The raw data were filtered using a Butterworth low-pass filter, the magnetometer data were corrected using an ellipsoidal fitting method, and the dataset was constructed using a gesture-assisted segmentation algorithm.
We used four machine learning algorithms to identify static gesture data and evaluate the prediction effect by cross-validation.
We constructed a hidden Markov model and an attention-based mechanism neural network model to design recognition methods for dynamic gestures.
This paper is structured as follows.
Section 2 describes the hardware and data-acquisition information of the system. The research methodology is described in
Section 3. The
Section 4 shows the algorithm design of gesture recognition.
Section 5 explains the results of this study. Finally,
Section 6 shows the discussion and conclusion.
2. Systematic Data Collection and Participants
2.1. System Setup
The gesture data acquired by the inertial sensors were processed by gesture segmentation, filtering and gesture fusion algorithms. The flowchart is shown in
Figure 2a. A home-made data acquisition system was the main source of gesture data. The composition of the data acquisition system is shown in
Figure 2. This is mainly composed of a pair of data gloves, a WiFi transceiver node and a personal computer (PC) host computer, which can complete the collection and storage of gesture data.
Each glove contains 16 inertial nodes, and the data collected by the inertial nodes can be sent to the PC host computer through the wireless module in real-time. The hardware part of the glove includes 15 inertial nodes and 1 sink node, as shown in
Figure 3a. The aggregation node adopts STM32F407VGT6 Microcontroller Unit(MCU) as the main controller and is equipped with an ESP8266 serial peripheral interface bus (SPI) interface WiFi module produced by Espressif. The sampling frequency of the node was set to 100 Hz and the collected data can meet the needs of the gesture recognition algorithm. Because it uses wireless transmission, too high a sampling rate may increase the packet loss rate during transmission. The data of each sensor node are collected and processed by the sink node, and the PC receives them synchronously through WiFi. The positions of 15 inertial nodes correspond to 15 finger bones, which can detect the data of each finger. The convergence node is located on the back of the palm, which can detect the data of the palm. The glove material currently used has a certain degree of elasticity, and can increase the size of the internal volume within a certain range after the installation of sensors. Common hand sizes can basically meet the wearing requirements. The wearing effect is shown in
Figure 3b.
The inertial node consists of MPU9250 9-axis sensors: a 3-axis accelerometer, a 3-axis gyroscope, and a 3-axis magnetometer. Taking node 1 of the dynamic one-hand gesture “Sorry” as an example, the obtained nine-axis raw data are shown in
Figure 4b. All gesture data were obtained from the author and classmates wearing data gloves and completing the specified gestures. Prior to data acquisition, we had to perform initial calibration, i.e., calibration of the subject’s hand position by specific movements, due to the variation in glove-wearing position between participants, as well as the effects of the system’s duty cycle and the external environment. When collecting gestures, the person being collected faces due north, and the initial gesture posture is naturally drooping. When the system starts acquisition, the upper-computer data information can be observed, and after 3 s the acquisition system completes the coordinate system calibration to facilitate the conversion of gesture data. During the subsequent acquisition actions, the collector can face in any direction. The gesture collection is shown in
Figure 4a.
In this paper, we used different recognition classification algorithms for static gesture and dynamic gesture recognition. Static gestures are hand patterns that are fixed at a certain moment in time, and each frame captured by the data glove can be used as a set of gesture samples. A large amount of information about dynamic features can be removed when performing feature extraction, and then traditional machine learning algorithms are used for gesture recognition. Specifically, these include: support vector machines, back-propagation neural networks, decision trees and random forest model algorithms. The dynamic gesture features are obviously more complex than static gestures, with the sign language actions changing all the time and the execution time varying between sign languages. Therefore, we used HMM- and attention-BiLSTM-based bi-directional long- and short-term memory neural network models for the recognition study of 10 dynamic sign languages, and the latter served to validate dynamic sign language recognition on deep learning methods.
2.2. Participant and Gesture Acquisition Actions
Four students recruited from the school participated in the preliminary study. Their average weight was 70.5 ± 2.3 kg and their average height was 1.74 ± 0.58 m. The participants consisted of three male students and one female student, and each student sampled 50 sets of gesture data, for a total of 200 samples per gesture. All participants recorded their height and weight, and received adequate information, and particpants’ consent was obtained.
A total of 20 static-letter gestures and 10 dynamic one-handed gestures were selected for this study. Twenty static-letter gestures were as follows: “A”, “C”, “D”, “E”, “F”, “G”, “H”, “K”, “L”, “M”, “N”, “O”, “Q”, “R”, “S”, “U”, “V”, “W ”, “X”, “Y”. Ten dynamic one-handed gestures were as follows: “Sorry”, “Angry”, “Sad”, “You”, “Hello”, “Effort”, “They”, “Me”, “Thanks”, “Goodbye”. All sign standards are referenced in the “Sign Language Dictionary | SpreadTheSign” page. Chinese sign language standards are used.
6. Discussion and Conclusions
Gesture-recognition technology can be applied to many scenarios, such as virtual reality, robot control, and remote operation. The main sensors currently used regarding gesture recognition are IMU, video-based optical capture, and surface electromyography sensors. The main problems are the inconvenience of wearing and the vulnerability to environmental interference. Some studies focus on the structural design of the data glove and ignore the influence of the recognition algorithm on the recognition accuracy.
In this paper, an inertial sensor-based gesture data acquisition system is used, with the goal of constructing a gesture-recognition model based on the collected static and dynamic gesture datasets. Traditional machine learning algorithms can perform gesture recognition classificatio. We evaluated the prediction effectiveness of four algorithms. For static gestures, the model prediction performance was evaluated by cross-validation comparison, and we obtained the conclusion that the random forest algorithm has the highest recognition accuracy and the shortest recognition time. For dynamic gestures, we used HMM and a deep-learning-based attention-BiLSTM model and, according to the results, the latter achieved a higher recognition accuracy. The model can integrate the time series information of sign language acceleration, angular velocity, and hand posture to predict the sign language category, introduce a dropout layer to avoid model overfitting, and use the Adam optimization algorithm to accelerate the model convergence speed.
However, this does not indicate that deep learning methods are superior to traditional machine learning algorithms in the field of gesture recognition. This is because we can obtain a more comprehensive understanding of the data and the underlying algorithm of the model compared to the black-box structure of the deep model. Finally, in the field of practical engineering, traditional machine learning methods often require much less computational cost than deep learning methods. In the direction of gesture recognition based on wearable devices, it is often necessary to consider portability, power consumption, cost, comfort, etc. In the case of the compatible consideration of these factors, it is difficult to add the computational units required for deep learning, so it is difficult to determine the performance of deep learning models. In contrast, traditional machine learning models are fast to train, simple to deploy, and the required engineering costs are concentrated on data processing and feature optimization in the pre-model, thus allowing for faster update iterations in hardware products and the ability to try different model approaches in a short period of time. These aspects are not attainable by deep learning at this stage.
In addition, participants believed that prolonged wear would also cause hand discomfort. In this case, there is an urgent need for more comfortable gesture-recognition monitoring solutions or the use of fewer miniature inertial sensor nodes with guaranteed recognition performance. There were no strict criteria for the gesture data collected in this study, and the participants had no experience with gesture learning. These factors are worth considering in the future. In the future, we will consider designing more lightweight, miniature wearable device modules that can be integrated into existing electronics, such as watches and rings, to create a more comprehensive gesture-capture interaction system.