1. Introduction
During the early stages of the Second Industrial Revolution, humans controlled and regulated the speed and direction of trains, as well as the rotational speed and power output of steam engines, by manipulating buttons and levers, marking the initial form of human–machine interaction. With the advent of computers, this mode of information transmission gradually gave way to keyboard input and mouse control. In recent years, the rapid development of signal acquisition technology and machine learning has led to increased information transmission speed and improved accuracy of information recognition. Complex tasks can now be precisely accomplished by simply driving the computer with simple signals. As technology has advanced, researchers have also developed various human–machine interaction technologies, including voice control [
1,
2,
3], brain-computer interfaces [
4,
5,
6,
7], facial expression control [
8,
9,
10,
11], and gesture recognition, among others, further enhancing the freedom, applicability, and efficiency of human–machine interaction.
Among the numerous human–machine interaction technologies, gesture recognition holds a significant position due to the frequent use of hand movements by humans to convey and receive information. Studies have shown that gestures account for 55% of the importance in information transmission, while sound and text each account for the remaining 45% [
12]. This highlights the crucial role that body language plays in expressing emotions and teaching, making gesture recognition a core technology in human–machine interaction, offering advantages such as simplicity, flexibility, and rich connotations [
13]. This article primarily discussed gesture recognition technologies related to the palm and its surrounding area. Depending on the method of implementation, these technologies can be broadly classified into four categories: electromagnetic wave sensing recognition, mechanical sensing recognition, electromyography sensing recognition, and visual sensing recognition. Due to the unique advantages associated with each gesture recognition approach, extensive research has been conducted worldwide to explore diverse implementation methods.
After this introduction, the article is structured as follows: In the second section, the article provides an in-depth introduction to the principles and implementation methods of various gesture recognition technologies. It categorizes and exemplifies the advancements made by researchers and scholars in recent years, focusing on areas such as feature extraction methods, artificial intelligence algorithms, and sensor material structure characteristics. The third section compiles and compares typical cases of four implementation methods, comprehensively discussing the advantages, disadvantages, and applicable scenarios of each approach from the perspectives of dataset size and accuracy. In the fourth section, the article delves into the applications and development of gesture recognition technology in modern production and daily life, encompassing areas such as improving traditional control methods, medical applications, and sports training. The fifth and sixth sections explore the existing problems and challenges of current gesture recognition technology, considering factors such as the biocompatibility and wearability of sensor structures, as well as the adaptability, stability, robustness, and cross-functionality of signal acquisition and analysis algorithms. Finally, these sections provide a summary and future outlook on the development directions within this field.
This paper provides a systematic summary and analysis of the current state and developmental trajectory of gesture recognition-based human–computer interaction technology. It identifies prevailing issues and proposes future directions for development. The findings are expected to facilitate the advancement and practical application of gesture recognition technology while aiding researchers and scholars in selecting implementation methods that align with their research objectives and application requirements. Moreover, this work serves as a foundation for enhancing and innovating this technology.
2. Research Methods and Current Situation
2.1. Electromagnetic Wave Sensing Recognition
The principle underlying gesture recognition using electromagnetic wave sensing is based on the physical phenomena of reflection, refraction, and scattering that occur when electromagnetic waves encounter obstacles, specifically human hands, in their path. These phenomena lead to changes in the intrinsic parameters of the original electromagnetic waves. By analyzing the variations in the transmitted and received signals and utilizing demodulation techniques, gesture poses can be identified. Currently, electromagnetic wave sensing for gesture recognition can be categorized into two main types: Wi-Fi-based recognition and radar-based recognition.
In a static propagation model, as electromagnetic waves propagate, they experience not only a direct path but also reflection, refraction, and scattering due to the presence of human hands. Consequently, the receiving end captures multiple signals from different paths, resulting in the occurrence of multipath effects [
14]. For the direct path, the Friis free-space propagation equation [
15] can be employed to determine the received signal strength:
where
Pt represents the transmission power,
Pr(d) represents the received power,
Gt and
Gr are the transmission and reception gains, respectively,
λ represents the wavelength, and d represents the propagation distance. When there is a human hand in the propagation path, Equation (1) becomes
where
represents an approximation of the path length change caused by the disturbance of the human hand, while h represents the distance between other reflection points and the direct path, excluding the human hand [
16]. It can be observed that the received power is inversely proportional to the length of the propagation path. When a human hand obstructs the propagation path, it introduces disturbances and creates a new propagation path.
The dynamic propagation model primarily relies on the Doppler effect, which describes the change in the wavelength of radiation emitted by an object due to the relative motion between the source and the observer. Assume the original wavelength of the source is
λ, the wave speed is
u, and the velocity of the observer is
v. When the observer approaches the source, the observed frequency
f1 of the source can be calculated as follows:
Otherwise, the observed wave source frequency
f2 is
In the presence of a moving wave source, the wave is compressed, resulting in a shorter wavelength and a higher frequency. Conversely, when the wave source is behind in motion, the opposite effect occurs, causing the wavelength to lengthen and the frequency to decrease [
17].
By considering these two effects, Wi-Fi signals at the Medium Access Control (MAC) layer can be represented by the Received Signal Strength Indication (RSSI), which accounts for the accumulation of propagation delay, amplitude attenuation, and phase shift along different propagation paths [
18,
19,
20]. However, this representation has limitations, such as low-ranging accuracy and deviation in recognition due to the influence of static multipath propagation on RSSI fluctuations. With the continuous improvement of Wi-Fi protocols, the development of Orthogonal Frequency Division Multiplexing (OFDM) technology has enabled the utilization of Channel State Information (CSI) at the physical layer to reflect the signal state during propagation [
21]. Due to the implementation of OFDM technology, the channel between the transmitter and receiver is partitioned into several subcarriers, as depicted in
Figure 1. These subcarriers are employed to capture characteristics such as signal scattering and multipath attenuation. The CSI (Channel State Information) signals exhibit consistency for the same gesture, while variations among CSI signals corresponding to different gestures enable differentiation between various types of gestures. Compared to the RSSI representation at the MAC layer, the CSI representation at the physical layer is less susceptible to multipath interference and provides a more precise characterization at each subcarrier.
Building upon this representation, researchers have attempted to overcome the inherent limitations of Wi-Fi signal sensing, such as low time resolution, vulnerability to interference, and narrow bandwidth. Qirong Bu et al. [
22] extracted gesture segments based on changes in CSI amplitude and transformed the problem of Wi-Fi-based gesture recognition into an image classification task by representing CSI streams as image matrices and inputting them into deep learning networks for recognition. Zhanjun Hao et al. [
23] established a correlation mapping between the amplitude and phase difference information of subcarrier levels in wireless signals and sign language gestures. They combined an effective denoising method to filter environmental interference and efficiently selected optimal subcarriers, thereby reducing system computation costs. Li Tao et al. [
24], utilizing the nexmon firmware, obtained 256 CSI subcarriers from the underlying layer of smartphones operating in IEEE 802.11 ac mode with an 80 MHz bandwidth. They then fused the extracted CSI features in the time and frequency domains using a cross-correlation method and ultimately recognized gestures using an improved DTW algorithm, breaking the limitation that gesture recognition can only be performed within relatively fixed regions along the transmission link.
Figure 1.
CSI signal samples during two human activities detected by Wi-Fi [
25].
Figure 1.
CSI signal samples during two human activities detected by Wi-Fi [
25].
Compared to Wi-Fi gesture recognition, continuous wave radar gesture recognition technology, which utilizes Doppler information, demonstrates superior performance in dynamic recognition [
26]. The general process is illustrated in
Figure 2. Radar signals reflected in the backward direction contain abundant gesture-related information, including distance-Doppler information, angle-Doppler information, distance-angle information, and time-frequency information, enabling the differentiation of various types of gestures. Skaria Sruthy et al. [
27] employed a low-cost dual-antenna Doppler radar to generate co-frequency and orthogonal components of the gesture signal, mapping them to three input channels of a deep convolutional neural network (DCNN). This approach yielded two spectrograms and an arrival angle matrix for recognition, achieving an accuracy rate exceeding 95%. Wang Yong et al. [
28] developed a gesture recognition platform based on frequency-modulated continuous wave (FMCW) radar, which extracted features from the obtained range-Doppler map (RDM) and range-angle map (RAM). The accuracy rate for simultaneous recognition of gestures involving both hands surpassed 93.12%. Yan Baiju et al. [
29] harnessed the secrecy and non-contact advantages of millimeter-wave radar to develop a gesture recognition system. They divided the collected data, including the distance-Doppler image (RDI), distance-angle image (RAI), Doppler-angle image (DAI), and micro-Doppler spectrum, into training and testing sets. Utilizing a semi-supervised learning (SSL) model, they achieved outstanding performance, with average accuracy rates of 98.59% for new users, 96.72% for new locations, and 97.79% for new environments.
2.2. Strain-Sensing Recognition
Strain-sensing recognition technology is based on percolation theory and tunneling theory. It converts the changes in stress on the sensing element into corresponding electrical signals, which are then used for gesture recognition through machine learning algorithms. In the early stages, strain-sensing recognition devices often utilized rigid substrates, which resulted in mechanical incompatibility with the flexible human skin. Additionally, these devices were typically bulky and inconvenient to carry. However, with technological advancements, these devices have undergone miniaturization and are now widely integrated into data gloves and smart wristbands.
There are three main types of strain-sensing devices based on the method of signal extraction: piezoelectric [
31,
32], capacitive, and resistive. Piezoelectric sensors are self-generating electromechanical transducers. When subjected to stress, the material generates charges on its surface, which are then converted into electrical signals and amplified. The strength of the electrical signal is generally proportional to the applied pressure, enabling strain-sensing recognition. Capacitive sensing technology commonly employs parallel plate capacitors. When the excitation electrode and the sensing electrode are placed in parallel opposition, the electric field lines between the electrodes expand uniformly between the plates as they slowly separate. As shown in
Figure 3, when the two electrodes reach a coplanar state, the edge electric field between them becomes dominant. Capacitive sensors utilize this effect to recognize dynamic gestures within the sensitive area [
33].
Resistive sensors transform non-electrical variations into changes in resistance by modifying the shape of a stress-sensitive elastic conductor. The typical structure of a resistive sensor, illustrated in
Figure 4, includes a flexible contact layer for the first and fifth layers, a first electrode layer and a second electrode layer for the second and fourth layers, respectively, and a sensing layer typically positioned in the middle as the third layer. Compared to other sensor types, resistive sensors have garnered significant attention due to their high sensitivity, wide measurement range, excellent reusability, and simple structure, making them a primary focus of research.
In recent years, researchers have pursued investigations into approaches that involve coupling resistive sensors with other hierarchical or microstructures. The aim is to enhance biocompatibility and sensing accuracy. These structures introduce an intermediate functional layer, which acts as a composite material combining dielectric materials with specific components. In comparison to pure dielectric materials, this intermediate layer possesses a lower effective Young’s modulus, making it more susceptible to deformation. When external pressure is applied, the microstructure and air–composite functional layer expel air, resulting in an increased proportion of components with high dielectric constant. This leads to a more pronounced change in capacitance compared to conventional hierarchical structures. Moreover, it offers a higher signal-to-noise ratio and improved stability. Zhu Penghua et al. [
34] proposed a stretchable resistive thread sensor based on a composite material comprising silver-coated glass microspheres (Ag@GMs) and solid rubber (SR). Wang Shuai et al. [
35] introduced a gradient porous/pyramid hybrid structure (GPPHS) conductive composite film with gradient compression characteristics and superior structural compressibility. This innovation simultaneously enhances the detection accuracy and range of resistive sensors. Liu Caixia et al. [
36] developed a flexible resistive sensor with a crack structure and high strain coefficient. It is composed of a biodegradable and stretchable gelatin composite material integrated with a fabric substrate. Additionally, specific materials can be plated onto the contact surface between the conductive composite structure and the conductive layer. This allows the conductivity of the device to be influenced by changes in the distance between the composite material and the contact area with the conductive layer. As a result, the sensor exhibits an expanded measurement range and improved measurement accuracy [
37].
For traditional resistive sensors, the most common method of signal classification is the fixed threshold judgment approach. This method is simple and fast, but it has significant drawbacks. It exhibits poor resistance to interference and lacks accuracy. Therefore, introducing efficient machine learning algorithms has become a key focus of research in this field. Fan Tianyi et al. [
38] used binary neural networks and convolutional neural networks to process the collected signals. For a dataset consisting of 10 × 200 gesture samples, they achieved a recognition accuracy of 98.5%. Liu Caixia et al. [
36] employed support vector machines to process data from nine types of gestures, greatly improving the recognition accuracy to an average of 99.5%. Wang Ziyi et al. [
39] invented a gesture recognition system based on a planar capacitive array. They conducted experiments using a dataset of 6 × 25 gesture samples and achieved a recognition rate exceeding 95% by employing hidden Markov models. These results demonstrate the significant application potential of strain-sensing technology with the intervention of artificial intelligence algorithms.
2.3. EMG Sensing Recognition
When skeletal muscles are in a state of resting relaxation, the membrane potential of each muscle cell, also known as muscle fiber, is approximately −80 mV [
40]. During muscle contraction, there is a potential difference between the muscle cells and the motor neurons that form the motor unit, leading to the conduction of electrical signals. Additionally, research suggests that the sum of action potentials within all muscle cells of a motor unit is referred to as the motor unit action potential. When a skeletal muscle contracts, the electromyographic (EMG) signal is the linear sum of the related motor units. The strength of the EMG signal is strongly correlated with the state and quantity of muscle contraction, thus providing possibilities for gesture recognition technology.
Based on the location of EMG signal generation, there are two types: surface EMG signals (sEMG) and intramuscular EMG (iEMG) signals, each with its own characteristics as shown in
Table 1. Surface EMG signal sensing technology collects signals by placing non-invasive electrodes on the surface of muscles, providing information about muscle movement. It is widely used in medical rehabilitation and human–computer interaction fields due to its ability to accurately record actual action potentials by ignoring factors such as lighting conditions and occlusions. Furthermore, studies have shown that this method can capture relevant electrical signals approximately 200 ms before physical movement [
40], which means that this technology also has the potential for action prediction. On the other hand, intramuscular EMG signal acquisition often requires inserting electrodes into specific points within muscle tissue to capture muscle activity signals. This invasive collection method unavoidably carries a slight risk of tissue damage, infection, and discomfort for the individual. Additionally, it has drawbacks such as high equipment costs and the need for frequent maintenance, making it unsuitable for flexible and dynamic human–computer interaction environments.
The application of sEMG signals can be traced back to the 1990s when researchers used muscle signals to control robot movements [
41] for human–computer interaction. Over time, the collection technology for sEMG signals has been continuously improved and expanded, finding applications in gesture recognition technology. Similar to other gesture recognition approaches, sEMG-based gesture recognition primarily relies on improved feature extraction methods and machine learning algorithms to enhance recognition accuracy. Due to the nonlinear, stochastic, and highly variable nature of EMG signals, processing and analyzing the entire signal waveform become particularly challenging. Lv Zhongming et al. [
42] proposed a feature selection and classification recognition algorithm based on a Self-Organizing Map (SOM) and Radial Basis Function (RBF), followed by Principal Component Analysis (PCA) to reduce the size of feature vectors. Anastasiev Alexey et al. [
43] used a novel four-component multi-domain feature set and feature vector weight addition for signal segmentation and feature extraction, enabling the more accurate investigation of features and patterns during muscle contraction. Mahmoud Tavakoli et al. [
44], Vimal Shanmuganathan et al. [
45], and Guo Weiyu [
46] respectively employed Support Vector Machines, R-CNN, and Long Exposure Neural Networks for processing the collected EMG signals, achieving recognition accuracies exceeding 95%.
Furthermore, EMG-based gesture recognition is also applicable to individuals with disabilities, as paralysis or amputation of the elbow or palm does not affect the generation of muscle contractions and EMG signal generation within the arm’s neural and muscular system. By capturing and analyzing EMG signals, it becomes possible to recognize and control gestures and movements for people with disabilities. Gu Guoying et al. [
47] invented a neuroprosthetic hand, as shown in
Figure 5, which utilizes non-invasive electrodes placed on the amputee’s elbow to capture EMG signals and control the neuroprosthetic hand. Furthermore, the pressure sensors on the fingertips of the prosthetic hand generate electrical stimulation to simulate tactile feedback. Such technology not only helps individuals with disabilities achieve normal hand movements but also assists doctors in better understanding the muscular condition of patients, facilitating their treatment.
2.4. Visual Sensing Recognition
Gesture recognition systems based on visual sensing have reached a relatively advanced stage of development. They recognize gestures by converting images containing human hands into color channel data and keypoint position information. Early research primarily focused on sign language translation and gesture control technology. With the continuous improvement and expansion of computer hardware and image processing techniques, visual sensing and recognition technology has become more precise and standardized.
During the image acquisition and preprocessing stage of visual sensing and recognition technology, human hand images are captured using image capture devices. Generally, higher-pixel cameras result in higher accuracy for visual gesture recognition, but they also increase the computational load. Therefore, preprocessing of the collected hand image is essential, as illustrated in
Figure 6. The image undergoes processing techniques such as filtering, grayscale conversion, and binarization to extract the contour information and finger coordinates of the hand, while simplifying the data.
Given the variations in hand size, skin color, and shape among individuals, as well as the potential interference caused by scene lighting and hand motion speed, effectively segmenting the meaningful hand gesture from the complex background environment remains an urgent challenge. Clearly, relying solely on color segmentation is insufficient. In 2019, Manu Martin et al. proposed seven methods for extracting pixel or neighborhood-related features from color images. They evaluated and combined these methods using random forest [
48]. Danilo Avola et al. [
49] normalized the extracted images and employed a multi-task semantic feature extractor to derive 2D heatmaps and hand contours from RGB images. Finally, a viewpoint encoder was used to predict hand parameters. With the advancement of structured light technology and depth image capture, gesture recognition based on depth images has gained considerable attention. Mature commercial products such as Kinect and Leap Motion are currently available in this field. Depth images utilize grayscale values to represent the relative distances of pixels from the capturing system, reducing the impact of complex backgrounds on gesture segmentation to a certain extent. Moreover, the data structure of depth images is well-suited for input into artificial intelligence algorithms such as random forests and neural networks, effectively transforming complex feature extraction problems into simpler pixel classification tasks. Consequently, this approach is gradually emerging as the mainstream trend in gesture segmentation. Xu et al. [
50] introduced traditional depth-based gesture segmentation algorithms, including RDF, R-CNN, YOLO, and SegNet. They enhanced the baseline SegNet algorithm by incorporating class weights, transpose convolutions, hybrid dilated convolution combinations, and concatenation and merging skip connections between the encoder and decoder. These enhancements resulted in F2-Score improvements of 7.6% and 5.9% for the left and right hand, respectively, compared to the baseline method.
Figure 6.
Hand boundary detection and hand key point detection [
51].
Figure 6.
Hand boundary detection and hand key point detection [
51].
After completing the aforementioned steps, the next phase involves extracting gesture-related signals from the preprocessed images, such as the number of fingers, finger length, finger opening angle, and more. These signals are then classified using machine learning algorithms, which are trained with a large number of samples to achieve accurate recognition. Khan Fawad Salam et al. [
52] employed the MASK-RCNN combined with the Grass Hopper algorithm for classifying the obtained RGB images and hand keypoints. Jaya Prakash Sahoo et al. [
53] developed an end-to-end fine-tuning method for pre-trained CNN models using fractional-level fusion technology. This approach demonstrates excellent gesture prediction performance even with a limited training dataset. Building upon this technology, the team designed and developed the real-time American Sign Language (ASL) recognition system [
54].
3. Comparison and Analysis
Table 2 includes representative examples and relevant parameters of various detection methods. By analyzing these typical cases, we can infer the strengths and limitations of each detection method. Typically, larger datasets lead to higher accuracy and better performance of gesture recognition methods. However, it is crucial to consider the variations in experimental conditions, such as diverse environments and dataset formats used in different methods. As a result, the final results may exhibit certain biases. Therefore, this paper provides a comprehensive analysis of the pros and cons of each detection method.
After comparison, it is evident that gesture recognition technologies employing machine learning and artificial intelligence algorithms generally achieve higher accuracy. Among them, electromagnetic wave-based gesture recognition stands out with its relatively high accuracy, non-contact nature, and independence from line-of-sight and lighting conditions. It is suitable for recognizing gestures of different sizes and shapes. However, this approach demands high hardware requirements and comes with a higher cost, sometimes necessitating multiple signal transmission devices to achieve high-precision results. Strain-based gesture recognition also demonstrates a high level of accuracy and strong reliability, allowing for operation in various environments. However, its gesture recognition range is limited, requiring the deployment of sensor hardware devices on the hand, which can impact accuracy based on the specific deployment environment. On the other hand, surface electromyography (sEMG) gesture recognition exhibits relatively lower accuracy due to significant noise in the electromyographic signals. It requires advanced denoising algorithms and inconveniently involves the placement of electrodes on the skin. Moreover, specific muscle training is necessary, and accuracy is influenced by muscle fatigue. However, sEMG has the capability to detect subtle muscle changes and enables gesture recognition for individuals with disabilities, thus holding significant importance in the medical field. Visual sensing-based gesture recognition achieves the highest accuracy, thanks to the rapid development of color and depth imaging technologies. It offers advantages such as low cost, ease of implementation, and non-contact operation. However, it is susceptible to signal distortions caused by background, lighting, and occlusion effects, necessitating image preprocessing. Furthermore, the recognition accuracy for random samples not present in the database requires further investigation.
Existing recognition technologies also face common issues that urgently require solutions. Despite the tremendous potential of gesture recognition technology, there are significant differences between static and dynamic gesture recognition, as well as variations in the semantic-syntactic structures of different gestures. Insufficient analysis algorithms, datasets, visual corpora, and other factors hinder the ability to perform in-depth semantic analysis. Currently, there is no fully automated model or method that can be widely applicable to multiple static or dynamic gesture recognition systems. In fact, as demonstrated in
Table 3, numerous unimodal or multimodal corpora are already accessible to researchers. The STF-LSTM [
59] deep neural network model architecture effectively incorporates gesture and lip movement information features, achieving a gesture recognition rate of 98.56% on the extensive multimodal Turkish Sign Language dataset, AUTSL [
60]. Notably, the article introduces a model compression technique utilizing ONNX Runtime, which significantly reduces the computational and memory requirements of the model. Consequently, this advancement facilitates seamless and efficient operation on prevalent mobile devices such as the Samsung Galaxy S22. Another notable framework, the SAM-SLR [
61] architecture proposed by Songyao Jiang et al., attained recognition rates of 98.42% and 98.53% on the RGB and RGB-D tracks of AUTSL, respectively. Nonetheless, existing corpora still confront certain challenges, including limited accessibility, insufficient data volume and diversity, and the inclusion of only one type of dynamic or static gesture. These constraints hinder progress in deep semantic analysis. Furthermore, a research gap exists in the domain of gesture recognition concerning cutting-edge technologies such as femtosecond laser recognition, fiber optic sensing recognition, and acoustic recognition. Considering practical application scenarios, there is room for improvement in the comfort and portability of certain wearable devices.
4. Technology Application
Hand gesture recognition technology is characterized by convenience, intuitiveness, and intelligence, and it holds tremendous significance and potential for human production and daily life. This section aims to provide a comprehensive overview of the applications of hand gesture recognition technology in various domains of modern production and daily life.
4.1. Improved Traditional Control Mode
Gesture recognition technology-based human–computer interaction control methods offer several advantages over traditional approaches. Firstly, they align more naturally with human habits, eliminating the need for additional learning of input devices. Secondly, gesture recognition technology allows for the design of diverse gestures to meet the specific requirements of different application scenarios. This enables more flexible and intuitive interaction in fields such as gaming, healthcare, and education. Extensive research and applications in various domains have demonstrated the significant potential of gesture recognition-based human–computer interaction control methods. For example, Strazdas Dominykas et al. [
67] developed a non-contact multimodal human–computer interaction system called Rosa, which integrates gesture recognition, facial recognition, and speech technologies to efficiently and securely control mechanical systems. Su Mu Chun et al. [
68] proposed a gesture recognition-based home appliance control system that achieved a recognition accuracy of 91% and allowed wireless control of household appliances using a small set of gestures. In the field of unmanned aerial vehicles (UAVs), gesture recognition technology has been employed as an alternative to traditional joystick controls. Lee JiWon et al. [
69] implemented a UAV gesture control system based on IMU (inertial measurement unit) recognition components. During control, obstacle information on the UAV’s heading can be conveyed to the user through vibration feedback, enhancing safety. Konstantoudakis Konstantinos et al. [
70] developed an AR-based single-hand control system for UAVs, addressing the issue of visual fatigue caused by prolonged focus on joysticks and control screens. Moreover, for highly dynamic underwater UAV environments, Yu Jiang et al. [
71] proposed a gesture interaction system for underwater autonomous underwater vehicles (AUVs) that employed fuzzy control to overcome challenges such as fluctuation interference and light attenuation. These studies collectively highlight the efficiency, safety, and speed advantages offered by gesture recognition technology compared to traditional methods, positioning it as a core technology for emerging human–computer interaction control systems.
4.2. Medical Applications
Gesture recognition technology has found extensive applications in the medical field, playing a crucial role in improving the efficiency and quality of healthcare services, facilitating convenient patient care by medical professionals, and aiding patients in their rehabilitation and return to a normal life. A noteworthy example is the work of Korayem M.H et al. [
72], who designed a high-precision remote laparoscopic surgery system based on the Leap Motion platform. This system enables skilled surgeons to perform laparoscopic procedures on patients regardless of geographical constraints. Xie Baao et al. [
64] and Gu Guoying et al. [
47] utilized electromyography (EMG) recognition technology to achieve gesture recognition based on the residual limbs of disabled individuals. By accurately controlling prosthetic hands through EMG signals, they empowered disabled patients to independently perform actions such as grasping and placing objects, significantly enhancing their quality of life (
Figure 7b). Stroh Ashley et al. [
73] addressed the challenges faced by individuals with conditions such as cerebral palsy or muscle malnutrition, who struggle with precise muscle control required for operating electric wheelchairs using traditional joysticks. They devised an electric wheelchair control system based on EMG gesture recognition, offering a solution that improves mobility for individuals with limited muscle control. Additionally, Nourelhoda M. Mahmoud et al. [
74] developed a remote patient monitoring system that tracks patients’ hand movements, detects pain levels, and monitors muscle function recovery. This system holds great significance in disease monitoring and the adjustment of subsequent treatment plans.
4.3. Physical Training
In sports training, hand gestures can effectively reflect the athletes’ performance. By calibrating hand postures and analyzing movements, gesture recognition technology helps coaches gain a comprehensive understanding of athletes’ skill levels and provides targeted training recommendations. Therefore, gesture recognition technology plays a crucial role in the field of sports. Li Guangjing et al. [
75] analyzed static images and video sequences and proposed a method that utilizes multi-scale feature approximation to improve the speed of hand feature extraction. This advancement in athlete posture analysis provides a theoretical foundation for subsequent analysis of athlete gesture movements. Rong Ji [
76] introduced an approach based on image feature extraction and machine learning for recognizing basketball shooting gestures. By classifying hand movements during shooting, this method provides a scientific basis for training basketball shooting techniques. Shuping Xu et al. [
77] conducted similar work focused on table tennis, significantly enhancing the efficiency of analyzing match recordings of highly skilled table tennis players during training sessions. Furthermore, gesture recognition technology can be applied to the professional training of referees, enhancing their accuracy in interpreting rules and making real-time assessments of game situations, thereby ensuring the fairness of sports competitions. Tse-Yu Pan et al. [
78] developed a referee training system where trainees wear MYO electromyography (EMG) gesture recognition armbands to watch pre-recorded match videos. The system facilitates training and corrective actions based on the consistency between the trainee’s EMG signals and the official EMG signals from the recorded videos. Paulo Trigueiros et al. [
79] created an application that can recognize the hand gestures of the main referee in real-time matches, assisting assistant referees and video assistant referees (VARs) in making real-time judgments on the game situation.
4.4. Other Areas
In addition to the three domains mentioned above, gesture recognition technology has found wide applications in various other technical fields. As shown in
Figure 7d, Wang Xin et al. [
80] developed a robotic system to address issues such as low productivity, low safety, and labor shortages in construction sites. By analyzing the hand gestures of workers through visual analysis, they laid the technological foundation for the subsequent development of mechanical hands for construction workers. Alexander Riedel et al. [
81] used visual gesture recognition to analyze the hand movements of a large number of workshop workers, enabling them to predict industry-standard production times for assembly line design and product cost estimation. Through this quantitative analysis of hand gestures, accurate predictions of future data trends can be made.
Figure 7.
Application of gesture recognition technology in modern production and life (
a) gesture recognition screen of autonomous underwater vehicle [
71]. (
b) Artificial hand based on EMG gesture recognition helps the disabled to live normally [
47]. (
c) Flow chart of a referee training system based on gesture recognition [
79]. (
d) Robot system based on gesture recognition of construction workers [
80].
Figure 7.
Application of gesture recognition technology in modern production and life (
a) gesture recognition screen of autonomous underwater vehicle [
71]. (
b) Artificial hand based on EMG gesture recognition helps the disabled to live normally [
47]. (
c) Flow chart of a referee training system based on gesture recognition [
79]. (
d) Robot system based on gesture recognition of construction workers [
80].
5. Future Outlook
Gesture recognition-based human–computer interaction technology is an emerging field with immense potential for development. In recent years, it has attracted considerable attention as a core technology in human–computer interaction. The performance of gesture recognition technology is expected to make significant advancements in the future. In this section, we will analyze its potential directions of development in four parts.
5.1. Biocompatibility and Wearability
The wearability of gesture recognition devices will become a key focus of future development. This is due to the current technological limitations that result in the mismatch between rigid materials and flexible skin, as well as the drawbacks of large volume and weight in existing sensing devices. The advancement of new material technologies brings new possibilities to gesture recognition technology. Some materials not only exhibit good biocompatibility but also enhance the sensing performance of sensors. Combining sensing systems with lightweight materials is becoming a necessary trend in order to select appropriate materials based on the specific usage environment.
5.2. Adaptability
Virtually all gesture recognition technologies rely on machine learning or artificial intelligence algorithms to classify gestures. From numerous research findings, it is evident that the combination of different machine learning techniques and gesture recognition implementation methods leads to varying levels of accuracy. In future research, in addition to advancing more efficient algorithms, researchers should also investigate the compatibility between different technologies, algorithms, and application domains. This includes efforts to reduce usage and manufacturing costs, enhance product usability, and identify the optimal combinations.
5.3. Stability and Robustness
The current gesture recognition technology is significantly influenced by the surrounding environment. Gesture recognition in practical applications needs to take into account factors such as lighting, medium properties, and occlusions. It goes beyond the recognition of static gestures from a specific database. Improving the stability and reliability of gesture recognition technology requires mitigating the impact of variations in gesture speed and motion trajectories during movement. The next step in the research will involve optimizing both hardware and algorithms to address these challenges effectively.
5.4. Overlapping
The effectiveness of gesture recognition varies in different environments depending on the chosen implementation approach. Gesture recognition technology should not be confined to a single sensing method. By combining two or more methods, such as electromagnetic wave sensing, strain sensing, electromyography (EMG) sensing, and visual sensing, or by exploring emerging sensing methods such as fiber optic sensing and acoustic sensing, the reliability and robustness of the technology can be significantly improved. The integration of multiple signals in applications will greatly enhance the overall performance of the technology.
6. Conclusions
This article comprehensively examined the principal implementation methods employed in recent years for gesture recognition-based human–computer interaction technologies. Specifically, it focused on electromagnetic wave sensing technology, strain sensing technology, EMG sensing technology, and visual sensing technology. The article also highlighted the latest advancements in these implementation methods and their associated technologies. Our study presented, analyzed, and discussed the findings derived from an extensive review of 73 pertinent publications pertaining to this technology. Below are our findings in regard to the four implementation methods.
Electromagnetic wave sensing recognition: We primarily focused on two categories of implementation methods: Wi-Fi sensing technology and radar sensing technology. We provided an overview of the disturbance model and the Doppler effect in the electromagnetic wave transmission process. Building upon this foundation, we delved into the characteristics of commonly used RSSI channel description methods and CSI channel description methods in Wi-Fi sensing technology. Furthermore, we explored various channel description methods based on Doppler information utilized in radar sensing technology. In comparison to other recognition technologies, electromagnetic wave sensing gesture recognition demonstrates relatively higher accuracy, non-contact nature, and immunity to line-of-sight and lighting conditions. It is suitable for gestures of varying sizes and shapes. However, achieving high recognition accuracy heavily relies on the quality of collected channel information, thereby imposing stringent requirements on acquisition devices and entailing higher costs. In certain cases, multiple signal-transmitting devices may be necessary to obtain highly precise results.
Strain sensing recognition: Drawing upon percolation theory and tunneling theory, the change in stress applied to sensing elements is effectively converted into variations in electrical signals to facilitate recognition. We elucidated the underlying principles of three stress-based gesture recognition techniques: piezoelectric, capacitive, and resistive methods. Furthermore, we delved into the enhancements introduced by researchers in terms of sensing layer materials, structural considerations, and pertinent algorithms for electrical signal processing. This technology demonstrates robust recognition reliability and can be effectively deployed in diverse operational environments. However, its gesture recognition capabilities are constrained, necessitating the deployment of sensor hardware on the hand, thereby influencing accuracy. Aspects such as sensor reusability, mitigation of hair growth interference, and resilience to electromagnetic interference warrant further exploration and discussion by future scholars.
EMG sensing recognition: Implemented primarily through the utilization of generated potential differences during muscle movement, EMG technology can be classified into two categories: sEMG and iEMG. A comparative analysis of their respective features reveals that the non-invasive nature of sEMG renders it more suitable for practical gesture recognition applications. Given the substantial noise inherent in EMG signals, demanding noise reduction algorithms are imperative. We explored researchers’ endeavors in signal segmentation, highlighting that the integration of artificial intelligence algorithms significantly enhances the accuracy of EMG signal recognition. Nonetheless, successful implementation of this technology still necessitates specific muscle training, with precision exhibiting a certain degree of correlation with muscle fatigue. Furthermore, we discussed relevant efforts in employing EMG signal-based gesture recognition for individuals with disabilities, emphasizing the potential benefits not only in facilitating natural hand movements but also in providing healthcare practitioners with invaluable insights into patients’ muscle conditions, thereby aiding in treatment in and holding significant implications for the medical field.
Visual sensing recognition: By converting images containing human hands into color data or depth data, we explored the advancements made by researchers in various aspects of gesture recognition, including hand segmentation methods and coordinate representation techniques. With the rapid development of depth imaging technology and the introduction of devices such as Kinect, this technology has exhibited additional features such as cost-effectiveness, ease of implementation, and non-contact operation. However, it is worth noting that the literature predominantly focuses on discussing the recognition accuracy of samples from existing databases, while further investigation is needed to assess the recognition accuracy of random samples not included in the database.
Technological applications and future work: We outlined some of the transformative effects that gesture recognition technology has brought to modern life and industrial production. Furthermore, we explored the potential future directions of this technology. In the coming years, there will be an increasing demand for gesture recognition technology, accompanied by higher performance expectations. To delve deeper, our next steps will involve thorough investigations into both materials and algorithms. The development of novel materials technology has given rise to high-quality structures that are lightweight, possess high electrical conductivity, and have low Young’s modulus. The maturation of 3D printing has made it possible to fabricate sensing layers with microstructured surfaces, enabling sensors to achieve broader detection ranges and heightened sensitivity. The advent of unsupervised learning methods, such as deep learning and cluster learning, allows for the formation of models even when target data lack labels. This provides a means of data fitting for emerging gesture recognition technologies, including fiber optic sensing recognition and acoustic sensing recognition, which deal with complex signals. Based on these insights, we can anticipate the emergence of gesture recognition systems that are highly wearable, exhibit exceptional stability, and demonstrate robustness.