1. Introduction
Unmanned aerial vehicles (UAVs) have dozens of applications, such as transport, wildlife monitoring, healthcare, and military. One area where they have become particularly useful is in search and rescue operations [
2]. Rescuers use them to try to spot victims from above. In such a case, a popular hand-held controller might not be a method of choice. When moving in difficult terrain, hands-free equipment is preferred.
One of the ways that would enable an operator to control a drone while walking at the same time is a gesture recognition system. A small device placed on a hand, in the form of a glove, ring, or similar, would send control signals to the vehicle, provided the gestures are recognized quickly and effectively. On the other hand, such a system would have plenty of applications besides unmanned aerial vehicle control—these would include assistance for disabled people [
4] or gaming and virtual reality [
The main goal of the research presented in this article was to find out whether a simple hand gesture recognition system might be implemented on a microcontroller unit. Three machine learning algorithms were inspected for this purpose: neural networks, Support Vector Machine (SVM), and Random Forest. As mentioned above, the developed solutions were planned to be implemented in a microcontroller-based system for controlling a UAV with the use of hand gestures. The final device, planned to be developed in future works, should be designed to allow the user to specify their own gesture classes, and thus, the algorithm should not need a lot of input data for the user’s convenience.
Recent works related to gesture recognition systems include the research presented in [
6]. This paper introduced an Accelerometer-Based System, utilizing an ARM processor, five accelerometers located on fingers, a wireless communications module, and the LABVIEW software. The system was aimed at robotic gripper movement through gestures. A different approach to the development of a gesture recognition system was presented in [
7], where a Computer Vision System was used for this purpose. The proposed system is capable of tracking both static and dynamic hand gestures. The detected gestures allow for opening websites, launching applications like VLC Player and PowerPoint, and switching the slides in a presentation. An approach based on an Arduino and machine learning implemented using Support Vector Machines (SVMs) for hand gesture classification was introduced in [
8]. In the proposed system, hand movements were captured with the use of a sensor combining a three-axis accelerometer and a gyroscope, and also an Internet of Things (IoT) device providing six-axis readings. Measured data were sent to a computer using Bluetooth technology. The hand gesture recognition system was developed for the use by people with disabilities. In [
9], approaches based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) were applied for hand gesture recognition. An overview of hand gesture recognition systems can be found in [
The rest of this paper is organized as follows.
Section 2 describes the developed gesture recognition system.
Section 3 describes the data gathering and the model training processes.
Section 4 presents the tests performed with the use of the developed gesture recognition system and discusses obtained results. The presented research is summarized in
Section 5.
2. Materials and Methods
2.1. Principle of Operation
In order to build a gesture recognition system, two vital aspects must be discussed:
There are many possible ways of capturing hand gestures, including Computer Vision, electromyography, or Time-of-Flight cameras. However, in this research, an inertial measurement unit (IMU) was chosen as the most suitable method. It does not need any external objects to refer to, is small and lightweight, and can thus be placed right on a hand or a finger [
11]. This sensor provides readings from an accelerometer and a gyroscope over time, which give an idea of how the hand moves and turns in the air.
The gesture is then stored in a digital form that can be processed, but it would be very challenging to come up with an algorithm to distinguish between specific movements. Thanks to machine learning, it is not necessary. Instead, a sufficient amount of data must be collected. Then, a neural network is trained and deployed on the device, performing classification afterwards [
12]. The training requires an external computer, but the model is deployed to the microcontroller afterwards [
13]. The details of the chosen model will be discussed further in the following part of the paper.
2.2. Machine Learning on Microcontrollers
Machine learning algorithms, especially deep learning models, such as neural networks, are complex and require significant computing power [
14]. The first version of the TensorFlow library, used in this research, was released by Google in 2015. Later, TensorFlow Lite was introduced, following the trend toward miniaturization. This version supports popular operating systems for smartphones, such as iOS and Android, as well as Raspberry Pi platforms. A few years ago, Google announced TensorFlow Lite for microcontrollers with a spot, listing possible applications such as tracking motion in sports, maintaining crops, or running underwater vehicles. The spot also accents the low power consumption as well as low costs [
Microcontroller units (MCUs) are superior in a few aspects. They boot up almost instantly, which allows the user to turn the device on and off many times, allowing them to save even more energy. Cutting off power does not pose any problems, whereas a single-board computer takes time to shut down safely. They are also usually smaller and more lightweight.
There are significant disadvantages to machine learning on MCUs apart from clocking, such as limited memory, which may not even be able to accommodate more complex models. The language support is also restricted. C/C++ is the language of choice for embedded systems, while many machine learning engineers and data analysts prefer to work with Python.
2.3. Choice of Components
The developed gesture recognition system is composed of the following elements:
Central unit: an MCU—Raspberry Pi Pico;
Input: an IMU sensor—Adafruit LSM6DS3TR-C;
Power supply: 3.7 V lithium polymer battery;
Output: radio communication module—there is a Pico version with Wi-Fi/Bluetooth available, and while it does not offer a long range, it offers enough to evaluate the project.
The IMU provides accelerometer and gyroscope readings, which are processed by the MCU using a machine learning model. Then, a radio module sends the UAV information of which gesture was performed by the user. These components are small enough to be incorporated into a glove or something similar, thus allowing the user to operate the system hands free.
2.4. The Sensor
The IMU is equipped with an accelerometer, a gyroscope, and a magnetometer. Incorporating these three sensors into one results in a 9-Degrees-of-Freedom sensor. The number 9 was obtained by multiplying the number of sensors by the number of axes, along which the sensors measure quantities. These are just orthogonal X, Y, and Z axes in a classic Cartesian coordinate system. Each of the three sensors measures a different quantity; thus, there are nine values that describe the position in space of the IMU and, when it is attached to the hand, the position of the hand as well. In order to explain how the IMU carries out the measurements, the principle of operation for each sensor is briefly described below.
Magnetometer: it measures the strength and direction of the magnetic field and can work based on either electromagnetic induction or the Hall effect. In the first case, the magnetic field induces voltage in a coil according to Faraday’s law. The Hall effect is the occurrence of voltage across an electrical conductor when exposed to a magnetic field.
Accelerometer: it measures linear acceleration and works in accordance with the law of inertia, while using microelectromechanical system (MEMS) technology. These are tiny mechanical structures, usually based on silicon, combined with microelectronic circuits. In such a sensor, a displacement of the proof mass is measured and converted to an electric signal.
Gyroscope: it measures angular velocity, works in line with the Coriolis effect, and also utilizes MEMS technology. Such a small gyroscope is likely to be of a vibrating structure type—a vibrating object continues to vibrate in the same plane, even if its support changes position. Vibrating structure gyroscopes tend to be cheaper and not as complex as rotating gyroscopes, utilizing a spinning wheel in a gimbal, which are applied in compasses on ships.
The IMU used in the experiments was the Adafruit LSM6DS3TR-C module from Adafruit Industries, New York, NY, USA, presented below in
Figure 1 next to a 10 eurocent coin for size reference. It is small enough to be placed on a palm or a finger, and even smaller models are available.
The sensors included in the IMU are prone to errors, such as the following:
Magnetic interference: magnetometer (although not used at this stage) readings might be affected by magnetic fields generated by circuits or ferrous materials. In order to avoid this, the magnetometer to be used must be placed far from sources of such interference.
Offset error: there might be an offset in the sensor’s output, which may affect further calculations. This problem should be addressed by the manufacturer, and the sensor should be calibrated using the provided software.
Gyroscope drift: this is the gradual deviation of the gyroscope’s reading. It can not really be avoided in the long term, but choosing a product of quality may help, as this phenomenon depends on the mechanical design.
As described above, some of these errors can easily be compensated for at the software level, while others require careful construction of the system.
3. Results
3.1. Gathering and Preprocessing the Data
Each gesture must be performed a number of times for the neural network to work. While defining desired classes of gestures, it is important to specify how much time it takes to perform them and derive the number of readings and interval of each recording. For instance, 150 readings at an interval of three milliseconds each describe a gesture within approximately half a second, which seems quite consistent with reality. The output for each class of gestures can be saved as a .csv file for convenience.
Preprocessing the data is crucial for any machine learning project to work properly. Gesture recordings, which are not representative, should be removed from the training set; otherwise, the model might perform poorly [
16]. After preprocessing, the data are ready to be used for training the model. A single gesture recording is presented below in
Figure 2 for accelerometer data and in
Figure 3 for gyroscope data.
The user kept a palm parallel to the ground and moved it slightly to the left and to the right, twice. The “aY” yellow plot in the first figure presents this move clearly. The “gX” and “gY” gyroscope readings may look similar to those of “aY”, as the user, perhaps unintentionally, moved their palm in a slightly circular manner.
Accelerometer readings are affected by gravity, as clearly presented in
Figure 2—the green aZ reading has an offset of 1, with the unit being gravitational acceleration. This happened as the user’s hand as well as the IMU were parallel to the ground—thus, the z axis was the only one affected by gravity. This situation, however, may change as the user moves their hand. If there is a need to distinguish between similar gestures performed at different angles to the ground, there exist ways to remove the gravity component [
17]. Sensor fusion algorithms can be applied for improved accuracy; these would provide noise reduction and reduced drift [
3.2. Splitting Data for Training
Inputs must be evenly distributed for training, testing, and validation. The testing set contains data, which has not been processed by the neural network during training, and it is used for assessing the performance of the network. The validation set, on the other hand, prevents overfitting—a phenomenon in which the neural networks simply memorize the data instead of learning the patterns. The data are split into testing data and the remainder; then, the remainder make up the validation and training data. It is common to dedicate 10–20% of data for testing and another 10–20% for validation. As a result, there were three disjoint sets—random recordings were excluded from the input for the testing and validation sets, which is vital for proper model training.
3.3. Training the Model
Python 3.13.0, being the language of choice for such projects, was used to train a TensorFlow/Keras 2.15.0.post1 model. For a rather easy machine learning task, a sequential interface was used, which is the simplest of all interfaces offered by Keras. It consists of layers of neurons connected sequentially, without any recursion. The first layer must match the input shape and size, while the last one is equal in size to the number of classes of gestures. Everything in between is a matter of trial and error, depending on the number of classes, similarity between the gestures, and available memory of the device as well.
The design is based on models aimed at solving similar tasks. In the case of the first layer, it is of an InputLayer type, and an input shape parameter is specified. The next two layers are of a Dense type and activated by the activation function (Rectified Linear Unit). The gesture parameter in the last layer states the number of classes of gestures. The Softmax activation function is often used in such cases.
There are three components to consider: a loss function, an optimizer function, and metrics. The loss function, known also as the cost function, tells whether the model’s predictions match actual values. Sparse categorical cross-entropy is used in multi-class classification. The model should assign each input recording to one of the classes, rather than providing probabilities of belonging to each class. If there is a need to provide probabilities, there are many other loss functions available, including categorical cross-entropy. This approach would be useful if the user performs gestures that should be labeled as unrecognized.
The optimizer function is used to train (adjust) the weights and biases of the model. RMSProp is one of the preferred algorithms for such tasks, but many others, like Adam, are available, and again, the most suitable one can be chosen by trial and error. The metrics parameter was set to accuracy, which provides the ratio of correctly classified samples to all samples.
3.4. Training Feedback
An epoch is single pass through the entire training dataset by the algorithm. Each epoch is characterized by four parameters: loss, metrics, val_loss, and val_metrics. These are defined as follows:
Loss: the training loss of the model. The function used to perform this task was specified as a parameter for compilation.
Metrics: metrics were specified as a parameter for compilation; it provides additional insight into the performance.
val_loss: the validation loss is essentially the same as loss but calculated for the validation set rather than for training. It is helpful in preventing overfitting.
val_metrics: this denotes the metrics for the validation set. Likewise, it is used to prevent overfitting.
The goal is to minimize all four parameters. The closer they get to 0, the better the performance of the model. If some of these parameters do not converge to 0, the following steps may be undertaken:
Model architecture adjustment: adding more layers, increasing the number of neurons, or changing activation functions;
Parameter optimization: choosing different loss or optimizer functions;
Providing more input data or working on preprocessing.
The progress may be plotted to see whether the values converge to 0, as presented below in
Figure 4 and
Figure 5.
Such plots provide a lot of feedback. Different shapes might indicate cases of overfitting, underfitting, bad learning rate, etc.; however, thorough testing may uncover issues that cannot be diagnosed with a plot.
4. Discussion
4.1. Gesture Descriptions
To validate the effectiveness of the IMU-based gesture recognition system, a series of controlled experiments were conducted focusing on three specific gestures: twirling, shaking, and pointing.
4.1.1. Twirling
Twirling involves making a circular motion with the hand or wrist, similarly to holding a pen or a small object and rotating it around in a circular path. This gesture consists of smooth, continuous motions, unlike shaking, and can involve different speeds and angles. The user’s hand points upward and then performs two-three circular moves in the same direction and ends in the starting position, as presented in
Figure 6 below.
4.1.2. Shaking
Shaking refers to rapidly moving the hand back and forth or side to side. For instance, when someone shakes their hand to indicate “no” or shakes a container to mix its contents, this gesture involves quick and abrupt movements. This gesture challenges the model to distinguish between actual motion and the noise. The user’s hand points upward and then quickly moves alternately to the left and to the right a few times and ends in the starting position, as presented in
Figure 7 below.
4.1.3. Pointing
Pointing is the act of extending a finger or hand in a specific direction, often to indicate an object or location. For example, when one points at something to draw attention to it, they might extend their index finger while keeping the rest of their fingers curled. This gesture typically involves less movement than twirling or shaking but requires precision in recognizing the direction and position. The user’s hand points up and then rather quickly changes position in order to point more or less to the horizon (with the hand parallel to the ground), as presented in
Figure 8 below.
4.2. The System
Pointing, along with a magnetometer reading, may indicate the direction that the controlled UAV should follow. Two other gestures may be assigned to command the UAV to change altitude or perform an emergency landing. These three predefined gesture classes should be enough to design a proper, although not sophisticated, UAV control system. More gestures may be introduced; however, they would probably require a more complex model to work. Thus, more computing power would be necessary, and a question arises as to whether an MCU would handle these tasks effectively.
4.3. Metrics for Performance Evaluation
Each gesture was performed multiple times in a consistent manner to ensure that variations in execution were minimized. This approach allowed for a more precise evaluation of the model’s capability to recognize these gestures.
To assess the system’s performance for each gesture, key metrics were used, including accuracy, precision, recall, and F1 score, defined as follows:
Accuracy measures the proportion of correctly identified gestures in a ratio to the total number of gestures performed. This metric provides an overall indicator of how well the model is performing, but may not specify areas where it fails.
Precision assesses the percentage of true positive predictions relative to all instances classified as positive. Precision indicates how many of the recognized gestures are actually correct, helping to identify false positives, where the model might mistakenly classify a different gesture as twirling.
Recall, sometimes called sensitivity as well, calculates the ratio of true positive predictions to the actual positive instances. This parameter indicates how many times the model successfully identified a certain gesture out of all actual instances of this gesture. This metric is particularly important when the goal is to minimize false negatives, ensuring that all actual occurrences of a gesture are detected.
The F1 score is the mean of precision and recall. It provides a single metric that combines the two, making it useful for situations where there is a trade-off between precision and recall. For gesture recognition, a high F1 score indicates that the model is at the same time accurately identifying gestures and not missing them.
Since only one user is performing the gestures, variability typically introduced by different users was controlled, allowing us to focus solely on the model’s response to the gestures themselves.
4.4. Factors Affecting Performance
Several factors were examined during the testing process, including sensor noise, gesture speed, and execution consistency. Understanding these factors is essential for interpreting the performance:
Sensor Noise: This term describes unwanted variations in the IMU data caused by environmental conditions or sensor inaccuracies (what usually occurs after using the sensor for some time). High sensor noise can significantly affect the model’s accuracy, especially for rapid gestures like shaking, where the system must differentiate between actual movements and the unwanted noise.
Gesture Speed: The speed at which a gesture is performed can also impact recognition accuracy. Faster gestures may generate high-frequency signals that challenge the model’s ability to capture data accurately, potentially leading to wrong classification—as a result, the model fails to work. Slower gestures may be less prone to error but could be confused with other static or slow movements, which should not affect the classification.
Execution Consistency: Even with a single user, slight variations in how each gesture is performed can affect the model’s ability to recognize them accurately. Consistency in the amplitude, angle, and duration of each gesture is crucial for the model to work; deviations could introduce problems that the model may not be able to resolve.
4.5. Data Collection Strategy
For each gesture, the following steps were carried out:
Performing each gesture: Execute each gesture for a predefined number of trials (e.g., 50 trials per gesture) at a consistent speed and amplitude.
Data logging: Collect IMU data at each trial: accelerometer and gyroscope readings.
Model evaluation: After each set of trials, evaluate the model’s performance using the predefined metrics (one of those listed above).
Factor analysis: Assess the impact of sensor noise, gesture speed, and execution consistency on the overall performance.
In order to improve the model and increase the overall success rate, a few methods may be considered:
Providing more input data for training;
Including the magnetometer;
Including a second IMU;
Using sensor fusion algorithms.
4.6. Model Evaluation
There are other algorithms, beside neural networks, that may be able to perform the given task, including the following:
These (and other) algorithms, optimized for microcontroller units, are available in the EloquentTinyML library. SVM and Random Forest models were trained with the same data as the neural network model, and their results are compared below.
The first user performed gestures according to the descriptions, collecting approximately 50–70 instances for each class. This dataset was used to train three models—neural network, SVM and Random Forest. The results are presented in
Table 1,
Table 2 and
Table 3.
The rows presenting best F1 scores are colored green, while the worst scores are marked with red. In this case, the best scores were achieved by the Random Forest model. The neural network and SVM models achieved worse results.
After this evaluation, another user was asked to perform the gestures as described above, and a second training set was created in this way. The results are presented in
Table 4,
Table 5 and
Table 6.
Random Forest turned out to be most efficient out of the three applied methods again. The remaining two models, however, performed better than in the previous experiment—the lowest value achieved with the second dataset was 0.92 (pointing—recall, in both models). The differences between the results for the two datasets are presented in
Table 7,
Table 8 and
Table 9.
The least differences were achieved by the Random Forest model. These did not exceed 0.02 (pointing and shaking—precision), as this model turned out to perform well with both datasets. For SVM, however, significant differences were observed, up to 0.16 (pointing—precision). Differences in the case of the neural network model were moderate, with 0.08 being the largest (twirling—precision). More research must be carried out to understand the different results produced by the models.
The three algorithms performed well overall and achieved scores that were generally above 0.8, indicating successful gesture recognition. Random Forest, with the lowest score of 0.96 (pointing—recall—first user), turned out to be the most efficient. The differences in the classification results were also the lowest for Random Forest.
5. Conclusions
An effective gesture recognition system was designed, built, and programmed. The overall success rate is acceptable, and potential methods for improvement are available.
In order to improve the model and increase the overall success rate, a few methods may be considered:
Providing more input for training;
Including a magnetometer;
Including a second IMU;
Using sensor fusion algorithms.
The central unit works well, and the output is delivered almost instantly. The possibility to deploy machine learning models on microcontrollers brings new opportunities. These devices are usually small, lightweight, and energy-efficient. While not superior in terms of clocking and with limited memory, they may be capable enough to perform simple machine learning tasks, which may find many applications.