1. Introduction
Gestures are key elements in the fields of interaction, understanding, and communication with machines. In certain situations, where other kinds of communication fail, such as speech recognition with environmental noise, gesture processing approaches have demonstrated to be a valid strategy, also offering the benefit that they do not need the use of additional elements or components to acquire extra data.
Gestures can be defined as natural movements made by humans, presenting many variations when done, either when the same person is performing the same gesture for several times, or when different people are doing it. Also, in the case of using images, variable environmental conditions add more challenges to the gesture detection process.
A gesture recognition system must be also flexible enough to recognize new gestures and must support the training of new ones, and because of these needed features, complex processes are needed for gesture recognition like motion modelling or pattern recognition. Some efforts on gesture recognitions started in 1993, where some techniques were adapted from other fields like speech or handwriting recognition. In this way, Darrell and Pentland [
1] adapted Dynamic Time Wrapping (DTW) to recognize dynamic gestures. After that, Starner et al. [
2] proposed to classify orientation, resultant shape and trajectory information of the gesture, using Hidden Markov Models (HMM).
In specific gesture-detection applications, different wearable sensors have been developed, such as gyroscopes or accelerometers. In [
3], the authors use magnetic and inertial sensors placed in a glove to capture the movements of the hand and arm. In [
4], a stretch-sensing soft glove is presented to perform an interactive hand-position capture. This method presents high accuracy without any other external optical system. They also demonstrate the way to make a calibrated low-cost version, using common components available in most of the manufacturing labs. On the other hand, the use of image-processing algorithms and point clouds for gesture and posture recognition is becoming possible by using 3D cameras, sensors providing point clouds or more common color images. In [
5], the segmented skeleton joints are used in combination with a precise segmentation of depth images to extract gesture information. On the other hand, in [
6], the authors use segmentation of color in 2D images to detect hands, head or tags. Ge et al. [
7,
8] introduce a new method for a real-time pose estimation using 3 dimensional Convolutional Neural Networks (CNNs). This approach takes a 3D volumetric representation of the hand, using a depth image as input, and then it extracts three-dimensional features from the volumetric input, capturing the 3D spatial structure of the hand, and in the same way obtaining the 3D pose of the hand using a single pass. Finally, and in the same way, the Google Media Pipe framework uses deep learning to track and detect gestures performed with both hands [
9].
There are some methods that use a random forest algorithm [
10,
11] or autoencoders [
12] for the estimation of the structure of the hands. More recently, there are some works that adopt a more computationally efficient methods based on these hierarchical structures [
13].
In recent years, research works try to detect gestures by detecting hand and head of persons. In [
14], head and hands of the users are tracked using the 3-dimensional space. In the same way, in [
15], a color segmentation using 2D images is used to detect user’s hand. In [
16] the detection of the hands is performed by using both approaches.
Another field of interest in this research area is sign language recognition, which is a sub-field of communicative gestures. Since this type of language is highly structural, it is frequently used in computer vision algorithms. As an example, Cheok et al. [
17] proposed an alignment network to detect specific hand gestures with iterative optimization for weakly supervised continuous sign language recognition. This solution presented two different approaches working together: a 3D convolutional residual network for feature learning and an encoder-decoder network used for sequence modelling. Also, works presented in [
18,
19], present complete reviews of different algorithms and specific approaches intended for gesture detection applications.
As exposed above in this introduction, deep learning algorithms represent the latest big advancement in the artificial intelligence and pattern recognition fields. These techniques are based on the construction of highly complex neural network architectures, with the final objective of replicating the inference and abstraction capacities of the human beings in varied tasks such as object recognition, scene interpretation or text and speech recognition and generation.
The first Convolutional Neural Network, created for image processing tasks and named AlexNet, was presented by the research group leaded by Geoffrey E. Hinton at Toronto University. With it, they won the ImageNet Large Scale Visual Recognition Challenge [
20].
In this sense, the CNNs have become one of the best solutions to solve intricate image processing problems among a great variety of applications fields [
21]. Nowadays, it is easy to find many different architectures, to perform image processing tasks, such as GoogleLeNet [
22] developed by Google, VGG [
23], ResNet [
24] engineered by Microsoft, or RCNNs (Region based CNNs).
This last one is the most common approach based on CNN to detect objects in images. More concretely, the most common approaches for object detection applications can be grouped as region proposal-based methods, such as RCNN and its derived algorithms, SPP-net or FPN, or regression/classification methods as YOLO, Multibox, G-CNN, AttentionNet, and so on [
25].
The region proposal methods include multiple internal steps, such as proposal generation, feature extraction and classification and bounding box regression operations. The RCNN based models are examples of this method [
26]. In this specific case, it has three main evolutions to obtain optimum results in object detection: RCNN, Fast RCNN, and Faster RCNN.
The original RCNN approach first generates region proposals using image processing algorithms such as Edge Boxes [
27]. In the paper defining the RCNN architecture, the authors use tentatively 2000 regions for each image. Each of these generated regions are cropped and resized to be classified by a CNN. In the last step, the bounding boxes proposed are refined by a support vector machine (SVM) previously trained with the CNN features. A schema of this system can be observed in
Figure 1.
A refinement on the RCNN detector, is the Fast-RCNN detector [
28].
Figure 2 shows the new detector architecture. The principal difference with the basic architecture is that, after the region proposal function, instead of cropping and resizing region proposals, Fast-RCNN detector analyzes the complete image. The Fast-RCNN algorithm pools the features of the proposed regions using a CNN. This makes Fast-RCNN more efficient, because there are not recurrent computations in overlapping regions.
The most recent evolution of the RCNN algorithm is the Faster-RCNN [
29], shown in
Figure 3. In this case, a region proposal network is used in a direct way. With this, the network is faster, and it is better adapted to the training data, offering faster, and better results.
We selected the different variations of the RCNN object detection algorithm in this work because their structure is based on an internal CNN network that can be created taking into account the computation capability available, making this approach fully adaptable to limited hardware.
Section 2 exposes the precise use case presented in this article.
Section 3 explains the dataset used for the training stage of the proposed solution.
Section 4 explains the proposed approach in this paper in terms of network architecture.
Section 5 presents the training procedure of the different RCNN alternatives, and in
Section 6, the results obtained after the evaluation tests are presented. Finally, the last section,
Section 7, summarizes the conclusions obtained from this research, highlighting the strong points of the proposed method.
Main Motivation and Contributions
The main motivation of this work is to design an effective CNN architecture that can be deployed and used in computationally limited devices, offering real-time gesture detection capabilities in embedded devices with the previously cited limitation.
Following this motivation, this article presents a research work to analyze the performance of the different RCNN approaches for the detection of hand gestures in real time, with the final objective of obtaining a lightweight system that fits not only in desktop PC, but also can be run in IoT (Internet of Things) devices like, Google Coral or Nvidia Jetson platforms, providing a real-time response.
After the testing of different previously existing networks, the designed CNN presents a very small architecture both in terms of number of layers and total number of neurons, offering at the same time real time capabilities and a percentage of correctly detected gestures of almost 97%, overcoming results in state of the art. This is the main contribution of this work.
2. Problem Statement
One of the biggest problems that Deep Learning techniques present in general, and CNNs in particular, is the high computational requirements. Several researchers pointed out this problem, like [
30] or [
31].
When we face a problem that requires a short time response, like in this case, small CNNs can be good candidate models to solve in the RCNN detectors when we try to validate the system’s performance. Before analyzing several different networks present in state of the art, small CNNs provide also smaller footprint, particularly if we compare them with other architectures (e.g., Squeezenet [
32], GoogleNet [
22], and Mobilenetv2 [
33]). In
Table 1, we present some of the principal features that these networks have. In the depth column present in the table, we defined the largest number of layers (fully connected or convolutional) on the path from the first to the last layer.
The RCNN detectors will detect the different user hand positions and will return as output the class of the gesture detected with the best accuracy.
Finally, as a summary, we can highlight three relevant features needed when developing a gesture detection system. These are the following:
Capability to define new gestures quickly. This feature provides flexibility to the system.
Flexibility to identify the same gesture made by different people, with a different orientation, skin color, hand position, and other aspects of the environment, like the light of the place or different distances to the performed gestures.
Due to the real-time constraints in several scenarios, the system has to provide a fast response.
Despite having enough computation power to deploy systems based on small-sized CNNs, the biggest issue with these embedded systems is related to the software. These devices only support a limited version of the traditional Deep Learning frameworks, like Tensorflow Lite, or new frameworks developed to maximize the performance of these devices, like OpenVino. These reduced versions do not support all the CNN layers that are available in the scientific literature.
According to these limitations, the developed architecture must fulfil different conditions to be compatible with these devices:
Use standard CNN layers to avoid the limitations of the lite version of the frameworks. Due to the resources constraints, it is essential also to have a reduced size.
Improve the response time, avoiding hardware constraints of these platforms.
Compatibility with transfer learning techniques, with pre-trained models, to minimize the training step.
3. Specific Dataset for Hand-Gesture Detection
There are a lot of gesture datasets already at researchers’ disposal, but, many of them, due to their research purposes, do not represent the real-world conditions correctly, especially in tasks that involve human-machine interaction. Some of these datasets are, for example, the dataset created by the University of Padova using Kinect and Leap Motion devices [
34], The Hand-Gesture Detection Dataset, created by the Video Processing and Understanding Lab [
35], or the Hand Gesture Database [
36]. In them, the authors have already preprocessed the images, and the hands are segmented, performing different gestures. Because of this, researchers do not need to locate hands in the overall scene. Hence, these datasets do not have a feasible application in systems that want to detect real-time gestures when working in an environment that has to be constantly under inspection.
The main application presented in this article is in the area known as “human-machine interaction” (HMI). In this type of application, the number of gestures that these systems need to identify is usually limited. For example, manual control of a robot operation needs four or five principal commands, such as “stop”, “resume”, “start”, or “reset”. In other application examples, the situation is just similar: in a television control based on gestures, a few of them are needed, “switch on/off”, “volume up/down”, “program up/down”. Using a great number of classes (i.e., gestures) makes the understanding of these systems a hard task, and their practical usability decreases.
In this work, we present a dataset that consists of four different commands done by gestures, defined assuming the typical interaction carried out between a collaborative robot and a worker in an industrial setup [
37]. The four different gestures: ‘Agree’, ‘Halt’, ‘Ok’ and ‘Run’. The position of fingers and the hand represent each gesture.
Table 2 presents the hand positions for each gesture.
For each gesture, we recorded and labelled 100 different images. In each image, we use both hands to perform the same gesture, so we labelled a final amount of 200 gestures for each class, with a total of 800 gestures in our training dataset. Using both hands in each image allows the learning generalization and adds more samples to the dataset. We also labelled the gestures at different distances from the camera, adding more variability.
Figure 4 shows an example of an image that has been labelled for the Halt gesture.
The dataset is created with both the image name and the coordinates of each labelled gesture as bounding boxes, following this schema: (x, y, w, h), where x and y represent the coordinates, using the upper left corner as a starting point, w is the width and h the height of the bounding box. All 400 images (100 images for each of the 4 classes) in the dataset have the same resolution: 1280 × 720 pix.
One problem that must be considered is the overfitting problem that can happen when a small number of classes and a CNN with a small number of trainable parameters are used. This situation, which is an important problem in generic gesture detection applications, can be assumed in HMI applications. Overfitting itself presents an over adjustment to the dataset. In the HMI application configured with a specific dataset created by the future users of the system, this over adjustment cannot be taken as a drawback, but even as an advantage in scenarios where the operation security is critical.
4. Proposed Network Architecture
We propose a new optimized CNN architecture, evolving the Darknet reference model presented in [
38]. This Darknet architecture’s main feature is the speed on the detection stage while having a relatively contained number of trainable neurons and a simple network architecture. Despite the limitations exposed in
Section 2, this model suits our application. Using this model allows us to minimize the training process, using a transfer learning approach with a pre-trained model of the network [
39].
The Darknet network is composed of 32 layers.
Figure 5 shows the schema we have used. On the other hand,
Table 3 shows the different filters located in the eight convolutional layers.
The proposed CNN in this work wants to simplify the Darknet model and minimize the training step, using a transfer learning approach, maintaining the Darknet convolutional layers’ trained weights. The carried-out changes are the following:
Simplification of intermediate convolutional layers (shown in
Figure 5). The embedded platforms framework limitations justify this simplification of layers. Thanks to this adaption, it makes the network smaller and more portable to many different systems.
Modify the output layer of the network. We used a fully connected layer of 5 outputs (number of gestures to detect and one extra layer, needed to make the model more usable by the RCNN detectors) (shown in
Figure 6).
Due to these modifications, the proposed network has only 25 layers instead of the 32 layers in the original network.
Table 4 shows the comparison between the proposed solution and state of the art in the small CNNs area, explicitly created to be used in embedded devices.
The proposed network in this work has a similar number of parameters if compared with other small CNNs, but it has a significant lower number of layers. Parallelizing the calculations needed for each layer, the proposed network is much more efficient in computation time, as we stated in
Section 6. SqueezeNet has almost three times more sequential layers than our proposed network (68 layers versus 24), and MobileNetv2 has even more, a total amount of 155 sequential layers, 6.45 times more than our proposed approach. Despite these two well known architectures are optimized in the number of trainable parameters, they present a more sequential structure, offering less possibilities for parallelization and hence, requiring more computation time than our optimized network.
7. Conclusions and Future Work
CNNs have demonstrated to be a very powerful option to solve complex image processing problems that present high variability. Partial occlusions, perspective problems, variability in sizes of the same object due to distance, changes in color and lighting, etc., are examples of these type of situations.
These networks have had an outstanding performance in many proposed challenges and competitions, such as ILSVRC [
39], the challenges presented in the Kaggle platform [
42] or The Low-Power Image Recognition Challenge [
43].
On the other hand, the principal drawback of this technology is that high computation capacities are needed for the majority of the problems that use CNN. This leaves systems like embedded systems of industrial equipment out of the potential areas that can take advantage of modern neural computation.
This work has proved that it is possible to use RCNN based algorithms to detect gesture commands in real time, processing each gesture as a different object to detect by the network. Moreover, it has been proven that, depending on the problem to solve, if it is not very intricate and does not require the detection of a high number of output classes, the use of a smaller CNN offers many advantages over the state-of-the-art published networks, in terms of training time, computation speed, and accuracy. In addition to all this, the use of small CNN networks allows using embedded systems for computation in these type of algorithms. This is an important point when developing this kind of systems.
The results have also been promising. The system detects 96.92% of the gestures correctly, and it is robust to variations that can make other alternative algorithms fail. As an example, the proposed approach is robust to lighting variations and different color skins, provided the training dataset is well created and generalized. This variability makes very difficult to solve skin segmentation approaches for gesture detection. In addition to this, and using a small CNN with a Faster-RCNN architecture, the mean computation time for a 1 Mpix image is 0.14 s using GPU computation, a time that can be taken as real time for systems that use gestures as a form of interaction.
Another point of interest to be taken into consideration and related to the explained results in the above paragraph is that, given the low computational resources that our system needs for real-time gesture detection, it could also be adapted very effectively to other areas beyond the industrial applications, such as domestic devices or smart TVs, for example.