1. Introduction
Communication, may it be verbally or through gestures, is a necessity in one’s life for conveying messages and interaction. When deaf and dumb persons interact with hearing people who are not familiar with sign language, a communication barrier will arise. This communication gap can be overcome with the presence of interpreters who convert sign language into spoken language and vice versa. However, the interpreter, whether it is a person or device, is extremely expensive, and it may not be available for the rest of a deaf person’s life. As a result, advancements in the hand gesture recognition of sign languages will benefit the deaf and dumb community by bridging the communication gap that currently exists.
Most of the sign language lexicons were made up of hand gestures, which are usually combined with facial expressions and body movements that emphasize the words or phrases. Due to this inherent trait, a hand gesture can be either static or dynamic in nature. Due to this inherent trait, a hand gestures can be either static or dynamic in nature. Dynamic gestures are made up of a series of hand gestures that move, whereas static hand gestures are made up of various forms and hand orientations that do not reflect any motion information.
The vision-based approach can be divided into two categories, handcrafted machine learning approach or deep learning approach (depicted in
Figure 1). The handcrafted approach, which is also known as the traditional machine learning, has a separate section for its features defined and extracted prior subjecting it through the machine learning algorithm. Some examples of features that are pre-defined are edge detection, corner detection, histograms, etc. On the other hand, the deep learning approach does not need a specific manual feature extraction process as the algorithm itself basically searches for what features are best to classify the images, such as CNN. The major difference between deep learning and machine learning techniques is the problem-solving approach. Deep learning techniques tend to solve the problem end to end, whereas machine learning techniques need to break down the problem statements into different parts to be solved first; then, their results are combined in the final stage.
Lately, more studies have been carried out to propose a model which can classify datasets of different conditions such as the illuminations level and complex backgrounds through CNN. By employing CNN, the hand-crafted feature extraction portion can be avoided, especially when the dataset comes with complex backgrounds. However, whenever it involves CNN, the dataset size is one of the crucial considerations when it comes to classification. Generally, deep neural networks require a very large amount of training data to avoid overfitting, whereas traditional machine learning approaches are more robust due to their hierarchical structure and have a shorter execution time. In order to achieve a better accuracy, researchers tried to perform a deeper convolution layer but have reported that computation resources such as computer memory is a major stumbling block, not to also mention the time taken to perform training.
In this paper, a hybrid hand gesture recognition model based on CNN as part of the deep learning and ensemble classifier is introduced. The performance of a model heavily depends on the features studied and extracted accurately. Hence, feature extraction via the CNN approach avoids complex methods in manual feature extraction, especially when it is required to be crafted according to each individual dataset. However, the dataset size and execution time, which have constantly been a source of worry with regard to CNN, will be addressed using machine learning methods in classification. This paper draws several key contributions as follows:
A hybrid model using deep learning techniques for feature extraction and an ensemble classifier for classification (Lightweight VGG16 and Random Forest) is devised for hand gesture recognition;
Reduced burden on the computation resources required for VGG16 feature extraction through architecture depth optimization;
Execution time improvement in the comparison of lightweight VGG16-RF to a full-fledged deep learning architecture for hand gesture recognition.
What remains of this paper is organized as follows:
Section 2 reviews the related work pertaining to hand gesture recognition;
Section 3 presents the proposed model;
Section 4 covers the datasets used, the experiments carried out and the results recorded; and
Section 5 concludes the paper.
2. Related Works
Before the deep learning approach became popular, the hand-crafted approach was the way to go for image recognition, particularly vision-based. A hand-crafted approach often consists of several sections of image pre-processing and specific crafted feature extractions modules. Vishwakarma (2017) [
1] proposed a hand gesture recognition using the shape and texture evidence in complex backgrounds. The National University of Singapore (NUS) Hand Posture dataset used was subjected to segmentation and morphological operation for image pre-processing. The pre-processing mainly targeted the internal noises prior to using the Gabor filter to retrieve the texture features of the images. A differentiable intensity profile was created through the Gabor filter and smoothened through the Gaussian filter where the intensity information then fed into the classifier.
Sadeddine et al. (2018) [
2] proposed an implementation of hand posture recognition using several descriptors on three different databases, namely American Sign Language (ASL), Arabic Sign Language (ArSL) and the NUS Hand Posture dataset. The system architecture was categorized into three phases, namely hand detection, feature extraction and classification. Several descriptors such as Hu’s Moment Descriptor (Hu’s MD), Zernike Moments Descriptor (ZMD), Generic Fourier Descriptor (GFD) and Local Binary Pattern Descriptor (LBPD) were used to detect the hand posture region. In Hu’s MD, the moment invariants were computed based on the information provided by both the external shape and internal edges. While for LBPD, the image was divided into several non-overlapping blocks; LBP histograms were then computed for each individual block. Finally, the local binary pattern (LBP) histograms were concatenated into a single vector. As for ZMD, a statistical measure of pixel distribution around the centre of gravity of the shape was used to detect the hand in the image and constructed a bounding box around it to eliminate the unwanted surrounding background.
Zhang et al. (2018) [
3] proposed a hand gesture recognition system based on the Histogram of Oriented Gradients (HOG) and LBP using the NUS dataset. The architecture of the proposal algorithm worked in a manner wherein feature extraction was performed separately and paralleled via HOG and LBP followed by fusing the collected features into one, and then followed by the Support Vector Machine (SVM) for classification. HOG features were used to acquire the edge and local shape information, while LBP features were used to extract the texture features which were robust to the grey level transform, therefore, as rotational change. In the final stage, SVM with the radial basis function (RBF) was used to classify the feature obtained.
Gajalakshmi and Sharmila (2019) [
4] proposed a hand gesture recognition using SVM with the chain code histogram (CCH) used for feature extraction on the NUS dataset. The process began with a thresholding process as part of the pre-processing to produce binary images of hand posture for feature extraction. Ridler and Calvard thresholding (RCT) was used to segment the region of interest. RCT thresholding worked by considering the average value of the intensity pixels as an initial threshold and the foreground and background classes were first separated based on the computed average foreground mean and background mean. For feature vector extraction, CCH segregated the binary image according to cluster-based thresholds into grid blocks and the histogram was then calculated based on the frequently occurring discrete values.
In order to obtain good classification, feature extraction has become a much more crucial task, especially with complex or noisy backgrounds. Researchers had then started to adopt more of a deep learning approach to ease the feature extraction module creation. Gao et al. (2017) [
5] proposed a static hand gesture recognition model with parallel CNNs for space human–robot interaction on the ASL dataset. This experiment proposed a parallel CNNs method where the network includes two subnetworks, the RGB-CNN subnetwork and the Depth-CNN subnetwork, which ran in parallel and merged to obtain the result for the final model. There are seven layers in RGB-CNN and Depth-CNN subnetwork each. The convolution layers made up the first four layers of the CNN, whereas the fully connected layers had 144 and 72 neurons, respectively. The prediction probabilities were created in the SoftMax classification layers at the conclusion of the subnetwork. The RGB-CNN and Depth-CNN subnetwork achieved an accuracy of 90.3% and 81.4% individually, but when combined, the CNN network can achieve a test accuracy of 93.3%.
Adithya and Rajesh (2020) [
6] proposed a method for the automatic recognition of hand postures using convolutional neural networks with deep parallel architectures. The proposed model avoided the need for hand segmentation, which was a very difficult task in images with cluttered backgrounds. In the proposal, two datasets were used, namely the National University of Singapore (NUS) dataset and ASL dataset. The images for training were subjected to three layers of convolutional operation with different filter sizes for feature extraction with proper zero padding applied in each layer to ensure that the size of the input and output remained the same. The dimension of the feature map was reduced through the max pooling layer for each convolutional layer. In this model, stochastic gradient descent with the momentum (SGDM) optimization function was used as well. The proposed model achieved an accuracy of 99.96% and 94.7% for the ASL and NUS dataset, respectively.
Bheda and Radpour (2017) [
7] presented a method to classify both letters and digits in ASL using deep convolutional networks. Three datasets were used in the research, self-acquired dataset, ASL Alphabets, and ASL Digits. The author proposed a common CNN architecture which consisted of three groups of two convolutional layers followed by a max-pool layer with a dropout layer and connected to two groups of fully connected layers followed by a dropout layer. The authors noticed that the size of the training data was critical in ensuring better accuracy at the validation stage. Data augmentation techniques such as rotation and transformation which includes flipping were applied on the self-acquired dataset in the effort to increase the sample size has yielded an improvement of 20% to the overall performance. On top of that, backgrounds from each of the images was removed using a background-subtraction method to minimize the noise impact to the overall accuracy. An accuracy of 82.5% was recorded for ASL Alphabets, 97% for ASL Digits while the self-acquired dataset only recorded 67% and 70%, respectively, for ASL Alphabet and Digits.
In the deep learning-based approach, as researchers have started to realize, the size of the dataset plays a role in determining a good classification rate. Hence, researchers are now either performing data augmentation to the datasets or importing weights from a pre-trained model which was trained on a larger dataset. Ozcan and Basturk (2019) [
8] proposed a hand gesture recognition method for digits using a transfer learning-based CNN structure with heuristic optimization. Two datasets were used in this proposal, the ASL Digits dataset and ASL dataset. In this model, the datasets were loaded into the system together with AlexNet, a pre-trained CNN model that had eight learnable layers, among which the first five are the convolutional layers and the three fully connected layers as part of the transfer learning. The final three layers of the CNN was modified and optimized using Artificial Bee Colony (ABC) algorithm.
Tan et al. (2021) [
9] proposed a customized network architecture called Enhanced Densely Connected Convolutional Neural Network (EDenseNet). In the experiment, the ASL dataset and NUS Hand Posture dataset were used. The datasets were subjected to nine data augmentation techniques as a mitigation plan towards the effect of data scarcity. The proposed model had three dense blocks where each block contained four convolutional layers and transition layers connected each of the dense blocks. The dense block was setup with three layers at a growth rate of 24 (amount of feature maps to be produced) with the filter size of three and within a single dense block, the feature map of preceding convolutional layers was concatenated and served as input to the succeeding convolutional layer. As for the transition layer, it was made up of a bottleneck layer of four convolutional layers, growth rate of 24 with filter size of 3 as well and followed by a pooling layer. Max pooling was also deployed in the first two transition layers to extract all extreme features such as curves and edges while average pooling was used in the final transition layer to only extract and smoothen out features.
In further optimizing the approach for image classification, the combination of deep learning models together with machine learning models are taking place. Wang et al. (2021) [
10] proposed a gesture image recognition method based on transfer learning called MobileNet-RF. The proposed model’s structure was a combination of CNN for feature extraction and machine learning for classification. The structure worked by processing images through a standard convolution and continued with stacking depth-wise convolution and point-wise convolution for feature extraction. Batch normalization (BN) and ReLU activation functions are added for each of the depth-wise and point-wise convolution where BN accommodates the slow convergence speed of the neural network while ReLU has great computing advantages which can make the network design more in-depth. The entire MobileNet has 28 layers as the depth-wise and point-wise convolution are calculated separately. The first 28 layers of the proposed network are used to extract the gesture image features and they are then directly input into the random forest model for classification.
Table 1 presents the summary of the related works.
Sahoo et al. (2022) [
11] proposed a score-level fusion technique between AlexNet and VGG16 for hand gesture recognition. In the effort of fine-tuning both CNN models, weights are transferred from the pre-trained model for initialization instead of starting from scratch. The vector score generated from both fine-tuned CNN models are first normalized and then fused together using the sum-ruled-based method to form a single output vector. Through the Massey University (MU) dataset and HUST American Sign Language (HUST-ASL) dataset, the accuracy of the proposed model was recorded at 90.26% and 56.18%, respectively.
Wang et al. (2022) [
12] proposed an improved lightweight CNN model by adding adaptive channel attention (ECA module) to an existing MobileNetv2 CNN model called E-MobileNetv2. The newly added module helped reduce the interference of unrelated information so as to enhance the model’s feature refining ability, especially in capturing the cross-channelled interactions. The proposed model is topped-up with a new activation function R6-SELU (instead of ReLU6) for better feature extraction ability and the prevention of the loss of negative-valued feature information. The newly proposed model achieved an accuracy of 96.82%, so as to reduce the number of parameters by 30%.
Furthering the effort to improve classification of hand gesture, Gadekallu et al. (2022) [
13] proposed a method where Harris hawks optimization (HHO) algorithm was utilized to fine tune the hyperparameters of the CNN model. The algorithm is a mimic of how an actual Harris hawk hunts for prey in the nature. The HHO has two phases which consist of two stages of exploration stages and four stages of exploitation stages in the effort to locate the optimal solution within a given location. The effectiveness of the algorithm has contributed to the 100% accuracy achieved when tested on the hand gesture dataset from Kaggle which consists of non-alphabetical hand gesture actions such as those of a fist, palm and thumb.
Li et al. (2022) [
14] proposed an algorithm which is robust for a multi-scale and multi-angle algorithm against a complex background during feature extraction. Features are extracted from the complex background using the Gaussian model and K-means algorithm then subjecting them through HOG and 9ULBP. Features generated when fused together is not only invariant in scale and rotation but rich in texture information. The proposed method used SVM for classification to locate the optimal separation between class and the achieved accuracy of 99.01%, 97.5% and 98.72% when tested on a self-collected dataset, NUS dataset and MU Hand Images ASL dataset.