Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier

Ewe, Edmond Li Ren; Lee, Chin Poo; Kwek, Lee Chung; Lim, Kian Ming

doi:10.3390/app12157643

Open AccessArticle

Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier

¹

Faculty of Engineering and Technology, Multimedia University, Melaka 75450, Malaysia

²

Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(15), 7643; https://doi.org/10.3390/app12157643

Submission received: 27 June 2022 / Revised: 12 July 2022 / Accepted: 18 July 2022 / Published: 29 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

Gesture recognition has been studied for a while within the fields of computer vision and pattern recognition. A gesture can be defined as a meaningful physical movement of the fingers, hands, arms, or other parts of the body with the purpose to convey information for the environment interaction. For instance, hand gesture recognition (HGR) can be used to recognize sign language which is the primary means of communication by the deaf and mute. Vision-based HGR is critical in its application; however, there are challenges that will need to be overcome such as variations in the background, illuminations, hand orientation and size and similarities among gestures. The traditional machine learning approach has been widely used in vision-based HGR in recent years but the complexity of its processing has been a major challenge—especially on the handcrafted feature extraction. The effectiveness of the handcrafted feature extraction technique was not proven across various datasets in comparison to deep learning techniques. Therefore, a hybrid network architecture dubbed as Lightweight VGG16 and Random Forest (Lightweight VGG16-RF) is proposed for vision-based hand gesture recognition. The proposed model adopts feature extraction techniques via the convolutional neural network (CNN) while using the machine learning method to perform classification. Experiments were carried out on publicly available datasets such as American Sign Language (ASL), ASL Digits and NUS Hand Posture dataset. The experimental results demonstrate that the proposed model, a combination of lightweight VGG16 and random forest, outperforms other methods.

Keywords:

sign language recognition; hand gesture recognition; convolutional neural network (CNN); ensemble classifier; lightweight VGG16; random forest; transfer learning

1. Introduction

Communication, may it be verbally or through gestures, is a necessity in one’s life for conveying messages and interaction. When deaf and dumb persons interact with hearing people who are not familiar with sign language, a communication barrier will arise. This communication gap can be overcome with the presence of interpreters who convert sign language into spoken language and vice versa. However, the interpreter, whether it is a person or device, is extremely expensive, and it may not be available for the rest of a deaf person’s life. As a result, advancements in the hand gesture recognition of sign languages will benefit the deaf and dumb community by bridging the communication gap that currently exists.

Most of the sign language lexicons were made up of hand gestures, which are usually combined with facial expressions and body movements that emphasize the words or phrases. Due to this inherent trait, a hand gesture can be either static or dynamic in nature. Due to this inherent trait, a hand gestures can be either static or dynamic in nature. Dynamic gestures are made up of a series of hand gestures that move, whereas static hand gestures are made up of various forms and hand orientations that do not reflect any motion information.

The vision-based approach can be divided into two categories, handcrafted machine learning approach or deep learning approach (depicted in Figure 1). The handcrafted approach, which is also known as the traditional machine learning, has a separate section for its features defined and extracted prior subjecting it through the machine learning algorithm. Some examples of features that are pre-defined are edge detection, corner detection, histograms, etc. On the other hand, the deep learning approach does not need a specific manual feature extraction process as the algorithm itself basically searches for what features are best to classify the images, such as CNN. The major difference between deep learning and machine learning techniques is the problem-solving approach. Deep learning techniques tend to solve the problem end to end, whereas machine learning techniques need to break down the problem statements into different parts to be solved first; then, their results are combined in the final stage.

Lately, more studies have been carried out to propose a model which can classify datasets of different conditions such as the illuminations level and complex backgrounds through CNN. By employing CNN, the hand-crafted feature extraction portion can be avoided, especially when the dataset comes with complex backgrounds. However, whenever it involves CNN, the dataset size is one of the crucial considerations when it comes to classification. Generally, deep neural networks require a very large amount of training data to avoid overfitting, whereas traditional machine learning approaches are more robust due to their hierarchical structure and have a shorter execution time. In order to achieve a better accuracy, researchers tried to perform a deeper convolution layer but have reported that computation resources such as computer memory is a major stumbling block, not to also mention the time taken to perform training.

In this paper, a hybrid hand gesture recognition model based on CNN as part of the deep learning and ensemble classifier is introduced. The performance of a model heavily depends on the features studied and extracted accurately. Hence, feature extraction via the CNN approach avoids complex methods in manual feature extraction, especially when it is required to be crafted according to each individual dataset. However, the dataset size and execution time, which have constantly been a source of worry with regard to CNN, will be addressed using machine learning methods in classification. This paper draws several key contributions as follows:

A hybrid model using deep learning techniques for feature extraction and an ensemble classifier for classification (Lightweight VGG16 and Random Forest) is devised for hand gesture recognition;
Reduced burden on the computation resources required for VGG16 feature extraction through architecture depth optimization;
Execution time improvement in the comparison of lightweight VGG16-RF to a full-fledged deep learning architecture for hand gesture recognition.

What remains of this paper is organized as follows: Section 2 reviews the related work pertaining to hand gesture recognition; Section 3 presents the proposed model; Section 4 covers the datasets used, the experiments carried out and the results recorded; and Section 5 concludes the paper.

2. Related Works

Before the deep learning approach became popular, the hand-crafted approach was the way to go for image recognition, particularly vision-based. A hand-crafted approach often consists of several sections of image pre-processing and specific crafted feature extractions modules. Vishwakarma (2017) [1] proposed a hand gesture recognition using the shape and texture evidence in complex backgrounds. The National University of Singapore (NUS) Hand Posture dataset used was subjected to segmentation and morphological operation for image pre-processing. The pre-processing mainly targeted the internal noises prior to using the Gabor filter to retrieve the texture features of the images. A differentiable intensity profile was created through the Gabor filter and smoothened through the Gaussian filter where the intensity information then fed into the classifier.

Sadeddine et al. (2018) [2] proposed an implementation of hand posture recognition using several descriptors on three different databases, namely American Sign Language (ASL), Arabic Sign Language (ArSL) and the NUS Hand Posture dataset. The system architecture was categorized into three phases, namely hand detection, feature extraction and classification. Several descriptors such as Hu’s Moment Descriptor (Hu’s MD), Zernike Moments Descriptor (ZMD), Generic Fourier Descriptor (GFD) and Local Binary Pattern Descriptor (LBPD) were used to detect the hand posture region. In Hu’s MD, the moment invariants were computed based on the information provided by both the external shape and internal edges. While for LBPD, the image was divided into several non-overlapping blocks; LBP histograms were then computed for each individual block. Finally, the local binary pattern (LBP) histograms were concatenated into a single vector. As for ZMD, a statistical measure of pixel distribution around the centre of gravity of the shape was used to detect the hand in the image and constructed a bounding box around it to eliminate the unwanted surrounding background.

Zhang et al. (2018) [3] proposed a hand gesture recognition system based on the Histogram of Oriented Gradients (HOG) and LBP using the NUS dataset. The architecture of the proposal algorithm worked in a manner wherein feature extraction was performed separately and paralleled via HOG and LBP followed by fusing the collected features into one, and then followed by the Support Vector Machine (SVM) for classification. HOG features were used to acquire the edge and local shape information, while LBP features were used to extract the texture features which were robust to the grey level transform, therefore, as rotational change. In the final stage, SVM with the radial basis function (RBF) was used to classify the feature obtained.

Gajalakshmi and Sharmila (2019) [4] proposed a hand gesture recognition using SVM with the chain code histogram (CCH) used for feature extraction on the NUS dataset. The process began with a thresholding process as part of the pre-processing to produce binary images of hand posture for feature extraction. Ridler and Calvard thresholding (RCT) was used to segment the region of interest. RCT thresholding worked by considering the average value of the intensity pixels as an initial threshold and the foreground and background classes were first separated based on the computed average foreground mean and background mean. For feature vector extraction, CCH segregated the binary image according to cluster-based thresholds into grid blocks and the histogram was then calculated based on the frequently occurring discrete values.

In order to obtain good classification, feature extraction has become a much more crucial task, especially with complex or noisy backgrounds. Researchers had then started to adopt more of a deep learning approach to ease the feature extraction module creation. Gao et al. (2017) [5] proposed a static hand gesture recognition model with parallel CNNs for space human–robot interaction on the ASL dataset. This experiment proposed a parallel CNNs method where the network includes two subnetworks, the RGB-CNN subnetwork and the Depth-CNN subnetwork, which ran in parallel and merged to obtain the result for the final model. There are seven layers in RGB-CNN and Depth-CNN subnetwork each. The convolution layers made up the first four layers of the CNN, whereas the fully connected layers had 144 and 72 neurons, respectively. The prediction probabilities were created in the SoftMax classification layers at the conclusion of the subnetwork. The RGB-CNN and Depth-CNN subnetwork achieved an accuracy of 90.3% and 81.4% individually, but when combined, the CNN network can achieve a test accuracy of 93.3%.

Adithya and Rajesh (2020) [6] proposed a method for the automatic recognition of hand postures using convolutional neural networks with deep parallel architectures. The proposed model avoided the need for hand segmentation, which was a very difficult task in images with cluttered backgrounds. In the proposal, two datasets were used, namely the National University of Singapore (NUS) dataset and ASL dataset. The images for training were subjected to three layers of convolutional operation with different filter sizes for feature extraction with proper zero padding applied in each layer to ensure that the size of the input and output remained the same. The dimension of the feature map was reduced through the max pooling layer for each convolutional layer. In this model, stochastic gradient descent with the momentum (SGDM) optimization function was used as well. The proposed model achieved an accuracy of 99.96% and 94.7% for the ASL and NUS dataset, respectively.

Bheda and Radpour (2017) [7] presented a method to classify both letters and digits in ASL using deep convolutional networks. Three datasets were used in the research, self-acquired dataset, ASL Alphabets, and ASL Digits. The author proposed a common CNN architecture which consisted of three groups of two convolutional layers followed by a max-pool layer with a dropout layer and connected to two groups of fully connected layers followed by a dropout layer. The authors noticed that the size of the training data was critical in ensuring better accuracy at the validation stage. Data augmentation techniques such as rotation and transformation which includes flipping were applied on the self-acquired dataset in the effort to increase the sample size has yielded an improvement of 20% to the overall performance. On top of that, backgrounds from each of the images was removed using a background-subtraction method to minimize the noise impact to the overall accuracy. An accuracy of 82.5% was recorded for ASL Alphabets, 97% for ASL Digits while the self-acquired dataset only recorded 67% and 70%, respectively, for ASL Alphabet and Digits.

In the deep learning-based approach, as researchers have started to realize, the size of the dataset plays a role in determining a good classification rate. Hence, researchers are now either performing data augmentation to the datasets or importing weights from a pre-trained model which was trained on a larger dataset. Ozcan and Basturk (2019) [8] proposed a hand gesture recognition method for digits using a transfer learning-based CNN structure with heuristic optimization. Two datasets were used in this proposal, the ASL Digits dataset and ASL dataset. In this model, the datasets were loaded into the system together with AlexNet, a pre-trained CNN model that had eight learnable layers, among which the first five are the convolutional layers and the three fully connected layers as part of the transfer learning. The final three layers of the CNN was modified and optimized using Artificial Bee Colony (ABC) algorithm.

Tan et al. (2021) [9] proposed a customized network architecture called Enhanced Densely Connected Convolutional Neural Network (EDenseNet). In the experiment, the ASL dataset and NUS Hand Posture dataset were used. The datasets were subjected to nine data augmentation techniques as a mitigation plan towards the effect of data scarcity. The proposed model had three dense blocks where each block contained four convolutional layers and transition layers connected each of the dense blocks. The dense block was setup with three layers at a growth rate of 24 (amount of feature maps to be produced) with the filter size of three and within a single dense block, the feature map of preceding convolutional layers was concatenated and served as input to the succeeding convolutional layer. As for the transition layer, it was made up of a bottleneck layer of four convolutional layers, growth rate of 24 with filter size of 3 as well and followed by a pooling layer. Max pooling was also deployed in the first two transition layers to extract all extreme features such as curves and edges while average pooling was used in the final transition layer to only extract and smoothen out features.

In further optimizing the approach for image classification, the combination of deep learning models together with machine learning models are taking place. Wang et al. (2021) [10] proposed a gesture image recognition method based on transfer learning called MobileNet-RF. The proposed model’s structure was a combination of CNN for feature extraction and machine learning for classification. The structure worked by processing images through a standard convolution and continued with stacking depth-wise convolution and point-wise convolution for feature extraction. Batch normalization (BN) and ReLU activation functions are added for each of the depth-wise and point-wise convolution where BN accommodates the slow convergence speed of the neural network while ReLU has great computing advantages which can make the network design more in-depth. The entire MobileNet has 28 layers as the depth-wise and point-wise convolution are calculated separately. The first 28 layers of the proposed network are used to extract the gesture image features and they are then directly input into the random forest model for classification. Table 1 presents the summary of the related works.

Sahoo et al. (2022) [11] proposed a score-level fusion technique between AlexNet and VGG16 for hand gesture recognition. In the effort of fine-tuning both CNN models, weights are transferred from the pre-trained model for initialization instead of starting from scratch. The vector score generated from both fine-tuned CNN models are first normalized and then fused together using the sum-ruled-based method to form a single output vector. Through the Massey University (MU) dataset and HUST American Sign Language (HUST-ASL) dataset, the accuracy of the proposed model was recorded at 90.26% and 56.18%, respectively.

Wang et al. (2022) [12] proposed an improved lightweight CNN model by adding adaptive channel attention (ECA module) to an existing MobileNetv2 CNN model called E-MobileNetv2. The newly added module helped reduce the interference of unrelated information so as to enhance the model’s feature refining ability, especially in capturing the cross-channelled interactions. The proposed model is topped-up with a new activation function R6-SELU (instead of ReLU6) for better feature extraction ability and the prevention of the loss of negative-valued feature information. The newly proposed model achieved an accuracy of 96.82%, so as to reduce the number of parameters by 30%.

Furthering the effort to improve classification of hand gesture, Gadekallu et al. (2022) [13] proposed a method where Harris hawks optimization (HHO) algorithm was utilized to fine tune the hyperparameters of the CNN model. The algorithm is a mimic of how an actual Harris hawk hunts for prey in the nature. The HHO has two phases which consist of two stages of exploration stages and four stages of exploitation stages in the effort to locate the optimal solution within a given location. The effectiveness of the algorithm has contributed to the 100% accuracy achieved when tested on the hand gesture dataset from Kaggle which consists of non-alphabetical hand gesture actions such as those of a fist, palm and thumb.

Li et al. (2022) [14] proposed an algorithm which is robust for a multi-scale and multi-angle algorithm against a complex background during feature extraction. Features are extracted from the complex background using the Gaussian model and K-means algorithm then subjecting them through HOG and 9ULBP. Features generated when fused together is not only invariant in scale and rotation but rich in texture information. The proposed method used SVM for classification to locate the optimal separation between class and the achieved accuracy of 99.01%, 97.5% and 98.72% when tested on a self-collected dataset, NUS dataset and MU Hand Images ASL dataset.

3. Hand Gesture Recognition Technique

The proposed system is designed to perform hand gesture recognition directly from the input image without further sectioning of the hand region, even with complex background conditions. The only pre-processing process is resizing the RGB input image to 48 × 48 in the effort of performing image size standardization across datasets. The following subsections shall detail out the architecture in which VGG16 is used in feature extraction while random forest is used for classification. Results from the pre-experimental studies are shared as well to explain the logic behind the selection.

3.1. Lightweight VGG16 as Feature Extractor

While a network such as ResNet and its variant improves gradient flow and feature propagation through the summation of an identity function, a recent variant of ResNet [15] shows that a huge number of convolutional layers contribute very little to the results. Needless to say, this induces a huge amount of trainable parameters. On the other hand, DenseNet emphasizes feature reuse via dense connectivity to mitigate the problem and improve parameter efficiency. In the meantime, dense connectivity further improves the gradient flow and feature propagation.

Figure 2 illustrates the proposed lightweight VGG16 network architecture for feature extraction. Figure 3 shows a comparison of the original VGG16 model and the proposed lightweight VGG16 model. The architecture consists of an input layer, four convolution blocks and a single batch normalization layer. The first two convolution blocks consist of two convolution layers and a max pooling layer, whereas the third and fourth convolution blocks have three convolution layers and one max pooling layer. The number of channels of the image increases continuously after passing through each of the convolution blocks while the width and height are halved. The channel first increases from 3 to 64 and then 128, 256 and 512 channels, respectively. As the number of channels increases, the size of the width and height reduces by half continuously as the image goes through the series of max pooling layers. The max pooling operation can be denoted as taking the input channel’s width,

N_{w}

, and height,

N_{h}

, and partitioning it into adjacent pixels of 2 × 2 size which results in the output image of

N_{w} / 2

and

N_{h} / 2

. The proposed VGG16 architecture was slightly modified compared to the original VGG16 structure where the 5th convolution block was removed from the pre-trained VGG16 model to ease the memory load as the image size was increased from 32 × 32 × 3 to 48 × 48 × 3 at the input layer.

Weights for the VGG16 layers were adopted from the pre-trained VGG16 model on ImageNet which consists of millions of images instead of training the model from scratch. This allows us to utilize much more optimized weights for feature extraction. Figure 4 provides an illustration of the feature extracted from different convolution layers. It can be seen that, as the layers go deeper, more specific features are extracted from the image. However, a good architecture needs to be defined to ensure an appropriate learning capacity without overfitting.

A layer of batch normalization is added after the 4th convolution block to normalize the output of the layers and prevent the model overfitting. The batch normalization process can be applied to any layer of the neural network with the main purpose of having a stable activation value distribution, which will reduce the internal covariate shift and suppress the over-fitting problem [16]. With the d-dimensional input,

x = (x^{(1)} \dots x^{(d)})

, each dimension is normalized by

{\hat{x}}^{(k)} = \frac{x^{(k)} - E [x^{(k)}]}{\sqrt{Var [x^{(k)}]}}

(1)

where

x^{(k)}

represents each particular activation while

E [x^{(k)}] = \frac{1}{m} \sum_{i = 1}^{m} x_{i}^{(k)}

,

Var [x^{(k)}] = \frac{1}{m} \sum_{i = 1}^{m} {(x_{i}^{(k)} - E [x^{(k)}])}^{2} + ϵ

represents the mean, variance and

ϵ

is a constant added for numerical stability. To ensure that the input value is not limited to a narrow range, the normalized value is generally multiplied with the scaling amount

γ^{(k)}

and the offset amount

β^{(k)}

:

y^{(k)} = γ^{(k)} {\hat{x}}^{(k)} + β^{(k)}

(2)

Conventionally, a typical VGG16 model will have three fully connected layers at the end of the VGG16 architecture for classification, but there is none for this case. The features extracted from the convolution layers are prepared and fed into the ensemble classifier.

3.2. Random Forest as Ensemble Classifier

The machine learning classifier is designed into the proposed model instead of using VGG16 to perform the classification task. Ensemble learning is a method where better predictive performance is achieved through combination of predictions from several models. There are several methods in ensemble learning, among which there is bagging. The bagging method basically accumulates the average results by fitting many decision trees on different samples but from the same dataset such as Random Forest. The features extracted from the convolution layers are subjected into the batch normalization layer and are then fed into Random Forest. Random forest is an ensemble learning method for classification which is based on the most selected class by many decision trees. Figure 5 illustrates how the classification is performed. Random forest is great in handling large input variables and in this case, the features generated from the convolution layers of VGG16 can be handled with variable deletion or down scaling. No cross-validation set is required to ensure an unbiased estimate which makes any size or data pool suitable.

4. Experiments

4.1. Datasets

Three datasets were tested in this experiment: ASL dataset; ASL Digits dataset; and NUS Hand Posture dataset.

4.1.1. ASL Dataset

This dataset is a collection of American Sign Language Alphabet images obtained from Kaggle. It has 29 classes in total, which comprises 26 ASL Alphabets and 3 extra gesture signs (Space, Delete and Nothing). The dataset contains a total of 87,000 pictures in 200 × 200 pixel size. Alphabet J and Z are excluded from the dataset as they are not static. The dataset is split into 69,600 images for training and 8700 images for test and validation respectively. Figure 6 shows some sample images from the ASL dataset.

4.1.2. ASL Digits Dataset

The ASL Digits dataset [17] consists of 26 classes of alphabets (A–Z) and 10 classes of digits (0–9). A total of 12,600 images of 28 × 28 pixel size of each photo is split into 3 categories, training set, test set, and validation set. The training set consists of 10,080 images while the test and validation set consists of 1260 images each. Figure 7 shows some sample images from the ASL Digits dataset.

4.1.3. NUS Hand Posture Dataset

The NUS Hand Posture dataset [18] contains 10 classes of postures obtained by altering the placement and size of the hand in the camera frame. The poses were captured in the National University of Singapore (NUS), with a variety of backdrops and hands. The poses were carried out by 40 people of various nationalities and origins. This dataset has ten classes with letters ranging from A to J. There are 10,000 images in total, with the images divided into 8000 images for training and 1000 images for test and validation. Figure 8 shows some sample images from the NUS Hand Posture dataset.

4.2. Pre-Work in Feature Extractor and Classifier Selection

Prior to selecting VGG16 as the feature extractor, studies were carried out to test out several potential options, namely MobileNetV2, VGG16, ResNet50V2, InceptionV3, InceptionResNetV2, DenseNet169, Xception, and NASNetMobile. Hyperparameter settings used in the studies are shown in Table 2. The majority of the models performed at approximately 99%. Hence, the selection of the best model was based on the duration required for classification, given that the performance of the models is quite close to each other as shown in Table 3. The VGG16 model was picked as the choice for feature extraction due to the accuracy achieved and requires the least epoch to complete the overall training.

Table 4 shows the comparison of different ensemble learning methods in classifying the training dataset based on the feature map generated by the VGG16 model. ASL Digits dataset (denoted as “D2”) and the NUS Hand Posture dataset (denoted as “D3”) constantly showed a high classification rate regardless of the type of ensemble classifier used. For the ASL dataset (denoted as “D1”), however, the Random Forest proved to have the highest classification rate with its estimator set at 100 and the processing time has shown a tremendous difference compared to XGBoost and LightGBM model. Hence, the random forest is picked as the ensemble classifier combine with VGG16 as the feature extractor.

Table 4 also shows the number of parameters and the feature length for the different numbers of convolutional blocks. In this case, VGG16 with only four convolution blocks (the original VGG16 has five convolution blocks) is the most optimized setting for these experimental datasets from the view of the classification rate and the processing time taken. The number of parameters of VGG16 is proportionate to the amount of convolution blocks available. However, with the removal of max pooling layers, the feature length will no longer be as small as a full VGG16 model, and hence the processing time will increase as the number of blocks reduces.

A summary of the hyperparameters and the range of values tested are presented in Table 5. Table 6 demonstrates the recognition accuracy at different estimators and the random state value. The number of estimator parameters determines the number of trees in the random forest and is also indirectly determining the number of features to consider when looking for the best split (branching to a new tree) while the random state value controls both the randomness of the bootstrapping of the sample used when building trees. The experimental results show that highest accuracy is obtained when the number of estimators is set to 100 and the random state is set to 50.

4.3. Experimental Results and Discussion

The evaluation of the proposed model’s performance in comparison with other researchers’ models is summarized in Table 7 based on the datasets. In all three instances, the proposed model outperforms other methods from an accuracy perspective. Further investigation shows that the CNN model used to classify the ASL dataset can address the different illumination conditions of the image. However, the complex and inconsistent background in the NUS dataset cannot solely be handled by the CNN model.

Feature extraction is relatively crucial to the accuracy of the hand gesture recognition model. However, certain features can be harder to extract due to factors such as illumination, feature selection and the number of features extracted. Each convolution layer extracts different types of features from the input image. Hence, as the convolution layer goes deeper, the more features can be extracted. The many convolutional layers of VGG16 can extract low-level features (generic features) up to high-level features (more specific features). However, the drawback associated is the computational resources (computer memory and training time) required for model training. The lightweight VGG16 as a feature extractor adopts the transfer learning approach by transferring the weights from the pre-trained VGG16 model, which was trained on a large-scale dataset (ImageNet). The depth of the proposed model is optimized based on two criteria, namely accuracy and training time. It was found that at the fourth convolutional block, the optimum balance of accuracy versus training time is achieved. Hence, the fifth convolutional block of the original VGG16 model is removed.

Random forest is an ensemble meta-algorithm which consists of many decision trees and is trained through the bagging method. Bagging involves the use of different subsets of features which are randomly selected for each tree which solves the issue of overfitting. The proposed classifier algorithm establishes its final decision based on the majority of the decision produced by each decision tree, which in this case, features are classified into classes of up to 100 times (via a different tree network). This amazing feature of the random forest creates a very high confidence level in classifying the features into the right class.

In the comparison of model performance, the other researchers achieved the highest accuracy of 99.96% on the ASL dataset but the proposed model outperformed them by achieving 99.98%. Figure 9, Figure 10 and Figure 11 show the confusion matrices of the proposed lightweight VGG16-RF model on the datasets. There were two samples of “U” wrongly categorized as alphabet “S”, probably due to its illumination problem (Figure 12). As for ASL, with its ASL Digits dataset and NUS Hand Posture dataset, the recorded performance was 100%, whereas the closest performance recorded by other researchers was 99.64% and 98.50% respectively. The main reason behind ASL Digits dataset and NUS Hand Posture dataset achieving 100% compared to the ASL dataset is the uniformity of the image’s illumination.

The classification time is the most crucial factor when applied in a real-time application. This proposed model clocked an average of 0.09 seconds in classifying a single image.

5. Conclusions

In this paper, a lightweight VGG16 feature extractor and Random Forest as an ensemble classifier is proposed for hand gesture recognition. The emphasis of this model is on the VGG16 layers where the transfer learning technique is adopted to ensure that the underfitting does not happen. As a result, the optimized weights of each VGG16 layer are achieved as it was pre-trained using a very large dataset (ImageNet), while the random forest classifier has its tree network grow up to 100 trees to ensure classification is at its best performance. This is evidenced by the accuracy level achieved in comparison with the other performed studies, as the ASL dataset, ASL Digits dataset and NUS Hand Posture dataset achieved 99, 98, and 100%, respectively.

Author Contributions

Conceptualization, E.L.R.E. and C.P.L.; methodology, E.L.R.E. and C.P.L.; software, E.L.R.E. and C.P.L.; validation, E.L.R.E. and C.P.L.; formal analysis, E.L.R.E.; investigation, E.L.R.E.; resources, E.L.R.E.; data curation, E.L.R.E. and C.P.L.; writing—original draft preparation, E.L.R.E.; writing—review and editing, C.P.L., L.C.K. and K.M.L.; visualization, E.L.R.E.; supervision, C.P.L. and L.C.K.; project administration, C.P.L.; funding acquisition, C.P.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research in this work was supported by the Fundamental Research Grant Scheme of the Ministry of Higher Education under award number FRGS/1/2021/ICT02/MMU/02/4 and Multimedia University Internal Research Grant with award number MMUI/220021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HGR	Hand Gesture Recognition
Lightweight VGG16-RF	Lightweight VGG16 and Random Forest
CNN	Convolutional Neural Network
ASL	American Sign Language
ArSL	Arabic Sign Language
Hu’s MD	Hu’s Moment Descriptor
ZMD	Zernike Moments Descriptor
GFD	Generic Fourier Descriptor
LBPD	Local Binary Pattern Descriptor
LBP	Local Binary Pattern
HOG	Histogram of Oriented Gradients
SVM	Support Vector Machine
RBF	Radial Basis Function
CCH	Chain Code Histogram
RCT	Ridler and Calvard Thresholding
NUS	National University of Singapore
SGDM	Stochastic Gradient Descent with Momentum
EDenseNet	Enhanced Densely Connected Convolutional Neural Network
BN	Batch Normalization

References

Vishwakarma, D.K. Hand gesture recognition using shape and texture evidences in complex background. In Proceedings of the 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, India, 23–24 November 2017; pp. 278–283. [Google Scholar]
Sadeddine, K.; Djeradi, R.; Chelali, F.Z.; Djeradi, A. Recognition of static hand gesture. In Proceedings of the 2018 6th International Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, 10–12 May 2018; pp. 1–6. [Google Scholar]
Zhang, F.; Liu, Y.; Zou, C.; Wang, Y. Hand gesture recognition based on HOG-LBP feature. In Proceedings of the 2018 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Houston, TX, USA, 14–17 May 2018; pp. 1–6. [Google Scholar]
Gajalakshmi, P.; Sharmila, T.S. Hand gesture recognition by histogram based kernel using density measure. In Proceedings of the 2019 2nd International Conference on Power and Embedded Drive Control (ICPEDC), Chennai, India, 21–23 August 2019; pp. 294–298. [Google Scholar]
Gao, Q.; Liu, J.; Ju, Z.; Li, Y.; Zhang, T.; Zhang, L. Static hand gesture recognition with parallel CNNs for space human-robot interaction. In Proceedings of the International Conference on Intelligent Robotics and Applications, Wuhan, China, 15–18 August 2017; pp. 462–473. [Google Scholar]
Adithya, V.; Rajesh, R. A deep convolutional neural network approach for static hand gesture recognition. Procedia Comput. Sci. 2020, 171, 2353–2361. [Google Scholar]
Bheda, V.; Radpour, D. Using deep convolutional networks for gesture recognition in American sign language. arXiv 2017, arXiv:1710.06836. [Google Scholar]
Ozcan, T.; Basturk, A. Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput. Appl. 2019, 31, 8955–8970. [Google Scholar] [CrossRef]
Tan, Y.S.; Lim, K.M.; Lee, C.P. Hand gesture recognition via enhanced densely connected convolutional neural network. Expert Syst. Appl. 2021, 175, 114797. [Google Scholar] [CrossRef]
Wang, F.; Hu, R.; Jin, Y. Research on gesture image recognition method based on transfer learning. Procedia Comput. Sci. 2021, 187, 140–145. [Google Scholar] [CrossRef]
Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; He, M.; Wang, X.; Ma, J.; Song, H. Medical Gesture Recognition Method Based on Improved Lightweight Network. Appl. Sci. 2022, 12, 6414. [Google Scholar] [CrossRef]
Gadekallu, T.R.; Srivastava, G.; Liyanage, M.; Iyapparaja, M.; Chowdhary, C.L.; Koppu, S.; Maddikunta, P.K.R. Hand gesture recognition based on a Harris hawks optimized convolution neural network. Comput. Electr. Eng. 2022, 100, 107836. [Google Scholar] [CrossRef]
Li, J.; Li, C.; Han, J.; Shi, Y.; Bian, G.; Zhou, S. Robust Hand Gesture Recognition Using HOG-9ULBP Features and SVM Model. Electronics 2022, 11, 988. [Google Scholar] [CrossRef]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2016; pp. 646–661. [Google Scholar]
Zheng, J.; Sun, H.; Wang, X.; Liu, J.; Zhu, C. A Batch-Normalized Deep Neural Networks and its Application in Bearing Fault Diagnosis. In Proceedings of the 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 24–25 August 2019; Volume 1, pp. 121–124. [Google Scholar] [CrossRef]
Barczak, A.; Reyes, N.; Abastillas, M.; Piccio, A.; Susnjak, T. A New 2D Static Hand Gesture Colour Image Dataset for ASL Gestures; Massey University: Palmerston North, New Zealand, 2011. [Google Scholar]
Pisharady, P.K.; Vadakkepat, P.; Loh, A.P. Attention based detection and recognition of hand postures against complex backgrounds. Int. J. Comput. Vis. 2013, 101, 403–419. [Google Scholar] [CrossRef]

Figure 1. Different categories of vision-based approaches.

Figure 2. Architecture of the lightweight VGG16 feature extractor model.

Figure 3. The model architecture of (left) the original VGG16 and (right) the proposed lightweight VGG16.

Figure 4. Feature map visualization.

Figure 5. Random forest.

Figure 6. Sample images from ASL dataset.

Figure 7. Sample images from ASL Digits dataset.

Figure 8. Sample images from NUS Hand Posture dataset.

Figure 9. The confusion matrix of the lightweight VGG16-RF on the ASL dataset.

Figure 10. The confusion matrix of the lightweight VGG16-RF on the ASL Digits dataset.

Figure 11. The confusion matrix of lightweight VGG16-RF on the NUS Hand Posture dataset.

Figure 12. The alphabet “U” and alphabet “S” from the ASL dataset that are wrongly classified.

Table 1. Summary of related works.

Author and Publication Year	Feature Extraction Method	Classification Method
Vishwakarma (2017) [1]	Gabor filter	Support vector machine (SVM)
Sadeddine et al. (2018) [2]	Hu’s invariant moments + LBPD + Zernike moments + GFD	Probabilistic neural network (PNN)
Zhang et al. (2018) [3]	Histogram of oriented gradients (HOG) + local binary pattern (LBP)	Support vector machine (SVM)
Gajalakshmi and Sharmila (2019) [4]	Chain code histogram (CCH)	Support vector machine (SVM)
Gao et al. (2017) [5]	CNN
Adithya and Rajesh (2020) [6]	CNN
Bheda and Radpour (2017) [7]	CNN
Ozcan and Basturk (2019) [8]	AlexNet + ABC
Tan et al. (2021) [9]	EDenseNet
Wang et al. (2021) [10]	MobileNet	Random forest
Sahoo et al. (2022) [11]	AlexNet, VGG16
Wang et al. (2022) [12]	E-MobileNetv2
Gadekallu et al. (2022) [13]	CNN + HHO
Li et al. (2022) [14]	HOG, 9ULBP	SVM

Table 2. Settings for pre-work investigation.

Hyperparameter		Settings
Input size		48 × 48
Batch size		16
Learning rate		0.0001
Optimizer		Adam
No. of epochs		100
	Patience	15
Early stopping function	Mode	Validation accuracy
Loss function		Sparse categorical crossentropy

Table 3. Results of the pre-work for feature extractor selection.

Model	Dataset	Epoch Completed	Accuracy	Average Accuracy
MobileNetV2	D1	91	99.95%	99.93%
	D2	58	99.84%
	D3	88	100.00%
VGG16	D1	22	100.00%	99.97%
	D2	20	99.92%
	D3	23	100.00%
ResNet50V2	D1	100	99.82%	99.87%
	D2	61	100.00%
	D3	83	99.80%
InceptionV3	D1	100	96.84%	84.94%
	D2	100	98.89%
	D3	100	59.10%
InceptionResNetV2	D1	59	99.95%	99.91%
	D2	40	99.76%
	D3	65	100.00%
DenseNet169	D1	100	99.20%	84.99%
	D2	100	95.39%
	D3	100	60.40%
Xception	D1	92	99.92%	99.97%
	D2	82	100.00%
	D3	100	100.00%
NASNetMobile	D1	100	92.37%	73.38%
	D2	100	88.08%
	D3	100	39.70%

Table 4. Results of the pre-work for classifier selection.

Conv Block	Total Parameter	Feature Length	Dataset	Accuracy (%) (Execution Time)
Conv Block	Total Parameter	Feature Length	Dataset	Random Forest	XGBoost	LightGBM
			D1	99.74 (75.1 s)	99.53 (1365.9 s)	99.74 (133.3 s)
5 conv blocks	14,716,736	512	D2	100 (4.6 s)	100 (54.1 s)	100 (24.2 s)
			D3	100 (3.5 s)	100 (35.5 s)	100 (14.3 s)
			D1	99.98 (330 s)	99.98 (9968 s)	99.99 (2789 s)
4 conv blocks	7,637,312	4608	D2	100 (17.6 s)	100 (328.4 s)	100 (534.7 s)
			D3	100 (12.8 s)	100 (220.9 s)	100 (241.9 s)
			D1	99.99 (603.5 s)	99.95 (25,938 s)	100 (5375 s)
3 conv blocks	1,736,512	9216	D2	100 (29.8 s)	100 (637.5 s)	100 (1099.7 s)
			D3	100 (20.2 s)	100 (415.2 s)	100 (509.6 s)
			D1	Unable to run due to out of memory
2 conv blocks	268,672	18,432	D2	100 (53.9 s)	100 (1544 s)	100 (2145 s)
			D3	100 (38.4 s)	100 (1078 s)	100 (933 s)

Table 5. Summary of Optimal Hyperparameter Settings for Ensemble Classifier.

Hyperparameters	Tested Value	Optimal Settings
Number of estimators	50, 75, 100, 200	100
Random state	42, 50, 60	50
Max depth	Set to default: none
Max feature	Set to default: sqrt (number of estimators)

Table 6. Recognition Accuracy at Different Estimators and Random State Value.

No. of Estimators	Random State	Accuracy (%)
No. of Estimators	Random State	Dataset 1	Dataset 2	Dataset 3
100	42	99.81	100	100
100	50	99.98	100	100
100	60	99.82	100	100
50	50	99.70	100	100
75	50	99.78	100	100
200	50	99.89	100	100

Table 7. Performance comparison among other models and the proposed lightweight VGG16-RF model.

Method	Accuracy (%)
Method	ASL Dataset	ASL Digits Dataset	NUS Hand Posture Dataset
Hu’s Moment + LBPD + Zernike moments + GFD + PNN [2]	93.33	93.33	93.33
CNN [6]	99.96	94.70	94.70
CNN [5]	93.30	-	-
EDenseNet [9]	98.50	98.50	98.50
MobileNet-Random Forest [10]	98.12	-	-
CCH + SVM [4]	90.00	90.00	90.00
HOG + LBP + SVM [3]	-	97.80	97.80
Gabor + SVM [1]	-	94.60	94.60
Proposed Lightweight VGG16-RF	99.98	100	100

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ewe, E.L.R.; Lee, C.P.; Kwek, L.C.; Lim, K.M. Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier. Appl. Sci. 2022, 12, 7643. https://doi.org/10.3390/app12157643

AMA Style

Ewe ELR, Lee CP, Kwek LC, Lim KM. Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier. Applied Sciences. 2022; 12(15):7643. https://doi.org/10.3390/app12157643

Chicago/Turabian Style

Ewe, Edmond Li Ren, Chin Poo Lee, Lee Chung Kwek, and Kian Ming Lim. 2022. "Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier" Applied Sciences 12, no. 15: 7643. https://doi.org/10.3390/app12157643

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hand Gesture Recognition via Lightweight VGG16 and Ensemble Classifier

Abstract

1. Introduction

2. Related Works

3. Hand Gesture Recognition Technique

3.1. Lightweight VGG16 as Feature Extractor

3.2. Random Forest as Ensemble Classifier

4. Experiments

4.1. Datasets

4.1.1. ASL Dataset

4.1.2. ASL Digits Dataset

4.1.3. NUS Hand Posture Dataset

4.2. Pre-Work in Feature Extractor and Classifier Selection

4.3. Experimental Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI