Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network
Abstract
:1. Introduction
- An end-to-end fine-tuning of the deep CNNs such as AlexNet and VGG-16 is performed on the training gesture samples of the target dataset. Then, the score-level fusion technique is applied between the output scores of the fine-tuned deep CNNs.
- The performance of recognition accuracy is evaluated on two publicly available benchmark American Sign Language (ASL) large-gesture class datasets.
- A real-time gesture recognition system is developed using the proposed technique and tested in subject-independent mode.
2. Related Works
2.1. Hand Gesture Recognition Using RGB Sensor Input
2.2. Hand Gesture Recognition Using RGB-D Sensor Input
3. Proposed Methodology
3.1. Data Acquisition
3.2. Preprocessing
3.3. Architecture of Pre-Trained CNNs and Fine-Tuning
3.4. Normalization
3.5. Score-Level Fusion Technique between Two Fine-Tuned CNNs
4. Experimental Evaluation
4.1. Benchmark Datasets
4.1.1. Massey University (MU) Dataset
4.1.2. HUST American Sign Language (HUST-ASL) Dataset
4.2. Data Analysis Using Validation Technique
4.3. Setting of Hyperparameters for Fine-Tuning
5. Results and Analysis
5.1. Performance Evaluation
5.2. Comparison with Earlier Methods
5.2.1. Comparison Results on MU Dataset
5.2.2. Comparison Results on HUST-ASL Dataset
5.3. Error Analysis on Both Datasets
5.4. Computational Time of the Proposed Method on Real-Time Data
6. Recognition of ASL Gestures in Real Time
- (1)
- Segmentation of hand region: In this step, the pixel values above d + 10 cm are marked as zero in the depth map, where d is the first pixel in the search space from the Kinect camera.
- (2)
- Conversion from one to three channels: The pixel values of a hand segmented image are normalized from and converted into three channels using a jet color map.
- (3)
- Image resize: The three-channel color image is resized to and image resolution according to fine-tuned AlexNet and VGG-16 CNN model input image sizes, respectively.
- (4)
- Fine-tuned CNNs: The resized input gesture is given as input to both fine-tuned CNNs to obtain the output score. The fine-tuned CNNs are taken from the CNN model trained on the HUST-ASL dataset.
- (5)
- Score fusion: Both score vectors are normalized using min-max normalization, and normalized scores are combined together using (3) with the weight value w as 0.5.
- (6)
- Recognized hand gesture: The output gesture pose is the maximum value obtained in the fused score vector.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mitra, S.; Acharya, T. Gesture Recognition: A Survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2007, 37, 311–324. [Google Scholar] [CrossRef]
- Wachs, J.P.; Kölsch, M.; Stern, H.; Edan, Y. Vision-based hand-gesture applications. Commun. ACM 2011, 54, 60–71. [Google Scholar] [CrossRef] [Green Version]
- McNeill, D. Hand and Mind; De Gruyter Mouton: Berlin, Germany, 2011. [Google Scholar]
- Pugeault, N.; Bowden, R. Spelling it out: Real-time ASL fingerspelling recognition. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1114–1119. [Google Scholar]
- Sharma, A.; Mittal, A.; Singh, S.; Awatramani, V. Hand Gesture Recognition using Image Processing and Feature Extraction Techniques. Procedia Comput. Sci. 2020, 173, 181–190. [Google Scholar] [CrossRef]
- Lian, S.; Hu, W.; Wang, K. Automatic user state recognition for hand gesture based low-cost television control system. IEEE Trans. Consum. Electron. 2014, 60, 107–115. [Google Scholar] [CrossRef]
- Ren, Z.; Yuan, J.; Meng, J.; Zhang, Z. Robust Part-Based Hand Gesture Recognition Using Kinect Sensor. IEEE Trans. Multimed. 2013, 15, 1110–1120. [Google Scholar] [CrossRef]
- Wang, C.; Liu, Z.; Chan, S.C. Superpixel-Based Hand Gesture Recognition with Kinect Depth Camera. IEEE Trans. Multimed. 2015, 17, 29–39. [Google Scholar] [CrossRef]
- Feng, B.; He, F.; Wang, X.; Wu, Y.; Wang, H.; Yi, S.; Liu, W. Depth-Projection-Map-Based Bag of Contour Fragments for Robust Hand Gesture Recognition. IEEE Trans. Hum.-Mach. Syst. 2017, 47, 511–523. [Google Scholar] [CrossRef]
- Pisharady, P.K.; Saerbeck, M. Recent methods and databases in vision-based hand gesture recognition: A review. Comput. Vis. Image Underst. 2015, 141, 152–165. [Google Scholar] [CrossRef]
- Suarez, J.; Murphy, R.R. Hand gesture recognition with depth images: A review. In Proceedings of the 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication, Paris, France, 9–13 September 2012. [Google Scholar] [CrossRef]
- Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced Computer Vision with Microsoft Kinect Sensor: A Review. IEEE Trans. Cybern. 2013, 43, 1318–1334. [Google Scholar] [CrossRef] [PubMed]
- Modanwal, G.; Sarawadekar, K. Towards hand gesture based writing support system for blinds. Pattern Recognit. 2016, 57, 50–60. [Google Scholar] [CrossRef]
- Plouffe, G.; Cretu, A.M. Static and Dynamic Hand Gesture Recognition in Depth Data Using Dynamic Time Warping. IEEE Trans. Instrum. Meas. 2016, 65, 305–316. [Google Scholar] [CrossRef]
- Sharma, P.; Anand, R.S. Depth data and fusion of feature descriptors for static gesture recognition. IET Image Process. 2020, 14, 909–920. [Google Scholar] [CrossRef]
- Patil, A.R.; Subbaraman, S. A spatiotemporal approach for vision-based hand gesture recognition using Hough transform and neural network. Signal, Image Video Process. 2019, 13, 413–421. [Google Scholar] [CrossRef]
- Tao, W.; Leu, M.C.; Yin, Z. American Sign Language alphabet recognition using Convolutional Neural Networks with multiview augmentation and inference fusion. Eng. Appl. Artif. Intell. 2018, 76, 202–213. [Google Scholar] [CrossRef]
- Oyedotun, O.K.; Khashman, A. Deep learning in vision-based static hand gesture recognition. Neural Comput. Appl. 2017, 28, 3941–3951. [Google Scholar] [CrossRef]
- Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chevtchenko, S.F.; Vale, R.F.; Macario, V.; Cordeiro, F.R. A convolutional neural network with feature fusion for real-time hand posture recognition. Appl. Soft Comput. 2018, 73, 748–766. [Google Scholar] [CrossRef]
- Lee, D.L.; You, W.S. Recognition of complex static hand gestures by using the wristband-based contour features. IET Image Process. 2018, 12, 80–87. [Google Scholar] [CrossRef]
- Chevtchenko, S.F.; Vale, R.F.; Macario, V. Multi-objective optimization for hand posture recognition. Expert Syst. Appl. 2018, 92, 170–181. [Google Scholar] [CrossRef]
- Fang, L.; Liang, N.; Kang, W.; Wang, Z.; Feng, D.D. Real-time hand posture recognition using hand geometric features and fisher vector. Signal Process. Image Commun. 2020, 82, 115729. [Google Scholar] [CrossRef]
- Barbhuiya, A.A.; Karsh, R.K.; Jain, R. CNN based feature extraction and classification for sign language. Multimed. Tools Appl. 2021, 80, 3051–3069. [Google Scholar] [CrossRef]
- Dadashzadeh, A.; Targhi, A.T.; Tahmasbi, M.; Mirmehdi, M. HGR-Net: A fusion network for hand gesture segmentation and recognition. IET Comput. Vis. 2019, 13, 700–707. [Google Scholar] [CrossRef] [Green Version]
- Guo, L.; Lu, Z.; Yao, L. Human-machine interaction sensing technology based on hand gesture recognition: A review. IEEE Trans. Hum.-Mach. Syst. 2021, 51, 300–309. [Google Scholar] [CrossRef]
- Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust RGB-D object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 681–687. [Google Scholar]
- Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional Neural Network with Data Augmentation for SAR Target Recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [Google Scholar] [CrossRef]
- Han, D.; Liu, Q.; Fan, W. A new image classification method using CNN transfer learning and web data augmentation. Expert Syst. Appl. 2018, 95, 43–56. [Google Scholar] [CrossRef]
- Akcay, S.; Kundegorski, M.E.; Willcocks, C.G.; Breckon, T.P. Using Deep Convolutional Neural Network Architectures for Object Classification and Detection within X-Ray Baggage Security Imagery. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2203–2215. [Google Scholar] [CrossRef] [Green Version]
- He, M.; Horng, S.J.; Fan, P.; Run, R.S.; Chen, R.J.; Lai, J.L.; Khan, M.K.; Sentosa, K.O. Performance evaluation of score level fusion in multimodal biometric systems. Pattern Recognit. 2010, 43, 1789–1800. [Google Scholar] [CrossRef]
- Taheri, S.; Toygar, Ö. Animal classification using facial images with score-level fusion. IET Comput. Vis. 2018, 12, 679–685. [Google Scholar] [CrossRef]
- Barczak, A.; Reyes, N.; Abastillas, M.; Piccio, A.; Susnjak, T. A new 2D static hand gesture colour image dataset for ASL gestures. Res. Lett. Inf. Math. Sci. 2011, 15, 12–20. [Google Scholar]
Data Acquisition Sensors | Wearable | Advantages | Limitations |
---|---|---|---|
Data glove | Yes | Low cost, robust | Less comfort and less user-friendly |
Leap motion | No | Track hand with absolute precision | Always putting hand above the sensor and less coverage area |
Vision sensors (Web camera) | No | Free to use | Affected by background and human noise |
Depth sensor (Kinect) | No | No color marker, hand segmentation easier | Hand should be the first object in the camera frame |
Fine-Tuned CNN | MU Dataset | HUST-ASL Dataset | ||
---|---|---|---|---|
LOO CV, % | Regular CV, % | LOO CV, % | Regular CV, % | |
AlexNet | 87.10 ± 1.67 | 97.88 ± 1.71 | 49.26 ± 3.66 | 61.05 ± 1.39 |
VGG-16 | 88.11 ± 1.44 | 97.80 ± 1.72 | 54.71 ± 3.34 | 62.51 ± 1.04 |
Proposed | 90.26 ± 1.35 | 98.14 ± 1.68 | 56.18 ± 3.13 | 64.55 ± 0.99 |
Test Methods | Mean Accuracy (LOO CV), % |
---|---|
CNN [24] | 73.86 ± 1.04 |
FFCN [24] | 84.02 ± 0.59 |
AlexNet + SVM [28] | 70.00 |
VGG 16 + SVM [28] | 70.00 |
Proposed | 90.26 ± 1.35 |
Test Methods | Mean Accuracy (Holdout CV), % |
---|---|
GB-ZM | 97.09 ± 0.80 |
GB-HU | 97.63 ± 0.76 |
Proposed | 98.93 ± 0.68 |
Test Methods | LOO CV | Regular CV |
---|---|---|
Front-view-based BCF | 50.4 ± 6.1 | 56.5 ± 0.6 |
Proposed | 56.27 ± 3.13 | 64.51 ± 0.99 |
ASL Gesture Class | MU Dataset, % | HUST-ASL Dataset, % | ||||
---|---|---|---|---|---|---|
AlexNet | VGG-16 | Proposed | AlexNet | VGG-16 | Proposed | |
0 | 62.9 | 61.4 | 67.1 | 45.0 | 51.9 | 53.1 |
a | 97.1 | 100 | 100 | 24.4 | 36.3 | 33.8 |
e | 100 | 100 | 100 | 16.9 | 29.4 | 29.4 |
m | 75.7 | 58.6 | 64.3 | 38.1 | 41.3 | 43.1 |
n | 70.0 | 88.6 | 88.6 | 30.0 | 37.5 | 38.8 |
o | 77.1 | 78.6 | 84.3 | 32.5 | 33.1 | 35.0 |
s | 95.7 | 98.6 | 98.6 | 23.1 | 26.3 | 28.8 |
t | 95.4 | 100 | 100 | 38.8 | 36.9 | 40.0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sahoo, J.P.; Prakash, A.J.; Pławiak, P.; Samantray, S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors 2022, 22, 706. https://doi.org/10.3390/s22030706
Sahoo JP, Prakash AJ, Pławiak P, Samantray S. Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors. 2022; 22(3):706. https://doi.org/10.3390/s22030706
Chicago/Turabian StyleSahoo, Jaya Prakash, Allam Jaya Prakash, Paweł Pławiak, and Saunak Samantray. 2022. "Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network" Sensors 22, no. 3: 706. https://doi.org/10.3390/s22030706
APA StyleSahoo, J. P., Prakash, A. J., Pławiak, P., & Samantray, S. (2022). Real-Time Hand Gesture Recognition Using Fine-Tuned Convolutional Neural Network. Sensors, 22(3), 706. https://doi.org/10.3390/s22030706