A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology
Abstract
:1. Introduction
2. Method
2.1. Material
2.2. The Proposed CNN–PPG SCR System
2.3. The Classical SCR Systems
2.3.1. CNN–MFCC SCR System
2.3.2. ASR-Based SCR System
2.4. Experiment Design
3. Results and Discussion
3.1. The Analysis of Speech Features between MFCC and PPG
3.2. Recognition Performance of Each SCR System
3.3. The Existing Application of Deep Learning Technology in Healthcare
4. Conclusions and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. The Setting of CNN–PPG System
Input: 120 D, Output: 19 Class | |
---|---|
Hidden Layer | |
Layer 1 | filters: 10, kernel size: 3 × 3, strides: 2 × 2, ReLU |
Layer 2 | filters: 8, kernel size: 3 × 3, strides: 2 × 2, ReLU |
Layer 3 | filters: 10, kernel size: 3 × 3, strides: 2 × 2, ReLU |
Global average pooling, Dense (19), softmax |
Appendix B. The 33-Dimensional Data of PPG
Class Index | Phone | Class Index | Phone | Class Index | Phone |
---|---|---|---|---|---|
1 | SIL | 12 | h | 23 | o1 |
2 | a1 | 13 | i1 | 24 | o3 |
3 | a2 | 14 | i2 | 25 | o4 |
4 | a3 | 15 | i3 | 26 | q |
5 | a4 | 16 | i4 | 27 | s |
6 | b6 | 17 | ii4 | 28 | sh |
7 | d7 | 18 | j | 29 | u1 |
8 | e4 | 19 | l | 30 | u3 |
9 | err4 | 20 | ng4 | 31 | u4 |
10 | f | 21 | nn1 | 32 | x |
11 | g | 22 | nn2 | 33 | z |
Appendix C. The Setting of CNN–MFCC System
Input: 120 D, Output: 19 Class | |
---|---|
Layer 1 | filters: 12, kernel size: 3 × 3, strides: 2 × 2, PReLU |
Layer 2 | filters: 12, kernel size: 3 × 3, strides: 2 × 2, PReLU |
Layer 3 | filters: 24, kernel size: 3 × 3, strides: 2 × 2, PReLU |
Layer 4 | filters: 24, kernel size: 3 × 3, strides: 1 × 1, PReLU |
Layer 5 | filters: 48, kernel size: 3 × 3, strides: 1 × 1, PReLU |
Layer 6 | filters: 48, kernel size: 3 × 3, strides: 1 × 1, PReLU |
Layer 7 | filters: 96, kernel size: 3 × 3, strides: 1 × 1, PReLU |
Layer 8 | filters: 96, kernel size: 3 × 3, strides: 1 × 1, PReLU |
Layer 9 | filters: 192, kernel size: 3 × 3, strides: 1 × 1, PReLU, dropout (0.4) |
Layer 10 | filters: 192, kernel size: 3 × 3, strides: 1 × 1, PReLU, dropout (0.3) |
Global average pooling, Dense (19), Dropout (0.2), softmax |
Appendix D. The Setting of ASR-Based Model
Input: 120 D; Output: 33 Class | |
---|---|
Hidden Layer | |
Layer 1 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 2 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 3 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 4 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 5 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 6 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 7 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 8 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 9 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 10 | dims: 128, context_size = 3, dilation = 14, ReLU |
Layer 11 | dims: 128, context_size = 3, dilation = 14, ReLU |
Dense (33), softmax |
Appendix E. The Layers and Parameter Number Setting of CNN–MFCC System
Model Size | ||||||
---|---|---|---|---|---|---|
A | B | C | D | E | F | |
layer1 | 8 | 9 | 10 | 12 | 14 | 16 |
layer2 | 8 | 9 | 10 | 12 | 14 | 16 |
layer3 | 16 | 18 | 20 | 24 | 28 | 32 |
layer4 | 16 | 18 | 20 | 24 | 28 | 32 |
layer5 | 32 | 36 | 40 | 48 | 56 | 64 |
layer6 | 32 | 36 | 40 | 48 | 56 | 64 |
layer7 | 64 | 72 | 80 | 96 | 112 | 128 |
layer8 | 64 | 72 | 80 | 96 | 112 | 128 |
layer9 | 128 | 144 | 160 | 192 | 224 | 256 |
layer10 | 128 | 144 | 160 | 192 | 224 | 256 |
output | 19 | 19 | 19 | 19 | 19 | 19 |
Total | 303,355 | 382,663 | 471,169 | 675,775 | 917,173 | 1,195,363 |
Appendix F. The Layers and Parameter Number Setting of CNN–PPG System
Model Size | |||||||
---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | |
layer1 | 5 | 6 | 7 | 8 | 9 | 10 | 12 |
layer2 | 3 | 4 | 5 | 5 | 6 | 8 | 8 |
layer3 | 5 | 6 | 7 | 8 | 9 | 10 | 12 |
output | 19 | 19 | 19 | 19 | 19 | 19 | 19 |
Total | 442 | 635 | 864 | 984 | 1267 | 1767 | 2115 |
Appendix G. The Layers and Parameter Number Setting of ASR-Based SCR System
Model Size | ||||||||
---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | F | H | |
layer1 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
layer2 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
layer3 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
layer4 | 128 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
layer5 | 0 | 128 | 128 | 128 | 128 | 128 | 128 | 128 |
layer6 | 0 | 0 | 128 | 128 | 128 | 128 | 128 | 128 |
layer7 | 0 | 0 | 0 | 128 | 128 | 128 | 128 | 128 |
layer8 | 0 | 0 | 0 | 0 | 128 | 128 | 128 | 128 |
layer9 | 0 | 0 | 0 | 0 | 0 | 128 | 128 | 128 |
layer10 | 0 | 0 | 0 | 0 | 0 | 0 | 128 | 128 |
layer11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 128 |
output | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 |
Total | 237,104 | 278,192 | 319,280 | 360,368 | 396,336 | 427,184 | 468,272 | 509,360 |
References
- Darley, F.L.; Aronson, A.E.; Brown, J.R. Differential Diagnostic Patterns of Dysarthria. J. Speech Hear. Res. 1969, 12, 246–269. [Google Scholar] [CrossRef] [PubMed]
- Calculator, S.; Luchko, C.D. Evaluating the Effectiveness of a Communication Board Training Program. J. Speech Hear. Disord. 1983, 48, 185–191. [Google Scholar] [CrossRef]
- Birchfield, S. Elliptical head tracking using intensity gradients and color histograms. In Proceedings of the Proceedings 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231), Santa Barbara, CA, USA, 25 June 1998; IEEE: New York, NY, USA, 2002. [Google Scholar]
- Zhou, Q.; Xing, J.; Chen, W.; Zhang, X.; Yang, Q. From Signal to Image: Enabling Fine-Grained Gesture Recognition with Commercial Wi-Fi Devices. Sensors 2018, 18, 3142. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, C.-S.; Ho, C.-W.; Chen, W.-C.; Chiu, C.-C.; Yeh, M.-S. Powered wheelchair controlled by eye-tracking system. Opt. Appl. 2006, 36, 401–412. [Google Scholar]
- Trnka, K.; McCaw, J.; Yarrington, D.; McCoy, K.F.; Pennington, C. Word prediction and communication rate in AAC. In Proceedings of the IASTED International Conference on Telehealth/Assistive Technologies, Baltimore, MD, USA, 16–18 April 2008. [Google Scholar]
- Rosen, K.; Yampolsky, S. Automatic speech recognition and a review of its functioning with dysarthric speech. Augment. Altern. Commun. 2000, 16, 48–60. [Google Scholar] [CrossRef]
- Shahamiri, S.R.; Salim, S.S.B. Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv. Eng. Inform. 2014, 28, 102–110. [Google Scholar] [CrossRef]
- Sharma, H.V.; Hasegawa-Johnson, M. Acoustic model adaptation using in-domain background models for dysarthric speech recognition. Comput. Speech Lang. 2013, 27, 1147–1162. [Google Scholar] [CrossRef]
- Carrillo, L.; Ortiz, K.Z. Vocal analysis (auditory-perceptual and acoustic) in dysarthrias. Pró-Fono Rev. Atualização Científica 2007, 19, 381–386. [Google Scholar] [CrossRef] [Green Version]
- Le Dorze, G.; Ouellet, L.; Ryalls, J. Intonation and speech rate in dysarthric speech. J. Commun. Disord. 1994, 27, 1–18. [Google Scholar] [CrossRef]
- Weismer, G.; Tjaden, K.; Kent, R.D. Can articulatory behavior in motor speechdisorders be accounted for by theories of normal speech production? J. Phon. 1995, 23, 149–164. [Google Scholar] [CrossRef]
- O’Sullivan, S.B.; Schmitz, T.J. Physical Rehabilitation, 5th ed.; Davis Company: Philadelphia, PA, USA, 2007. [Google Scholar]
- Hasegawa-Johnson, M.; Gunderson, J.; Perlman, A.; Huang, T. Hmm-Based and Svm-Based Recognition of the Speech of Talkers with Spastic Dysarthria. In Proceedings of the 2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; IEEE: New York, NY, USA, 2006. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Rudzicz, F. Phonological features in discriminative classification of dysarthric speech. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; IEEE: New York, NY, USA, 2009; pp. 4605–4608. [Google Scholar]
- Rudzicz, F. Articulatory Knowledge in the Recognition of Dysarthric Speech. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 947–960. [Google Scholar] [CrossRef] [Green Version]
- Vázquez, J.J.; Arjona, J.; Linares, M.; Casanovas-Garcia, J. A Comparison of Deep Learning Methods for Urban Traffic Forecasting using Floating Car Data. Transp. Res. Procedia 2020, 47, 195–202. [Google Scholar] [CrossRef]
- Nguyen, G.; Dlugolinsky, S.; Tran, V.; Garcia, A.L. Deep Learning for Proactive Network Monitoring and Security Protection. IEEE Access 2020, 8, 19696–19716. [Google Scholar] [CrossRef]
- Carta, S.; Ferreira, A.; Podda, A.S.; Recupero, D.R.; Sanna, A. Multi-DQN: An ensemble of Deep Q-learning agents for stock market forecasting. Expert Syst. Appl. 2021, 164, 113820. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Fathima, N.; Patel, T.; Mahima, C.; Iyengar, A. TDNN-based Multilingual Speech Recognition System for Low Resource Indian Languages. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar]
- Hawley, M.S.; Enderby, P.; Green, P.; Cunningham, S.; Brownsell, S.; Carmichael, J.; Parker, M.; Hatzis, A.; O’Neill, P.; Palmer, R. A speech-controlled environmental control system for people with severe dysarthria. Med Eng. Phys. 2007, 29, 586–593. [Google Scholar] [CrossRef]
- Fager, S.K.; Beukelman, D.R.; Jakobs, T.; Hosom, J.-P. Evaluation of a Speech Recognition Prototype for Speakers with Moderate and Severe Dysarthria: A Preliminary Report. Augment. Altern. Commun. 2010, 26, 267–277. [Google Scholar] [CrossRef]
- Yang, S.; Chung, M. Improving Dysarthric Speech Intelligibility using Cycle-consistent Adversarial Training. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies 2020, Valetta, Malta, 24–26 February 2020. [Google Scholar]
- Selouani, S.-A.; Yakoub, M.S.; O’Shaughnessy, D. Alternative Speech Communication System for Persons with Severe Speech Disorders. EURASIP J. Adv. Signal Process. 2009, 2009, 540409. [Google Scholar] [CrossRef] [Green Version]
- Polur, P.D.; Miller, G.E. Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals. Med Eng. Phys. 2006, 28, 741–748. [Google Scholar] [CrossRef] [PubMed]
- Hawley, M.; Enderby, P.; Green, P.; Brownsell, S.; Hatzis, A.; Parker, M.; Carmichael, J.; Cunningham, S.; O’Neill, P.; Palmer, R. STARDUST—Speech Training and Recognition for Dysarthric Users of Assistive Technology. In Proceedings of the 7th European Conference for the Advancement of Assistive Technology (AAATE 2003), Dublin, Ireland, 31 August–3 September 2003. [Google Scholar]
- Farooq, O.; Datta, S. Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Process. Lett. 2001, 8, 196–198. [Google Scholar] [CrossRef]
- Saia, R.; Carta, S.; Fenu, G. A Wavelet-based Data Analysis to Credit Scoring. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Association for Computing Machinery (ACM), Tokyo, Japan, 25–27 February 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 176–180. [Google Scholar]
- Du, J.; Huo, Q. A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions. In Proceedings of the Ninth Annual Conference of the International Speech Communication Association, Brisbane, Australia, 22–26 September 2008. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef] [Green Version]
- Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the European Conference on Machine Learning and Principles of Knowledge Discovery in Databases, Prague, Czech Republic, 23–27 September 2013. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Yılmaz, E.; Ganzeboom, M.; Cucchiarini, C.; Strik, H. Multi-Stage DNN Training for Automatic Recognition of Dysarthric Speech. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Abdel-Hamid, O.; Mohamed, A.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef] [Green Version]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Chen, C.; Bunescu, R.; Xu, L.; Liu, C. Tone Classification in Mandarin Chinese Using Convolutional Neural Networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
- Rubin, J.; Abreu, R.; Ganguli, A.; Nelaturi, S.; Matei, I.; Sricharan, K. Classifying Heart Sound Recordings using Deep Convolutional Neural Networks and Mel:Frequency Cepstral Coefficients. In Proceedings of the 2016 Computing in Cardiology Conference (CinC), Computing in Cardiology, Vancouver, ON, Canada, 11–14 September 2016. [Google Scholar]
- Che, Q.; Wen, H.; Li, X.; Peng, Z.; Chen, K.P. Partial Discharge Recognition Based on Optical Fiber Distributed Acoustic Sensing and a Convolutional Neural Network. IEEE Access 2019, 7, 101758–101764. [Google Scholar] [CrossRef]
- Nakashika, T.; Yoshioka, T.; Takiguchi, T.; Ariki, Y.; Duffner, S.; Garcia, C. Convolutive Bottleneck Network with Dropout for Dysarthric Speech Recognition. Trans. Mach. Learn. Artif. Intell. 2014, 2, 48–62. [Google Scholar] [CrossRef] [Green Version]
- Yakoub, M.S.; Selouani, S.-A.; Zaidi, B.-F.; Bouchair, A. Improving dysarthric speech recognition using empirical mode decomposition and convolutional neural network. EURASIP J. Audio Speech Music. Process. 2020, 2020, 1–7. [Google Scholar] [CrossRef] [Green Version]
- Zhao, G.; Sonsaat, S.; Levis, J.; Chukharev-Hudilainen, E.; Gutierrez-Osuna, R. Accent Conversion Using Phonetic Posteriorgrams. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New York, NY, USA, 2018; pp. 5314–5318. [Google Scholar]
- Zhou, Y.; Tian, X.; Xu, H.; Das, R.K.; Li, H. Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: New York, NY, USA, 2019; pp. 6790–6794. [Google Scholar]
- Chen, C.Y.; Zheng, W.Z.; Wang, S.S.; Tsao, Y.; Li, P.C.; Lai, Y.H. Enhancing Intelligibility of Dysarthric Speech Using Gated Convolutional-based Voice Conversion System to appear. In Proceedings of the IEEE Interspeech, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Chen, J.; Wang, Y.; Yoho, S.E.; Wang, D.; Healy, E.W. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am. 2016, 139, 2604–2612. [Google Scholar] [CrossRef] [Green Version]
- Bengio, Y. Learning Deep Architectures for AI. Found. Trends® Mach. Learn. 2009, 2, 1–127. [Google Scholar] [CrossRef]
- Mohamed, A.; Dahl, G.E.; Hinton, G. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 2011, 20, 14–22. [Google Scholar] [CrossRef]
- Salamon, J.; Bello, J.P. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Garcia-Romero, D.; McCree, A. Stacked Long-Term TDNN for Spoken Language Recognition. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar] [CrossRef] [Green Version]
- Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. Mach. Learn. 2015, arXiv:1503.02531. [Google Scholar]
- Teller, V. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Comput. Linguist. 2000, 26, 638–641. [Google Scholar] [CrossRef]
- Räsänen, O.; Leppänen, J.; Laine, U.K.; Saarinen, J.P. Comparison of classifiers in audio and acceleration based context classification in mobile phones. In Proceedings of the European Signal Processing Conference, Barcelona, Spain, 29 August–2 September 2011; IEEE: New York, NY, USA, 2011. [Google Scholar]
- Lai, Y.-H.; Tsao, Y.; Lu, X.; Chen, F.; Su, Y.-T.; Chen, K.-C.; Chen, Y.-H.; Chen, L.-C.; Li, L.P.-H.; Lee, C.-H. Deep Learning–Based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients. Ear Hear. 2018, 39, 795–809. [Google Scholar] [CrossRef]
- Tiwari, V. MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 2010, 1, 19–22. [Google Scholar]
- Ganchev, T.; Fakotakis, N.; Kokkinakis, G. Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the SPECOM, Patras, Greece, 17–19 October 2005. [Google Scholar]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015. [Google Scholar]
- Furui, S. Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech, Signal Process. 1981, 29, 254–272. [Google Scholar] [CrossRef] [Green Version]
- Ma, L.; Milner, B.; Smith, D. Acoustic environment classification. ACM Trans. Speech Lang. Process. 2006, 3, 1–22. [Google Scholar] [CrossRef]
- Masmoudi, A.; Bougares, F.; Ellouze, M.; Estève, Y.; Belguith, L.H. Automatic speech recognition system for Tunisian dialect. Lang. Resour. Evaluation 2018, 52, 249–267. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Deoras, A.; Mikolov, T.; Kombrink, S.; Karafiat, M.; Khudanpur, S. Variational approximation of long-span language models for lvcsr. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; IEEE: New York, NY, USA, 2011; pp. 5532–5535. [Google Scholar]
- Chen, S.F.; Goodman, J. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 1999, 13, 359–394. [Google Scholar] [CrossRef] [Green Version]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Sarle, W.S. Stopped training and other remedies for overfitting. Comput. Sci. Stat. 1996, 352, 60. [Google Scholar]
- Lawrence, S.; Giles, C.L.; Tsoi, A.C. Lessons in neural network training: Overfitting may be harder than expected. In Proceedings of the Fourteenth National Conference on Artificial Intelligence, Menlo Park, CA, USA, 14–19 July 1997. [Google Scholar]
- Liu, Z.; Wang, Y.; Zhang, J.; Liu, Z. Shortcut computation for the thermal management of a large air-cooled battery pack. Appl. Therm. Eng. 2014, 66, 445–452. [Google Scholar] [CrossRef]
- Xian, C.; Lu, Y.-H.; Li, Z. Adaptive computation offloading for energy conservation on battery-powered systems. In Proceedings of the 2007 International Conference on Parallel and Distributed Systems, Hsinchu, Taiwan, 5–7 December 2007; IEEE: New York, NY, USA, 2007; pp. 1–8. [Google Scholar]
- Fang, S.-H.; Tsao, Y.; Hsiao, M.-J.; Chen, J.-Y.; Lai, Y.-H.; Lin, F.-C.; Wang, C.-T. Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach. J. Voice 2019, 33, 634–641. [Google Scholar] [CrossRef]
- Ali, F.; El-Sappagh, S.; Islam, S.M.R.; Ali, A.; Attique, M.; Imran, M.; Kwak, K.-S. An intelligent healthcare monitoring framework using wearable sensors and social networking data. Future Gener. Comput. Syst. 2020, 114, 23–43. [Google Scholar] [CrossRef]
- Ali, F.; El-Sappagh, S.; Islam, S.R.; Kwak, D.; Ali, A.; Imran, M.; Kwak, K.-S. A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Inf. Fusion 2020, 63, 208–222. [Google Scholar] [CrossRef]
- Korzekwa, D.; Barra-Chicote, R.; Kostek, B.; Drugman, T.; Lajszczak, M. Interpretable Deep Learning Model for the Detection and Reconstruction of Dysarthric Speech. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
- Merritt, T.; Putrycz, B.; Nadolski, A.; Ye, T.; Korzekwa, D.; Dolecki, W.; Drugman, T.; Klimkov, V.; Moinet, A.; Breen, A.; et al. Comprehensive Evaluation of Statistical Speech Waveform Synthesis. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE: New York, NY, USA, 2018; pp. 325–331. [Google Scholar]
- Vryzas, N.; Tsipas, N.; Dimoulas, C. Web Radio Automation for Audio Stream Management in the Era of Big Data. Information 2020, 11, 205. [Google Scholar] [CrossRef] [Green Version]
- Vrysis, L.; Tsipas, N.; Thoidis, I.; Dimoulas, C. 1D/2D Deep CNNs vås. Temporal Feature Integration for General Audio Classification. J. Audio Eng. Soc. 2020, 68, 66–77. [Google Scholar] [CrossRef]
- Tsipas, N.; Vrysis, L.; Konstantoudakis, K.; Dimoulas, C. Semi-supervised audio-driven TV-news speaker diarization using deep neural embeddings. J. Acoust. Soc. Am. 2020, 148, 3751–3761. [Google Scholar] [CrossRef]
- Brezeale, D.; Cook, D.J. Using closed captions and visual features to classify movies by genre. In Proceedings of the Poster Session of the Seventh International Workshop on Multimedia Data Mining, Philadelphia, PA, USA, 20 August 2006. [Google Scholar]
- Lin, M.; Chen, Q.; Yan, S. Network in network. Neural Evol. Comput. 2013, arXiv:1312.4400. [Google Scholar]
Accuracy (%) | ||||||
---|---|---|---|---|---|---|
- | Convolution Neural Network with Mel-frequency Cepstral Coefficient (CNN–MFCC) | Convolution Neural Network with a Phonetic Posteriorgram (CNN–PPG) | Automatic Speech Recognition (ASR) | |||
Times | Training Phase | Application Phase | Training Phase | Application Phase | Training Phase | Application Phase |
1 | 97.9% | 57.9% | 95.4% | 95.3% | 100% | 89.5% |
2 | 98.2% | 67.3% | 95.7% | 94.2% | 100% | 94.2% |
3 | 98.2% | 63.7% | 93.2% | 95.3% | 100% | 89.5% |
4 | 97.9% | 67.8% | 96.7% | 96.5% | 100% | 74.9% |
5 | 96.7% | 69.6% | 95.7% | 93.6% | 100% | 87.1% |
6 | 92.2% | 71.9% | 96.9% | 92.9% | 100% | 94.2% |
7 | 95.9% | 64.9% | 95.2% | 90.0% | 99.7% | 94.2% |
8 | 99.2% | 64.9% | 98.9% | 90.0% | 100% | 88.9% |
9 | 97.9% | 62.0% | 97.2% | 91.2% | 100% | 95.2% |
10 | 96.7% | 66.7% | 96.2% | 95.9% | 100% | 88.9% |
Average | 97.1% | 65.7% | 96.1% | 93.4% | 99.9% | 89.6% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, Y.-Y.; Zheng, W.-Z.; Chu, W.C.; Han, J.-Y.; Hung, Y.-H.; Ho, G.-M.; Chang, C.-Y.; Lai, Y.-H. A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology. Appl. Sci. 2021, 11, 2477. https://doi.org/10.3390/app11062477
Lin Y-Y, Zheng W-Z, Chu WC, Han J-Y, Hung Y-H, Ho G-M, Chang C-Y, Lai Y-H. A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology. Applied Sciences. 2021; 11(6):2477. https://doi.org/10.3390/app11062477
Chicago/Turabian StyleLin, Yu-Yi, Wei-Zhong Zheng, Wei Chung Chu, Ji-Yan Han, Ying-Hsiu Hung, Guan-Min Ho, Chia-Yuan Chang, and Ying-Hui Lai. 2021. "A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology" Applied Sciences 11, no. 6: 2477. https://doi.org/10.3390/app11062477
APA StyleLin, Y.-Y., Zheng, W.-Z., Chu, W. C., Han, J.-Y., Hung, Y.-H., Ho, G.-M., Chang, C.-Y., & Lai, Y.-H. (2021). A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology. Applied Sciences, 11(6), 2477. https://doi.org/10.3390/app11062477