Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications
Abstract
:1. Introduction
2. Human–Voice Interface System
2.1. Speech Pre-Processing
2.2. Speaker Separation System
- (1)
- Speaker encoder
- (2)
- Voice Filter model
- (3)
- Evaluation and Training in Voice Filter model
2.3. Automatic Speech Recognition (ASR)
2.3.1. Conformer Model Architecture
2.3.2. Convolution Module Design
2.3.3. Feed-Forward Module
2.3.4. Model Testing Result
- (1)
- Speaker Separation Model Testing
- (2)
- Evaluations for Automatic Speech Recognition Model
3. Environment Map Generation through VSLAM
4. Voice Interactive Robot Control in Real-Robot Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lin, P.H.; Lin, C.Y.; Hung, C.T.; Chen, J.J.; Liang, J.M. The Autonomous Shopping-Guide Robot in Cashier-Less Convenience Stores. Proc. Eng. Technol. Innov. 2020, 14, 9–15. [Google Scholar] [CrossRef]
- Wuth, J.; Correa, P.; Núñez, T.; Saavedra, M.; Yoma, N.B. The Role of Speech Technology in User Perception and Context Acquisition in HRI. Int. J. Soc. Robot. 2021, 13, 949–968. [Google Scholar] [CrossRef]
- Hershey, J.R.; Chen, Z.; Roux, J.L.; Watanabe, S. Deep Clustering: Discriminative Embeddings for Segmentation and Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 31–35. [Google Scholar] [CrossRef] [Green Version]
- Luo, Y.; Mesgarani, N. TasNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 696–700. [Google Scholar] [CrossRef] [Green Version]
- Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yu, D.; Kolbæk, M.; Tan, Z.H.; Jensen, J. Permutation Invariant Training of Deep Models for Speaker-Independent Multi-Talker Speech Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 241–245. [Google Scholar] [CrossRef] [Green Version]
- Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef] [Green Version]
- Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar] [CrossRef]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar] [CrossRef]
- Narayanan, A.; Chiu, C.C.; O’Malley, T.; Wang, Q.; He, Y. Cross-Attention Conformer for Context Modeling in Speech Enhancement for ASR. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 312–319. [Google Scholar] [CrossRef]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
- Yeh, C.F.; Mahadeokar, J.; Kalgaonkar, K.; Wang, Y.; Le, D.; Jain, M.; Schubert, K.; Fuegen, C.; Seltzer, M.L. Transformer-Transducer: End-to-end speech Recognition with Self-attention. arXiv 2019, arXiv:1910.12977. [Google Scholar]
- Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Karita, S.; Chen, N.; et al. A Comparative Study on Transformer vs RNN in Speech Applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 449–456. [Google Scholar] [CrossRef] [Green Version]
- Chen, S.; Wu, Y.; Chen, Z.; Wu, J.; Li, J.; Yoshioka, T.; Wang, C.; Liu, S.; Zhou, M. Continuous Speech Separation with Conformer. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5749–5753. [Google Scholar] [CrossRef]
- Kohlbrecher, S.; Stryk, O.V.; Meyer, J.; Klingauf, U. A Flexible and Scalable SLAM System with Full 3D Motion Estimation. In Proceedings of the IEEE International Symposium on Safety, Security, and Rescue Robotics, Kyoto, Japan, 1–5 November 2011; pp. 155–160. [Google Scholar] [CrossRef] [Green Version]
- Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]
- Labbé, M.; Michaud, F. RTAB-Map as an Open-Source Lidar and Visual Simultaneous Localization and Mapping Library for Large-Scale and Long-Term Online Operation. J. Field Robot. 2019, 36, 416–444. [Google Scholar] [CrossRef]
- Boll, S.F. Suppression of Acoustic Noise in Speech Using Spectral Subtraction. IEEE/ACM Trans. Audio Speech Lang. Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Reynolds, D.A. Gaussian Mixture Models. Encycl. Biom. 2009, 741, 659–663. [Google Scholar] [CrossRef]
- Wang, Q.; Muckenhirn, H.; Wilson, K.; Sridhar, P.; Wu, Z.; Hershey, J.R.; Saurous, R.A.; Weiss, R.J.; Jia, Y.; Moreno, I.L. VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2728–2732. [Google Scholar]
- Wan, L.; Wang, Q.; Papir, A.; Moreno, I.L. Generalized End-to-End Loss for Speaker Verification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4879–4883. [Google Scholar] [CrossRef] [Green Version]
- Sahidullah, M.; Saha, G. Design, Analysis and Experimental Evaluation of Block Based Transformation in MFCC Computation for Speaker Recognition. Speech Commun. 2012, 54, 543–565. [Google Scholar] [CrossRef]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1086–1090. [Google Scholar] [CrossRef] [Green Version]
- Vincent, E.; Gribonval, R.; Fevotte, C. Performance Measurement in Blind Audio Source Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. Learning Deep Transformer Models for Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1810–1822. [Google Scholar]
- Lu, Y.; Li, Z.; He, D.; Sun, Z.; Dong, B.; Qin, T.; Wang, L.; Liu, T.Y. Understanding and Improving Transformer from a Multi-Particle Dynamic System Point of View. arXiv 2019, arXiv:1906.02762. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 933–941. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4 December 2017; pp. 6000–6010. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar] [CrossRef]
- Park, D.S.; Zhang, Y.; Chiu, C.C.; Chen, Y.; Li, B.; Chan, W.; Le, Q.V.; Wu, Y. SpecAugment on Large Scale Datasets. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6879–6883. [Google Scholar]
- Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Feature. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar] [CrossRef]
- Zhang, Y.; Park, D.S.; Han, W.; Qin, J.; Gulati, A.; Shor, J.; Jansen, A.; Xu, Y.; Huang, Y.; Wang, S.; et al. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition. IEEE J. Sel. Top. Signal Process. 2022, 16, 1519–1532. [Google Scholar] [CrossRef]
- Novoa, J.; Wuth, J.; Escudero, J.P.; Fredes, J.; Mahu, R.; Yoma, N.B. DNN-HMM based Automatic Speech Recognition for HRI Scenarios. In Proceedings of the 2018 13th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Chicago, IL, USA, 5–8 March 2018; pp. 150–159. [Google Scholar]
- Stuede, M.; Wilkening, J.; Tappe, S.; Ortmaier, T. Voice Recognition and Processing Interface for an Interactive Guide Robot in a University Scenario. In Proceedings of the 2019 19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 15–18 October 2019; pp. 1238–1242. [Google Scholar] [CrossRef]
- Pleshkova, S.G.; Bekyarski, A.B.; Zahariev, Z.T. LabVIEW Model of Voice Commands for Mobile Robot Motion Control Using Internet of Thinks Module. In Proceedings of the 2019 X National Conference with International Participation (ELECTRONICA), Sofia, Bulgaria, 16–17 May 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Heracleous, P.; Even, J.; Sugaya, F.; Hashimoto, M.; Yoneyama, A. Exploiting Alternative Acoustic Sensors for Improved Noise Robustness in Speech Communication. Pattern Recognit. Lett. 2018, 112, 191–197. [Google Scholar] [CrossRef]
- Lin, H.I.; Nanda, S. 6 DOF Pose Estimation for Efficient Robot Manipulation. In Proceedings of the IEEE Conference on Industrial Cyberphysical Systems (ICPS), Tampere, Finland, 10–12 June 2020; pp. 279–284. [Google Scholar] [CrossRef]
- Labbé, M.; Michaud, F. Long-Term Online Multi-Session Graph-Based SPLAM with Memory Management. Auton. Robot. 2018, 42, 1133–1150. [Google Scholar] [CrossRef]
- Wong, C.C.; Chien, S.Y.; Feng, H.M.; Aoyama, H. Motion Planning for Dual-Arm Robot Based on Soft Actor-Critic. IEEE Access 2021, 9, 26871–26885. [Google Scholar] [CrossRef]
- Li, S.A.; Chou, L.H.; Chang, T.H.; Wong, C.C.; Feng, H.M. Design and Implementation of an Autonomous Service Robot Based on Cyber Physical Modeling Systems. Proc. Inst. Mech. Eng. B J. Eng. Manuf. 2022, 1–15, Advance online publication. [Google Scholar] [CrossRef]
- Available online: https://youtu.be/6-bWdy5DG8A (accessed on 16 January 2022).
Layer | Kernel Size | Dilation | Filters/Nodes | ||
---|---|---|---|---|---|
Time | Freq | Time | Freq | ||
Convolution Layer 1 | 1 | 7 | 1 | 1 | 64 |
Convolution Layer 2 | 7 | 1 | 1 | 1 | 64 |
Convolution Layer 3 | 5 | 5 | 1 | 1 | 64 |
Convolution Layer 4 | 5 | 5 | 2 | 1 | 64 |
Convolution Layer 5 | 5 | 5 | 4 | 1 | 64 |
Convolution Layer 6 | 5 | 5 | 8 | 1 | 64 |
Convolution Layer 7 | 5 | 5 | 16 | 1 | 64 |
Convolution Layer 8 | 1 | 1 | 1 | 1 | 8 |
LSTM | 400 | ||||
FC 1 | 600 | ||||
FC 2 | 600 |
Symbol | Meaning |
---|---|
Vector of output audio | |
Vector of target audio | |
Training Set | SDR | Frequency (Step) | Batch Size |
---|---|---|---|
Train-100 | 3.2 dB | 65 k | 8 |
Train-360 | 5.4 dB | 65 k | 8 |
Train-100 and train-360 | 7.5 dB | 30 k | 8 |
Train-100 and train-360 | 8.6 dB | 530 k | 4 |
Training Set | WER Test Result |
---|---|
Train-100 | 14.1% |
Train-360 | 11.5% |
Train-100 and train-360 | 10.3% |
Train-100 and train-360, and train-500 | 5.3% |
Methods | WER% | Accuracy of Speech Recognition | Fitting Type |
---|---|---|---|
Zhang et al. [34] | 7.7% | X | Universal word type |
Novoa et al. [35] | 11.62% | X | Universal word type |
Google API [35] | 15.79% | X | Universal word type |
IBM API [35] | 40.74% | X | Universal word type |
Stuede et al. [36] | 5.8% | 83.3% | Single word type |
Pleshkova et al. [37] | X | 82% | Single word type |
Heracleous et al. [38] | X | 88.5% | Single word type |
The proposed methods | 5.3% | 89.3% | Universal word type |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.-A.; Liu, Y.-Y.; Chen, Y.-C.; Feng, H.-M.; Shen, P.-K.; Wu, Y.-C. Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications. Appl. Sci. 2023, 13, 3359. https://doi.org/10.3390/app13053359
Li S-A, Liu Y-Y, Chen Y-C, Feng H-M, Shen P-K, Wu Y-C. Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications. Applied Sciences. 2023; 13(5):3359. https://doi.org/10.3390/app13053359
Chicago/Turabian StyleLi, Shih-An, Yu-Ying Liu, Yun-Chien Chen, Hsuan-Ming Feng, Pi-Kang Shen, and Yu-Che Wu. 2023. "Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications" Applied Sciences 13, no. 5: 3359. https://doi.org/10.3390/app13053359
APA StyleLi, S. -A., Liu, Y. -Y., Chen, Y. -C., Feng, H. -M., Shen, P. -K., & Wu, Y. -C. (2023). Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications. Applied Sciences, 13(5), 3359. https://doi.org/10.3390/app13053359