Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation
Abstract
:1. Introduction
- Real-time end-to-end speech emotion recognition from the cross-domain (E2ESER-CD) is proposed. E2ESER-CD transfers the speech recognition domain to the speech emotion recognition domain based on a speech recognition front-end network and speech emotion recognition back-end network, thus achieving better performance than the baselines.
- A comparison study across pretrained and fine-tuned models and across different baseline models.
- The proposed speech emotion back-end network in E2ESER-CD is built to meet two criteria: convolution time reduction (CTR) and linear mean encoding transformation (LMET).
- Network and error analysis are proposed to understand the model learning for front-end and back-end networks by the attention weight pattern of the model and the correctness of the prediction result.
2. Related Work
3. Proposed Method
3.1. Raw Speech Preparation
3.2. Speech Recognition Front-End Network
3.3. Speech Emotion Recognition Back-End Network
4. Experimental Setup
4.1. Datasets and Preprocessing
4.2. Parameter Setting
4.3. Evaluation Metrics
5. Results and Discussions
5.1. Results
5.2. Network Analysis
5.2.1. Front-End Network Analysis
5.2.2. Back-End Network Analysis
5.3. Error Analysis
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
SER | Speech emotion recognition |
CTR | Convolution time reduction |
LMET | Linear mean encoding transformation |
Emo-DB | Berlin German Dataset |
E2ESER-CD | Real-time End-to-End Speech Emotion Recognition from Cross-Domain |
VTLP | Vocal tract length perturbation |
VAD | Voice activity detection |
WER | Word error rate |
WA | Weighted accuracy |
UA | Unweighted accuracy |
Appendix A. Details of Common Voice 7.0 German and Thai
Language\Data Split | Dev | Other | Test | Train | Validated |
---|---|---|---|---|---|
German (de) | 15,907 | 8836 | 15,907 | 360,664 | 684,794 |
Thai (th) | 9712 | 90,315 | 9712 | 23,332 | 107,747 |
References
- Singkul, S.; Woraratpanya, K. Vector Learning Representation for Generalized Speech Emotion Recognition. Heliyon 2022, 8, e09196. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Singkul, S.; Chatchaisathaporn, T.; Suntisrivaraporn, B.; Woraratpanya, K. Deep Residual Local Feature Learning for Speech Emotion Recognition. In Neural Information Processing; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 241–252. [Google Scholar] [CrossRef]
- Lech, M.; Stolar, M.; Best, C.; Bolia, R. Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Front. Comput. Sci. 2020, 2, 14. [Google Scholar] [CrossRef]
- Protopapas, A.; Lieberman, P. Fundamental frequency of phonation and perceived emotional stress. J. Acoust. Soc. Am. 1997, 101, 2267–2277. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, S.; Bresch, E.; Adams, J.; Kazemzadeh, A.; Narayanan, S. A study of emotional speech articulation using a fast magnetic resonance imaging technique. In Proceedings of the Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 17–21 September 2006. [Google Scholar]
- Samantaray, A.K.; Mahapatra, K.; Kabi, B.; Routray, A. A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages. In Proceedings of the 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), Kolkata, India, 9–11 July 2015; pp. 372–377. [Google Scholar]
- Wang, W.; Watters, P.A.; Cao, X.; Shen, L.; Li, B. Significance of phonological features in speech emotion recognition. Int. J. Speech Technol. 2020, 23, 633–642. [Google Scholar] [CrossRef]
- Breitenstein, C.; Lancker, D.V.; Daum, I. The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn. Emot. 2001, 15, 57–79. [Google Scholar] [CrossRef]
- Dieleman, S.; Schrauwen, B. End-to-end learning for music audio. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6964–6968. [Google Scholar] [CrossRef]
- Li, Y.; Zhao, T.; Kawahara, T. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
- Yuenyong, S.; Hnoohom, N.; Wongpatikaseree, K.; Singkul, S. Real-Time Thai Speech Emotion Recognition with Speech Enhancement Using Time-Domain Contrastive Predictive Coding and Conv-Tasnet. In Proceedings of the 2022 7th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand, 19–20 May 2022; pp. 78–83. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar]
- Soekhoe, D.; Van Der Putten, P.; Plaat, A. On the impact of data set size in transfer learning using deep neural networks. In Proceedings of the International Symposium on Intelligent Data Analysis, Stockholm, Sweden, 13–15 October 2016; Springer: Cham, Switzerland, 2016; pp. 50–60. [Google Scholar]
- Singkul, S.; Khampingyot, B.; Maharattamalai, N.; Taerungruang, S.; Chalothorn, T. Parsing Thai Social Data: A New Challenge for Thai NLP. In Proceedings of the 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Chiang Mai, Thailand, 7–9 November 2019; pp. 1–7. [Google Scholar]
- Singkul, S.; Woraratpanya, K. Thai Dependency Parsing with Character Embedding. In Proceedings of the 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), Pattaya, Thailand, 10–11 October 2019; pp. 1–5. [Google Scholar]
- Chaksangchaichot, C. Vistec-AIS Speech Emotion Recognition. 2021. Available online: https://github.com/vistec-AI/vistec-ser (accessed on 1 November 2021).
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria, Y.B. Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network. Sensors 2020, 20, 6008. [Google Scholar] [CrossRef] [PubMed]
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Anagnostopoulos, C.N.; Iliou, T.; Giannoukos, I. Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artif. Intell. Rev. 2015, 43, 155–177. [Google Scholar] [CrossRef]
- Shaneh, M.; Taheri, A. Voice Command Recognition System Based on MFCC and VQ Algorithms. Int. J. Comput. Inf. Eng. 2009, 3, 2231–2235. [Google Scholar]
- Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech Emotion Recognition with Multiscale Area Attention and Data Augmentation. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6319–6323. [Google Scholar]
- Kim, C.; Shin, M.; Garg, A.; Gowda, D. Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 739–743. [Google Scholar]
- Venkataramanan, K.; Rajamohan, H.R. Emotion Recognition from Speech. arXiv 2019, arXiv:1912.10458. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 630–645. [Google Scholar]
- Shanahan, T. Everything You Wanted to Know about Repeated Reading. Reading Rockets. 2017. Available online: https://www.readingrockets.org/blogs/shanahan-literacy/everything-you-wanted-know-about-repeated-reading (accessed on 10 December 2021).
- Team, S. Silero VAD: Pre-Trained Enterprise-Grade Voice Activity Detector (VAD), Number Detector and Language Classifier. 2021. Available online: https://github.com/snakers4/silero-vad (accessed on 2 March 2022).
- Jaitly, N.; Hinton, G.E. Vocal tract length perturbation (VTLP) improves speech recognition. In Proceedings of the ICML Workshop on Deep Learning for Audio, Speech and Language, Atlanta, GA, USA, 16 June 2013; Volume 117, p. 21. [Google Scholar]
- Sefara, T.J. The effects of normalisation methods on speech emotion recognition. In Proceedings of the 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), Vanderbijlpark, South Africa, 21–22 November 2019; pp. 1–8. [Google Scholar]
- Markitantov, M. Transfer Learning in Speaker’s Age and Gender Recognition. In Speech and Computer; Karpov, A., Potapova, R., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 326–335. [Google Scholar]
- Limkonchotiwat, P.; Phatthiyaphaibun, W.; Sarwar, R.; Chuangsuwanich, E.; Nutanong, S. Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Online, 2020; pp. 3841–3847. [Google Scholar] [CrossRef]
- Ardila, R.; Branson, M.; Davis, K.; Kohler, M.; Meyer, J.; Henretty, M.; Morais, R.; Saunders, L.; Tyers, F.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4218–4222. [Google Scholar]
- Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.K.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Schober, P.; Boer, C.; Schwarte, L.A. Correlation coefficients: Appropriate use and interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef] [PubMed]
Parameter | Value |
---|---|
embedding size | 512 |
learning rate (ASR) | |
learning rate (SER) | |
optimizer | AdamW |
batch size | 8 |
number of epochs | 20 |
Model | Language | WER (%) |
---|---|---|
Wav2Vec2 | German | 15.6 |
Wav2Vec2 | Thai | 44.46 |
XLSR | German | 18.5 |
XLSR | Thai | 28.64 |
Model | Learning | Back-End | ThaiSER | |
---|---|---|---|---|
UA | WA | |||
1DLFLB+LSTM [2] | scratch | - | 58.07 | 58.38 |
DeepResLFLB [3] | scratch | - | 60.73 | 60.60 |
RT-AlexNet [4] | scratch | - | 61.58 | 64.96 |
Wav2Vec2 | Transfer Learning | CTR | 69.25 | 68.89 |
Wav2Vec2 | Transfer Learning | LMET | 69.34 | 71.11 |
XLSR | Transfer Learning | CTR | 66.61 | 66.57 |
XLSR | Transfer Learning | LMET | 67.57 | 68.21 |
Wav2Vec2 | Fine-training | CTR | 59.98 | 62.56 |
Wav2Vec2 | Fine-training | LMET | 66.60 | 68.38 |
XLSR | Fine-training | CTR | 65.30 | 65.81 |
XLSR | Fine-training | LMET | 70.73 | 71.27 |
Model | Learning | Back-End | Emo-DB | |
---|---|---|---|---|
UA | WA | |||
1DLFLB+LSTM [2] | scratch | - | 78.30 | 79.41 |
DeepResLFLB [3] | scratch | - | 79.02 | 82.35 |
RT-AlexNet [4] | scratch | - | 83.20 | 85.29 |
Wav2Vec2 | Transfer Learning | CTR | 88.69 | 91.18 |
Wav2Vec2 | Transfer Learning | LMET | 81.55 | 85.29 |
XLSR | Transfer Learning | CTR | 54.93 | 58.82 |
XLSR | Transfer Learning | LMET | 85.11 | 88.24 |
Wav2Vec2 | Fine-training | CTR | 56.25 | 61.76 |
Wav2Vec2 | Fine-training | LMET | 53.13 | 58.82 |
XLSR | Fine-training | CTR | 33.70 | 38.24 |
XLSR | Fine-training | LMET | 78.42 | 82.35 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wongpatikaseree, K.; Singkul, S.; Hnoohom, N.; Yuenyong, S. Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation. Big Data Cogn. Comput. 2022, 6, 79. https://doi.org/10.3390/bdcc6030079
Wongpatikaseree K, Singkul S, Hnoohom N, Yuenyong S. Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation. Big Data and Cognitive Computing. 2022; 6(3):79. https://doi.org/10.3390/bdcc6030079
Chicago/Turabian StyleWongpatikaseree, Konlakorn, Sattaya Singkul, Narit Hnoohom, and Sumeth Yuenyong. 2022. "Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation" Big Data and Cognitive Computing 6, no. 3: 79. https://doi.org/10.3390/bdcc6030079
APA StyleWongpatikaseree, K., Singkul, S., Hnoohom, N., & Yuenyong, S. (2022). Real-Time End-to-End Speech Emotion Recognition with Cross-Domain Adaptation. Big Data and Cognitive Computing, 6(3), 79. https://doi.org/10.3390/bdcc6030079