Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network
Abstract
:1. Introduction
- 1.
- We have conducted a detailed exploration of the application of feature pyramids in speech emotion recognition for the first time. We enhanced the MSCNN by integrating the CSA module to better capture local emotional correlations.
- 2.
- We have improved the CSA by using a multi-scale convolutional module, avoiding the degradation problem of the convolutional attention network.
- 3.
- We designed a backward fusion approach that effectively captures features across different levels of detail, successfully preserving the importance of local dynamics and deep semantics in emotional representation.
2. Related Work
2.1. Background of Speech Emotion Recognition
2.2. Multi-Scale Network Model
2.3. Feature Pyramid Network Model
3. Methodology
3.1. Multi-Scale Feature Pyramid Network
3.2. Convolutional Self-Attention
3.3. Global–Local Representation Learning Module
4. Experiments
4.1. Corpora Description
4.2. Implementation Details
4.3. Experiment Results and Discussion
4.4. Ablation Study
5. Conclusions
6. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Korsmeyer, C.; Rosalind, W. Picard, affective computing. Minds Mach. 1999, 9, 443–447. [Google Scholar] [CrossRef]
- Schuller, B.W. Speech emotion recognition two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
- Low, L.-S.A.; Maddage, N.C.; Lech, M.; Sheeber, L.; Allen, N.B. Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 2011, 58, 574–586. [Google Scholar] [CrossRef]
- Yoon, W.-J.; Cho, Y.-H.; Park, K.-S. A study of speech emotion recognition and its application to mobile services. In Ubiquitous Intelligence and Computing; Springer: Berlin/Heidelberg, Germany, 2007; pp. 758–766. [Google Scholar]
- Tawari, A.; Trivedi, M. Speech based emotion classification framework for driver assistance system. In Proceedings of the 2010 IEEE Intelligent Vehicles Symposium, La Jolla, CA, USA, 21–24 June 2010; pp. 174–178. [Google Scholar]
- Ma, H.; Yarosh, S. A review of affective computing research based on function-component-representation framework. IEEE Trans. Affect. Comput. 2021, 14, 1655–1674. [Google Scholar] [CrossRef]
- Deshmukh, S.; Gupta, P.; Mane, P. Investigation of results using various databases and algorithms for music player using speech emotion recognition. In International Conference on Soft Computing and Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2021; pp. 205–215. [Google Scholar]
- Basu, S.; Chakraborty, J.; Aftabuddin, M. Emotion recognition from speech using convolutional neural network with recurrent neural network architecture. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 19–20 October 2017; pp. 333–336. [Google Scholar]
- Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM architecture for speech emotion recognition with data augmentation. arXiv 2018, arXiv:1802.05630. [Google Scholar]
- Li, R.; Wu, Z.; Jia, J.; Zhao, S.; Meng, H. Dilated residual network with multi-head self-attention for speech emotion recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6675–6679. [Google Scholar]
- Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient speech emotion recognition using multi-scale CNN and attention. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3020–3024. [Google Scholar]
- Liu, J.; Liu, Z.; Wang, L.; Guo, L.; Dang, J. Speech emotion recognition with local-global aware deep representation learning. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7174–7178. [Google Scholar]
- Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6437–6441. [Google Scholar]
- Xu, M.; Zhang, F.; Cui, X.; Zhang, W. Speech emotion recognition with multiscale area attention and data augmentation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6319–6323. [Google Scholar]
- Chen, M.; Zhao, X. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 374–378. [Google Scholar]
- Yang, B.; Wang, L.; Wong, D.F.; Chao, L.S.; Tu, Z. Convolutional self-attention networks. arXiv 2019, arXiv:1904.03107. [Google Scholar]
- Lee, J.-H.; Kim, J.-Y.; Kim, H.-G. Emotion Recognition Using EEG Signals and Audiovisual Features with Contrastive Learning. Bioengineering 2024, 11, 997. [Google Scholar] [CrossRef] [PubMed]
- Liu, G.; Hu, P.; Zhong, H.; Yang, Y.; Sun, J.; Ji, Y.; Zou, J.; Zhu, H.; Hu, S. Effects of the Acoustic-Visual Indoor Environment on Relieving Mental Stress Based on Facial Electromyography and Micro-Expression Recognition. Buildings 2024, 14, 3122. [Google Scholar] [CrossRef]
- Das, A.; Sarma, M.S.; Hoque, M.M.; Siddique, N.; Dewan, M.A.A. AVaTER: Fusing Audio, Visual, and Textual Modalities Using Cross-Modal Attention for Emotion Recognition. Sensors 2024, 24, 5862. [Google Scholar] [CrossRef]
- Udahemuka, G.; Djouani, K.; Kurien, A.M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Appl. Sci. 2024, 14, 8071. [Google Scholar] [CrossRef]
- Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
- Wang, Y.; Li, Y.; Cui, Z. Incomplete multimodality-diffused emotion recognition. Adv. Neural Inf. Process. Syst. 2024, 36, 17117–17128. [Google Scholar]
- Meng, T.; Shou, Y.; Ai, W.; Yin, N.; Li, K. Deep imbalanced learning for multimodal emotion recognition in conversations. In IEEE Transactions on Artificial Intelligence; IEEE: New York, NY, USA, 2024. [Google Scholar]
- Xie, Y.; Liang, R.; Liang, Z.; Zhao, X.; Zeng, W. Speech emotion recognition using multihead attention in both time and feature dimensions. IEICE Trans. Inf. Syst. 2023, 106, 1098–1101. [Google Scholar] [CrossRef]
- Gan, C.; Wang, K.; Zhu, Q.; Xiang, Y.; Jain, D.K.; García, S. Speech emotion recognition via multiple fusion under spatial–temporal parallel network. Neurocomputing 2023, 555, 126623. [Google Scholar] [CrossRef]
- Li, Z.; Xing, X.; Fang, Y.; Zhang, W.; Fan, H.; Xu, X. Multi-scale temporal transformer for speech emotion recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Dublin, Ireland, 20–24 August 2023; pp. 3652–3656. [Google Scholar]
- Yu, L.; Xu, F.; Qu, Y.; Zhou, K. Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion. Appl. Acoust. 2024, 216, 109752. [Google Scholar] [CrossRef]
- Andayani, F.; Theng, L.B.; Tsun, M.T.; Chua, C. Hybrid LSTM-transformer model for emotion recognition from speech audio files. IEEE Access 2022, 10, 36018–36027. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, G.; Gong, K.; Liang, X.; Chen, Z. CP-GAN: Context pyramid generative adversarial network for speech enhancement. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6624–6628. [Google Scholar]
- Luo, S.; Feng, Y.; Liu, Z.J.; Ling, Y.; Dong, S.; Ferry, B. High precision sound event detection based on transfer learning using transposed convolutions and feature pyramid network. In Proceedings of the 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2023; pp. 1–6. [Google Scholar]
- Basbug, A.M.; Sert, M. Acoustic scene classification using spatial pyramid pooling with convolutional neural networks. In Proceedings of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Newport Beach, CA, USA, 30 January–1 February 2019; pp. 128–131. [Google Scholar]
- Gupta, S.; Karanath, A.; Mahrifa, K.; Dileep, A.D.; Thenkanidiyoor, V. Segment-level probabilistic sequence kernel and segment-level pyramid match kernel based extreme learning machine for classification of varying length patterns of speech. Int. J. Speech Technol. 2019, 22, 231–249. [Google Scholar] [CrossRef]
- Ren, Y.; Peng, H.; Li, L.; Xue, X.; Lan, Y.; Yang, Y. A voice spoofing detection framework for IoT systems with feature pyramid and online knowledge distillation. J. Syst. Archit. 2023, 143, 102981. [Google Scholar] [CrossRef]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Self-supervised speech representation learning by masked prediction of hidden units. In IEEE/ACM Transactions on Audio, Speech, and Language Processing; IEEE: New York, NY, USA, 2021; Volume 29, pp. 3451–3460. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Prechelt, L. Early stopping—But when? In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
- Li, Y.; Zhao, T.; Kawahara, T. Improved end-to-end speech emotion recognition using self-attention mechanism and multitask learning. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2803–2807. [Google Scholar]
- Chakhtouna, A.; Sekkate, S.; Abdellah, A. Unveiling embedded features in Wav2vec2 and HuBERT models for Speech Emotion Recognition. Procedia Comput. Sci. 2024, 232, 2560–2569. [Google Scholar] [CrossRef]
- Ullah, R.; Asif, M.; Shah, W.A.; Anjam, F.; Ullah, I.; Khurshaid, T.; Wuttisittikulkij, L.; Shah, S.; Ali, S.M.; Alibakhshikenari, M. Speech emotion recognition using convolution neural networks and multi-head convolutional transformer. Sensors 2023, 23, 6212. [Google Scholar] [CrossRef] [PubMed]
- Manelis, A.; Miceli, R.; Satz, S.; Suss, S.J.; Hu, H.; Versace, A. The Development of Ambiguity Processing Is Explained by an Inverted U-Shaped Curve. Behav. Sci. 2024, 14, 826. [Google Scholar] [CrossRef] [PubMed]
- Arslan, E.E.; Akşahin, M.F.; Yilmaz, M.; Ilgın, H.E. Towards Emotionally Intelligent Virtual Environments: Classifying Emotions Through a Biosignal-Based Approach. Appl. Sci. 2024, 14, 8769. [Google Scholar] [CrossRef]
- Sun, L.; Yang, H.; Li, B. Multimodal Dataset Construction and Validation for Driving-Related Anger: A Wearable Physiological Conduction and Vehicle Driving Data Approach. Electronics 2024, 13, 3904. [Google Scholar] [CrossRef]
Model | UA | WA |
---|---|---|
GLAM [13] | 69.70% | 68.75% |
E2ESA [39] | 70.86% | 69.25% |
Xie [24] | 70.0% | 68.8% |
DRN [10] | 71.59% | 70.23% |
MSFPN | 73.39% | 71.79% |
Method | UA | WA |
---|---|---|
w/o CSA | 66.89% | 65.35% |
w/o MSCNN | 70.95% | 69.05% |
w/o forward fusion | 71.85% | 70.09% |
w/o backward fusion | 72.72% | 71.26% |
MSFPN | 73.39% | 71.79% |
Method | UA |
---|---|
w/o CSA | 68.8% |
w/o MSCNN | 80.6% |
w/o forward fusion | 80.9% |
w/o backward fusion | 83.0% |
MSFPN | 86.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Huang, J.; Zhao, Z.; Lan, H.; Zhang, X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Appl. Sci. 2024, 14, 11494. https://doi.org/10.3390/app142411494
Wang Y, Huang J, Zhao Z, Lan H, Zhang X. Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Applied Sciences. 2024; 14(24):11494. https://doi.org/10.3390/app142411494
Chicago/Turabian StyleWang, Yuhua, Jianxing Huang, Zhengdao Zhao, Haiyan Lan, and Xinjia Zhang. 2024. "Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network" Applied Sciences 14, no. 24: 11494. https://doi.org/10.3390/app142411494
APA StyleWang, Y., Huang, J., Zhao, Z., Lan, H., & Zhang, X. (2024). Speech Emotion Recognition Using Multi-Scale Global–Local Representation Learning with Feature Pyramid Network. Applied Sciences, 14(24), 11494. https://doi.org/10.3390/app142411494