MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers
Abstract
:1. Introduction
1.1. Challenges
- Emotional subtleties: Detecting subtle emotional nuances in speech presents a significant challenge. Emotions are often expressed through minor variations in vocal attributes such as tone, pitch and intensity. These subtle cues can be easily overshadowed by more overt emotional expressions or obscured by background noise and other distortions. Consequently, accurately identifying and distinguishing these fine-grained emotional differences requires advanced analytical methods. Overcoming this challenge is essential for enhancing the sensitivity and accuracy of emotion recognition systems in capturing and interpreting subtle emotional signals.
- Noisy background: Communications often have complex and various environmental backgrounds. Such environments often feature multiple overlapping sounds that complicate the task of isolating the target voice from the background noise. This interference not only diminishes the clarity of the speech signal but also obstructs the accurate identification of the speaker’s emotional state. Addressing this challenge is crucial for improving the effectiveness of SER systems in noisy conditions.
1.2. Observation and Insights
1.3. Contributions
- An efficient MelTrans model is developed to leverage the findings we observe in speech emotion signals. To the best of our knowledge, the critical cues and long-range semantic relationships in voice signals are revealed for the first time. Then, a Transformer is leveraged to exploit the relationships among mel-spectrograms.
- A dual-stream model is proposed to exploit crucial cues and long-distance relationship. Specifically, the crucial cue stream extracts the core representations in speech signals, while the relationship stream generates the long-distance relationship information of speech. Together, the two streams can make full use of the information in voice signals to form a core cue-aware neural network.
2. Related Work
2.1. Speech Emotion Recognition
2.2. Attention Mechanism
2.3. Transformer-Based SER
2.4. Summary
3. Proposed Method
3.1. Crucial Cue Stream
3.2. Relationship Stream
- (1) Word encoder
- (2) Object encoder
- (3) Jointing block
3.3. Loss Function
4. Experimental Results
4.1. General Setting
4.2. Implementation Details
4.3. Experimental Results and Analysis
- (1) Performance comparison on the IEMOCAP dataset
- (2) Performance comparison on the EmoDB dataset
4.4. Analysis of the Dual-Stream Design and Discussion
5. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Seinfeld, S.; Feuchtner, T.; Maselli, A.; Müller, J. User Representations in Human-Computer Interaction. Hum. -Comput. Interact. 2020, 36, 400–438. [Google Scholar] [CrossRef]
- Agarla, M.; Bianco, S.; Celona, L.; Napoletano, P.; Petrovsky, A.; Piccoli, F.; Schettini, R.; Shanin, I. Semi-supervised cross-lingual speech emotion recognition. Expert Syst. Appl. 2024, 237, 121368. [Google Scholar] [CrossRef]
- Gao, R.; Grauman, K. Visualvoice: Audio-visual speech separation with cross-modal consistency. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15490–15500. [Google Scholar]
- Rong, J.; Li, G.; Chen, Y.-P.P. Acoustic feature selection for automatic emotion recognition from speech. Inf. Process. Manag. 2009, 45, 315–328. [Google Scholar] [CrossRef]
- Wu, C.-H.; Liang, W.-B. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2010, 2, 10–21. [Google Scholar]
- Tawari, A.; Trivedi, M.M. Speech emotion analysis: Exploring the role of context. IEEE Trans. Multimed. 2010, 12, 502–509. [Google Scholar] [CrossRef]
- Hozjan, V.; Kačič, Z. Context-independent multilingual emotion recognition from speech signals. Int. J. Speech Technol. 2003, 6, 311–320. [Google Scholar] [CrossRef]
- Doulamis, N. An adaptable emotionally rich pervasive computing system. In Proceedings of the 2006 14th European Signal Processing Conference, Florence, Italy, 4–8 September 2006; pp. 1–5. [Google Scholar]
- Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixão, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
- Bekmanova, G.; Yergesh, B.; Sharipbay, A.; Mukanova, A. Emotional Speech Recognition Method Based on Word Transcription. Sensors 2022, 22, 1937. [Google Scholar] [CrossRef]
- Mamyrbayev, O.Z.; Oralbekova, D.O.; Alimhan, K.; Nuranbayeva, B.M. Hybrid end-to-end model for Kazakh speech recognition. Int. J. Speech Technol. 2023, 26, 261–270. [Google Scholar] [CrossRef]
- Zhao, Z.; Bao, Z.; Zhang, Z.; Deng, J.; Cummins, N.; Wang, H.; Tao, J.; Schuller, B. Automatic Assessment of Depression From Speech via a Hierarchical Attention Transfer Network and Attention Autoencoders. IEEE J. Sel. Top. Signal Process. 2020, 14, 423–434. [Google Scholar] [CrossRef]
- Abibullaev, B.; Keutayeva, A.; Zollanvari, A. Deep learning in EEG-based BCIs: A comprehensive review of transformer models, advantages, challenges, and applications. IEEE Access 2023, 11, 127271–127301. [Google Scholar] [CrossRef]
- Liu, J.; Wang, H. Graph Isomorphism Network for Speech Emotion Recognition. In Proceedings of the Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 3405–3409. [Google Scholar]
- Liu, Y.; Sun, H.; Guan, W.; Xia, Y.; Li, Y.; Unoki, M.; Zhao, Z. A Discriminative Feature Representation Method Based on Cas-caded Attention Network with Adversarial Strategy for Speech Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1063–1074. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Lei, Y.; Yang, S.; Wang, X.; Xie, L. MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 853–864. [Google Scholar] [CrossRef]
- Makiuchi, M.R.; Uto, K.; Shinoda, K. Multimodal emotion recognition with high-level speech and text features. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 350–357. [Google Scholar]
- Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.-T.; Zhou, E. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11313–11322. [Google Scholar]
- Chen, Z.; Li, J.; Liu, H.; Wang, X.; Wang, H.; Zheng, Q. Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst. Appl. 2023, 214, 118943. [Google Scholar] [CrossRef]
- Feng, K.; Chaspari, T. Few-shot learning in emotion recognition of spontaneous speech using a siamese neural network with adaptive sample pair formation. IEEE Trans. Affect. Comput. 2021, 14, 1627–1633. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Bekmanova, G.; Yelibayeva, G.; Yergesh, B.; Orynbay, L.; Sairanbekova, A.; Kaderkeyeva, Z. Emotional Coloring of Kazakh People’s Names in the Semantic Knowledge Database of “Fascinating Onomastics” Mobile Application. In Proceedings of the 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Niagara Falls, ON, Canada, 17–20 November 2022; pp. 666–671. [Google Scholar]
- Jothimani, S.; Premalatha, K. MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network. Chaos Solitons Fractals 2022, 162, 112512. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Interspeech, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
- Gideon, J.; McInnis, M.G.; Provost, E.M. Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG). IEEE Trans. Affect. Comput. 2021, 12, 1055–1068. [Google Scholar] [CrossRef]
- Khurana, Y.; Gupta, S.; Sathyaraj, R.; Raja, S.P. RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions. IEEE Trans. Comput. Soc. Syst. 2024, 11, 478–487. [Google Scholar] [CrossRef]
- Zhu, X.; Lei, Y.; Li, T.; Zhang, Y.; Zhou, H.; Lu, H.; Xie, L. METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1506–1518. [Google Scholar] [CrossRef]
- Dong, G.-N.; Pun, C.-M.; Zhang, Z. Temporal Relation Inference Network for Multimodal Speech Emotion Recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6472–6485. [Google Scholar] [CrossRef]
- Zheng, Q.; Chen, Z.; Liu, H.; Lu, Y.; Li, J.; Liu, T. MSRANet: Learning discriminative embeddings for speaker verification via channel and spatial attention mechanism in alterable scenarios. Expert Syst. Appl. 2023, 217, 119511. [Google Scholar] [CrossRef]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7367–7371. [Google Scholar]
- Chen, W.; Xing, X.; Xu, X.; Pang, J.; Du, L. SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 775–788. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2015, arXiv:1409.0473. [Google Scholar]
- Kwon, S. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 2021, 167, 114177. [Google Scholar]
- Thanh, P.V.; Huyen, N.T.T.; Quan, P.N.; Trang, N.T.T. A Robust Pitch-Fusion Model for Speech Emotion Recognition in Tonal Languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12386–12390. [Google Scholar]
- Liu, Z.; Kang, X.; Ren, F. Dual-TBNet: Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2193–2203. [Google Scholar] [CrossRef]
- Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
- Liu, K.; Wang, D.; Wu, D.; Liu, Y.; Feng, J. Speech Emotion Recognition via Multi-Level Attention Network. IEEE Signal Process. Lett. 2022, 29, 2278–2282. [Google Scholar] [CrossRef]
- Mao, K.; Wang, Y.; Ren, L.; Zhang, J.; Qiu, J.; Dai, G. Multi-branch feature learning based speech emotion recognition using SCAR-NET. Connect. Sci. 2023, 35, 2189217. [Google Scholar] [CrossRef]
- Shen, S.; Gao, Y.; Liu, F.; Wang, H.; Zhou, A. Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10111–10115. [Google Scholar]
- Ma, H.; Wang, J.; Lin, H.; Zhang, B.; Zhang, Y.; Xu, B. A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations. IEEE Trans. Multimed. 2024, 26, 776–788. [Google Scholar] [CrossRef]
- Jiang, P.; Xu, X.; Tao, H.; Zhao, L.; Zou, C. Convolutional-Recurrent Neural Networks With Multiple Attention Mechanisms for Speech Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 1564–1573. [Google Scholar] [CrossRef]
- Soltani, R.; Benmohamed, E.; Ltifi, H. Newman-Watts-Strogatz topology in deep echo state networks for speech emotion recognition. Eng. Appl. Artif. Intell. 2024, 133, 108293. [Google Scholar] [CrossRef]
Methods | Backbone | WA (%) | UA (%) | Acc (%) |
---|---|---|---|---|
FENT [43] | CNN | 71.84 | 73.88 | 72.86 |
MLT-DNet [37] | CNN | 73.22 | 72.88 | 73.00 |
Zheng et al. [33] | ResNet | 71.64 | 72.70 | 72.17 |
AMSNet | ResNet | 69.22 | 70.51 | 69.87 |
ISNet | ResNet | 70.43 | 65.02 | 67.73 |
SpeechFormer++ | Transformer | 70.50 | 71.5 | 71.00 |
SDT [44] | Transformer | 73.82 | 74.08 | 73.95 |
ICAnet | Transformer | 82.68 | 82.67 | 82.68 |
MelTrans (Ours) | Transformer | 76.50 | 76.54 | 76.52 |
Methods | Backbone | WA (%) | UA (%) | Acc (%) |
---|---|---|---|---|
MLT-DNet | CNN | 90.90 | 89.10 | 90.00 |
Jiang et al. [45] | CNN | 87.9 | 86.7 | 87.30 |
ICAnet | CAN | 91.58 | 88.76 | 90.17 |
AMSNet | ResNet | 88.34 | 88.56 | 88.45 |
SCAR-NET | Transformer | - | - | 96.45 |
DeepESN [46] | Transformer | 87.89 | 87.14 | 87.51 |
HuBERT [2] | Transformer | - | - | 89.00 |
MelTrans (Ours) | Transformer | 92.47 | 92.50 | 92.52 |
Dataset | Model Variants | SpeechFormer | Mask | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
EmoDB dataset | ST stream | × | × | 0.864 | 0.844 | 0.854 |
S-former stream | √ | × | 0.900 | 0.899 | 0.899 | |
ST+crucial cue stream | × | √ | 0.910 | 0.905 | 0.906 | |
MelTrans | √ | √ | 0.927 | 0.926 | 0.925 | |
IEMOCAP dataset | ST stream | × | × | 0.741 | 0.729 | 0.717 |
S-former stream | √ | × | 0.758 | 0.740 | 0.732 | |
Transformer-mask | × | √ | 0.754 | 0.755 | 0.743 | |
MelTrans | √ | √ | 0.776 | 0.775 | 0.766 |
Method | Angry (%) | Neutral (%) | Happy (%) | Sad (%) |
---|---|---|---|---|
S-former | 82.95 | 74.55 | 52.28 | 93.60 |
MelTrans | 78.15 | 79.00 | 65.32 | 88.54 |
Method | Bored (%) | Disgusted (%) | Neutral (%) | Hateful (%) | Afraid (%) | Happy (%) | Sad (%) |
---|---|---|---|---|---|---|---|
S-former | 100.00 | 100.00 | 74.02 | 95.65 | 97.10 | 95.77 | 100.00 |
MelTrans | 100.00 | 100.00 | 100.00 | 97.78 | 95.65 | 45.40 | 100.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Li, J.; Liu, H.; Liu, T.; Chen, Q.; You, X. MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers. Sensors 2024, 24, 5506. https://doi.org/10.3390/s24175506
Li H, Li J, Liu H, Liu T, Chen Q, You X. MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers. Sensors. 2024; 24(17):5506. https://doi.org/10.3390/s24175506
Chicago/Turabian StyleLi, Hui, Jiawen Li, Hai Liu, Tingting Liu, Qiang Chen, and Xinge You. 2024. "MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers" Sensors 24, no. 17: 5506. https://doi.org/10.3390/s24175506
APA StyleLi, H., Li, J., Liu, H., Liu, T., Chen, Q., & You, X. (2024). MelTrans: Mel-Spectrogram Relationship-Learning for Speech Emotion Recognition via Transformers. Sensors, 24(17), 5506. https://doi.org/10.3390/s24175506