Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM
Abstract
:1. Introduction
- (1)
- We propose a residual attention-based gated multilayer perceptron (RA-GMLP) structure, where the GMLP helps the model determine which information is reserved, and the attention mechanism further weights this important emotion information. This structure fills the gap of the GMLP mechanism in terms of attention, enhancing the accuracy of SER.
- (2)
- This study proposes TFCM and uses a new method to achieve a time–frequency domain fusion of MFCC features. The experimental results show that this fusion method is actually superior to traditional feature extraction methods.
- (3)
- To increase the receptive field of neurons and the diversity of speech emotion features, hybrid dilated convolution is introduced to the SER task for the first time. Experimental results show that the model achieves outstanding performance in speech emotion classification tasks.
- (4)
- This paper proposes a time–frequency SER model based on the RA-GMLP module, which has higher classification accuracy.
2. Related Work
2.1. Traditional Features
2.2. Popular SER Neural Network
2.3. Hybrid Dilated Convolution
3. Architecture Design
3.1. Feature Extraction Block
3.2. Dilated Convolution
3.3. Residual Attention Perceptron Block
4. Detailed Analysis of the Dataset and Precise Configuration of Experimental Parameters
4.1. Analysis of the Dataset
4.2. Configuration of Model Parameters
5. Detailed Analysis and Discussion of Experimental Results
5.1. Our Model Was Compared and Analyzed against Current Advanced Models
5.2. Investigating the Impact of Varying Dilation Rates in Hybrid Dilated Convolution on Model Performance
5.3. Exploring the Specific Impact of Different Attention Heads on Model Performance
5.4. A Detailed Comparative Analysis of the GMLP Mechanism and Its Enhanced Version, the RA-GMLP Mechanism
5.5. Conducted Ablation Studies and Provided an In-Depth Analysis of the Results
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Schelinski, S.; Von Kriegstein, K. The relation between vocal pitch and vocal emotion recognition abilities in people with autism spectrum disorder and typical development. J. Autism Dev. Disord. 2019, 49, 68–82. [Google Scholar] [CrossRef] [PubMed]
- Paris, M.; Mahajan, Y.; Kim, J.; Meade, T. Emotional speech processing deficits in bipolar disorder: The role of mismatch negativity and P3a. J. Affect. Disord. 2018, 234, 261–269. [Google Scholar] [CrossRef] [PubMed]
- Hsieh, Y.H.; Chen, S.C. A decision support system for service recovery in affective computing: An experimental investigation. Knowl. Inf. Syst. 2020, 62, 2225–2256. [Google Scholar] [CrossRef]
- Lampropoulos, A.S.; Tsihrintzis, G.A. Evaluation of MPEG-7 descriptors for speech emotional recognition. In Proceedings of the 2012 Eighth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, Piraeus-Athens, Greece, 18–20 July 2012; pp. 98–101. [Google Scholar]
- Virvou, M.; Tsihrintzis, G.A.; Alepis, E.; Stathopoulou, I.O.; Kabassi, K. Emotion recognition: Empirical studies towards the combination of audio-lingual and visual-facial modalities through multi-attribute decision making. Int. J. Artif. Intell. Tools 2012, 21, 1240001. [Google Scholar] [CrossRef]
- Makiuchi, M.R.; Uto, K.; Shinoda, K. Multimodal emotion recognition with high-level speech and text features. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 350–357. [Google Scholar]
- Zhang, X.; Wang, M.J.; Guo, X.D. Multi-modal emotion recognition based on deep learning in speech, video and text. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 23–25 October 2020; pp. 328–333. [Google Scholar]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7367–7371. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Tao, F.; Liu, G. Advanced LSTM: A study about better time dependency modeling in emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2906–2910. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Arjun, A.; Rajpoot, A.S.; Panicker, M.R. Introducing attention mechanism for eeg signals: Emotion recognition with vision transformers. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico City, Mexico, 1–5 November 2021; pp. 5723–5726. [Google Scholar]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ding, X.; Xia, C.; Zhang, X.; Chu, X.; Han, J.; Ding, G. Repmlp: Re-parameterizing convolutions into fully-connected layers for image recognition. arXiv 2021, arXiv:2105.01883. [Google Scholar]
- Qiu, Z.; Jiao, Q.; Wang, Y.; Chen, C.; Zhu, D.; Cui, X. rzMLP-DTA: GMLP network with ReZero for sequence-based drug-target affinity prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 308–313. [Google Scholar]
- Yan, J.; Ando, K.; Yu, J.; Motomura, M. TT-MLP: Tensor Train Decomposition on Deep MLPs. IEEE Access 2023, 11, 10398–10411. [Google Scholar] [CrossRef]
- Zhu, W.; Li, X. Speech emotion recognition with global-aware fusion on multi-scale feature representation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6437–6441. [Google Scholar]
- Pepino, L.; Riera, P.; Ferrer, L.; Gravano, A. Fusion approaches for emotion recognition from speech using acoustic and text-based features. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6484–6488. [Google Scholar]
- Laukka, P.; Neiberg, D.; Forsell, M.; Karlsson, I.; Elenius, K. Expression of affect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation. Comput. Speech Lang. 2011, 25, 84–104. [Google Scholar] [CrossRef]
- Bou-Ghazale, S.E.; Hansen, J.H. A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans. Speech Audio Process. 2000, 8, 429–442. [Google Scholar] [CrossRef]
- Han, Z.; Wang, J. Speech emotion recognition based on Gaussian kernel nonlinear proximal support vector machine. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 2513–2516. [Google Scholar]
- Hsiao, P.W.; Chen, C.P. Effective attention mechanism in dynamic models for speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2526–2530. [Google Scholar]
- Yuan, Z.; Li, S.; Zhang, W.; Du, R.; Sun, X.; Wang, H. Speech Emotion Recognition Based on Secondary Feature Reconstruction. In Proceedings of the 2021 6th International Conference on Computational Intelligence and Applications (ICCIA), Xiamen, China, 11–13 June 2021; pp. 149–154. [Google Scholar]
- Liu, Z.T.; Han, M.T.; Wu, B.H.; Rehman, A. Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning. Appl. Acoust. 2023, 202, 109178. [Google Scholar] [CrossRef]
- Wang, J.; Xue, M.; Culhane, R.; Diao, E.; Ding, J.; Tarokh, V. Speech emotion recognition with dual-sequence LSTM architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6474–6478. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control. 2019, 47, 312–323. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Lian, Z.; Li, Y.; Tao, J.; Huang, J. Improving speech emotion recognition via transformer-based predictive coding through transfer learning. arXiv 2018, arXiv:1811.07691. [Google Scholar]
- Chen, W.; Xing, X.; Xu, X.; Yang, J.; Pang, J. Key-sparse transformer for multimodal speech emotion recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6897–6901. [Google Scholar]
- Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay attention to mlps. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Lin, G.; Wu, Q.; Qiu, L.; Huang, X. Image super-resolution using a dilated convolutional neural network. Neurocomputing 2018, 275, 1219–1230. [Google Scholar] [CrossRef]
- Noh, K.J.; Jeong, C.Y.; Lim, J.; Chung, S.; Kim, G.; Lim, J.M.; Jeong, H. Multi-path and group-loss-based network for speech emotion recognition in multi-domain datasets. Sensors 2021, 21, 1579. [Google Scholar] [CrossRef] [PubMed]
- Wu, X.; Liu, S.; Cao, Y.; Li, X.; Yu, J.; Dai, D.; Ma, X.; Hu, S.; Wu, Z.; Liu, X.; et al. Speech emotion recognition using capsule networks. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6695–6699. [Google Scholar]
Baseline | HDC | HDC+RA-MLP | ||||
---|---|---|---|---|---|---|
(W/O TF) | (W/TF) | (W/O TF) | (W/TF) | (W/O TF) | (W/TF) | |
Segments length | 2 s | 2 s | 2 s | 2 s | 2 s | 2 s |
Overlap | 1.6 s | 1.6 s | 1.6 s | 1.6 s | 1.6 s | 1.6 s |
Batch size | 32 | 32 | 32 | 32 | 32 | 32 |
Epoch | 50 | 50 | 50 | 50 | 50 | 50 |
Learning rate | ||||||
Feature input | (26, 57) | (78, 57) | (26, 57) | (78, 57) | (26, 57) | (78, 57) |
Optimizer | Adam | Adam | Adam | Adam | Adam | Adam |
Feature output | (32, 4) | (32, 4) | (32, 4) | (32, 4) | (32, 4) | (32, 4) |
Model | Wa | Ua | (Time/Epoch) |
---|---|---|---|
SPSL-MFCC - Mel-spec [35] | 60.00 | 58.00 | 1.2 s |
SEQCAP-Spectrogram [36] | 72.73 | 59.71 | 2.7 s |
APCNN-MFCC [18] | 69.00 | 67.00 | 1.5 s |
MHcnn-mfcc [18] | 69.80 | 70.09 | 3.5 s |
AAcnn-mfcc [18] | 70.94 | 71.04 | 1.3 s |
OUR | 75.31 | 75.09 | 1.5 s |
Emotion Type | Sample Size (Total) | Sample Size (Male) | Sample Size (Female) |
---|---|---|---|
neu | 929 | 466 | 463 |
hap | 855 | 430 | 425 |
sad | 653 | 251 | 402 |
ang | 1071 | 531 | 540 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sha, M.; Yang, W.; Wei, F.; Lu, Z.; Chen, M.; Ma, C.; Zhang, L.; Shi, H. Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics 2024, 13, 588. https://doi.org/10.3390/electronics13030588
Sha M, Yang W, Wei F, Lu Z, Chen M, Ma C, Zhang L, Shi H. Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics. 2024; 13(3):588. https://doi.org/10.3390/electronics13030588
Chicago/Turabian StyleSha, Mo, Wenzhong Yang, Fuyuan Wei, Zhifeng Lu, Mingliang Chen, Chengji Ma, Linlu Zhang, and Houwang Shi. 2024. "Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM" Electronics 13, no. 3: 588. https://doi.org/10.3390/electronics13030588
APA StyleSha, M., Yang, W., Wei, F., Lu, Z., Chen, M., Ma, C., Zhang, L., & Shi, H. (2024). Speech Emotion Recognition Using RA-Gmlp Model on Time–Frequency Domain Features Extracted by TFCM. Electronics, 13(3), 588. https://doi.org/10.3390/electronics13030588