A New Network Structure for Speech Emotion Recognition Research
Abstract
:1. Introduction
- We propose a new SER model, which combines double Bi-GRU and multi-head attention to extract a spectrogram from speech data to learn speech representation.
- During training, CNN layers are first applied to study the local associations from spectrograms, and dual GRU layers are used to learn the long-term correlations and contextual feature information. Second, multi-head attention is used to focus on the features related to emotions. Finally, the softmax layer is used to output various emotions to improve the overall performance. After experimental verification, our proposed model has better classification performance, and its unweighted accuracies on person-neutral IEMOCAP and Emo-DB sentiment datasets reached 75.04% and 88.93%, respectively. Different from the methods proposed in references [38,39], our proposed model achieves better classification performance in speech emotion recognition. Moreover, the gated recurrent unit (GRU) is more efficient than long short-term memory in training [40].
- We also conducted training on different tasks and datasets to verify the generalization performance and stability of the model. Finally, we analyzed and summarized the results of the experiment and proposed possible future research points.
2. Materials and Methods
2.1. Spectrogram Extraction
2.2. The Bi-GRU Layer
2.3. The Multi-Head Attention
Algorithm 1. The pseudocode of the model. |
Input: Time-frequency characteristics |
Output: Emotion categories and probabilities Y, P |
// Bi-GRU algorithm (forward and backward) ← |
→ // Multi-head Attention algorithm |
, , |
// Linear layer and output layer |
3. Results
3.1. Experiment Setup
3.2. Experimental Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A comprehensive review of speech emotion recognition systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
- Lee, S.W.; Sarp, S.; Jeon, D.J.; Kim, J.H. Smart water grid: The future water management platform. Desalination Water Treat. 2014, 55, 339–346. [Google Scholar] [CrossRef]
- Wu, Z.; Lu, Y.; Dai, X. An Empirical Study and Improvement for Speech Emotion Recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
- Mitkov, R.; Breck, E.; Cardie, C. Opinion Mining and Sentiment Analysis. In The Oxford Handbook of Computational Linguistics, 2nd ed.; Oxford Academic: Oxford, UK, 2017. [Google Scholar]
- Zeng, Y.; Li, Z.; Tang, Z.; Chen, Z.; Ma, H. Heterogeneous graph convolution based on in-domain self-supervision for multimodal sentiment analysis. Expert Syst. Appl. 2023, 213, 119240. [Google Scholar] [CrossRef]
- Kaur, K.; Singh, P. Applications. Trends in speech emotion recognition: A comprehensive survey. Multimed. Tools Appl. 2023, 82, 29307–29351. [Google Scholar] [CrossRef]
- Tang, H.; Zhang, X.; Cheng, N.; Xiao, J.; Wang, J. ED-TTS: Multi-Scale Emotion Modeling Using Cross-Domain Emotion Diarization for Emotional Speech Synthesis. arXiv 2024, arXiv:2401.08166. [Google Scholar]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech emotion recognition with co-attention based multi-level acoustic information. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 7367–7371. [Google Scholar]
- Schuller, B.W. Speech emotion recognition. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
- Stuhlsatz, A.; Meyer, C.; Eyben, F.; Zielke, T.; Meier, G.; Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5688–5691. [Google Scholar]
- Rozental, A.; Fleischer, D. Amobee at SemEval-2018 Task 1: GRU Neural Network with a CNN Attention Mechanism for Sentiment Classification. In Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, LA, USA, 5–6 June 2018; pp. 218–225. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar] [CrossRef]
- Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmad, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. 2017, 78, 5571–5589. [Google Scholar] [CrossRef]
- Sak, H.; Senior, A.; Rao, K.; İrsoy, O.; Graves, A.; Beaufays, F.; Schalkwyk, J. Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 4280–4284. [Google Scholar]
- Tao, F.; Liu, G. Advanced LSTM: A Study About Better Time Dependency Modeling in Emotion Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2906–2910. [Google Scholar]
- Moritz, N.; Hori, T.; Roux, J.L. Triggered Attention for End-to-end Speech Recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5666–5670. [Google Scholar]
- Chiu, C.C.; Sainath, T.N.; Wu, Y.; Prabhavalkar, R.; Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R.J.; Rao, K.; Gonina, E.; et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4774–4778. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Vinola, C.; Vimaladevi, K. A Survey on Human Emotion Recognition Approaches, Databases and Applications. ELCVIA Electron. Lett. Comput. Vis. Image Anal. 2015, 14, 24. [Google Scholar] [CrossRef]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar] [CrossRef]
- Lee, J.; Tashev, I.J. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Chauhan, K.; Sharma, K.K.; Varma, T. Speech Emotion Recognition Using Convolution Neural Networks. In Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; pp. 1176–1181. [Google Scholar]
- Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
- Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
- Sak, H.; Vinyals, O.; Heigold, G.; Senior, A.W.; McDermott, E.; Monga, R.; Mao, M.Z. Sequence discriminative distributed training of long short-term memory recurrent neural networks. In Proceedings of the Interspeech, Singapore, 14–18 September 2014. [Google Scholar]
- Mahjoub, M.A.; Raoof, K.; Mbarki, M.; Serrestou, Y.; Kerkeni, L. Speech Emotion Recognition: Methods and Cases Study. In Proceedings of the 10th International Conference on Agents and Artificial Intelligence, Funchal, Portugal, 16–18 January 2018; pp. 175–182. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar] [CrossRef]
- Sainath, T.N.; Vinyals, O.; Senior, A.; Sak, H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
- Chen, M.; Zhao, X. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 374–378. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Li, P.; Song, Y.; McLoughlin, I.; Guo, W.; Dai, L. An Attention Pooling Based Representation Learning Method for Speech Emotion Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar]
- Shen, J.; Wang, C.; Lai, C.-F.; Wang, A.; Chao, H.-C. Direction Density-Based Secure Routing Protocol for Healthcare Data in Incompletely Predictable Networks. IEEE Access 2016, 4, 9163–9173. [Google Scholar] [CrossRef]
- Neumann, M.; Vu, N.T. Cross-lingual and Multilingual Speech Emotion Recognition on English and French. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5769–5773. [Google Scholar]
- Feng, X.; Liu, X. Sentiment Classification of Reviews Based on BiGRU Neural Network and Fine-grained Attention. J. Phys. Conf. Ser. 2019, 1229, 012064. [Google Scholar] [CrossRef]
- Huang, P.-Y.; Chang, X.; Hauptmann, A. Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K.J.A. Multi-Head Decoder for End-to-End Speech Recognition. arXiv 2018, arXiv:1804.08050. [Google Scholar]
- Liang, Z.; Li, X.; Song, W. Research on speech emotion recognition algorithm for unbalanced data set. J. Intell. Fuzzy Syst. 2020, 39, 2791–2796. [Google Scholar] [CrossRef]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B. Self Supervised Adversarial Domain Adaptation for Cross-Corpus and Cross-Language Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022, 14, 1912–1926. [Google Scholar] [CrossRef]
- Mustaqeem; Sajjad, M.; Kwon, S. Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Cai, Y.; Li, X.; Li, J. Emotion Recognition Using Different Sensors, Emotion Models, Methods and Datasets: A Comprehensive Review. Sensors 2023, 23, 2455. [Google Scholar] [CrossRef] [PubMed]
- Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y.J.A. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Lin, R.; Hu, H. Multi-Task Momentum Distillation for Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2023. [Google Scholar] [CrossRef]
Emotion | Emo-DB | IEMOCAP |
---|---|---|
neutral | 50 | 50 |
angry | 50 | 50 |
happy | 50 | 50 |
sad | 50 | 50 |
fear | 50 | - |
disgust | 50 | - |
bored | 50 | - |
Hyper-Parameter | IEMOCAP | Emo-DB |
---|---|---|
Learning rate | 1 × 10−5 | 1 × 10−4 |
Batch size | 32 | 32 |
Att dropout | 0.2 | 0.2 |
Head num | 16 | 8 |
Dropout (Output) | 0.3 | 0.2 |
Epochs | 100 | 80 |
Model | IEMOCAP | Emo-DB |
---|---|---|
Bi-GRU | 71.01% | 83.22% |
Head num = 32 | 73.32% | 85.03% |
Head num = 16 | 75.04% * | 86.81% |
Head num = 8 | 73.89% | 88.93% * |
Head num = 4 | 74.47% | 84.66% |
Results | IEMOCAP | Emo-DB |
---|---|---|
Precision | 83.33% | 74.57% |
Accuracy | 75.04% | 88.93% |
Recall | 70.20% | 80.20% |
Emotion Category | IEMOCAP | Emo-DB |
---|---|---|
Neutral | 80.25% | 86.64% |
Angry | 97.11% | 65.50% |
Happy | 74.75% | 55.05% |
Sad | 83.21% | 88.64% |
Fear | 95.74% | - |
Disgust | 92.34% | - |
Bored | 95.55% | - |
Method | IEMOCAP | Emo-DB |
---|---|---|
CNN | 69.80% | 70.58% |
LSTM | 70.10% | 77.73% |
RNN | 70.22% | 79.91% |
CNN + LSTM | 72.17% | 83.79% |
Bi-GRU | 71.01% | 83.22% |
Self-Attention | 71.87% | 82.91% |
Bi-GRU + Self-Att | 73.21% | 84.58% |
Multi-head Attention | 72.13% | 83.37% |
Our Method | 75.04% | 88.93% |
Evaluation Index | CH-SIMS | MOSI |
---|---|---|
Acc-2 | 72.43% | 70.29% |
F1-score | 71.97 | 70.31 |
MAE | 0.648 | 0.974 |
Corr | 0.524 | 0.613 |
Acc-5 | 30.19% | 32.12% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, C.; Liu, Y.; Song, W.; Liang, Z.; Chen, X. A New Network Structure for Speech Emotion Recognition Research. Sensors 2024, 24, 1429. https://doi.org/10.3390/s24051429
Xu C, Liu Y, Song W, Liang Z, Chen X. A New Network Structure for Speech Emotion Recognition Research. Sensors. 2024; 24(5):1429. https://doi.org/10.3390/s24051429
Chicago/Turabian StyleXu, Chunsheng, Yunqing Liu, Wenjun Song, Zonglin Liang, and Xing Chen. 2024. "A New Network Structure for Speech Emotion Recognition Research" Sensors 24, no. 5: 1429. https://doi.org/10.3390/s24051429
APA StyleXu, C., Liu, Y., Song, W., Liang, Z., & Chen, X. (2024). A New Network Structure for Speech Emotion Recognition Research. Sensors, 24(5), 1429. https://doi.org/10.3390/s24051429