A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms
Abstract
:1. Introduction
2. Related Works
2.1. Emotional Speech Database
2.2. Speech Emotion Recognition
- 3 Time domain: zero crossing rate, energy, and entropy of energy
- 5 Spectral domain: spectral centroid, spectral spread, spectral entropy, spectral flux, and spectral roll-off
- 13 MFCCs
- 13 Chroma: 12-dimensional chroma vector, standard deviation of chroma vector
3. Korean Emotional Speech Database
4. Speech Emotion Recognition
4.1. Feature Selection
- 11 Spectral-domain: spectral centroid, spectral bandwidth, 7 spectral contrast, spectral flatness, and spectral roll-off
- 13 MFCCs
- 12 Chroma: 12-dimensional Chroma vector
4.2. Pre-Processing
4.2.1. Speech Segment Extraction
4.2.2. Feature Scaling
4.3. Emotion Recognition Model
5. Experiments and Results
5.1. K-EmoDB
5.2. International DB
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on Speech Emotion Recognition: Features, Classification Schemes, and Databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
- Shin, B.; Lee, S. A Comparison of Effective Feature Vectors for Speech Emotion Recognition. Trans. Korean Inst. Electr. Eng. 2018, 67, 1364–1369. [Google Scholar]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech Emotion Recognition using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching. IEEE Trans. Multimed. 2017, 20, 1576–1590. [Google Scholar] [CrossRef]
- Domínguez-Jiménez, J.A.; Campo-Landines, K.C.; Martínez-Santos, J.; Delahoz, E.J.; Contreras-Ortiz, S. A Machine Learning Model for Emotion Recognition from Physiological Signals. Biomed. Signal Process. Control. 2020, 55, 101646. [Google Scholar] [CrossRef]
- Liu, M.; Li, S.; Shan, S.; Wang, R.; Chen, X. Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis. In Proceedings of the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; pp. 143–157. [Google Scholar]
- Xiong, X.; De la Torre, F. Supervised Descent Method and its Applications to Face Alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 532–539. [Google Scholar]
- Jia, X.; Li, W.; Wang, Y.; Hong, S.; Su, X. An Action Unit Co-Occurrence Constraint 3DCNN Based Action Unit Recognition Approach. KSII Trans. Internet Inf. Syst. 2020, 14, 924–942. [Google Scholar]
- He, J.; Li, D.; Bo, S.; Yu, L. Facial Action Unit Detection with Multilayer Fused Multi-Task and Multi-Label Deep Learning Network. KSII Trans. Internet Inf. Syst. 2019, 13, 5546–5559. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Speech Emotion Recognition using Deep 1D & 2D CNN LSTM Networks. Biomed. Signal Process. Control. 2019, 47, 312–323. [Google Scholar]
- Swain, M.; Routray, A.; Kabisatpathy, P. Databases, Features and Classifiers for Speech Emotion Recognition: A Review. Int. J. Speech Technol. 2018, 21, 93–120. [Google Scholar] [CrossRef]
- Gilbert, C.; Hutto, E. Vader: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM-14), Ann Arbor, MI, USA, 1–4 June 2014; p. 82. Available online: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf (accessed on 20 April 2016).
- Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-Visual Emotion Fusion (AVEF): A Deep Efficient Weighted Approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
- Scherer, K.R. Vocal Communication of Emotion: A Review of Research Paradigms. Speech Commun. 2003, 40, 227–256. [Google Scholar] [CrossRef]
- Lee, C.; Lui, S.; So, C. Visualization of Time-Varying Joint Development of Pitch and Dynamics for Speech Emotion Recognition. J. Acoust. Soc. Am. 2014, 135, 2422. [Google Scholar] [CrossRef]
- Wu, C.; Yeh, J.; Chuang, Z. Emotion perception and recognition from speech. In Affective Information Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 93–110. [Google Scholar]
- Lotfian, R.; Busso, C. Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 815–826. [Google Scholar] [CrossRef] [Green Version]
- Song, P.; Zheng, W. Feature Selection Based Transfer Subspace Learning for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2018, 11, 373–382. [Google Scholar] [CrossRef]
- Jing, S.; Mao, X.; Chen, L. Prominence features: Effective emotional features for speech emotion recognition. Digit. Signal Process. 2018, 72, 216–231. [Google Scholar] [CrossRef]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A Database of German Emotional Speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal, 4–8 September 2005. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Bou-Ghazale, S.E.; Hansen, J.H. A Comparative Study of Traditional and Newly Proposed Features for Recognition of Speech Under Stress. IEEE Trans. Speech Audio Process. 2000, 8, 429–442. [Google Scholar] [CrossRef] [Green Version]
- Lee, C.M.; Yildirim, S.; Bulut, M.; Kazemzadeh, A.; Busso, C.; Deng, Z.; Lee, S.; Narayanan, S. Emotion Recognition Based on Phoneme Classes. In Proceedings of the Eighth International Conference on Spoken Language Processing, Jeju Island, Korea, 4–8 October 2004. [Google Scholar]
- Schuller, B.; Rigoll, G.; Lang, M. Hidden Markov Model-Based Speech Emotion Recognition. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China, 6–10 April 2003. [Google Scholar]
- Lee, J.; Tashev, I. High-Level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic Speech Emotion Recognition using Recurrent Neural Networks with Local Attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Chen, R.; Zhou, Y.; Qian, Y. Emotion Recognition using Support Vector Machine and Deep Neural Network. In Proceedings of the National Conference on Man-Machine Speech Communication, Lianyungang, China, 11–13 October 2017; pp. 122–131. [Google Scholar]
- Wieman, M.; Sun, A. Analyzing Vocal Patterns to Determine Emotion. Available online: http://www.datascienceassn.org/content/analyzing-vocal-patterns-determine-emotion (accessed on 3 September 2016).
- Shaqra, F.A.; Duwairi, R.; Al-Ayyoub, M. Recognizing Emotion from Speech Based on Age and Gender using Hierarchical Models. Procedia Comput. Sci. 2019, 151, 37–44. [Google Scholar] [CrossRef]
- Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1459–1462. [Google Scholar]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef] [Green Version]
- Chernykh, V.; Prikhodko, P. Emotion Recognition from Speech with Recurrent Neural Networks. arXiv 2017, arXiv:1701.08071. [Google Scholar]
- Graves, A.; Fernandez, S.; Gomez, F.J.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. In Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Iliou, T.; Anagnostopoulos, C. Statistical Evaluation of Speech Features for Emotion Recognition. In Proceedings of the 2009 Fourth International Conference on Digital Telecommunications, Colmar, France, 20–25 July 2009; pp. 121–126. [Google Scholar]
- Kao, Y.; Lee, L. Feature Analysis for Emotion Recognition from Mandarin Speech Considering the Special Characteristics of Chinese Language. In Proceedings of the Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 17–21 September 2006. [Google Scholar]
- Luengo, I.; Navas, E.; Hernáez, I.; Sánchez, J. Automatic Emotion Recognition using Prosodic Parameters. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal, 4–8 September 2005. [Google Scholar]
- Rao, K.S.; Koolagudi, S.G.; Vempada, R.R. Emotion Recognition from Speech using Global and Local Prosodic Features. Int. J. Speech Technol. 2013, 16, 143–160. [Google Scholar] [CrossRef]
- Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef] [Green Version]
- Schuller, B.; Steidl, S.; Batliner, A.; Vinciarelli, A.; Scherer, K.; Ringeval, F.; Chetouani, M.; Weninger, F.; Eyben, F.; Marchi, E. INTERSPEECH 2013 Computational Paralinguistics Challenge: Social Signals, Conflict, Emotion, Autism. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, Lyon, France, 25–29 August 2013; Interspeech: Lyon, France, 2013. [Google Scholar]
- Russell, J.A. Is there Universal Recognition of Emotion from Facial Expression? A Review of the Cross-Cultural Studies. Psychol. Bull. 1994, 115, 102. [Google Scholar] [CrossRef]
- Ortony, A.; Collins, T. What’s Basic about Basic Emotions? Psychol. Rev. 1990, 97, 315–331. [Google Scholar] [CrossRef] [PubMed]
- Barrett, L.F. Are Emotions Natural Kinds? Perspect. Psychol. Sci. 2006, 1, 28–58. [Google Scholar] [CrossRef] [PubMed]
- Ekman, P. Pictures of Facial Affect; Consulting Psychologists Press: Palo Alto, CA, USA, 1976. [Google Scholar]
- Lundqvist, D.; Flykt, A.; Ohman, A. Karolinska Directed Emotional Faces; Database of Standardized Facial Images; Psychology Section, Department of Clinical Neuroscience, Karolinska Hospital: Stockholm, Sweden, 1998; Volume S-171, p. 76. [Google Scholar]
- Wang, L.; Markham, R. The Development of a Series of Photographs of Chinese Facial Expressions of Emotion. J. Cross-Cult. Psychol. 1999, 30, 397–410. [Google Scholar] [CrossRef]
- Kanade, T.; Cohn, J.F.; Tian, Y. Comprehensive Database for Facial Expression Analysis. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. no. PR00580), Grenoble, France, 28–30 March 2000; pp. 46–53. [Google Scholar]
- Tottenham, N.; Tanaka, J.W.; Leon, A.C.; McCarry, T.; Nurse, M.; Hare, T.A.; Marcus, D.J.; Westerlund, A.; Casey, B.; Nelson, C. The NimStim Set of Facial Expressions: Judgments from Untrained Research Participants. Psychiatry Res. 2009, 168, 242–249. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tracy, J.L.; Robins, R.W.; Schriber, R.A. Development of a FACS-Verified Set of Basic and Self-Conscious Emotion Expressions. Emotion 2009, 9, 554. [Google Scholar] [CrossRef] [PubMed]
- Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.; Hawk, S.T.; Van Knippenberg, A. Presentation and Validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [Google Scholar] [CrossRef]
- Simon, D.; Craig, K.D.; Gosselin, F.; Belin, P.; Rainville, P. Recognition and Discrimination of Prototypical Dynamic Expressions of Pain and Emotions. PAIN 2008, 135, 55–64. [Google Scholar] [CrossRef] [PubMed]
- Castro, S.L.; Lima, C.F. Recognizing Emotions in Spoken Language: A Validated Set of Portuguese Sentences and Pseudosentences for Research on Emotional Prosody. Behav. Res. Methods 2010, 42, 74–81. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Yin, L.; Cohn, J.F.; Canavan, S.; Reale, M.; Horowitz, A.; Liu, P.; Girard, J.M. Bp4d-Spontaneous: A High-Resolution Spontaneous 3d Dynamic Facial Expression Database. Image Vis. Comput. 2014, 32, 692–706. [Google Scholar] [CrossRef]
- LoBue, V.; Thrasher, C. The Child Affective Facial Expression (CAFE) Set: Validity and Reliability from Untrained Adults. Front. Psychol. 2015, 5, 1532. [Google Scholar] [CrossRef] [PubMed]
- Emotion Classification. Available online: https://en.wikipedia.org/wiki/Emotion_classification (accessed on 29 January 2021).
- Gabrielsson, A.; Lindström, E. The Role of Structure in the Musical Expression of Emotions. In Handbook of Music and Emotion: Theory, Research, Applications; Series in affective science; Juslin, P.N., Sloboda, J.A., Eds.; Oxford University Press: Oxford, UK, 2010; pp. 367–400. [Google Scholar]
- Bogdanov, D.; Wack, N.; Gómez Gutiérrez, E.; Gulati, S.; Boyer, H.; Mayor, O.; Roma Trepat, G.; Salamon, J.; Zapata González, J.R.; Serra, X. Essentia: An audio analysis library for music information retrieval. In Proceedings of the 14th International Society for Music Information Retrieval Conference, Curitiba, Brazil, 4–8 November 2013; pp. 493–498. [Google Scholar]
- Zamil, A.A.A.; Hasan, S.; Baki, S.M.J.; Adam, J.M.; Zaman, I. Emotion Detection from Speech Signals using Voting Mechanism on Classified Frames. In Proceedings of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 10–12 January 2019; pp. 281–285. [Google Scholar]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu Features? End-to-End Speech Emotion Recognition using a Deep Convolutional Recurrent Network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
No | Features |
---|---|
1 | Harmonic energy |
2 | Noise energy |
3 | Noiseness |
4 | F0 |
5 | Inharmonicity |
6 | Tristimulus |
7 | Harmonic spectral deviation |
8 | Odd to even harmonic ratio |
Features | Harmonic Feature Set | Accuracy (%) | ||
---|---|---|---|---|
Positive | Negative | |||
High Arousal | Happiness | Anger | Noise energy, Noiseness, Inharmonicity, Tristimulus | 65.5% |
Low Arousal | Neutrality | Sadness | Noiseness, F0, Inharmonicity, Tristimulus | 76% |
Paper | Features | Algorithm | Accurary |
---|---|---|---|
Chernykh (2017) |
| LSTM with CTC loss | 67.90% |
Shaqr (2019) |
| Multi-layer perceptron | 67.14% |
Zamil (2019) | • MFCC | Logistic Model Tree | 61.32% |
George (2016) | - | 1D Conv layer + LSTM | 65.89% |
Jianfeng (2019) | Log mel-spectrogram | 1D + 2D Conv layer + LSTM | 62.63% |
Proposed |
| LSTM + Voting mechanism | 83.81% |
Proposed |
| LSTM | 75.46% |
Proposed |
| LSTM | 70.51% |
Paper | Database | Emotion | Features | Algorithm | Accuracy |
---|---|---|---|---|---|
Chernykh (2017) |
|
|
| LSTM with CTC loss |
|
Shaqr (2019) | RAVDESS | neutral, calm, happy, sad, angry, fearful, disgust, surprised |
| Multi-layer perceptron | 74% |
Zamil (2019) |
|
|
| Logistic Model Tree |
|
George (2016) | RECOLA | Arousal, Valence | - | 1D Conv layer + LSTM | 65.89% |
Jianfeng (2019) |
|
| Log mel-spectrogram | 1D + 2D Conv layer + LSTM |
|
Proposed |
|
|
|
|
|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Byun, S.-W.; Lee, S.-P. A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms. Appl. Sci. 2021, 11, 1890. https://doi.org/10.3390/app11041890
Byun S-W, Lee S-P. A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms. Applied Sciences. 2021; 11(4):1890. https://doi.org/10.3390/app11041890
Chicago/Turabian StyleByun, Sung-Woo, and Seok-Pil Lee. 2021. "A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms" Applied Sciences 11, no. 4: 1890. https://doi.org/10.3390/app11041890
APA StyleByun, S. -W., & Lee, S. -P. (2021). A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms. Applied Sciences, 11(4), 1890. https://doi.org/10.3390/app11041890