Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network
Abstract
:1. Introduction
- To present collaborative low-order and high-order features using various acoustic features such as MFCC, LPCC, WPT, ZCR, RMS, spectrum centroid, spectral roll-off, spectral kurtosis, formants, pitch, jitter, and shimmer for the improvement of speech signal’s feature distinctiveness;
- To develop a lightweight 1-D deep convolutional neural network for complexity reduction of deep learning frameworks for SER.
2. Acoustic Features
2.1. MFCC
2.2. RMS
2.3. ZCR
2.4. Spectrum Centroid
2.5. Spectral Roll-off
2.6. LPCC
2.7. Spectral Kurtosis
2.8. Jitter and Shimmer
2.9. Pitch Frequency
2.10. Formants
2.11. Wavelet Packet Decomposition Features
3. Proposed Methodology
4. System Implementation and Results
4.1. Dataset
4.2. Results and Discussions
5. Conclusions and Future Scopes
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lieskovská, E.; Jakubec, M.; Jarina, R.; Chmulík, M. A review on speech emotion recognition using deep learning and attention mechanism. Electronics 2021, 10, 1163. [Google Scholar] [CrossRef]
- Berkehan, A.M.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar]
- KishorBarasu, B.; Kothandaraman, M. Survey of Deep Learning Paradigms for Speech Processing. Wirel. Pers. Commun. 2022, 125, 1913–1949. [Google Scholar]
- Shah, F.M.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
- Michalis, P.; Spyrou, E.; Giannakopoulos, T.; Siantikos, G.; Sgouropoulos, D.; Mylonas, P.; Makedon, F. Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation 2017, 5, 26. [Google Scholar]
- Turgut, Ö. A novel feature selection method for speech emotion recognition. Appl. Acoust. 2019, 146, 320–326. [Google Scholar]
- Abdel-Hamid, L.; Shaker, N.H.; Emara, I. Analysis of Linguistic and Prosodic Features of Bilingual Arabic–English Speakers for Speech Emotion Recognition. IEEE Access 2020, 8, 72957–72970. [Google Scholar] [CrossRef]
- Ben, A.S.; Mary, L.; Babu, B.P. Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features. Circuits Syst. Signal Process. 2020, 39, 5681–5709. [Google Scholar]
- Atreyee, K.; Roy, U.K. Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 22–24 March 2017; pp. 1017–1021. [Google Scholar]
- Likitha, M.S.; Gupta, S.R.R.; Hasitha, K.; Raju, A.U. Speech based human emotion recognition using MFCC. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 22–24 March 2017; pp. 2257–2260. [Google Scholar]
- Renjith, S.; Manju, K.G. Speech based emotion recognition in Tamil and Telugu using LPCC and hurst parameters—A comparitive study using KNN and ANN classifiers. In Proceedings of the 2017 International Conference on Circuit, Power and Computing Technologies (I.C.C.P.C.T.), Kollam, India, 20–21 April 2017; pp. 1–6. [Google Scholar]
- Monica, F.S.; Zbancioc, M.D. Emotion recognition in Romanian language using LPC features. In Proceedings of the 2013 E-Health and Bioengineering Conference (E.H.B.), Iasi, Romania, 21–23 November 2013; pp. 1–4. [Google Scholar]
- Roddy, C.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2001, 18, 32–80. [Google Scholar]
- Li, X.; Li, X. Speech Emotion Recognition Using Novel HHT-TEO Based Features. J. Comput. 2011, 6, 989–998. [Google Scholar] [CrossRef]
- Drisya, P.S.; Rajan, R. Significance of TEO slope feature in speech emotion recognition. In Proceedings of the 2017 International Conference on Networks & Advances in Computational Technologies (NetACT), Thiruvananthapuram, India, 20–22 July 2017; pp. 438–441. [Google Scholar]
- Barasu, B.K.; Mohanaprasad, K. A review on speech processing using machine learning paradigm. Int. J. Speech Technol. 2021, 24, 367–388. [Google Scholar]
- Majid, W.T.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A comprehensive review of speech emotion recognition systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar]
- Sonawane, A.; Inamdar, M.U.; Bhangale, K.B. Sound based human emotion recognition using MFCC & multiple SVM. In Proceedings of the 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), Indore, India, 17–19 August 2017; pp. 1–4. [Google Scholar]
- Amin, K.R.; Jones, E.; Babar, M.I.; Jan, T.; Zafar, M.H.; Alhussain, T. Speech emotion recognition using deep learning techniques: A review. IEEE Access 2019, 7, 117327–117345. [Google Scholar]
- Rashid, J.; WahTeh, Y.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar]
- Anuja, T.; Dhull, S. Speech Emotion Recognition: A Review. Adv. Commun. Comput. Technol. 2021, 4, 815–827. [Google Scholar]
- Soonil, K. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 2021, 167, 114177. [Google Scholar]
- Mustaqeem; Kwon, S. 1D-CNN: Speech emotion recognition system using a stacked network with dilated CNN features. Cmc-Comput. Mater. Contin. 2021, 67, 4039–4059. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- Satt, A.; Rozenberg, S.; Hoory, R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
- Kishor, B.; Mohanaprasad, K. Speech emotion recognition using mel frequency log spectrogram and deep convolutional neural network. In Futuristic Communication and Network Technologies; Springer: Singapore, 2022; pp. 241–250. [Google Scholar]
- Zhao, J.; Mao, X.; Chen, L. Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Process. 2018, 12, 713–721. [Google Scholar] [CrossRef]
- Bilal, E.M. A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 2020, 8, 221640–221653. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Meng, H.; Yan, T.; Yuan, F.; Wei, H. Speech Emotion Recognition From 3D Log-Mel Spectrograms with Deep Learning Network. IEEE Access 2019, 7, 125868–125881. [Google Scholar] [CrossRef]
- Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Zikria, Y.B. Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors 2020, 20, 6008. [Google Scholar] [CrossRef] [PubMed]
- Sonawane, S.; Kulkarni, N. Speech emotion recognition based on MFCC and convolutional neural network. Int. J. Adv. Sci. Res. Eng. Trends 2020, 5, 18–22. [Google Scholar]
- Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar]
- Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2019, 20, 183. [Google Scholar]
- Vryzas, N.; Vrysis, L.; Matsiola, M.; Kotsakis, R.; Dimoulas, C.; Kalliris, G. Continuous speech emotion recognition with convolutional neural networks. J. Audio Eng. Soc. 2020, 68, 14–24. [Google Scholar] [CrossRef]
- Ho, N.-H.; Yang, H.-J.; Kim, S.-H.; Lee, G. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attentionbased recurrent neural network. IEEE Access 2020, 8, 61672–61686. [Google Scholar] [CrossRef]
- Atila, O.; Şengür, A. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition. Appl. Acoust. 2021, 182, 108260. [Google Scholar] [CrossRef]
- Liu, J.; Wang, H. A speech emotion recognition framework for better discrimination of confusions. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4483–4487. [Google Scholar]
- Gintautas, T.; Korvel, G.; Yayak, A.B.; Treigys, P.; Bernatavičienė, J.; Kostek, B. A study of cross-linguistic speech emotion recognition based on 2D feature spaces. Electronics 2020, 9, 1725. [Google Scholar]
- Huang, S.; Dang, H.; Jiang, R.; Hao, Y.; Xue, C.; Gu, W. Multi-Layer Hybrid Fuzzy Classification Based on SVM and Improved PSO for Speech Emotion Recognition. Electronics 2021, 10, 2891. [Google Scholar] [CrossRef]
- Fazliddin, M.; Kutlimuratov, A.; Akhmedov, F.; Abdallah, M.S.; Cho, Y.-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics 2022, 11, 4047. [Google Scholar]
- Bhangale, K.B.; Titare, P.; Pawar, R.; Bhavsar, S. Synthetic speech spoofing detection using MFCC and radial basis function SVM. IOSR J. Eng. (IOSRJEN) 2018, 8, 55–62. [Google Scholar]
- Chaturvedi, I.; Noel, T.; Satapathy, R. Speech Emotion Recognition Using Audio Matching. Electronics 2022, 11, 3943. [Google Scholar] [CrossRef]
- George, T.; Cook, P. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 2002, 10, 293–302. [Google Scholar]
- Emery, S.; Wolfe, J.; Tarnopolsky, A. Spectral centroid and timbre in complex, multiple instrumental textures. In Proceedings of the International Conference on Music Perception and Cognition; North Western University: Evanston, IL, USA, 2004; pp. 112–116. [Google Scholar]
- Harshita, G.; Gupta, D. LPC and LPCC method of feature extraction in Speech Recognition System. In Proceedings of the 2016 6th International Conference-Cloud System and Big Data Engineering (Confluence), Noida, India, 14–15 January 2016; pp. 498–502. [Google Scholar]
- Olla, E.; Elbasheer, E.; Nawari, M. A comparative study of MFCC and LPCC features for speech activity detection using deep belief network. In Proceedings of the 2018 International Conference on Computer, Control, Electrical, And Electronics Engineering (ICCCEEE), Khartoum, Sudan, 12–14 August 2018; pp. 1–5. [Google Scholar]
- John, M. Linear prediction: A tutorial review. Proc. IEEE 1975, 63, 561–580. [Google Scholar]
- Rupali, K.; Bhalke, D.G. Speech Emotion Recognition Based on Wavelet Packet Coefficients. In ICCCE 2021: Proceedings of the 4th International Conference on Communications and Cyber Physical Engineering; Springer Nature Singapore: Singapore, 2022; pp. 823–828. [Google Scholar]
- Shibani, H.; Shahin, I.; Iraqi, Y.; Werghi, N. Emotion recognition from speech using wavelet packet transform cochlear filter bank and random forest classifier. IEEE Access 2020, 8, 96994–97006. [Google Scholar]
- Sumita, N.; Kulkarni, V. Enhancement in speaker recognition for optimized speech features using GMM, SVM and 1-D CNN. Int. J. Speech Technol. 2021, 24, 809–822. [Google Scholar]
- Chowdhury, S.M.M.A.R.; Nirjhor, S.M.; Uddin, J. Bangla speech recognition using 1D-CNN and LSTM with different dimension reduction techniques. In International Conference for Emerging Technologies in Computing; Springer: Cham, Switzerland, 2020; pp. 158–169. [Google Scholar]
- Felix, B.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. Interspeech 2005, 5, 1517–1520. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [Green Version]
Network Layers | Raw Speech + 1-D DCNN | MFCC + 1-D DCNN | ||
---|---|---|---|---|
Size | Stride | Size | Stride | |
Input Layer | 64,000 × 1 | - | 715 × 1 | - |
Conv1 | 64,000 × 1 × 32 | 1 | 715 × 1 × 32 | 1 |
ReLU-1 | 64,000 × 1 × 32 | 1 | 715 × 1 × 32 | 1 |
Conv2 | 64,000 × 1 × 64 | 1 | 715 × 1 × 64 | 1 |
ReLU-2 | 64,000 × 1 × 64 | 1 | 715 × 1 × 64 | 1 |
Conv3 | 64,000 × 1 × 128 | 1 | 715 × 1 × 128 | 1 |
ReLU-3 | 64,000 × 1 × 128 | 1 | 715 × 1 × 128 | 1 |
FC | 20 × 1 | - | 20 × 1 | - |
FC | 7 × 1 | - | 7 × 1 | - |
Output | 7 × 1 | - | 7 × 1 | - |
Samples | Speech Emotions (EMODB) | |||||||
---|---|---|---|---|---|---|---|---|
Anger | Boredom | Disgust | Fear | Happiness | Neutral | Sadness | Total | |
Total Samples | 126 | 79 | 46 | 70 | 73 | 78 | 63 | 535 |
Training Samples (70%) | 87 | 55 | 31 | 50 | 51 | 54 | 44 | 372 |
Testing samples (30%) | 39 | 24 | 15 | 20 | 22 | 24 | 19 | 163 |
Samples | Speech Emotions (RAVDESS) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Anger | Calm | Disgust | Fear | Happy | Neutral | Sadness | Surprised | Total | |
Total Samples | 192 | 192 | 192 | 192 | 192 | 96 | 192 | 192 | 1440 |
Training Samples (70%) | 134 | 134 | 134 | 134 | 134 | 67 | 134 | 134 | 1005 |
Testing samples (30%) | 58 | 58 | 58 | 58 | 58 | 29 | 58 | 58 | 435 |
Emotion | Raw Speech + 1-D DCNN | MFCC + 1-D DCNN | Multiple Acoustic Features + 1-D DCNN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | |
Anger | 97.44 | 0.97 | 0.97 | 0.97 | 100.00 | 1.00 | 0.98 | 0.99 | 100.00 | 1.00 | 0.98 | 0.99 |
Boredom | 87.50 | 0.88 | 0.78 | 0.82 | 87.50 | 0.88 | 0.84 | 0.86 | 87.50 | 0.88 | 0.84 | 0.86 |
Disgust | 86.67 | 0.87 | 0.93 | 0.90 | 80.00 | 0.80 | 1.00 | 0.89 | 93.33 | 0.93 | 1.00 | 0.97 |
Fear | 85.00 | 0.85 | 0.89 | 0.87 | 90.00 | 0.90 | 0.90 | 0.90 | 95.00 | 0.95 | 0.95 | 0.95 |
Happiness | 86.36 | 0.86 | 0.95 | 0.90 | 90.91 | 0.91 | 1.00 | 0.95 | 90.91 | 0.91 | 1.00 | 0.95 |
Neutral | 91.67 | 0.92 | 0.88 | 0.90 | 95.83 | 0.96 | 0.88 | 0.92 | 91.67 | 0.92 | 0.88 | 0.90 |
Sadness | 89.47 | 0.89 | 0.89 | 0.89 | 94.74 | 0.95 | 0.90 | 0.92 | 94.74 | 0.95 | 0.95 | 0.95 |
Overall | 89.16 | 0.89 | 0.90 | 0.89 | 91.28 | 0.91 | 0.93 | 0.92 | 93.31 | 0.93 | 0.94 | 0.94 |
Emotion | Raw Speech + 1-D DCNN | MFCC + 1-D DCNN | Multiple Acoustic Features + 1-D DCNN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | |
Anger | 94.83 | 0.95 | 0.93 | 0.94 | 98.28 | 0.98 | 0.89 | 0.93 | 98.28 | 0.98 | 0.93 | 0.96 |
Calm | 91.38 | 0.91 | 0.90 | 0.91 | 91.38 | 0.91 | 0.93 | 0.92 | 93.10 | 0.93 | 0.93 | 0.93 |
Disgust | 89.66 | 0.90 | 0.93 | 0.91 | 93.10 | 0.93 | 0.96 | 0.95 | 91.38 | 0.91 | 0.96 | 0.94 |
Fear | 89.66 | 0.90 | 0.90 | 0.90 | 93.10 | 0.93 | 0.93 | 0.93 | 93.10 | 0.93 | 0.93 | 0.93 |
Happy | 91.38 | 0.91 | 0.95 | 0.93 | 91.38 | 0.91 | 0.91 | 0.91 | 94.83 | 0.95 | 1.00 | 0.97 |
Neutral | 82.76 | 0.83 | 0.77 | 0.80 | 82.76 | 0.83 | 0.86 | 0.84 | 93.10 | 0.93 | 0.82 | 0.87 |
Sadness | 91.38 | 0.91 | 0.93 | 0.92 | 94.83 | 0.95 | 0.95 | 0.95 | 91.38 | 0.91 | 0.96 | 0.94 |
Surprised | 93.10 | 0.93 | 0.92 | 0.92 | 91.38 | 0.91 | 0.95 | 0.93 | 96.55 | 0.97 | 0.95 | 0.96 |
Overall | 90.52 | 0.91 | 0.90 | 0.90 | 92.03 | 0.92 | 0.92 | 0.92 | 94.18 | 0.94 | 0.94 | 0.94 |
Emotion | Raw Speech + 1-D DCNN | MFCC + 1-D DCNN | Multi-Feature + 1-D DCNN | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | Accuracy | Recall | Precision | F1-Score | |
Anger | 86.67 | 0.87 | 0.83 | 0.85 | 91.11 | 0.91 | 0.84 | 0.87 | 95.56 | 0.96 | 0.86 | 0.91 |
Happy | 80.00 | 0.80 | 0.95 | 0.87 | 84.44 | 0.84 | 0.95 | 0.89 | 88.64 | 0.89 | 0.95 | 0.92 |
Neutral | 84.44 | 0.84 | 0.75 | 0.79 | 86.67 | 0.87 | 0.81 | 0.84 | 88.89 | 0.89 | 0.93 | 0.91 |
Sadness | 82.22 | 0.82 | 0.84 | 0.83 | 82.22 | 0.82 | 0.86 | 0.84 | 86.67 | 0.87 | 0.87 | 0.87 |
Overall | 83.33 | 0.83 | 0.84 | 0.83 | 86.11 | 0.86 | 0.86 | 0.86 | 89.94 | 0.90 | 0.90 | 0.90 |
Methods | Features | Accuracy (%) | Total Trainable Parameters (Million) | Total Training Time (s) | |
---|---|---|---|---|---|
EMODB | RAVDESS | ||||
1-D Dilated CNN [23] | Raw Speech | 90.00 | - | - | 3150 |
1-D CNN + LSTM [24] | Raw Speech | 86.73 | - | - | - |
DCNN [31] | Mel Log Spectrogram | 85.57 | 77.02 | - | - |
RBFN-BiLSTM [33] | STFT | 85.57 | 77.02 | >3 M | - |
ACRNN [29] | 3-D Mel Spectrogram | 82.82 | - | - | 6811 |
ADRNN [30] | 3-D Mel Spectrogram | 88.98 | - | - | 7187 |
Merged DCNN [27] | Log Mel Spectrogram | 91.78 | - | >10 M | - |
ResNet101 [28] | MFCC, RMS, Croma Features, Spectral Features, Spectrogram | 90.21 | 79.41 | 44.5 M | - |
Proposed Method | Raw Speech | 89.16 | 90.52 | 163 M | 8650 |
Proposed Method | MFCC speech input-39 features | 91.28 | 92.03 | 0.192 M | 2650 |
Proposed Method | Multiple Acoustic Features (Spectral, Time domain, Voice quality) | 93.31 | 94.18 | 1.77 M | 2980 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bhangale, K.; Kothandaraman, M. Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network. Electronics 2023, 12, 839. https://doi.org/10.3390/electronics12040839
Bhangale K, Kothandaraman M. Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network. Electronics. 2023; 12(4):839. https://doi.org/10.3390/electronics12040839
Chicago/Turabian StyleBhangale, Kishor, and Mohanaprasad Kothandaraman. 2023. "Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network" Electronics 12, no. 4: 839. https://doi.org/10.3390/electronics12040839