1. Introduction
As the economy grows and people’s material standards improve, there is an increasing pursuit of spiritual enrichment through music, art, literature, and live performances. Music, in particular, serves as a powerful form of expression, conveying emotions through sound vibrations, melodies, and rhythms, and holds a vital place in our daily lives. Music Emotion Recognition (MER) involves the use of computer technology to automatically identify the emotional states expressed in music, bridging the gap between the technical and the emotional aspects of musical experience. According to the China Music Industry Development Report 2022, the scale of China’s digital music industry reached CNY 79.068 billion in 2021, a year-on-year growth of 10.3% [
1]. Despite the challenges of the post-epidemic landscape and intense competition in the market, the digital music industry continues to experience robust growth, demonstrating its vibrant vitality. This surge in digital music data, coupled with an increasing demand for music information retrieval, highlights the industry’s dynamic evolution. Research indicates that emotion-related vocabulary ranks among the most common terms used in music searches and descriptions. Consequently, there is a growing need for music retrieval systems that can categorize and recommend music based on its emotional attributes. The technology of music emotion recognition involves multiple fields such as musicology, psychology, music acoustics, audio signal processing, natural language processing, deep learning, etc. [
2,
3]. It is a multi-disciplinary, interdisciplinary research field [
4].
Most researchers undertaking MER research use supervised machine learning methods to achieve music emotion recognition. Yang [
5] proposed a CNN-based emotion recognition method. By converting the original data into a spectral graph, and then inputting the spectral graph into the CNN for emotion recognition. Liu and others [
6] use the spectral graph computed by the short-time Fourier transform of the audio signal as input. Each music’s spectral graph undergoes convolutional layers, pooling layers, and hidden layers, and finally, it goes through SoftMax for prediction. Coutinho et al. [
7] added psycho-acoustic features on top of the ComPareE feature set, using LSTM–RNN to achieve information modeling on longer contexts, capturing the time-varying emotional features of music for music emotion identification. In consideration of the high contextual relevance between music feature sequences and the advantages of Bi-Directional Long Short-Term Memory (BLSTM) in capturing sequence information, Li and others [
8] proposed a multi-scale regression model based on deep BLSTM and a fusion of Extreme Learning Machines (ELM). Hizlisoy et al. [
9] proposed a music emotion recognition method based on Convolutional Long Short-Term Memory Deep Neural Network (CLDNN) architecture, which provides the features obtained by the logarithmic Mel filter group energy and the Mel Frequency Cepstral Coefficients (MFCC), which are then passed to a convolutional neural network (CNN) layer. Subsequently, LSTM + DNN is used as a classifier to deal with problems such as the difficulty of neural network model selection and model overfitting. Zheng Yan and others [
10] proposed a CGRU model that combined Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU). After extracting low-level and high-level emotional features from MFCC features, random forests were used to select features from them. Xie and others [
11] proposed a new method that combines frame-level speech features and attention-based LSTM recurrent neural networks to maximize the emotional saturation difference between time frames. In order to speed up the model training speed, Wang Jingjing and others [
12] combined Long Short-Term Memory networks (LSTM) with Broad Learning Systems (BLS), used LSTM as the feature mapping node of BLS, and built a new wide and deep learning network LSTM–BLS. Considering the effectiveness of deep audio embedding methods in capturing high-dimensional features into compact representations, Koh and others [
13] used L3-Net and VGGish deep audio embedding methods to prove that deep audio embedding can be used for music emotion recognition. Huang [
14] used only log Mel spectrum as input, using the modified VGGNet as the Spatial Feature Learning Module (SFLM) to obtain spatial features of different levels, inputting the spatial features into the Time Feature Learning Module (TFLM), based on Squeeze and Excite (SE) attention, to obtain Multiple Level Emotion Spatiotemporal Features (MLESTF). In order to reduce Long Distance Dependency in Long Short-Term Memory Neural Networks in Music Emotion Recognition, Zhong Zhipeng [
15] proposed a new network model, CBSA (CNN BiLSTM Self Attention).
Given the complexity and challenge of obtaining substantial, valid emotional feedback in controlled experiments, there is a notable shortage of music datasets featuring emotional annotations, especially for musical instruments. This scarcity is particularly acute in the field of musical emotion recognition for instruments like the violin. To address this gap, we have developed the VioMusic dataset, a specialized collection of violin solo audio recordings with emotional annotations. This dataset aims to facilitate the development and evaluation of Music Emotion Recognition (MER) models. Furthermore, we have introduced a CNN–BiGRU–Attention (CBA) network specifically tailored to mimic human perception of violin music emotions. This model utilizes CNNs to capture the deep emotional features inherent in the music, employs BiGRUs to decode the contextual relationships among musical emotions, and incorporates an Attention mechanism to focus on the most emotionally expressive elements of the music. The experimental results validate the effectiveness of the VioMusic dataset and confirm the accuracy and utility of the CBA model for emotion recognition in violin music.
In the next few sections, we will demonstrate the validity of the dataset related to the domain of music emotion recognition, the present data collection process and emotion annotation process, as well as several experimental scenarios in detail, and analyze the performance of state-of-the-art music emotion recognition methods on this dataset.
2. Related Work
Datasets are the basis of music information retrieval research. Rich databases can improve the accuracy of algorithms in the field of music information retrieval, which is of great significance for algorithm improvement [
16]. Since people started to pay attention to MER, many datasets have been designed for it. Here is a brief overview of several common public music emotion datasets (
Table 1).
The CAL500 dataset consists of 500 music tracks covering a variety of musical styles, including rock, pop, jazz, classical, and more. This dataset is characterized by a label-based approach that categorizes each music track into multiple facets and assigns values to each facet. The CAL500 contains over 17,000 annotations in total.
The DEAP dataset contains physiological responses and subjective emotional reactions of volunteers to music and video stimuli. It includes data such as EEG, ECG, and skin conductance, as well as self-reported emotional states, making it valuable for emotion recognition and affective computing research.
The emoMusic dataset is specifically designed for emotion recognition in music. It consists of audio samples labeled with emotional categories such as happy, sad, angry, and relaxed.
The DEAM dataset is a multimodal music sentiment categorization dataset containing 120 songs covering a wide range of music genres such as rock, pop, and classical. The dataset contains not only audio and textual information but also multiple emotion raw data from physiological signals and psychological questionnaires.
MagnaTagATune is a large-scale dataset of annotated music clips collected from the Magnatune online music store. It includes audio samples labeled with a wide range of descriptive tags, covering genres, instruments, moods, and more. This dataset is often used for tasks such as music tagging, recommendation, and genre classification.
The AMG1608 dataset is a subset of the Million Song Dataset, focusing on genre classification. It contains audio samples labeled with genre categories, allowing researchers to train and evaluate models for automatic genre recognition in music.
The FMA is a collection of freely available music tracks with associated metadata, including genre labels, artist information, and track features. It is a popular resource for researchers and music enthusiasts interested in exploring and analyzing a diverse range of music styles and characteristics.
Emotify is a dataset designed for emotion recognition in music, similar to emoMusic. It consists of audio samples labeled with emotional categories, providing a resource for training and evaluating emotion detection algorithms in music.
The PMEmo dataset contains physiological signals and self-reported emotion annotations collected from participants listening to music excerpts. It is used for research in affective computing and emotion recognition, providing data for analyzing the relationship between physiological responses and perceived emotions in music.
Author Contributions
Conceptualization, R.Z.; methodology, S.M. and R.Z.; formal analysis, S.M. and R.Z.; investigation, S.M. and R.Z.; resources, S.M. and R.Z.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, S.M. and R.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
Acknowledgments
The authors express their gratitude to the generous volunteers who contributed to the VioMusic dataset. Without their involvement, creating this audio dataset would not have been possible. The authors sincerely appreciate their participation and support.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Sun, H. China’s Music Industry to Total over 378.7 Billion Yuan by 2021; China Press, Publication, Radio and Television News: Beijing, China, 2023. [Google Scholar] [CrossRef]
- Daneshfar, F.; Kabudian, S.J.; Neekabadi, A. Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier. Appl. Acoust. 2020, 166, 107360. [Google Scholar] [CrossRef]
- Matsunaga, M.; Kikusui, T.; Mogi, K.; Nagasawa, M.; Myowa, M. Breastfeeding dynamically changes endogenous oxytocin levels and emotion recognition in mothers. Biol. Lett. 2020, 16, 20200139. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Ning, Y.E.; Wang, R. Improving cross-corpus speech emotion recognition using deep local domain adaptation. Chin. J. Electron. 2022, 32, 640–646. [Google Scholar]
- Yang, P.T.; Kuang, S.M.; Wu, C.C.; Hsu, J.L. Predicting music emotion by using convolutional neural network. In Proceedings of the 22nd HCI International Conference, Copenhagen, Denmark, 19–24 July 2020; pp. 266–275. [Google Scholar]
- Liu, X.; Chen, Q.; Wu, X.; Liu, Y.; Liu, Y. CNN based music emotion classification. arXiv 2017, arXiv:1704.05665. [Google Scholar] [CrossRef]
- Coutinho, E.; Weninger, F.; Schuller, B.; Scherer, K.R. The munich LSTM-RNN approach to the MediaEval 2014” Emotion in Music” Task. In Proceedings of the CEUR Workshop Proceedings, Crete, Greece, 27 May 2014; p. 1263. [Google Scholar]
- Li, X.; Xianyu, H.; Tian, J.; Chen, W. A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 544–548. [Google Scholar]
- Hizlisoy, S.; Yildirim, S.; Tufekci, Z. Music emotion recognition using convolutional long short term memory deep neural networks. Engineering Science and Technology. Eng. Sci. Technol. Int. J. 2021, 24, 760–767. [Google Scholar]
- Yan, Z.; Jianan, C.; Fan, W.; Bin, F. Research and Implementation of Speech Emotion Recognition Based on CGRU Model. J. Northeast. Univ. (Nat. Sci. Ed.) 2020, 41, 1680–1685. [Google Scholar]
- Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
- Jingjing, W.; Ru, H. Music Emotion Recognition Based on Wide and Deep Learning Network. J. East China Univ. Sci. Technol. (Nat. Sci. Ed.) 2022, 48, 373–380. [Google Scholar] [CrossRef]
- Koh, E.; Dubnov, S. Comparison and analysis of deep audio embeddings for music emotion recognition. arXiv 2021, arXiv:2104.06517. [Google Scholar]
- Huang, Z.; Ji, S.; Hu, Z.; Cai, C.; Luo, J.; Yang, X. ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition. arXiv 2022, arXiv:2204.05649. [Google Scholar]
- Zhong, Z.; Wang, H.; Su, G. Music emotion recognition fusion on CNN-BilSTM and Self-Attention Model. Comput. Eng. Appl. 2023, 59, 94–103. [Google Scholar]
- Li, Z.; Yu, S.; Xiao, C. CCMusic: Construction of Chinese Music Database for MIR Research. J. Fudan Univ. (Nat. Sci. Ed.) 2019, 58, 351–357. [Google Scholar] [CrossRef]
- Turnbull, D.; Barrington, L.; Torres, D.; Lanckriet, G. Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 467–476. [Google Scholar] [CrossRef]
- Koelstra, S.; Muhl, C.; Soleymani, M.; Lee, J.-S. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 2011, 3, 18–31. [Google Scholar] [CrossRef]
- Baum, D. Emomusic-Classifying music according to emotion. In Proceedings of the 7th Workshop on Data Analysis (WDA2006), Kosice, Slovakia, 1–3 July 2006. [Google Scholar]
- Aljanaki, A.; Yang, Y.H.; Soleymani, M. Developing a benchmark for emotional analysis of music. PLoS ONE 2017, 12, e0173392. [Google Scholar] [CrossRef]
- Wolff, D.; Weyde, T. Adapting similarity on the MagnaTagATune database: Effects of model and feature choices. In Proceedings of the International Conference on World Wide Web, Lyon, France, 16–20 April 2012. [Google Scholar] [CrossRef]
- Chen, Y.A.; Yang, Y.H.; Wang, J.C.; Chen, H. The AMG1608 dataset for music emotion recognition. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 693–697. [Google Scholar]
- Defferrard, M.; Benzi, K.; Vandergheynst, P.; Bresson, X. FMA: A dataset for music analysis. arXiv 2016, arXiv:1612.01840. [Google Scholar]
- Eerola, T.; Vuoskoski, J.K. A comparison of the discrete and dimensional models of emotion in music. Psychol. Music 2011, 39, 18–49. [Google Scholar] [CrossRef]
- Zentner, M.; Grandjean, D.; Scherer, K.R. Emotions evoked by the sound of music: Characterization, classification, and measurement. Emotion 2008, 8, 494. [Google Scholar] [CrossRef] [PubMed]
- Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161. [Google Scholar] [CrossRef]
- Reynaldo, J.; Santos, A. Cronbach’s alpha: A tool for assessing the reliabilityof scales. J. Ext. 1999, 37, 1–5. [Google Scholar]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; Volume 8, pp. 18–25. [Google Scholar]
- Fletcher, N.H. Vibrato in music–physics and psychophysics. In Proceedings of the International Symposium on Music Acoustics, Sydney, Australia, 30–31 August 2010; pp. 1–4. [Google Scholar]
- Grekow, J. Music emotion recognition using recurrent neural networks and pretrained models. J. Intell. Inf. Syst. 2021, 57, 531–546. [Google Scholar] [CrossRef]
- Arandjelovic, R.; Zisserman, A. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 609–617. [Google Scholar]
- Verma, G.; Dhekane, E.G.; Guha, T. Learning affective correspondence between music and image. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 3975–3979. [Google Scholar]
Figure 1.
Example of musical notation annotation.
Figure 2.
Russell’s Circumplex Model of Emotion.
Figure 3.
Distribution chart of music fragment annotations.
Figure 4.
Emotion change curve for different songs.
Figure 5.
The overall structure of CNN–BiGRU–Attention.
Figure 6.
Structure diagram of CNN.
Figure 7.
Structure diagram of the GRU cycle unit.
Figure 8.
Structure diagram of the GRU cycle unit.
Figure 9.
Attention mechanism model.
Figure 10.
Vibrato amplitude variation curve for different songs. (a) More emotionally powerful music clips. (b) Music clips with low mood swings.
Figure 11.
The trend of actual versus predicted values.
Figure 12.
Comparison of RMSE for different models. (a) Comparison between CA and CBA. (b) Comparison between CB and CBA.
Figure 13.
Comparison of MSE and MAE loss functions.
Table 1.
Summary of public music emotion datasets.
Dataset | Year | Raw Audio |
---|
CAL500 [17] | 2008 | No |
DEAP [18] | 2012 | Yes |
emoMusic [19] | 2013 | Yes |
DEAM [20] | 2013 | Yes |
MagnaTagATune [21] | 2013 | Yes |
AMG1608 [22] | 2015 | No |
FMA [23] | 2016 | Yes |
Emotify [24] | 2017 | Yes |
PMEmo [25] | 2018 | Yes |
Table 2.
Cronbach’s alpha statistic for the sentiment dimension.
Dimension | Mean | Standard Deviation |
---|
Arousal | 0.775 | 0.212 |
Valance | 0.809 | 0.221 |
Table 3.
Evaluation metrics results for different models.
Model | Arousal | Valence |
---|
r | MAE | r | MAE |
---|
Linear Regression | 0.459 | 0.136 | 0.517 | 0.148 |
CNN | 0.432 | 0.176 | 0.512 | 0.146 |
CNN–Attention | 0.480 | 0.127 | 0.460 | 0.145 |
CNN–SelfAttention | 0.483 | 0.125 | 0.468 | 0.139 |
CNN–BiGRU | 0.502 | 0.127 | 0.570 | 0.136 |
CNN–BiGRU–SelfAttention | 0.562 | 0.134 | 0.516 | 0.133 |
CNN–BiGRU–Attention | 0.612 | 0.120 | 0.599 | 0.123 |
Table 4.
CBA model performance after extracting deep music emotion features using different deep learning models.
Model | Arousal | Valence |
---|
r | MAE | r | MAE |
---|
VGG16 | 0.524 | 0.124 | 0.576 | 0.129 |
ResNet50 | 0.544 | 0.133 | 0.494 | 0.129 |
ResNet101 | 0.446 | 0.123 | 0.465 | 0.140 |
ResNet152 | 0.570 | 0.126 | 0.573 | 0.124 |
InceptionV3 | 0.457 | 0.151 | 0.401 | 0.152 |
InceptionResNetV2 | 0.438 | 0.157 | 0.409 | 0.150 |
DenseNet121 | 0.612 | 0.120 | 0.592 | 0.130 |
DenseNet169 | 0.579 | 0.125 | 0.599 | 0.123 |
DenseNet201 | 0.590 | 0.125 | 0.537 | 0.128 |
Xception | 0.410 | 0.150 | 0.420 | 0.141 |
Table 5.
A comparison table of state-of-the-art methods.
Model | MAE |
---|
Arousal | Valence |
---|
RNN (124, 124 LSTM) [30] | 0.150 | 0.170 |
-Net [31] | 0.136 | 0.143 |
ACP-Net [32] | 0.131 | 0.130 |
CNN–BiGRU–Attention | 0.120 | 0.123 |
Table 6.
Impact of different data processing methods on model performance.
Data Processing | Arousal | Valence |
---|
r | MAE | r | MAE |
---|
Raw data | 0.459 | 0.147 | 0.442 | 0.139 |
Image enhancements | 0.527 | 0.139 | 0.498 | 0.138 |
Data standardization | 0.532 | 0.136 | 0.533 | 0.133 |
Image enhancement and data normalization | 0.612 | 0.120 | 0.599 | 0.123 |
Table 7.
Evaluation metrics results for different feature fusion methods.
Features | Arousal | Valence |
---|
r | MAE | r | MAE |
---|
Mel | 0.524 | 0.124 | 0.576 | 0.129 |
Mel + MFCC | 0.651 | 0.114 | 0.656 | 0.118 |
LLDs | 0.673 | 0.106 | 0.710 | 0.108 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).