FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification

Su, Yuping; Chen, Jie; Chai, Ruiting; Wu, Xiaojun; Zhang, Yumei

doi:10.3390/app14166866

Open AccessArticle

FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification

by

Yuping Su

^1,2,*

,

Jie Chen

^1,2,*,

Ruiting Chai

^1,2,

Xiaojun Wu

^1,2 and

Yumei Zhang

^1,2

¹

School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

²

Key Laboratory of Intelligent Computing and Service Technology for Folk Song, Ministry of Culture and Tourism, Xi’an 710119, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6866; https://doi.org/10.3390/app14166866

Submission received: 22 May 2024 / Revised: 27 July 2024 / Accepted: 4 August 2024 / Published: 6 August 2024

(This article belongs to the Special Issue Machine Learning in Audio Signal Processing and Music Information Retrieval)

Download

Browse Figures

Versions Notes

Abstract

:

Music emotion recognition is becoming an important research direction due to its great significance for music information retrieval, music recommendation, and so on. In the task of music emotion recognition, the key to achieving accurate emotion recognition lies in how to extract the affect-salient features fully. In this paper, we propose an end-to-end spatial-temporal feature extraction method named FFA-BiGRU for music emotion classification. Taking the log Mel-spectrogram of music audio as the input, this method employs an attention-based convolutional residual module named FFA, which serves as a spatial feature learning module to obtain multi-scale spatial features. In the FFA module, three group architecture blocks extract multi-level spatial features, each of which consists of a stack of multiple channel-spatial attention-based residual blocks. Then, the output features from FFA are fed into the bidirectional gated recurrent units (BiGRU) module to capture the temporal features of music further. In order to make full use of the extracted spatial and temporal features, the output feature maps of FFA and those of the BiGRU are concatenated in the channel dimension. Finally, the concatenated features are passed through fully connected layers to predict the emotion classification results. The experimental results of the EMOPIA dataset show that the proposed model achieves better classification accuracy than the existing baselines. Meanwhile, the ablation experiments also demonstrate the effectiveness of each part of the proposed method.

Keywords:

music emotion classification; channel-spatial attention; multi-level feature; spatial-temporal feature

1. Introduction

In recent years, music emotion recognition (MER) has attracted widespread interest, stimulated by the growing demand for the management of massive music resources. MER is considered a useful auxiliary tool for music information retrieval and organization [1], music recommendation systems [2,3], automatic music composing [4,5], and so on. Using manual methods to obtain music emotion labels can be time-consuming, labor-intensive, and error-prone. Therefore, the research field of automatically recognizing emotion labels has come into being.

Automatic MER constitutes a process of using computers to extract and analyze music features, form the mapping relations between music features and emotion space, and recognize the emotion that music expresses [6,7]. The existing MER methods can be divided into two categories: regression and classification, according to different emotion models. The former uses the spatial position of emotion space to express human internal emotions. The latter selects finite discrete emotional labels to classify music. In this paper, we focus on the music emotion classification task and use Russell’s circumplex emotional model [8] to label music emotions based on the four quadrants of the valence-arousal space.

Since MER requires the establishment of mapping relations between music features and emotion space, the effective extraction of emotion-related features is the key to accurate classification. In early research, most studies used handcrafted features (such as pitch, tone, intensity, rhythm, etc.) and traditional machine learning methods such as the support vector machine (SVM) [9,10], k-nearest neighbor (KNN) [11], Bayesian network [12], Gaussian mixture model (GMM) [13], decision tree (DT) [14], etc., to classify music emotions. These methods complete feature extraction and emotion recognition separately, with no further processing or extraction of the original handcrafted musical features. Therefore, the traditional machine learning-based MER methods require significant effort in manually extracting features and thus achieve low classification accuracy.

With the rapid development of artificial intelligence, deep learning-based music emotion recognition is gradually becoming mainstream, and it significantly contributes to improving classification accuracy by using multi-layer representation and abstract learning [6]. Compared with traditional machine learning methods, deep learning-based methods can reduce the burden of manually extracting features and learning music features automatically during the training procedure. In MER studies, most works are convolutional neural network (CNN) or recurrent neural network (RNN)-based models [15,16,17,18,19]. CNN-based methods mimic the visual perception of living creatures and can learn feature representations from data effectively. RNN is a sequential model and is good at processing sequence data, so it is widely used in dimensional MER tasks and dynamic emotion detection [17,18,20]. In order to combine the advantages of CNNs and RNNs, some works also propose frameworks that combine the CNN and RNN (including its variants such as Bi-RNN and LSTM) together to strengthen the ability to learn useful features [21,22].

Although deep learning methods have become increasingly popular in recent years, they still face some challenges and limitations when performing music emotion recognition. For example, traditional CNN and RNN models do not consider the influence of features with different spatial positions or time intervals, and they treat all the features equally. Fortunately, the emergence of the attention mechanism has largely addressed this issue. The main idea of the attention mechanism is to introduce a dynamic weighting mechanism within the network, allowing the model to perform weighted calculations for different parts of the input, thus enabling the network to focus more on important information while ignoring irrelevant information [23]. Thus far, various attention-based models have been proposed to learn emotion-related information about music by letting models focus on the important parts of features to achieve better performance [24,25,26,27,28]. Another challenge for traditional neural networks is the potential loss of key information and performance degradation as the network depth increases; therefore, residual learning is introduced to prevent the gradient from disappearing through skip connections [29] and has also been employed in emotion recognition tasks [30,31].

Motivated by existing works, we propose an attention-based spatial-temporal feature extraction approach for music emotion classification and have called it the FFA-BiGRU model. In the proposed method, we first input the log Mel-spectrogram of music clips into the FFA (feature fusion attention) [32] module to extract high-level spatial features of music audios. The FFA module consists of three group architecture blocks, each of which is a stack of channel-spatial attention-based convolutional residual blocks. The three group architecture blocks can fully extract the multi-scale spatial features of music clips, and the channel-spatial attention mechanism can pay more attention to critical information for sentiment classification from both the channel and spatial aspects. Finally, the output features of three group architecture blocks are fused together through channel-spatial attention. In order to further capture the sequence characteristics of music audio, the bidirectional gated recurrent units (BiGRU) module is employed to learn the temporal features after the FFA module. Then, the output feature maps of FFA and those of the BiGRU are concatenated in the channel direction. Finally, the concatenated features are passed through fully connected layers to predict the emotion classification results.

In summary, the main contributions of this paper are as follows:

(1): We propose an end-to-end spatial-temporal feature extraction method for MER called FFA-BiGRU. The proposed model fully considers the spatial and temporal properties of the log Mel-spectrogram features of music audios and extracts rich emotion-related spatial-temporal features through the combination of the FFA and BiGRU modules. The experimental results show that the integration of FFA and BiGRU is effective and can achieve a better classification performance than the existing baselines.
(2): In the proposed FFA-BiGRU model, we extract multi-scale spatial features through three group architecture blocks, and each of them is a stack of multiple channel-spatial attention-based residual blocks. The channel-spatial attention mechanism can effectively highlight features that are critical to music emotion classification at both the channel and spatial levels. Moreover, the concatenation of spatial features from FFA and the temporal features from the BiGRU can fully retain the spatial and temporal features of music, which can discriminate emotions well in the final emotion space.
(3): Finally, we conduct sufficient comparison experiments and an ablation study on the EMOPIA dataset [33], and it shows the effectiveness of the network architecture as well as each component of the approach, including the effectiveness of the channel-spatial attention mechanism, the optimal number of group architecture blocks, and the optimal number of network layers in these blocks.

The remainder of this paper is organized as follows. Related works of this paper are given in Section 2. Section 3 illustrates the architectural details of the proposed model. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of our method in Section 4. Finally, Section 5 concludes this paper and discusses several future research directions.

2. Related Works

In general, the MER methods are classified into two types: traditional machine learning approaches and deep learning approaches. Meanwhile, with the emergence of the attention mechanism, it has been widely used in deep learning models and improves the MER accuracy. Therefore, machine learning methods and deep learning methods (including those whose integration with the attention mechanism) are discussed in detail in the following subsections.

2.1. Machine Learning Method for MER

For music emotion recognition, commonly used machine learning approaches include SVM, the Gaussian mixture model, the KNN method, naïve Bayes, and so on. The representative works using traditional machine learning methods are summarized in the first half of Table 1. Specifically, the authors in [11] use rhythm patterns as music features and use KNN and a self-organizing map (SOM) to predict emotion for a set of kid songs. Lu et al. [13] proposed a hierarchical Gaussian mixture model to automate the task of mood detection, where three types of music features—intensity, timbre, and rhythm—are extracted to represent the characteristics of a music clip. The hierarchical framework can emphasize the most suitable features in different detection tasks. Kim et al. [34] executed music emotion classification based on the lyrics using three machine learning methods: naïve Bayes, the hidden Markov model, and SVM. The results showed that the classification performance of the SVM method was optimal. Malheiro et al. [35] used SVM to evaluate musical emotions based on three novel lyric features: slang presence, structural analysis features, and semantic features. Hu et al. [36] collected physiological signals from wearable devices, trained four classification models, and found that the KNN model achieved the best performance. Xu et al. [37] introduced source separation into a standard music emotion recognition system and extracted a combined 84-dimensional feature vector for music emotion recognition. The combined feature vector consists of a 72-dimensional acoustic feature vector and a 12-dimensional chroma feature vector. Lastly, the SVM classifier is employed for emotion prediction, and the experimental results have verified that source separation can effectively improve the MER performance.

Each of these traditional machine learning methods has inherent strengths and limitations. On the one hand, machine learning methods explicitly extract features from music data, allowing for a clear understanding of which features are crucial for music emotion recognition tasks. On the other hand, machine learning methods also have a significant drawback: manually extracted features may not cover the full spectrum of relevant characteristics, and there is a lack of further feature extraction beyond the initial selection.

2.2. Deep Learning Method and Attention Mechanism for MER

With the rapid development of deep learning, significant work has been conducted for MER by constructing various deep learning network architectures, and the accuracy has been greatly improved in recent years. The most widely used network is the convolutional neural network, which emulates biological visual perception and can effectively extract feature representations from music spectrogram data. For example, a deep CNN on music spectrograms is proposed for music emotion classification in [15]. By using the proposed method, no additional effort is required to extract specific features, which is left to the training procedure of the CNN model. Keelawat et al. [38] used electroencephalogram (EEG) as the input feature to realize emotion recognition during music listening. CNNs with three to seven convolutional layers were employed in the research, and a binary classification task was measured. Yang et al. [19] applied the constant-Q transform on music objects to derive the spectrogram and then took the spectrogram as the input of the CNN model to predict the dimensional emotion of music objects.

In addition, the BiLSTM (bidirectional long short-term memory) model, as a two-way recurrent neural network with long short-term memory, is often used in emotion classification tasks due to its ability to process sequence data and maintain long-term memory. Weninger et al. [20] employed a deep RNN structure for online continuous-time music mood regression. The study first extracted a large set of segmental acoustic features and then performed multi-variate regression using deep recurrent neural networks. The results showed that the deep RNN outperformed SVR and feedforward neural networks in both continuous-time and static music mood regression. In [17], a deep bidirectional long short-term memory (DBLSTM)-based multi-scale regression method was proposed for dynamic music emotion prediction, in which a fusion component was introduced to integrate the outputs of all DBLSTM models with different scales. The experimental results show that the proposed method achieves a significant improvement when compared with state-of-the-art methods. The representative works using deep learning methods are also summarized in Table 1.

More recently, the attention mechanism has also been combined with the deep learning methods, further achieving accuracy improvements for the MER tasks. In [24], the authors proposed multi-scale context-based attention (MCA) using LSTM for dynamic music emotion prediction. The proposed MCA mechanism pays different attention to the previous contexts of different time scales of music, and multi-scale models fused with attention can learn the deep representations of music structure dynamically and lead to better performance. In [39], a structure that combines 3D convolutions and attention-based sliding recurrent neural networks (ASRNNs) was proposed for speech emotion recognition, in which the 3D convolution model was proposed to obtain both the local features and periodicity information of emotional speech and the ASRNN was used to extract the continuous segment-level internal representations and focus on the salient emotion regions using a temporal attention model. The authors in [40] proposed two attention-based methods based on a VGG-ish architecture for the task of music emotion recognition. The first attention method used self-attention to replace the spatial convolutions in later layers of the VGG-ish network, and the second method used element-wise attention-based rectified linear units (ReLUs) in all the layers of the baseline VGG-ish network. The experimental results show that the first method can match the baseline performance with fewer computations and parameters, and the second method can achieve a better performance than the baseline without increasing the number of parameters.

In [25], a novel attention-based joint feature extraction model was proposed for static MER. It utilizes the CNN to learn emotion-related features through the filter bank and log Mel-spectrogram and further uses location-aware attention and self-attention [23] mechanisms to obtain salient emotion-related features. The authors in [27] proposed an end-to-end attention-based deep feature fusion (ADFF) approach for MER. The proposed model first uses an adapted VGGNet as a spatial feature learning module and then uses a squeeze-and-excitation (SE) attention-based [41] temporal feature learning module to obtain multi-level emotion-related spatial-temporal features. The experiments show that the combination of a spatial-temporal feature extractor and SE attention can achieve a better performance than the state-of-the-art model. In [28], a short-chunk CNN model with multi-head self-attention, called SCMA, and a BiLSTM model with multi-head self-attention, called BiLMA, are proposed for MER. It shows that the multi-head self-attention mechanism can effectively capture relevant information from features for emotion recognition tasks.

Moreover, there are also some studies using a multimodal network to predict music emotion by combining various features such as symbolic, acoustic, and lyric features. In particular, a multimodal neural network for MER is proposed in [26], where audio features, lyric features, and context features are extracted separately and fused by a cross-modal attention mechanism. Similarly, a multimodal multifaceted MER method is proposed in [30], where symbolic and acoustic features are extracted from both MIDI and audio data and integrated with a self-attention mechanism. In [42], the authors propose an end-to-end one-dimensional residual temporal and channel attention network (RTCAN-1D) to fuse the subject’s individual EDA features and the external evoked music features. The experiments show that the proposed method outperforms the existing state-of-the-art models. The representative works utilizing attention mechanisms are also summarized in Table 1.

Overall, compared with traditional machine learning models, deep learning methods, along with their combination with attention mechanisms, can automatically extract more useful features for MER and lead to significant improvements in MER accuracy.

Table 1. Representative works of MER.

Method	Reference	Year	Input Features	Learning Model
Machine learning	[13]	2006	Intensity, timbre, and rhythm	GMM
	[11]	2010	Rhythm patterns	KNN, SOM
	[34]	2011	Emotion vocabulary	NB, HMM, SVM
	[37]	2014	Acoustic feature, chroma feature	SVM
	[35]	2018	Slang presence, structural analysis features, semantic features	SVM
	[36]	2018	Physiological signals	SVM, NB, KNN, and DT
Deep learning methods	[20]	2014	A large set of acoustic features	Deep RNN
	[17]	2016	Low-level acoustic features	BiLSTM
	[15]	2017	Spectrogram	CNN
	[38]	2019	EEG	CNN
	[19]	2020	Spectrogram	CNN
Deep learning + attention	[24]	2017	MFCCs, spectral flux, centroid, entropy, slope, etc.	LSTM + MCA
	[25]	2021	Log Mel-spectrogram and filter bank spectrogram	CNN + local attention and global self-attention + GRU-SVM
	[27]	2022	Log Mel-spectrogram	VGGNet + SE Attention + BiLSTM
	[42]	2022	EDA, the external evoked music features	Residual channel-temporal attention
	[26]	2022	Mel-spectrogram, lyrics, track name, and artist	CNN + cross-modal attention
	[28]	2023	Mel-spectrogram/MIDI-like representation	Short-chunk CNN/BiLSTM + multi-head self-attention
	[30]	2023	MIDI-like representation, MFCCs, chromagram	BiGRU + CNN + the multi-head self-attention

3. Proposed FFA-BiGRU Method

For music emotion recognition, past works have shown that spectral features play an important role in identifying the emotion [27,28], and a spectrogram is a good representation of the audio clip, and it contains all physical information of the original audio [43]. Therefore, we choose the spectrogram as the model input since it summarizes spectral information in a concise form. Furthermore, the frequency scale of the spectrogram is converted from a linear scale to a mel scale as it resembles the human auditory system. As a result, the log Mel-spectrogram of music audio is used as the model input in this paper. The log Mel-spectrogram of music audio is a pictorial representation and, at the same time, reflects the temporal changes of frequency; that is, the log Mel-spectrogram has both spatial and temporal characteristics. Therefore, we propose a spatial-temporal feature extraction model, called FFA-BiGRU, for music emotion classification.

The proposed model is shown in Figure 1, which consists of three modules: a multi-level spatial feature learning module, a temporal feature learning module, and an emotion prediction module. Specifically, the spatial feature learning module is an attention-based convolutional residual network called FFA [32], which mainly consists of three group architecture (GA) blocks to learn multi-scale spatial features. Each group architecture is a global residual block that combines 19 basic blocks with local residual learning, and each basic block combines residual learning with a channel-spatial attention module. To further capture the temporal features of music audio, the BiGRU module is introduced after the spatial feature learning module to capture the sequential features. Then, the output feature maps of FFA and those of the BiGRU are concatenated in the channel direction. Finally, the fully connected (FC) layer is used to provide the final emotion prediction result. In the following subsections, we will provide the network details of each module in the proposed model.

3.1. Multi-Level Spatial Feature Learning

In reference [32], the channel-spatial attention-based feature fusion module, FFA, is proposed, and it has shown great effectiveness in image processing. In order to extract the spatial features from the log Mel-spectrogram of music audio, we utilized the architecture of the FFA module as a multi-level spatial feature learning module in our method.

As shown in Figure 1, the spatial feature learning module first passes the log Mel-spectrogram of music audio into a convolutional layer containing 64 filters with a filter size of 3 × 3 to perform shallow feature extraction. It then feeds the output into three group architecture blocks to extract multi-level deep spatial features. The output features from the three group architecture blocks are first concatenated in the channel direction and further fused together through channel-spatial attention. After that, the fused features are passed through two convolutional layers with a filter size of 3 × 3 to further extract the spatial features. To avoid losing useful information about the original log Mel-spectrogram, a global skip connection is introduced between the original input and the final output of the spatial feature learning module.

In the spatial feature learning module, group architecture is a key component for extracting multi-level deep spatial music features. In particular, each group architecture consists of 19 basic blocks with local residual learning, and each basic block is a convolutional residual subnet with channel-spatial attention. Therefore, in the following subsections, we will introduce channel-spatial attention, group architecture (including its basic blocks), and the spatial feature fusion strategy successively.

3.1.1. Channel-Spatial Attention

Since the log Mel-spectrogram of music is a pictorial representation, we believe that different spatial regions have varying degrees of importance for emotion classification. Meanwhile, different feature maps from the CNN-based spatial feature learning module also play distinct roles in emotion recognition. Therefore, the group architecture subnet employs channel-spatial attention to obtain importance at both the channel and spatial levels, as shown in Figure 2. The channel-spatial attention weights are calculated as follows.

Channel Attention (CA) Channel attention is mainly used to expand the representation ability of the spatial feature learning module by assigning different weights to each feature map. This allows it to learn the importance of each channel adaptively, thereby better capturing the emotion information. The realization of the CA block is shown in Figure 2. First, the input feature maps

x

with dimension

H \times W \times C

are fed into a global average pooling layer to squeeze the global spatial information into a channel descriptor

z \in R^{C}

, and the c-th element of

z

is given as:

z_{c} = F_{gp} (x_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{c} (i, j),

(1)

where

x_{c}

refers to the c-th channel feature map of

x

, and

x_{c} (i, j)

represents the value of

x_{c}

at the position (i, j).

F_{g p} (\cdot)

denotes the global pooling function. After global average pooling, the feature shape changes from

H \times W \times C

to

1 \times 1 \times C

.

Next, to obtain the channel attention weights, the average pooled feature maps are then fed into two convolutional layers, followed by a ReLU activation function and a sigmoid activation function, respectively. The channel attention weights are finally given by:

s_{C A} = σ (C o n v (δ (C o n v (z)))),

(2)

where

σ (\cdot)

represents the sigmoid function, and

δ (\cdot)

denotes the ReLU function. Both Conv layers utilize a 1 × 1 convolution, with kernel numbers

C^{’}

and

C

, respectively. For each basic block, the two Conv layers in its channel attention module have the filter numbers

C^{’} = 8

and

C = 64

, respectively.

Finally, the output of the CA block is obtained by scaling the input feature maps

x

with the attention weight vector

s_{C A}

as follows:

x_{C A} = s_{C A} \otimes x,

(3)

where

\otimes

denotes element-wise multiplication, and the attention values are broadcast along the spatial dimension during multiplication.

Spatial Attention (SA) Considering that different regions of the music audio features may have different importance for emotion classification, the output from the CA block is subsequently fed into a spatial attention (SA) block. This allows the network to focus more on the important regions.

As shown in Figure 2, the SA weight is computed by feeding the output

x_{C A}

of the CA block into two convolutional layers. The first layer is followed by a ReLU activation function, and the second is followed by a sigmoid activation function, respectively. Both convolutional layers use 1 × 1 convolution filters with the filter numbers

\hat{C} = 8

and 1, respectively. Therefore, the dimension of the SA weight is

H \times W \times 1

. The SA weight is given by:

s_{SA} = σ (C o n v (δ (C o n v (x_{C A})))),

(4)

Finally, the output of the SA block is obtained by scaling the input feature maps

x_{C A}

with the attention weight vector

s_{SA}

as follows:

x_{SA} = s_{SA} \otimes x_{C A},

(5)

where

\otimes

also denotes element-wise multiplication, and the SA weight values are broadcast along the channel dimension during this multiplication.

3.1.2. Group Architecture Block: Deep Spatial Feature Extractor

For the FFA module in Figure 1, the log Mel-spectrogram is first fed into a convolution layer to extract the shallow information of music; then, the output features are sent into three group architecture blocks to extract multi-level deep spatial features. Each group architecture block is a global residual block that contains 19 basic blocks with the same structure, as shown in Figure 1. The stacking of multiple basic blocks increases the depth and enhances the expressiveness of the network to learn high-level emotion-related information. A long shortcut connection is introduced between the first and the last basic blocks to avoid losing useful information. Each of the three group architecture modules can learn spatial features of different levels. Since each group architecture block consists of a stack of multiple basic blocks, we provide a detailed introduction of the basic block in the following.

Basic Block: The basic subnet in group architecture is the basic block, and its detailed realization is given in Figure 1. It is a local residual learning structure consisting of two convolutional layers, followed by a ReLU layer and a channel-spatial attention module, respectively. Local residual learning can improve training stabilization and allow the less important information to be bypassed through multiple local residual connections while the main network focuses on effective information. In the basic block, two convolutional layers (each containing 64 filters with a filter size of 3 × 3) are used to extract local spatial features, and the channel-spatial attention mechanism allows the network to focus on important information from the channel and spatial levels.

3.1.3. Spatial Feature Fusion Strategy

Channel-spatial attention-based feature fusion: The deep learning of each GA network can extract spatial features from different levels. To make full use of all the emotion-related information, we first concatenate all the output spatial feature maps of the three group architecture blocks in the channel direction, as shown in Figure 1. Then, we multiply the output feature maps of the three group architecture blocks by the corresponding channel attention weights and further fuse these three parts of the weighted feature maps by element-wise summation. The fused features are further passed into the spatial attention module to assign different weights for the feature maps from the spatial level. The channel and spatial attention here employ the same structure as that in Figure 2, except that the number of filters in the two Conv layers of the channel attention module becomes 4 and 192, respectively. Finally, the output features of the channel-spatial attention are passed to two Conv layers (each containing 64 and 1 filters with the filter size of 3 × 3, respectively) to learn the high-level spatial features further.

3.2. Temporal Feature Learning and Emotion Prediction

Temporal feature learning: Music audio sequences tend to be long and exhibit strong temporal dependencies. The time-dependent information in the sequences may not be captured using the spatial feature extractor alone. To further capture temporally related emotion features, we employed the BiGRU to learn the temporal features following the spatial feature learning module. The structure of the BiGRU is given in Figure 1. The BiGRU is a variant of BiLSTM with fewer parameters and higher learning efficiency. By incorporating the BiGRU layer, the model can learn to extract the temporal features in both the forward and backward directions, which helps to improve the overall performance of the model.

The BiGRU is a bidirectional GRU that constructs two GRU layers in opposite directions. The basic GRU unit in the BiGRU is depicted in Figure 3, and the main calculation formula for the GRU is as follows:

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}]),

(6)

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}]),

(7)

{\hat{h}}_{t} = \tan h (W_{\hat{h}} [r_{t} \times h_{t - 1}, x_{t}]),

(8)

h_{t} = (1 - z_{t}) \times h_{t - 1} + z_{t} \times {\hat{h}}_{t},

(9)

where

r_{t}, z_{t}, {\hat{h}}_{t}

, and

h_{t}

represent the outputs of the reset gate, the update gate, the candidate hidden state, and the GRU unit, respectively;

W_{r}, W_{z}

, and

W_{\hat{h}}

denote the corresponding state weight matrices, respectively;

σ (\cdot)

denotes the sigmoid function.

For the BiGRU model, which superimposes two single-layer GRU models in opposite directions, its output is determined by the states of the two superimposed GRUs. Specifically, the forward output of the BiGRU is

{\vec{h}}_{t}

, and the reverse output is

{\overset{\leftarrow}{h}}_{t}

. Then, the output of the BiGRU is given as

h_{t} = [{\vec{h}}_{t}, {\overset{\leftarrow}{h}}_{t}]

.

After the BiGRU module processes the music features, the temporal features are obtained. To fully utilize both the spatial and temporal features, the output feature maps of the FFA and those from the BiGRU are concatenated in the channel direction. The concatenated spatial-temporal features are subsequently used to predict the emotion results.

Emotion prediction: By passing the log Mel-spectrogram into spatial and temporal feature learning modules, the spatial-temporal features are sufficiently extracted and concatenated. To predict the music emotion results, the concatenated features from the FFA and BiGRU are first flattened and then fed into two FC layers. These layers map the spatial-temporal features into the emotion space. The structure of the FC layer is given in Figure 4, where the first layer consists of 32 nodes and the second consists of 4 nodes.

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Dataset

We evaluated our method on the EMOPIA dataset [33], which consists of both audio and corresponding symbolic music files for music emotion recognition. In this paper, we focused on the emotion recognition of the audio files. The dataset contains 1087 music clips from 387 pop piano music songs. These clips were labeled into four emotion classes using Russell’s 4Q model [8], including HVHA (high valence, high arousal), HVLA (high valence, low arousal), LVHA (low valence, high arousal), and LVLA (low valence, low arousal). The clips in the EMOPIA dataset are divided into train-validation-test splits with a ratio of 7:2:1.

4.1.2. Evaluation Metrics

The performance of our model is measured in terms of accuracy, precision, confusion matrix, and AUC (area under the receiver operating characteristic curve). Accuracy, the most common evaluation metric in classification tasks, calculates the proportion of correctly classified samples to the total number of samples. However, it does not perform well on unbalanced data, so precision is also introduced to evaluate the model’s performance. Precision is the proportion of the true-positive samples to the total number of samples predicted as positive.

The confusion matrix is an n × n matrix, where n is the number of categories for classification. The column labels represent the true category, and the row labels represent the predicted category. Each value in the confusion matrix denotes the probability (or the number of samples) of one category that is predicted as another specified class. It is helpful for understanding the performance of the classification models for each category of samples.

The AUC is a performance metric for classification tasks under different thresholds. It is calculated based on the relationship between the true-positive rate (TPR) and the false-positive rate (FPR) at different classification thresholds. The calculations of the TPR and FPR are given as follows:

T P R = \frac{T P}{T P + F N}, F P R = \frac{F P}{F P + T N}

(10)

where TP stands for true positive, which represents the number of positive instances correctly predicted as positive by the classification model. Similarly, FN stands for false negative, FP stands for false positive, and TN stands for true negative. The value of the AUC ranges from 0.5 to 1. The closer the value is to 1, the better the performance of the classification model. The AUC provides a comprehensive measure of its classification ability by considering the model’s performance across different thresholds. Therefore, our study employed both the AUC and the confusion matrix, along with accuracy and precision, as evaluation metrics.

4.1.3. Implementation Details

We implemented the proposed model using Python 3.8 in the PyTorch framework on an NVIDIA RTX 3090 Ti GPU from Beijing Yuangui Interactive Technology Co., Ltd., Beijing, China. We utilized TorchAudio 2.1.0 to extract the Mel-spectrogram with 128 Mel bins using a 2048-size FFT (with a Hanning window) and a hop size of 1024 at a sampling rate of 22,050 Hz. Moreover, we employed the Adam optimizer for training with a learning rate of 1 × 10⁻³, a weight decay of 1 × 10⁻⁴, 200 training epochs, and a batch size of 16. We used cross-entropy loss as the training loss function.

4.2. Results and Analysis

To verify the advancement of our proposed model and explore the role of different modules, we chose short-chunk CNN [44,45], CBAM [45], BiLSTM + Attention [33,46], ADFF [27], and the latest SCMA [28] for comparison. Moreover, to verify the effectiveness of the combination of FFA and BiGRU in our method, we also constructed two other models called FFA + BiLSTM and CBAM + BiGRU as baselines. The details of the baseline models are given as follows:

Short-chunk CNN: this comprises seven-layer CNNs with residual connections, and the output is obtained through max pooling, followed by two fully connected layers with ReLU activations for the classification.

CBAM: a convolutional block attention module that includes a channel attention map and a spatial attention map.

BiLSTM + Attention: a combination of a bidirectional LSTM and a self-attention module.

ADFF: an attention-based deep feature fusion approach that sequentially uses an adapted VGGNet as the spatial feature learning module and a squeeze-and-excitation (SE) attention-based module as the temporal feature learning module.

SCMA: this is a short-chunk CNN with multi-head self-attention.

Table 2 summarizes the accuracy, precision, and AUC performance of the proposed model and the other seven baselines. The proposed model outperforms all the baseline models with an accuracy of 69.8%, a precision of 70.3%, and an AUC score of 0.91. It suggests that our model can effectively capture more relevant information from the features for emotion recognition tasks. Moreover, the proposed model also outperforms the FFA+ BiLSTM and CBAM + BiGRU models, which indicates that the combination of FFA and the BiGRU in the proposed model is effective for extracting high-level emotion-related information.

To further check the performance of the proposed model for each class of samples in the dataset, the confusion matrix of accuracies is provided in Figure 5. It shows that the accuracies of the HVHA and LVLA categories are relatively high, while those of the HVLA and LVHA categories are relatively low. This coincides with our intuition since music with extremely high or low emotional intensity is easier to distinguish, and those with moderate emotional intensity are prone to be misclassified.

4.3. Ablation Study

In this subsection, we empirically demonstrate the effectiveness of our design choices. First, the impact of channel attention and spatial attention during spatial feature fusion is verified. Next, the performance of the group architecture blocks with varying numbers is analyzed, as well as that of the basic block structures. Finally, the effectiveness of combining FFA and BiGRU in our method is also assessed.

Effect of Channel Attention and Spatial Attention for Spatial Feature Fusion. For the spatial feature learning module, channel and spatial attention mechanisms are introduced in each basic block of the group architecture blocks to focus more on the important channels and spatial regions. Meanwhile, to integrate multi-level features from the three group architecture blocks, channel and spatial attention mechanisms are used again to further fuse the concatenated features from these blocks. Therefore, to demonstrate the necessity and effectiveness of the attention mechanism for spatial feature fusion, the ablation study results are presented in Table 3 by removing either channel attention (named Model 1) or spatial attention (named Model 2). It shows that the accuracy drops to 68.6% and 67.4% when removing channel attention or spatial attention, respectively. Similarly, precision and the AUC performance also decrease when either of the two attention mechanisms is removed. This suggests that the combination of channel and spatial attention in spatial feature fusion is necessary and beneficial, even though these attention mechanisms have already been used in the group architecture blocks.

Effect of varying the number of group architecture blocks and basic block structures. In the proposed model, we utilized three group architecture blocks to learn the multi-level spatial features, and each group architecture block consisted of 19 basic block structures. To verify that our selection is optimal, we explored the impact of varying the number of group architecture blocks and basic block structures on the classification performance, as shown in Table 4 and Table 5. Table 4 reveals that different numbers of group architecture blocks result in discrepancies in accuracy performance. For the EMOPIA dataset, the model achieves the best performance when the number of group architecture blocks is 3. To further determine the optimal number of basic blocks within each group architecture, Table 5 displays the performance of models with varying numbers of basic blocks in a group architecture, with the number of group architecture blocks fixed at 3. It indicates that the model performs best when the number of basic blocks is 19.

Effect of combining FFA and BiGRU. To further demonstrate the effectiveness of combining the spatial feature learning module (i.e., FFA) with the temporal feature learning module (i.e., BiGRU) in the proposed model, we conducted an ablation study by using only the spatial feature module, FFA, or solely, the temporal model, the BiGRU. As shown in Table 6, using either the FFA or BiGRU module alone may result in performance degradation. For instance, the accuracy of using only the FFA or BiGRU module alone is 64% and 66.3%, respectively, while that of the proposed model, which combines these two modules, is 69.8%. This indicates that the combination of spatial and temporal feature extractors can capture the underlying information in music audio more comprehensively. The results suggest that the extraction of both spatial and temporal features is crucial for music emotion classification.

5. Conclusions

In this work, we propose an attention-based spatial-temporal feature extractor for music emotion classification. The proposed model employs the FFA module as a multi-scale spatial feature extractor to obtain comprehensive and effective spatial information, and it utilizes the BiGRU to learn the temporal features of music sequences further. A series of experiments verify the effectiveness of the proposed network architecture, and it outperforms the baseline models in terms of accuracy, precision, and AUC performance. Although the proposed model has achieved some improvement in accuracy, there is still much room for improvement in the emotion classification of certain categories, such as the “HVLA” and “LVHA” categories. In future work, we will further refine the model to enhance its classification accuracy for these specific categories. Moreover, the proposed method is only evaluated on the EMOPIA dataset, in which the emotion expression is not universal. In fact, since different individuals, cultures, and customs may elicit varying emotional responses to the same piece of music; therefore, more diversified datasets and corresponding research methods are urgently needed for music emotion recognition.

Author Contributions

Conceptualization, Y.S. and J.C.; methodology, Y.S. and J.C.; software, J.C. and R.C.; validation, J.C. and R.C.; formal analysis, X.W. and Y.Z.; investigation, X.W. and Y.Z.; resources, J.C. and Y.S.; data curation, J.C. and X.W.; writing—original draft preparation, J.C.; writing—review and editing, Y.S., J.C. and R.C.; visualization, J.C. and R.C.; supervision, Y.S., X.W. and Y.Z.; project administration, Y.S.; funding acquisition, Y.S., X.W. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Xi’an Science and Technology Plan Project, China Grant (No. 24GXFW0009), the Fundamental Research Funds for the Central Universities (No. GK202205035, No. GK202101004, No. GK202407007), the National Natural Science Foundation of China (No. 62377034), and the Shaanxi Key Science and Technology Innovation Team Project (No. 2022TD-26).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available. The EMOPIA dataset is available at https://zenodo.org/records/5090631, accessed on 1 February 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, Y.H.; Chen, H.H. Music Emotion Recognition; CRC Press: Boca Raton, FL, USA, 2011; pp. 1–3. [Google Scholar]
Florence, S.M.; Uma, M. Emotional Detection and Music Recommendation System based on User Facial Expression. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2020; Volume 912, p. 062007. 8p. [Google Scholar]
Moscato, V.; Picariello, A.; Sperlí, G. An Emotional Recommender System for Music. IEEE Intell. Syst. 2021, 36, 57–68. [Google Scholar] [CrossRef]
Bao, C.; Sun, Q. Generating Music With Emotions. IEEE Trans. Multimed. 2023, 25, 3602–3614. [Google Scholar] [CrossRef]
Huang, C.F.; Huang, C.Y. Emotion-based AI Music Generation System with CVAE-GAN. In Proceedings of the 2020 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 23–25 October 2022; pp. 220–222. [Google Scholar] [CrossRef]
Yang, X.Y.; Dong, Y.Z.; Li, J. Review of data features-based music emotion recognition methods. Multimed. Syst. 2018, 24, 365–389. [Google Scholar] [CrossRef]
Donghong, H.; Yanru, K.; Jiayi, H.; Guoren, W. A survey of music emotion recognition. Front. Comput. Sci. 2022, 16, 166335. [Google Scholar]
Russell, J.A. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
Chin, Y.H.; Lin, C.H.; Siahaan, E.; Wang, I.C.; Wang, J.C. Music emotion classification using double-layer support vector machines. In Proceedings of the 2013 1st International Conference on Orange Technologies (ICOT), Tainan, Taiwan, 12–16 March 2013; pp. 193–196. [Google Scholar] [CrossRef]
Chiang, W.C.; Wang, J.S.; Hsu, Y. A Music Emotion Recognition Algorithm with Hierarchical SVM Based Classifiers. In Proceedings of the 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 10–12 June 2014; pp. 1249–1252. [Google Scholar] [CrossRef]
Dewi, K.C.; Harjoko, A. Kid’s song classification based on mood parameters using K-Nearest Neighbor classification method and Self Organizing Map. In Proceedings of the 2010 International Conference on Distributed Frameworks for Multimedia Applications, Jogjakarta, Indonesia, 2–3 August 2010; pp. 1–5. [Google Scholar]
Wu, W.; Xie, L. Discriminating Mood Taxonomy of Chinese Traditional Music and Western Classical Music with Content Feature Sets. In Proceedings of the 2008 Congress on Image and Signal Processing, Sanya, China, 27–30 May 2008; pp. 148–152. [Google Scholar] [CrossRef]
Lu, L.; Liu, D.; Zhang, H.J. Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 5–18. [Google Scholar] [CrossRef]
Lee, C.C.; Mower, E.; Busso, C.; Lee, S.; Narayanan, S. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 2009, 53, 1162–1171. [Google Scholar] [CrossRef]
Liu, X.; Chen, Q.; Wu, X.; Liu, Y.; Liu, Y. CNN based music emotion classification. arXiv 2017, arXiv:1704.05665. [Google Scholar]
Sarkar, R.; Choudhury, S.; Dutta, S.; Roy, A.; Saha, S.K. Recognition of emotion in music based on deep convolutional neural network. Multimed. Tools Appl. 2020, 79, 765–783. [Google Scholar] [CrossRef]
Li, X.X.; Tian, J.S.; Xu, M.X.; Ning, Y.S.; Cai, L.H. DBLSTM-based multi-scale fusion for dynamic emotion prediction in music. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
Liu, H.P.; Fang, Y.; Huang, Q.H. Music emotion recognition using a variant of recurrent neural network. In Proceedings of the 2018 International Conference on Mathematics, Modeling, Simulation and Statistics Application (MMSSA 2018), Shanghai, China, 22–23 December 2018; pp. 15–18. [Google Scholar]
Yang, P.T.; Kuang, S.M.; Wu, C.C.; Hsu, J.L. Predicting music emotion by using convolutional neural network. In HCI in Business, Government and Organizations; Springer: Cham, Switzerland, 2020; pp. 266–275. [Google Scholar]
Weninger, F.; Eyben, F.; Schuller, B.W. On-line continuous-time music mood regression with deep recurrent neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 5412–5416. [Google Scholar]
Dong, Y.Z.; Yang, X.Y.; Zhao, X.; Li, J. Bidirectional convolutional recurrent sparse network (BCRSN): An efficient model for music emotion recognition. IEEE Trans. Multimed. 2019, 21, 3150–3163. [Google Scholar] [CrossRef]
Chen, Z.; Liu, C. Music Audio Sentiment Classification Based on CNN-BiLSTM and Attention Model. In Proceedings of the 2021 4th International Conference on Robotics, Control and Automation Engineering (RCAE), Wuhan, China, 4–6 November 2021; pp. 156–160. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the International Conference on Neural Information Processig Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Ma, Y.; Li, X.X.; Xu, M.; Jia, J.; Cai, L. Multi-scale Context Based Attention for Dynamic Music Emotion Prediction. In Proceedings of the 25th ACM international conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1443–1450. [Google Scholar]
Zhang, M.; Zhu, Y.; Ge, N.; Zhu, Y.; Feng, T.; Zhang, W. Attention-based Joint Feature Extraction Model For Static Music Emotion Classification. In Proceedings of the 2021 14th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 11–12 December 2021; pp. 291–296. [Google Scholar]
Zhao, J.; Ru, G.; Yu, Y.; Wu, Y.; Li, D.; Li, W. Multimodal Music Emotion Recognition with Hierarchical Cross-Modal Attention Network. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, China, 18–22 July 2022; pp. 1–6. [Google Scholar]
Huang, Z.; Ji, S.; Hu, Z.; Cai, C.; Luo, J.; Yang, X. ADFF: Attention Based Deep Feature Fusion Approach for Music Emotion Recognition. In Proceedings of the 23rd Interspeech Conference, Incheon, Republic of Korea, 18–22 September 2022; pp. 4152–4156. [Google Scholar]
Xiao, Y.; Ruan, H.; Zhao, X.; Jin, P.; Cai, X. Music Emotion Recognition Using Multi-head Self-attention-Based Models. In Proceedings of the International Conference on Intelligent Computing, Zhengzhou, China, 10–13 August 2023; pp. 101–114. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Zhao, J.; Yoshii, K. Multimodal Multifaceted Music Emotion Recognition Based on Self-Attentive Fusion of Psychology-Inspired Symbolic and Acoustic Features. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 1641–1645. [Google Scholar]
Du, X.; Yang, J.; Xie, X. Multimodal emotion recognition based on feature fusion and residual connection. In Proceedings of the 2023 IEEE 2nd International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 24–26 February 2023; pp. 373–377. [Google Scholar]
Qin, X.; Wang, Z.; Bai, Y.; Xie, X.; Jia, H. FFA-Net: Feature Fusion Attention Network for Single Image Dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 11908–11915. [Google Scholar] [CrossRef]
Hung, H.; Ching, J.; Doh, S.; Kim, N.; Nam, J.; Yang, Y. EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, Online, 7–12 November 2021; pp. 318–325. [Google Scholar]
Kim, M.; Kwon, H.C. Lyrics-based emotion classification using feature selection by partial syntactic analysis. In Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence, Boca Raton, FL, USA, 7–9 November 2011; pp. 960–964. [Google Scholar] [CrossRef]
Malheiro, R.; Panda, R.; Gomes, P.; Paiva, R.P. Emotionally-Relevant Features for Classification and Regression of Music Lyrics. IEEE Transactions on Affective Computing 2018, 9, 240–254. [Google Scholar] [CrossRef]
Hu, X.; Li, F.; Ng, J. On the Relationships between Music-Induced Emotion and Physiological Signals. In Proceedings of the 19th International Society for Music Information Retrieval Conference, Paris, France, 23–27 September 2018; pp. 362–369. [Google Scholar]
Xu, J.; Li, X.; Hao, Y.; Yang, G. Source separation improves music emotion recognition. In Proceedings of the International Conference on Multimedia Retrieval, Glasgow, UK, 1–4 April 2014; pp. 423–426. [Google Scholar] [CrossRef]
Keelawat, P.; Thammasan, N.; Kijsirikul, B.; Numao, M. Subject-Independent Emotion Recognition During Music Listening Based on EEG Using Deep Convolutional Neural Networks. In Proceedings of the 2019 IEEE 15th International Colloquium on Signal Processing & Its Applications (CSPA), Penang, Malaysia, 8–9 March 2019; pp. 21–26. [Google Scholar]
Peng, Z.; Li, X.; Zhu, Z.; Unoki, M.; Dang, J.; Akagi, M. Speech emotion recognition using 3d convolutions and attention-based sliding recurrent networks with auditory front-ends. IEEE Access 2020, 8, 16560–16572. [Google Scholar] [CrossRef]
Rajamani, S.T.; Rajamani, K.T.; Schuller, B. Emotion and Theme Recognition in Music Using Attention-Based Methods. In Proceedings of the MediaEval’20, Online, 14–15 December 2020. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Yin, G.; Sun, S.; Yu, D.; Li, D.; Zhang, K. A Multimodal Framework for Large-Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–23. [Google Scholar] [CrossRef]
Deshpande, H.; Singh, R.; Nam, U. Classification of music signals in the visual domain. In Proceedings of the COST G-6 Conference on Digital Audio Effects (DAFX-01), Limerick, Ireland, 6–8 December 2001. [Google Scholar]
Won, M.; Ferraro, A.; Bogdanov, D.; Serra, X. Evaluation of CNN-based automatic music tagging models. In Proceedings of the 17th Sound and Music Computing Conference, Torino, Italy, 24–26 June 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Lin, Z.; Feng, M.; Santos, C.N.D.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]

Figure 1. An overview of the proposed FFA-BiGRU model. The model consists of a spatial feature learning module (FFA), a temporal feature learning module (BiGRU), and an emotion prediction module.

Figure 2. Channel-spatial attention mechanism. The input feature maps are first scaled by channel attention weights and then by spatial attention weights. The filter size of Conv layers in both the CA and SA blocks is 1 × 1.

Figure 3. The architecture of the GRU unit. It has two doors: reset door

r_{t}

and update door

z_{t}

.

Figure 3. The architecture of the GRU unit. It has two doors: reset door

r_{t}

and update door

z_{t}

.

Figure 4. The FC layer for emotion prediction.

Figure 5. Confusion matrix of the proposed model.

Table 2. Performance comparison between different models.

Method	Accuracy	Precision	AUC
Short-chunk CNN [45]	63.1%	63.9%	0.88
CBAM [45]	64.0%	65.9%	0.86
BiLSTM + Attention [33]	65.1%	65.8%	0.88
ADFF [27]	67.4%	68.0%	0.86
SCMA [28]	67.4%	69.3%	0.89
CBAM + BiGRU	65.1%	66.9%	0.89
FFA + BiLSTM	68.6%	69.8%	0.89
Ours (FFA + BiGRU)	69.8%	70.3%	0.91

Table 3. Ablation study: effects of channel and spatial attention for spatial feature fusion.

Method	Accuracy	Precision	AUC
Model 1 (remove CA)	68.6%	70.2%	0.89
Model 2 (remove SA)	67.4%	70.1%	0.88
Ours (FFA + BiGRU)	69.8%	70.3%	0.91

Table 4. Ablation study: effect of varying the number of group architecture blocks.

Number of Group Architecture Blocks	Accuracy	Precision	AUC
i = 1	68.6%	69.4%	0.90
i = 2	65.1%	67.8%	0.88
i = 3 (ours)	69.8%	70.3%	0.91
i = 4	54.7%	57.4%	0.84

Table 5. Ablation study: effect of varying the number of basic block structures.

Number of Basic Block Structures	Accuracy	Precision	AUC
i = 17	60.5%	60.9%	0.86
i = 18	62.8%	64.1%	0.87
i = 19 (ours)	69.8%	70.3%	0.91
i = 20	62.8%	64.5%	0.87
i = 21	65.1%	68.5%	0.88

Table 6. Ablation study: effect of combining FFA and BiGRU.

Method	Accuracy	Precision	AUC
FFA	64.0%	66.3%	0.89
BiGRU	66.3%	67.8%	0.89
ours (FFA + BiGRU)	69.8%	70.3%	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, Y.; Chen, J.; Chai, R.; Wu, X.; Zhang, Y. FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification. Appl. Sci. 2024, 14, 6866. https://doi.org/10.3390/app14166866

AMA Style

Su Y, Chen J, Chai R, Wu X, Zhang Y. FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification. Applied Sciences. 2024; 14(16):6866. https://doi.org/10.3390/app14166866

Chicago/Turabian Style

Su, Yuping, Jie Chen, Ruiting Chai, Xiaojun Wu, and Yumei Zhang. 2024. "FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification" Applied Sciences 14, no. 16: 6866. https://doi.org/10.3390/app14166866

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FFA-BiGRU: Attention-Based Spatial-Temporal Feature Extraction Model for Music Emotion Classification

Abstract

1. Introduction

2. Related Works

2.1. Machine Learning Method for MER

2.2. Deep Learning Method and Attention Mechanism for MER

3. Proposed FFA-BiGRU Method

3.1. Multi-Level Spatial Feature Learning

3.1.1. Channel-Spatial Attention

3.1.2. Group Architecture Block: Deep Spatial Feature Extractor

3.1.3. Spatial Feature Fusion Strategy

3.2. Temporal Feature Learning and Emotion Prediction

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Results and Analysis

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI