Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition

Raboy, Love Jhoye Moreno; Taparugssanagorn, Attaphongse

doi:10.3390/app14135761

Open AccessArticle

Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition

by

Love Jhoye Moreno Raboy

^* and

Attaphongse Taparugssanagorn

Department of Information and Communication Technologies, School of Engineering and Technology, Asian Institute of Technology, 58 Moo 9, Km. 42, Paholyothin Highway, Klong Luang, P.O. Box 4, Pathum Thani 12120, Thailand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5761; https://doi.org/10.3390/app14135761

Submission received: 12 April 2024 / Revised: 26 June 2024 / Accepted: 28 June 2024 / Published: 1 July 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this study, we present a novel approach for music emotion recognition that utilizes a stacked ensemble of models integrating audio and lyric features within a structured song framework. Our methodology employs a sequence of six specialized base models, each designed to capture critical features from distinct song segments: verse1, chorus, and verse2. These models are integrated into a meta-learner, resulting in superior predictive performance, achieving an accuracy of 96.25%. A basic stacked ensemble model was also used in this study to independently run the audio and lyric features for each song segment. The six-input stacked ensemble model surpasses the capabilities of models analyzing song parts in isolation. The pronounced enhancement underscores the importance of a bimodal approach in capturing the full spectrum of musical emotions. Furthermore, our research not only opens new avenues for studying musical emotions but also provides a foundational framework for future investigations into the complex emotional aspects of music.

Keywords:

stacking ensemble; music emotion recognition; bi-modal features; verse-chorus structure; feature extraction

1. Introduction

Music, whether played or listened to, has been recognized as significantly contributing to a person’s emotional well-being and overall health [1]. In the era of digital transformation, advancements in music technology, particularly in music emotion recognition and retrieval, have progressed rapidly, opening innovative avenues for exploration. This encourages us to take note of the burgeoning field of music emotion recognition, where substantial research and development have been conducted, refining our understanding of how music conveys and elicits emotions.

Within this domain, numerous unexplored areas indicate further exploration. One such area is the utilization of stacked ensemble techniques for the simultaneous analysis of integrated audio and lyrical features, particularly within a segmented verse1-chorus-verse2 extraction framework. Stacking, as an advanced ensemble learning methodology, substantially augments the predictive prowess of machine learning models. By amalgamating the outputs from a multitude of base models, stacking not only effectively reduces bias and variance but also broadens model diversity, thereby enhancing the interpretability of the final predictions [2].

Spotify, a leading music streaming platform, boasts an extensive library of tracks, some of which have been organized into playlists labeled with emotional descriptors like “angry”, “happy”, “sad”, and “relaxed”. In the study conducted by [3], where they concluded that Spotify API extractor provides some higher level emotionally relevant features, still, additional emotionally relevant features are needed to improve MER.

This study aims to evaluate the effectiveness of stacked ensemble methods in recognizing emotions from structured parts of songs, specifically analyzing audio-lyrics features from verse1, chorus, verse2 and a combination thereof. By employing our own meticulously curated datasets where music labels are extracted from Spotify playlist titles, we seek to improve the accuracy and reliability of music emotion recognition. Additionally, our research aims to deepen the understanding of how different sections of a song contribute to emotional expression, providing new insights for future research in the field.

1.1. Theoretical Background from a Musical Perspective

Understanding the emotional content of music involves an intricate interplay between various musical elements, including both audio features and lyrics. These elements collectively contribute to the overall emotional impact of a song. This section leverages insights from recent studies in music theory and psychology to support the integration of audio and lyrics features in our study.

The emotional expression in music is significantly influenced by various audio features and lyrical content. Melodic contours, such as ascending patterns, often correlate with positive emotions, while descending melodies can evoke sadness or calmness [4]. The study by [5] investigates how different musical pieces, particularly through tempo and rhythmic unit, affect participants’ emotional states. Their findings state that high tempos elicit excitement while low tempos induce relaxation. Timbre, on the other hand, refers to the quality of sound, and varies by instrument and playing technique, influencing emotional perception; a bright timbre might be perceived as happy while a dark timbre can convey sadness [6].

Lyrics provide explicit emotional cues through words and phrases, shape emotional responses through narrative context, and use imagery and metaphor to vividly convey emotions [7]. By integrating these audio and lyrical elements, a more comprehensive understanding of a song’s emotional content can be achieved.

1.2. Related Works on Music Emotion Recognition Domains and Challenges

Music emotion recognition (MER) is a research domain focused on developing computational frameworks to autonomously grasp and decipher the emotional essence embedded within music. It involves extracting relevant attributes from audio signals, lyrics, or other pertinent data sources, and subsequently employing machine learning techniques to categorize or predict the emotional states conveyed by the music. This facet of emotion recognition holds substantial implications, spanning domains such as music therapy, recommendation systems, and affective computing [8].

A study by [9] involves employing acoustic features extracted from audio signals, capturing features based on audio’s timbre, rhythm, and dynamics. These features are then fed into machine learning algorithms that enable the creation of models designed to classify music into distinct emotional categories. The authors in [9] proposed a hybrid model that combines convolutional neural networks (CNNs) and long short-term memory (LSTM) networks for music emotion recognition. Their study demonstrated promising results on a benchmark dataset, highlighting the efficacy of deep learning in MER.

Another strand in MER involves methodologies based on lyrics, which delve into textual song content to extract emotional states. This avenue utilizes sentiment analysis and natural language processing techniques to extract emotional cues from lyrics. In [10], a hybrid deep learning model was introduced, merging recurrent neural networks and attention mechanisms to discern emotions in song lyrics. This combination significantly improved emotion classification performance, highlighting the pivotal role of lyrics in MER.

The authors in [11] proposed a multimodal deep learning framework intertwining audio and lyric features for music emotion recognition. Their study showcases the superb performance of multimodal features, therefore outperforming individual modality usage. This amalgamated approach shows the benefits of combining diverse information sources from both audio and lyrics data.

Yet, the accurate identification and classification of emotions in music confronts intricate challenges arising from emotions’ inherent subjectivity and music’s multi-dimensional nature. This endeavor addresses the intricate interplay of subjectivity and inter-individual variance in emotional interpretation. Emotions embody subjective perceptions, and the music’s emotional content may be construed differently by diverse individuals. This diversity in emotional responses, influenced by personal experiences, cultural contexts, and situational settings, adds complexity, defying the creation of all-encompassing models.

1.3. Related Works on Using Audio and Lyrics Features

According to [12], audio features refer to descriptions of sound or an audio signal that can be utilized as inputs for statistical or machine learning models to develop intelligent audio systems. These features play a crucial role in various audio applications, including audio classification, speech recognition, automatic music tagging, audio segmentation and source separation, audio fingerprinting, audio denoising, and music information retrieval. They provide the necessary information and characteristics for analyzing and processing audio data across a wide range of tasks and applications.

Various categories of musical signals serve as audio features, as described in [13]. These include high-level audio features, mid-level features, and low-level features. High-level audio features are abstract features appreciated by humans, such as tempo, which was extracted in this study. According to [14], temporal features of audio influence the pace, energy, and perceived arousal of the music, playing a significant role in evoking specific emotions.

Mid-level audio features, such as pitch, represented by chroma and Mel-frequency cepstral coefficients (MFCCs), have demonstrated their significance in capturing perceptible characteristics contributing to music emotion recognition. Furthermore, combining both chroma and MFCC features has been shown to improve the performance of music emotion recognition systems. The authors in [15] introduced a hybrid feature fusion approach combining chroma features and MFCCs for emotion recognition in music. Their results demonstrated the complementary nature of these features in capturing both tonal and timbral aspects of music, leading to enhanced emotion classification accuracy.

The seminal work of [16] explores the different relevant audio features for music emotion recognition and mentions the role of spectral characteristics in MER. Their study provides an in-depth survey of existing computational audio features that are emotionally relevant, underpinned by music psychology literature and linking various musical dimensions—such as melody, harmony, rhythm, and particularly tone color—to specific emotions. Therefore, the researchers identify the necessity for the inclusion of spectral bandwidth and spectral centroid audio features in this study.

Combining audio and lyrics features in MER tasks presents numerous advantages, fostering more comprehensive and accurate results. Various studies assert that this integration allows for a richer representation of emotional content in music. While audio features capture acoustic properties and musical elements, lyrics features offer insights into semantic and textual expressions. The synergy of both modalities facilitates a nuanced understanding of emotions.

Studies by [17,18,19,20,21] emphasize the potential benefits of merging acoustic cues with textual expressions. This holistic approach enhances the discrimination of emotional states, elevating the overall accuracy of emotion recognition models. By tapping into complementary information from both modalities, researchers can craft more comprehensive, accurate, and contextually aware emotion recognition models.

1.4. Related Works on Using Spotify and Emotion-Labeled Playlists

Spotify has revolutionized music discovery and consumption habits by leveraging sophisticated algorithms and machine learning to curate personalized playlists, analyze listening behaviors, and recommend music across diverse genres. This has led to increased music discovery, with users adding new songs to their libraries almost daily and embracing a wider range of musical styles. Spotify’s ability to analyze user data, including listening habits and song characteristics, allows it to create multiple listener identities for each user, providing a highly personalized experience that caters to diverse tastes and moods. This shift from rigid genre categorization to mood-based playlists has made music discovery more accessible and encouraged listeners to explore beyond their usual preferences, as claimed by [22].

Emotion-labeled playlists within the music streaming service realm encompass thoughtfully curated song collections, meticulously categorized based on their emotional nuances [23]. These playlists are crafted by Spotify users, music enthusiasts, or algorithms that dissect songs’ audio and/or lyrical attributes to affix distinct emotional descriptors. This repository of playlists holds remarkable significance for MER research.

1.5. Related Works on Audio and Lyrics Data for Using Stacking Techniques

The convergence of audio and lyrics features through deep learning models for music emotion classification has garnered substantial research interest. Diverse studies have explored distinct methodologies, architectures, and performance metrics, striving to attain precise and comprehensive music emotion classification. One noteworthy study in this area is by [17], which proposes a multimodal strategy amalgamating audio and lyrics attributes via attention-based fusion recurrent neural networks. MFCCs underpin audio features, while lyrics attributes involve term frequency-inverse document frequency (TF-IDF) representation. The attention mechanism pinpoints informative segments from both modalities. Their model comprises dual recurrent neural networks (RNNs) dedicated to audio and lyrics. Attention mechanisms refine each RNN’s output, with a fusion gate uniting the attended representations. The resultant fusion fuels a fully connected layer for emotion classification. Evaluation employs accuracy, precision, recall, and F1-score metrics alongside cross-validation and baseline model comparisons, validating the efficacy of their multimodal approach.

A study cited in [18] introduced a multimodal technique that merges audio and lyrics characteristics, fortified by an attention mechanism. OpenSMILE is employed to extract audio traits, encompassing spectral and rhythmic attributes, while lyrics features include word embeddings and sentiment analysis scores. The model consists of twin subnetworks: an audio subnetwork utilizing CNNs to extract audio features and a lyrics subnetwork employing LSTM networks to process lyrical traits. Attentive mechanisms operate in both subnetworks, leading to fused representations via weighted summation.

Another study referenced in [24] advocates the integration of audio and lyrics features using deep belief networks (DBNs). Audio features encompass statistical attributes derived from audio signals, while lyrics attributes involve semantic representations through word embeddings. DBNs decode hierarchical representations from both audio and lyrics traits, ultimately converging at a fusion layer. The synthesized representation undergoes classification for emotions via a softmax layer.

Stacking methods offer numerous advantages in MER, accurately capturing intricate patterns and representations from both audio and lyrics data. Authors cited in [25] emphasize deep learning’s capability to autonomously capture complex patterns without relying on prior assumptions. This process involves collecting raw data, enabling models to autonomously deduce relevant features from both audio and lyrics. In contrast, a study referenced from [26] introduces RNNs, which excel in modeling sequential data—a significant asset in MER, given that emotions hinge on the dynamic evolution of music and lyrics over time. These studies shed light on the intricate methodologies, architectures, and performance metrics that underlie the fusion of audio and lyrics attributes within ensemble models for music emotion classification. A wide array of deep neural network architectures is explored, spanning from convolutional to recurrent layers, and various fusion techniques are employed at different levels. The performance of these models is meticulously evaluated using established metrics, with comprehensive analysis and cross-validation efforts substantiating their effectiveness.

1.6. Bridging Gaps and Our Contributions

In our work, we define a “standard” in the context of MER as methodological standards that involve the methods and techniques used to recognize and classify emotions in music using both audio and lyrics. As it is methodological, it encompasses steps from bimodal feature extraction techniques to using stacked ensemble deep learning models, assessing model performance in terms of accuracy, precision, recall, and F1 score. Current methodological standards, as discussed in studies [17,18,24,25,26], predominantly focus on using bimodal ensemble model methods to handle lyrics and audio features. However, critical gaps and constraints persist, demanding focused attention and innovative solutions, particularly in addressing labeling subjectivity, fusion techniques, and the exploration of human-centric evaluation.

Our proposed standard addresses these gaps by implementing refined methodological standards that recognize and classify emotions in music using audio and lyrics from specific structural parts of a song, namely verse1, chorus, and verse2. This process involves extracting the audio and lyric features from each structural part of the song, feeding them individually into each of the six deep learning models that comprise our stacked ensemble. We then assess the models’ performance in terms of accuracy, precision, recall, and F1 score. For instance, our six deep learning stacking models and the extensive feature extraction tailored to different structural parts have demonstrated superior performance in terms of accuracy compared to existing benchmarks. Theoretically, our approach is grounded in ensemble modeling, which suggests that a more holistic understanding of musical emotion can be achieved by integrating both audio and lyrical content. This integration allows for a richer interpretation of emotional cues, as supported by our empirical findings where our model achieved higher accuracy in scenarios involving complex emotional expressions.

2. Methodology

The methodology employed in this study encompasses a multifaceted approach, comprising four key stages. In Section 2.1, we explain the meticulous process involved in establishing a comprehensive dataset sourced from emotion-labeled Spotify playlists. This dataset serves as the cornerstone of our research, providing the necessary foundation for subsequent analysis and model development. Subsequently, in Section 2.2, we delve into the intricate steps of constructing a stacked ensemble model, employing innovative techniques to amalgamate audio and lyrics features for enhanced music emotion recognition. Following this, Section 2.3 delves into the critical stage of training the model, outlining the iterative process of parameter optimization and performance enhancement. Lastly, in Section 2.4, we detail the selection and implementation of the optimizer and loss function, pivotal components in guiding the training process towards convergence and improving the accuracy of emotion label predictions. Together, these methodological stages form a comprehensive framework aimed at advancing the field of music emotion recognition.

2.1. Establishment of Dataset from Emotion-Labeled Spotify Playlist

At the heart of music lies the harmonious interplay of two vital elements: audio and lyrics. Audio encompasses the instrumental and vocal arrangements, while lyrics convey poetic expressions, emotions, messages, and narratives. In this study, we manually gathered a dataset comprising 400 samples of raw audio and lyric data.

Before the collection of data samples, preliminary activities were conducted that involved using and extracting Spotify Playlists titles that contain the words for the four basic emotions from Russell’s emotion plane as mentioned by [27]: angry, happy, sad, and relax, respectively. Additionally, the extraction of playlist titles included synonyms of these four basic emotions to ensure comprehensive coverage. The collection of playlist titles was then subjected to a meticulous cleaning process to exclude those playlists containing instrumental music, music that lacks lyrics, playlists tailored for gym or yoga classes, for this may contain a mixture of audio instructions not lyrics, and non-English songs to ensure consistency in the linguistic context of our dataset. Furthermore, to ensure accuracy in emotion labeling, we adopted a methodology where songs appearing in more than two playlists were considered for inclusion. Subsequently, raw audio wav data and lyric data were extracted from online platforms, and feature extraction was performed on both types of data.

For audio data, the following features, including chroma, tempo, 20 mel-frequency cepstral coefficients (MFCC), zero crossing rate (ZCR), spectral bandwidth, spectral centroid, root mean square value, and roll-off, were automatically extracted using the python Librosa Library. These audio features were utilized as they were claimed, in the study of [13,14,15,16], to have an impact on the recognition of emotion in music. Two additional features were also included, namely the emotion label and its filename. Meanwhile, lyric data underwent term frequency—inverse document frequency extraction (TF-IDF) using the Natural Language Toolkit (NLTK) Framework–coupled with stopword removal and removal of punctuation and symbols for cleaning purposes.

The final step involved segmenting the dataset into verse1, chorus, and verse2 sections, storing audio and lyrics features separately in comma separated value (CSV) files as shown in Figure 1. These meticulous procedures were executed to establish a comprehensive and enriched dataset, facilitating in-depth exploration in the field of music emotion recognition.

Table 1 provides an overview of the raw datasets that were created in this study. The “Song Part” column represents different parts of the songs, namely Verse 1, Verse 2, Chorus, Whole Song, and a combined sequence of Verse 1-Chorus-Verse 2. Features for these “Song Parts” were manually extracted from segments that contain both audio and lyrics data. Furthermore, each of these parts was extracted to compare the performance of deep learning models, as they will be fed individually into the model.

The column “Audio Features” details the structure of the audio dataset, which comprises audio features and is represented by the notation (total number of songs/audio tracts, total number of audio features) vector. For instance, the value (400, 29) indicates 400 songs/audio tracks, each characterized by 29 distinct audio features. These audio features include chroma, tempo, 20 mel frequency cepstral coefficients, zero crossing rate, spectral bandwidth, spectral centroid, root mean square value, roll-off, emotion label, and the corresponding filename. The audio features for Verse 1-Chorus-Verse 2 display different values that present the sum of all features across these specific parts of each song.

The columns “Document Frequency”, “Unigram”, and “Bigram” are parameters utilized in the extraction of lyric features using the TF-IDF algorithm. “Document Frequency” indicates the minimum number of documents in which a term appears within the dataset. “Unigram” and “Bigram” refer to the occurrences of unique single words and pairs of words in each document, respectively. As can be found in Table 1, the values 5, 10, and 30 represent the minimum number of documents in which a unique word or pair of words appears. For each specified minimum document frequency, a lyrics dataset was created. The structure of this dataset is denoted by the vector notation (total number of songs, lyrics TF-IDF features). The “Unigram” column represents the dataset that extracts a single word from the corpus, while the “Bigram” column pertains to the dataset extracting pairs of words.

2.2. Development of Stacked Ensemble Model

Based on the insights from [2], stacking emerges as a powerful machine learning technique that amalgamates predictions from multiple base models, or first-level models, to yield a final prediction. This strategy entails training several base models on a common dataset and subsequently leveraging their predictions as inputs for a higher-level model, often termed a meta-model or second-level model, to render the ultimate decision. The fundamental premise of stacking revolves around consolidating predictions from diverse base models to enhance predictive performance beyond what any single model can achieve in isolation.

In the development of our Stacked Ensemble Model, meticulous attention has been devoted to each component to ensure optimal performance and robustness. Our approach involves employing individual network base models that cater to both audio and lyrics input data, resulting in six base models for the combined verse1-chorus-verse2 sections of the song. Figure 2 shows the integrated process for the development of Stacked Ensemble Model in this study. Starting from the establishment of datasets: from its preliminaries, to preprocessing, feature extraction, segmentation, stacked ensemble model used for predictions.

2.2.1. Base Model Representation

Audio Data Representation

In the context of analyzing songs, the input audio features for the i-th song can be represented as

x_{i, a u d i o} = [a_{i, v e r s e 1}] [a_{i, c h o r u s}] [a_{i, v e r s e 2}],

(1)

where [

x_{i, a u d i o}

] denotes a grouping or collection of audio features associated with specific sections of a song. Thus,

[a_{i, v e r s e 1}], [a_{i, c h o r u s}],

and

[a_{i, v e r s e 2}]

represent the vectors of the audio features associated with the verse1 section, chorus section, and verse2 section of the i-th song, respectively. These vectors are of type Numpy arrays of integers with 400 rows that represent the total number of song tracks with up to 29 features as shown in Table 1.

Audio Base Model

Figure 3 represents the audio base model utilized for the stacked ensemble model. Individual datasets, as represented by [

x_{i, a u d i o}

], were used as inputs in the artificial neural network model (ANN), which has up to six hidden layers utilizing the ReLU activation function. This model culminates in an output layer with softmax activation, a common choice for multi-class classification tasks.

The training of the audio base model utilizes the common 80−0 split of the dataset. Specifically, 320 data points, representing the total number of song tracks, will be used for training out of a total of 400 entries. The remaining 80 data points will be used for evaluation and will be reserved for the evaluation in meta learner. This audio base model will generate prediction values as probability distributions of the target classes for each data point in the dataset. Since there are four classes, four probability values will be generated for each data point in the audio dataset. The prediction vector can also be represented through

p_{i, a u d i o, p a r t} = f_{a u d i o} (x_{i, a u d i o}) .

(2)

Formula, which can be generated through training the audio base model

f_{a u d i o}

. The vector

x_{i, a u d i o}

represents the segmented input values from the datasets as defined in Table 1. Therefore, for individual audio datasets, the prediction formula would be

p_{i, a u d i o, v e r s e 1}

,

p_{i, a u d i o, c h o r u s}

, and

p_{i, a u d i o, v e r s e 2}

.

Finally, each segmented audio dataset, namely verse1, chorus, and verse2, generates an individual prediction vector, as captured in formula (2), with four values. These values will be used as one of the inputs for the stacked ensemble model’s meta-learner to perform another layer of predictions for music emotion classification.

Lyrics Data Representation

For the i-th song, the TF-IDF lyrics feature input is represented as

x_{i, l y r i c s} = [l_{i, v e r s e 1}] [l_{i, c h o r u s}] [l_{i, v e r s e 2}],

(3)

where

[l_{i, v e r s e 1}], [l_{i, c h o r u s}],

and

[l_{i, v e r s e 2}]

the vectors of the TF-IDF features associated with the verse1 section, chorus section, and verse2 section of the i-th song, respectively.

Lyrics Base Model

Figure 4 represents the lyrics base model utilized for the stacked ensemble model. Individual datasets, as represented by [

x_{i, l y r i c s}

], were used as input in the artificial neural network model (ANN), which has up to six hidden layers utilizing the ReLU activation function. This model, in the same was as the audio model, culminates in an output layer with softmax activation and its configuration allows the base model to generate its own features for predictions to capture distinct emotional cues inherent for lyrics datasets.

The training and prediction of the lyrics base model follow the same approach as the audio base model. It also uses the common 80–20 split of the dataset, with 320 data points representing the total number of song tracks. This model will also generate four prediction values as probability distributions of the target classes for each data point in the dataset. With lyrics datasets, the prediction vector can be expressed as

p_{i, l y r i c s, p a r t} = f_{l y r i c s} (x_{i, l y r i c s}) .

(4)

which can be generated through training the lyrics base model

f_{l y r i c s}

and the vector

x_{i, l y r i c s}

represents the segmented input values from the datasets as defined in Table 1. Therefore, for individual audio datasets, the prediction formula would be

p_{i, l y r i c s, v e r s e 1}

,

p_{i, l y r i c s, c h o r u s}

, and

p_{i, l y r i c s, v e r s e 2}

.

The prediction vector for lyrics, as shown in formula (4), will generate four prediction values like the audio representation. These values will also be used as inputs for the stacked ensemble model’s meta-learner to perform another layer of predictions for music emotion classification. The primary difference between the audio and lyrics models lies in the number of layers, with the optimal model being selected based on accuracy values.

2.2.2. Dual Input Stacked Ensemble Model for Datasets Verse1, Chorus, Verse2, and Whole Song

The dual input stacked ensemble model, shown in Figure 5, was created to determine the model performance for the datasets verse1, chorus, verse2, and the whole song dataset for comparison purposes. Each network is an Artificial Neural Network that encompasses multiple hidden layers, up to six layers, with ReLu activation functions employed throughout, culminating in softmax activation at the output layer.

Figure 5 depicts the architecture of the basic stacked ensemble method model, which was used as the model to run individually the datasets for verse1, chorus, verse2, and whole song. In this model, predictions from two base models serve as input for a meta-learner, aiming to surpass the performance of any individual base model. The figure showcases a dual-base model architecture, where audio and lyrics datasets are processed independently. Each base model comprises an artificial neural network (ANN) with up to six hidden layers, utilizing the ReLU activation function. These models culminate in an output layer with softmax activation, a common choice for multi-class classification tasks.

Concatenation through Predictions of Dual Input Stacked Ensemble Model

The predictions of the different base models are concatenated to form a new feature vector

Z_{i, p a r t}

, expressed as

Z_{i, v e r s e 1} = c o n c a t (p_{i, a u d i o, v e r s e 1}, p_{i, l y r i c s, v e r s e 1})

(5)

Z_{i, c h o r u s} = c o n c a t (p_{i, a u d i o, c h o r u s}, p_{i, l y r i c s, c h o r u s})

(6)

Z_{i, v e r s e 2} = c o n c a t (p_{i, a u d i o, v e r s e 2}, p_{i, l y r i c s, v e r s e 2})

(7)

Z_{i, w h o l e s o n g} = c o n c a t (p_{i, a u d i o, w h o l e s o n g}, p_{i, l y r i c s, w h o l e s o n g})

(8)

where

c o n c a t ()

function appends the prediction vectors for each part to form the new feature vector

Z_{i, p a r t},

which serves as the input to the meta-model.

Z_{i, p a r t},

generates four probability values derived from the individual predictions for each dataset: four values for each audio segment and four values for each lyrics segment. These predictions are then subsequently fed into a meta-learner.

A sample of prediction for the first entry in the verse1 audio dataset generates the following values for each segment of the song: [0.0003, 0.0189, 0.0004, 0.9804]. The sequencing of class values represents the “Angry”, “Happy”, “Relax”, and “Sad” classes in order. The high probability of 0.9804, 0.9996, 0.9849 indicates that the model predicts the song is “Sad”. The other probabilities represent the likelihoods of the other classes.

Another sample prediction for the first entry in the verse1 lyric’s dataset generates the following values for each segment of the song: [0.0053, 0.0146, 0.4644, 0.5157]. Each vector contains four values representing the four emotions in this study.

Concatenation then happens by appending all the above values in the verse1 segment. Therefore, the concatenated values are now [0.0003, 0.0189, 0.0004, 0.9804, 0.0053, 0.0146, 0.4644, 0.5157], which will be used as input into the meta-learner model.

The rationale for concatenating the probability distributions from both audio and lyrics models in our study stems from several key advantages. First, this approach enhances classification accuracy by leveraging the complementary strengths of each modality. By considering both audio and lyrics, the model can capture nuances and contextual cues that may be missed when analyzing either modality alone, thereby improving overall classification accuracy.

Second, concatenating probabilities enables comprehensive emotion detection by allowing the model to account for emotional cues present in both music and lyrics. This holistic understanding of the song’s emotional content leads to more nuanced and accurate predictions.

Third, by combining audio and lyrics data, we mitigate the weaknesses inherent in each modality. For example, lyrics may lack emotional intensity, while music may lack clear semantic content. By integrating both modalities, we ensure that the strengths of one can compensate for the weaknesses of the other, resulting in a more robust model. Additionally, combining probabilities from two distinct sources enhances the model’s robustness and generalizability. If one modality’s prediction is uncertain or noisy, the other modality’s prediction can provide additional context, leading to more reliable final predictions.

Finally, probabilistic concatenation offers an efficient fusion technique that is computationally lightweight and interpretable. This straightforward integration of information enables efficient processing without the need for complex fusion algorithms, making it a practical choice for our classification task. Overall, the rationale for concatenation lies in its ability to improve classification accuracy, enhance emotion detection, mitigate modality weaknesses, increase model robustness and generalizability, and offer an efficient fusion technique for combining audio and lyrics data.

2.2.3. The Six Input Stacked Ensemble Model

Figure 2 shows a 6-input stacked ensemble model tailored to discern emotional states from music by individually utilizing the audio and lyrics datasets from various song segments—Verse 1, Chorus, and Verse 2. This sophisticated model processes each segment through dedicated audio and lyrics base models as shown in Figure 3 and Figure 4, both being ANNs with up to six hidden layers featuring ReLU activation functions. This architecture adeptly captures complex patterns while addressing the vanishing gradient problem often encountered in deep learning tasks. The outputs from these base models undergo softmax activation, enabling multi-class classification to predict the emotional tone of the song accurately.

Concatenation Process for Six Input Stacked Ensemble Model

The predictions of the different base models are concatenated to form a new feature vector

z_{i}

, expressed as

Z_{i} = c o n c a t (p_{i, a u d i o, v e r s e 1}, p_{i, l y r i c s, v e r s e 1}, p_{i, a u d i o, c h o r u s}, p_{i, l y r i c s, c h o r u s}, p_{i, a u d i o, v e r s e 2}, p_{i, l y r i c s, v e r s e 2})

(9)

where

c o n c a t ()

function appends these six prediction vectors to form the new feature vector

Z_{i},

which serves as the input to the meta-model.

Z_{i},

generates 24 probability values derived from the individual predictions for each dataset: four values for each audio segment (verse1, chorus, verse2) and four values for each lyrics segment (verse1, chorus, verse2), respectively. These predictions are then subsequently fed into a meta-learner.

A sample of prediction for the first entry in the audio dataset generates the following values for each segment of the song: [0.0003, 0.0189, 0.0004, 0.9804], [0.0000, 0.0000, 0.0004, 0.9996], [0.0118, 0.0004, 0.0029, 0.9849]. The sequencing of class values represents the “Angry”, “Happy”, “Relax”, and “Sad” classes in order. The high probability of 0.9804, 0.9996, 0.9849 indicates that the model predicts the song is “Sad”. The other probabilities represent the likelihoods of the other classes.

Another sample prediction for the first entry in the lyric’s dataset generates the following values for each segment of the song: [0.0053, 0.0146, 0.4644, 0.5157], [0.0015, 0.0003, 0.9973, 0.0009], [0.0005, 0.9979, 0.0001, 0.0016]. Each vector contains four values representing the four emotions in this study.

Concatenation then happens by appending all the above values in the order according to the segments verse1, chorus, verse2. Therefore, the concatenated values are now [0.0003, 0.0189, 0.0004, 0.9804, 0.0053, 0.0146, 0.4644, 0.5157, 0.0000, 0.0000, 0.0004, 0.9996, 0.0015, 0.0003, 0.9973, 0.0009, 0.0118, 0.0004, 0.0029, 0.9849, 0.0005, 0.9979, 0.0001, 0.0016], which will be used as input into the meta-learner model.

By concatenating the probability distributions rather than raw data, we allow the meta-model to leverage the insights captured by each base model, which has been specifically trained to extract meaningful patterns from audio and lyrics data separately. This approach focuses on the most relevant information for classification.

From a musical perspective, combining audio and lyrics data is crucial because both elements contribute uniquely to the emotional expression of a song. Audio features capture the tonal and acoustic characteristics, such as melody, harmony, and rhythm, while lyrics provide semantic content and contextual meaning. By concatenating the probability distributions from both modalities and the segments of a song, our model can leverage the strengths of each, leading to a more accurate and nuanced understanding of the song’s emotional state. This integrated approach aligns with how humans perceive music, considering both the auditory and lyrical components in tandem.

2.2.4. The Meta-Learner for Final Classification

The meta-learner, shown in Figure 2 and Figure 5, is a simpler ANN with a single hidden layer comprising 64 units. The meta-learner’s purpose is to synthesize nuanced insights from both lyrical and audio aspects, offering a comprehensive emotional analysis through a softmax layer that discerns among four emotional categories: angry, happy, relax, and sad. This structured approach ensures precise emotion recognition, leveraging the varied emotional expressions conveyed across different parts of a song while employing a strategy that mitigates overfitting by reducing model complexity.

The meta-learner model, as shown in Figure 5 for dual input stacked ensemble, is represented as another neural network function g(), defined as

y_{i, p a r t} = g (Z_{i, p a r t}),

(10)

which maps the concatenated prediction vector

Z_{i, p a r t}

, as defined in Formulas (5)–(8), to a final prediction vector

y_{i, p a r t} .

This final prediction vector

y_{i, p a r t}

provides probabilities for each of the four emotional categories, angry, happy, sad, and relax, under each segment. These probabilities indicate the likelihood of each emotion being expressed in the i-th song.

A sample data scenario for the dual input stacked ensemble meta-learner, as depicted in Figure 5, involves the use of eight probability values generated from each base model as shown in the formula

Z_{i, p a r t}

(5)–(7) for segments such as verse1, chorus, verse2. The concatenated values [0.0003, 0.0189, 0.0004, 0.9804, 0.0053, 0.0146, 0.4644, 0.5157] are generated from the concatenation process in Section Concatenation through Predictions of Dual Input Stacked Ensemble Model. After feeding these concatenated values to the meta-model, another four probability values can be generated, as represented in the vector

y_{i, p a r t}

(10), where “part” represents the segment verse1, chorus, verse2 individually.

The meta-learner model for six-input stacked ensemble, as shown in Figure 2, is represented as another neural network function g(), defined as

y_{i} = g (Z_{i}),

(11)

which maps the concatenated prediction vector

Z_{i}

(9) to a final prediction vector

y_{i} .

This final prediction vector

y_{i}

provides probabilities for each of the four emotional categories: angry, happy, sad, and relax. These probabilities indicate the likelihood of each emotion being expressed in the i-th song given the combined predictions of segments verse1, chorus, and verse2, allowing for a nuanced understanding of the emotional content conveyed by the music.

A sample data scenario for Figure 2 showing the six input stacked ensemble meta-learner to process the final classification involves using the 24 probability values generated from each based model as shown in

Z_{i}

(9): [0.0003, 0.0189, 0.0004, 0.9804, 0.0053, 0.0146, 0.4644, 0.5157, 0.0000, 0.0000, 0.0004, 0.9996, 0.0015, 0.0003, 0.9973, 0.0009, 0.0118, 0.0004, 0.0029, 0.9849, 0.0005, 0.9979, 0.0001, 0.0016], derived from the concatenation process in Section Concatenation Process for Six Input Stacked Ensemble Model. These values are then fed to the meta-learner model and generate another four probability values for final prediction.

The final classification, denoted as

C_{i, p a r t}

_, or

C_{i}

, is determined by selecting the emotion with the highest probability in the prediction vector

y_{i, p a r t}

(10) and

y_{i}

(11), represented as

C_{i, p a r t} = argmax (y_{i, p a r t}),

(12)

C_{i} = argmax (y_{i}),

(13)

respectively, where argmax() returns the index of the maximum element in a vector. In the context of Equation (13), argmax(

y_{i}

) is used to find the index corresponding to the emotional category with the highest probability in the prediction vector

y_{i, p a r t}

. The formula

C_{i, p a r t}

(12) represents the final classification for the dual input stacked ensemble model while

C_{i}

(13), represents the final classification for the six input stacked ensemble model used in this study.

The meta-learner is a higher-level model designed to integrate the insights from the base models and make the final classification. By leveraging the combined information from both audio and lyrics predictions, the meta-learner can improve the accuracy of the emotion classification.

Overall, the described model represents a robust methodology for addressing the inherently multimodal nature of music. It aligns with research by [13,14,16] suggesting that combining audio and lyrical analysis outperforms models using only one data type. By stacking predictions from individual models dedicated to different data modalities and song parts, the model harnesses a broad range of features that provide a nuanced understanding of musical emotions.

The researchers defined the following reasons as to why this approach might be favored in experiments:

Holistic Understanding: Songs are more than the sum of their parts. By analyzing a song, researchers can better understand how different elements interact and contribute to the overall mood, theme, and emotional impact.
Contextual Relevance: Many aspects of a song, such as lyrical themes, musical motifs, and emotional dynamics, unfold over the course of the entire piece rather than within isolated sections. Examining the song in its entirety provides a richer context for interpreting these elements.
Structural Dynamics: Songs often exhibit structural patterns and dynamics that unfold across multiple sections. Analyzing the song globally allows researchers to identify recurring themes, variations, and transitions that shape the overall narrative arc.
Emotional Flow: Music has the power to evoke emotions and tell stories through its progression. Analyzing the song enables researchers to trace the emotional journey experienced by listeners from beginning to end.
Comparative Analysis: Analyzing songs globally facilitates comparisons between different compositions and genres. Researchers can assess how various structural and thematic elements contribute to the overall effectiveness and appeal of the music.

In summary, analyzing a song globally provides a comprehensive perspective that captures its interconnectedness and artistic integrity, shedding light on how its various elements work together to create a meaningful listening experience.

In practical applications, this model holds promise for dynamic playlist creation based on the collective audio and lyrics features and, in the structural parts of the song, thereby enhancing music discovery services. Furthermore, it could find utility in therapeutic settings where music is utilized to modulate mood, offering personalized interventions based on emotional preferences.

2.3. Training the Model

Our designed training process is underpinned by a rationale aimed at ensuring the robustness and generalization of our model. We begin by strategically partitioning the dataset into training and validation sets, employing a random seed of 42 and maintaining an 80-20 split ratio. This choice is deliberate, as it balances the need for providing sufficient data for effective model training while reserving a sizable portion for independent validation. By allocating 80% of the data for training, we aim to expose the model to a diverse range of examples, enabling it to learn and generalize patterns effectively. Simultaneously, the 20% reserved for validation allows us to assess the model’s performance on unseen data, providing valuable insights into its ability to generalize to real-world scenarios.

Throughout the training phase, we iteratively enhance the model’s capacity to capture intricate patterns in the data by employing multiple epochs. This approach allows the model to refine its understanding of the underlying relationships between input features and target labels over successive training iterations. Furthermore, the incorporation of early stopping, configured with a patience of 10 and a minimum delta of 0.0001, serves as a critical mechanism to prevent overfitting. By monitoring the model’s performance on the validation set, early stopping halts the training process when performance ceases to improve, thus preventing the model from learning noise or irrelevant patterns from the training data. This not only conserves computational resources but also ensures that the final model is optimized for generalization to unseen data.

The culmination of our training process involves a rigorous evaluation on a separate test dataset using a specified metric. This step serves to provide an unbiased assessment of the model’s generalization performance and validate its reliability and effectiveness in real-world scenarios. By rigorously testing the model on unseen data, we gain confidence in its ability to accurately predict emotional states in music across diverse contexts, thereby affirming its utility and practical applicability. Overall, our training methodology reflects a meticulous and systematic approach aimed at delivering a robust and reliable model for music emotion recognition.

2.4. Optimizer and Loss Function

The model employs the Adam optimization algorithm, a strategic choice for its adaptive learning rate that amalgamates the strengths of Adagrad and RMSprop. This algorithm is widely embraced in deep learning due to its capability to converge swiftly and exhibit robust performance across diverse tasks. Adam’s adaptive learning rate adjustment proves crucial in circumventing convergence issues and expediting the training process, enhancing the model’s efficiency. Additionally, the model utilizes categorical cross-entropy as its loss function, a well-suited metric for multi-class classification challenges where each input belongs to one of several classes. This loss function quantifies the dissimilarity between the predicted class probabilities and the actual class distribution, providing a reliable measure for the model’s performance in intricate classification tasks.

2.5. Model’s Performance Evaluation Criteria

In Music Emotion Recognition (MER), the application of performance metrics such as accuracy, precision, recall, and the F1 score is essential for evaluating the effectiveness of classification models. The authors of [28] state the definition of each performance metrics. Accuracy provides a straightforward measure of how well a model performs its intended task. Precision and recall offer more granular insights: precision is crucial when the consequences of false positives are significant, while recall is critical when failing to detect true positives could have severe implications. The F1 score, which balances precision and recall through their harmonic mean, is particularly useful in scenarios where both false positives and false negatives carry high costs. Overall, these metrics are vital for navigating the complexities of MER, helping to enhance the robustness of systems to accurately classify the wide array of human emotions expressed through music, ensuring effectiveness across various musical emotional contexts.

3. Results and Discussion

The results and discussions section of this study shed light on the created twenty-eight unique datasets, each meticulously tailored to focus on specific structural parts of songs. This strategic approach aims to optimize the performance of the stacking model in music emotion recognition. The rationale behind this detailed dataset establishment stems from the recognition of the diverse structural elements present within songs, including Verse 1, Chorus, Verse 2, the Whole Song, and the amalgamation of Verse 1, Chorus, and Verse 2.

It is essential to note that each component of the song underwent processing using a dual-base model framework specifically designed for the stacking method. This decision was made because the network receives only two types of input data: one related to audio and the other to lyrics. The model utilized corresponds to the one outlined in Figure 2. Conversely, the stacking model for the structural sequence of Verse 1-Chorus-Verse 2 within the songs incorporates a six-input model. This model represents discrete audio and lyrics data for each respective section, as illustrated in Figure 3.

Conducting model training and testing yields different results to different datasets. Performance evaluation of the above-mentioned model was measured following the common classification performance criteria, namely accuracy, precision, recall, and F1-score.

This study specifically aims to identify the datasets that yield the best performance when integrated with the stacking model. This systematic approach ensures a comprehensive exploration of the model’s capabilities given a combined audio and lyrics features in a structured part of a song, thereby providing valuable insights into the nuances of music emotion recognition.

3.1. Performance of Basic Stacked Ensemble Model

In this section, we discuss the performance of the basic stacked ensemble model applied to various structural parts of songs. After establishing the model’s parameters, rigorous training and testing were conducted on each structural segment of a song, utilizing both audio and lyric features. Accuracy, precision, recall, and F1 scores were subsequently collected to facilitate a comprehensive analysis of the model’s performance.

3.1.1. Performance of Basic Stacked Ensemble for Verse 1 Dataset

Figure 6 displays the performance metrics of the stacking method for the Verse1 dataset, presenting three different configurations indicated by the minimum document frequency (MinDF) settings of 5, 10, and 30. The metrics, as mentioned above, include accuracy, precision, recall, and F1 score—all crucial for evaluating a model’s performance in classification tasks.

The highest accuracy is observed with MinDF set to 5 (46.25%), declining as the MinDF increases. This suggests that including terms appearing in at least five documents enhances model performance. Lowering the MinDF threshold allows the model to capture more diverse and specific features from the dataset, thereby improving accuracy.

Highest precision is attained at the highest MinDF of 30 (48.72%), indicating that the model is more precise at this threshold. Focusing on terms more prevalent across the dataset potentially leads to more confident predictions, reducing false positives. Consequently, precision is optimized when the model prioritizes terms with higher frequencies.

Recall peaks with MinDF set to 5 (46.69%), akin to accuracy, suggesting that a lower threshold includes terms crucial for identifying positive instances accurately. Lowering the MinDF threshold enables the model to capture a broader range of terms, enhancing recall by effectively identifying more true positive instances.

The F1 Score mirrors trends observed in accuracy and recall, with the highest value at MinDF = 5 (47.24%). Achieving a balance between false positives and false negatives is optimal at this setting, as indicated by the harmonic mean of precision and recall. Lower MinDF thresholds allow the model to maintain this balance by capturing a broader range of significant terms, leading to improved overall performance in classifying emotional states.

Figure 6 illustrates a clear trade-off between precision and recall. As precision increases with higher MinDF settings, recall and accuracy decrease, highlighting the need to strike a balance based on the desired outcome of the model’s application. These findings offer valuable guidance for model tuning. For applications prioritizing recall, such as avoiding missing positive cases, a lower MinDF might be favored. Conversely, in scenarios where precision is paramount, such as when a false positive carries significant consequence, a higher MinDF could be more appropriate.

The impact of MinDF on model performance underscores its role as a feature selection technique in text processing, influencing which terms are included as features in the model. These results underscore the importance of considering the dataset and the problem space. If the dataset contains many rare but informative terms, opting for a lower MinDF threshold could capture this valuable information. Conversely, a higher MinDF threshold might be advantageous in focusing on more commonly occurring terms, potentially enhancing model performance in certain contexts. Therefore, understanding the dataset’s characteristics and the specific requirements of the problem space is crucial when determining the optimal MinDF setting for model training and application.

3.1.2. Performance of Basic Stacked Ensemble Model for Chorus Datasets

Figure 7 displays the performance metrics of the stacking method for the Chorus dataset, offering a comprehensive evaluation of model performance across varying MinDF settings—5, 10, and 30—specifically tailored to song choruses.

The model achieves its peak accuracy at MinDF = 5 (47.50%), indicating that lower thresholds, encompassing less frequent terms, are crucial for enhancing model performance on this dataset. Precision reaches its zenith at MinDF = 30 (51.51%), signifying that when the model predicts an instance to be positive at this threshold, it does so with a heightened level of confidence. Recall attains its maximum value at MinDF = 5 (41.52%), suggesting that broader term inclusion is advantageous for identifying true positives. Similarly, the F1 Score, balancing precision and recall, also peaks at MinDF = 5 (48.31%), indicating a more balanced performance between false positives and false negatives at this level.

These performance metrics underscore the influence of MinDF on model outcomes. Lower thresholds for MinDF generally lead to higher accuracy, recall, and F1 scores, indicative of a more effective model when considering a wider range of terms from the chorus. Conversely, as the MinDF increases, there is a notable gain in precision at the expense of recall and accuracy, highlighting a trade-off that hinges on the model’s intended application. These results offer valuable insights for optimizing the model, emphasizing the critical role of careful MinDF selection based on the desired outcome—whether prioritizing the avoidance of false positives (higher MinDF) or minimizing missed actual positives (lower MinDF).

3.1.3. Performance of Basic Stacked Ensemble Model for Verse2 Dataset

Figure 8 depicts the performance metrics of a model evaluated at different MinDF settings for the Verse 2 Dataset, offering insights into the effectiveness of varying feature selection parameters for the second verses of songs.

The model achieves its highest accuracy at the MinDF setting of 5 (52.50%), indicating that including terms appearing in a minimum of five documents is beneficial for enhancing model performance on the second verses of songs. Precision peaks at MinDF = 5 (53.67%), suggesting that at this setting, the model achieves the highest proportion of true positives out of all predicted positives, emphasizing the significance of less frequent terms for precision. Similarly, recall reaches its zenith at MinDF = 5 (53.31%), indicating the model’s effectiveness in identifying all relevant instances at this threshold. The F1 Score, reflecting a balance between precision and recall, also peaks at MinDF = 5 (53.05%), showcasing a well-balanced performance at this lower MinDF threshold.

The observed trend indicates that all performance metrics are highest at the lowest MinDF setting, suggesting that the second verses of songs may contain unique or less frequent lyrical content crucial for the emotion recognition task. Unlike the results for Verse 1 and Chorus, where a notable trade-off between precision and recall was observed at higher MinDF settings, Verse 2 results suggest that a lower MinDF threshold enhances both precision and recall simultaneously. The improved metrics across the board with a lower MinDF imply that for Verse 2, a broader inclusion of terms, even those less frequent, is critical for the accuracy of the model’s predictions.

Comparing these results to those of Verse 1 and Chorus reveals that each section of a song may influence the emotion recognition model differently. These insights can inform targeted feature selection and model tuning specific to different song parts. In conclusion, the “Verse2 Results” chart underscores that a lower MinDF setting is most effective for emotion recognition in the second verse of songs, highlighting how the choice of feature selection parameters significantly affects the performance of a machine learning model in the domain of music analysis.

3.1.4. Performance of Basic Stacked Ensemble Model for Whole Song Dataset

Figure 9 displays the performance metrics of a model that processes the entirety of songs at various MinDF settings. The accuracy of the model is consistently high across all MinDF settings, with a slight peak at MinDF = 5 and MinDF = 10 (50.00%), which implies that the model’s overall performance is relatively stable across different thresholds. The highest precision is observed at MinDF = 30 (51.48%), suggesting that the model, at this threshold, is slightly more precise in predicting true positives, potentially due to focusing on more frequently occurring terms. The recall is marginally higher at MinDF = 10 (50.05%), indicating that this setting might strike a better balance between including less frequent terms and filtering out noise for detecting true positives. The F1 Score, which indicates the balance between precision and recall, is notably higher at MinDF = 30 (50.12%). This suggests that the precision gains at this higher threshold are beneficial to the model’s harmonic mean performance.

In contrast to the analysis of individual song components such as verses and choruses, the evaluation of the entire song demonstrates less variation in performance metrics across MinDF settings. This observation suggests that the full context of a song may not be as sensitive to the frequency of term occurrence. The relative stability of accuracy across thresholds implies that the model remains robust when considering the entirety of a song’s content, encompassing both frequent and infrequent terms that contribute to emotion recognition.

A discernible trade-off between precision and recall becomes evident, particularly noticeable as the MinDF value increases. At MinDF = 30, higher precision is achieved, indicating fewer misclassified terms, but there is a slight decrease in recall, suggesting that some relevant terms may be missed. This trade-off underscores the importance of careful consideration when selecting the MinDF value based on the specific requirements of the application.

For applications where minimizing false positives is critical, a higher MinDF might be preferable due to its impact on precision and the F1 score. However, if capturing as many relevant instances as possible is important, a lower MinDF might be chosen, as indicated by the recall rates. Therefore, the choice of MinDF setting should be aligned with the priorities of the application, striking a balance between precision and recall optimizing model performance for the given task.

In summary, the “Wholesong Results” chart suggests that for full song analysis, there is a nuanced impact of MinDF on performance metrics. The choice of MinDF should be guided by the specific balance of precision and recall that is desired for the model’s application in music emotion recognition.

3.2. Performance of 6-Input Stacked Ensemble Model

In this section, we discuss the performance of a 6-input stacked ensemble model applied to a combined structural part of songs, analyzing evaluation results of all the audio and lyrics features in verse1, chorus, and verse2 datasets. This stacked ensemble model has six inputs that process the audio and lyrics datasets separately and then later combine them to be fed into the meta learner, as shown in Figure 2.

Figure 10 shows that a remarkable consistency in accuracy is observed across all MinDF settings, with a slight increase at MinDF = 5 (71.25%). This stability suggests that the model remains relatively robust, with lower MinDF values potentially enhancing the overall correctness of predictions. Precision reaches its peak at MinDF = 5 (72.37%), indicating that including terms appearing in at least five documents allows the model to make more accurate positive predictions.

Highest recall is achieved at MinDF = 30 (74.50%), suggesting that, at this threshold, the model excels at identifying all relevant instances, despite potentially including fewer terms. The F1 Score, aiming to balance precision and recall, remains relatively stable across all settings, with a minor peak at MinDF = 30 (74.47%).

The model’s performance appears less sensitive to changes in MinDF settings compared to results observed in individual song parts, suggesting that the combination of multiple song elements provides a rich dataset benefiting from both frequent and infrequent terms. The chart indicates that for this combined song structure, a higher MinDF contributes to better recall, while a lower MinDF contributes to better precision. However, the stability of the F1 Score suggests a relatively balanced trade-off between these metrics. The high F1 Scores across all MinDF settings underscore the model’s strong predictive ability when evaluating the composite structure of Verse 1, Chorus, and Verse 2.

The choice of MinDF should align with the specific goals of the analysis. A higher MinDF may be preferred to capture most true positives (high recall), while a lower MinDF would be beneficial for more accurate positive predictions (high precision).

In summary, the “Verse1-Chorus-Verse2 Results” chart underscores the importance of MinDF in tuning model performance for composite song structures. It emphasizes that the inclusion of both frequent and infrequent terms is vital for the model’s ability to recognize emotions in music, and adjustments to the MinDF can fine-tune the model for different performance outcomes.

3.3. Song Parts Implications for Music Emotion Recognition

In the context of MER, examining the performance of machine learning models across various song structures—such as Verse 1, Verse 2, Chorus, Whole Song, and Verse1-Chorus-Verse2—provides valuable insights into how emotional content is conveyed and interpreted through both lyrics and potentially audio features.

Results from Verse 1 and Verse 2 indicate that these song segments contain distinct emotional cues captured by a diverse range of terms, including less frequent ones. As verses often establish narratives or scenes, the nuanced language within them is crucial for discerning the intended emotion. The improved performance metrics at lower MinDF settings suggest that the richness of emotional expression may be linked to the diversity of vocabulary used.

The Chorus typically encapsulates the primary theme or emotional climax of a song. The observed increase in precision at higher MinDF settings may reflect the repetition of emotionally charged phrases typical in choruses. This implies that the chorus’s repetitive nature, aimed at evoking or amplifying the song’s core emotional message, is best captured with more frequent terms.

The robustness of the model across different MinDF settings when analyzing whole songs suggests that both frequent and infrequent terms play significant roles in conveying emotion. The entirety of the song provides a comprehensive emotional narrative, which is not overly sensitive to the frequency of term usage, indicating that emotional content is broadly distributed throughout the song and not confined to specific sections or terms.

The composite structure of Verse1-Chorus-Verse2 exhibits high recall at higher MinDF settings, suggesting that capturing the full spectrum of emotional expression may necessitate focusing on the most common terms across all three sections. The consistent F1 Score across all settings underscores the effectiveness of combining these sections to achieve a balance between precision and recall.

Different parts of a song contribute uniquely to its overall emotional expression, with verses providing context and narrative, the chorus reinforcing the central emotional theme, and the combination of these elements offering a complete emotional picture. In practical applications, such as playlist creation or album analysis, understanding these nuances is crucial. For instance, in a recommendation system, models tuned to higher MinDF settings might be more effective if a user prefers songs with a specific emotional tone in the chorus. Conversely, for therapeutic or educational purposes where understanding the full narrative is vital, models with lower MinDF settings that capture finer details in verses could be more suitable. Overall, these analyses provide a comprehensive understanding of how to approach emotion recognition in music, emphasizing the importance of considering musical structure and desired outcomes when developing machine learning models for this purpose.

3.4. Effectiveness of Multi-Input Stacking Methods

The results from the various analyses of the structural elements of songs (Verse 1, Verse 2, Chorus, Whole Song, and the combined Verse1-Chorus-Verse2 structure) using different MinDF settings provide a compelling argument for the bimodal approach and the stacking method in music emotion recognition.

By utilizing both audio and lyrics data, the bimodal approach captures a more complete picture of the emotional cues present in music. Audio features might capture the tone, beat, and melody, while lyrics provide semantic and narrative context. As seen in the results, models that consider both modes tend to perform better in terms of accuracy and F1 scores, especially when processing the complete structure of Verse1-Chorus-Verse2. This indicates that combining these two types of data helps the model in accurately identifying the emotions conveyed by a song. Each mode compensates for the potential shortcomings of the other. For instance, where lyrics might be ambiguous, the clarity of emotion could be reinforced by the audio, and vice versa. A bimodal approach can potentially be more robust to songs with less clear emotional expression in one mode. For example, if the lyrics are not strongly indicative of emotion, the audio features can provide the necessary cues to classify the song’s emotion correctly. It is then to be more effective if a complete selected dataset is fed into the model.

Stacking methods, on the other hand, which involve using predictions from individual models as inputs for a final meta-learner, have shown to be effective in the results. This is likely because stacking allows for the model to not only learn from the input data but also from the patterns in the predictions made by the base models. The stacking approach allows for complexity where necessary. For example, the base models can be complex and tuned to capture the nuances in their specific modalities, while the meta-learner can be simpler, focusing on combining these nuances effectively. Since stacking methods use predictions from base models as features, they can reduce the risk of overfitting by distilling the features into more robust meta-features that are less likely to be noise-driven. Stacking inherently benefits from the diversity of models. If the audio and lyrics models make different kinds of errors, stacking can mitigate these through the meta-learner, which learns to correct these errors when combining the models’ outputs.

Based on the results above, the stacking method has shown to be particularly effective for the Verse1-Chorus-Verse2 structure, likely due to its ability to integrate and leverage the unique emotional content presented in each song part. This suggests that for tasks such as music emotion recognition, where the aim is to understand complex and subjective qualities like emotional tone, a stacked bimodal approach is advantageous. It allows the creation of a more nuanced and accurate emotion recognition system by combining the strengths of both audio and lyrics-based features, as opposed to relying on a unimodal approach that might miss out on key emotional indicators.

3.5. Comparison with Other Existing Studies

Recent advancements in MER underscore the importance of integrating multimodal data to enhance predictive accuracy. In this study, our objective was to investigate the efficacy of a stacked generalization approach utilizing both audio and lyrics input data to model emotional content in music. This approach was compared against prominent methodologies in the field as shown in Table 2 which employ using different models for audio only, lyrics only and a combined audio and lyrics features.

The results of our study are benchmarked against three existing models: [29,30,31]. Our comparative analysis reveals that the bimodal stacking approach does surpass the highest benchmark, and that it aligns with broader findings advocating for the synergy of lyrical and audio data in recognizing musical emotions. For instance, the CNN-LSTM model reported by [29], leveraging spectrogram and statistical features, achieved an average performance of 71.2%. In comparison, our bimodal stacked model demonstrated comparable effectiveness with a different increase of 25.08% in accuracy. This suggests that incorporating structural song elements in the model may capture a more comprehensive emotional spectrum, like the diverse features utilized by [29].

Similarly, the approach conducted by [30], with an accuracy of 78.2%, utilized a multifaceted feature set including Word2Vec embeddings, highlighting the potential of sophisticated linguistic models in MER. Our bimodal stacked model supports this finding, indicating that nuanced language processing, modulated through MinDF parameterization, can enhance model performance significantly.

Furthermore, the high performance achieved by BERT and CNN Fusion [31] (92%) underscores the potency of state-of-the-art natural language processing and image recognition techniques in MER. Our bimodal stacked model does reach the higher level of accuracy of 96.28%, having a difference of 4.28%. It aligns conceptually with this approach, and it similarly seeks to harness the strengths of different data representations.

The models using audio-only features exhibit significant performance in music emotion recognition. The study by [32] effectively utilized neural networks with the MediaEval dataset using audio-only features, demonstrating the potential of neural networks in deep learning domains. On the other hand, a machine learning model such as the study on the hierarchical support vector machine (SVM) classifier [33] demonstrated high accuracy with audio-only features, highlighting the effectiveness of well-tuned machine learning algorithms in capturing emotional content from audio signals. A similar approach was employed by [34], which added another classification layer to improve the accuracy of classifying songs into “relax” or “sad” categories. Lastly, a study of [35] that utilizes genetic algorithm for feature optimization was also used, with a maximum accuracy of 84.57%.

The highest accuracy achieved by an audio-only model is 92.33%, using a Hierarchical SVM Classifier [33]. Our proposed model achieves an accuracy of 96.28%, representing a 3.95% improvement over the best audio-only model. Furthermore, our bimodal stacked ensemble model demonstrates a substantial performance enhancement, achieving an average of their difference of 12.49% in accuracy. This significant improvement supports the claims of [29,30,31] that combining audio and lyric features leads to superior performance in music emotion recognition.

The models using lyrics-only features also contribute significantly to the performance in music emotion classification. The study by [36] utilizes Plutchik’s model of eight basic emotions to classify music based on lyrics, with a relatively low accuracy of 43.4%. This suggests that while the approach may capture some emotional content from the lyrics, it struggles with precise classification, possibly due to the limited size of the dataset and the complexity of accurately mapping words to emotional states. The Bi-LSTM model using GloVe word embeddings achieves a high accuracy of 91.0%. This impressive performance highlights the effectiveness of advanced NLP techniques and deep learning models in processing and understanding lyrical content. The large dataset size likely contributes to the robustness and generalizability of the model [37]. Lastly, the model that employs a classification and regression approach using Russell’s emotion model to analyze music lyrics [38] has an accuracy of 77.1%. The model shows solid performance, suggesting that quadrant-based emotional classification can effectively capture the emotional nuances in lyrics.

The highest accuracy achieved by a lyrics-only model is 91.0% using a Bidirectional Long-Short Term Memory Model with GloVe embeddings [37]. Our proposed model achieves an accuracy of 96.28%, marking a 5.28% improvement over the best lyrics-only model. Furthermore, our bimodal stacked ensemble model significantly outperforms the lyrics-only model, achieving an average difference of 25.78% in accuracy. This substantial improvement once again supports the claims of [29,30,31] that integrating audio and lyric features leads to enhanced performance in music emotion recognition.

Our 6-Input bimodal stacked model does outperform the most advanced audio-only, lyrics-only, and fusion-based approaches. The combination of audio and lyrics features, along with the innovative Verse1-Chorus-Verse2 stacking method, contributes to a higher accuracy and robust performance. This comparison clearly indicates the advancements achieved by our model. In comparison with its alignment with current research trends confirms the validity of its underlying principles. Our study contributes to the evolving discourse on MER by reinforcing the merit of composite analyses and the potential of stacking methods in harnessing multimodal information. Further investigation into feature extraction techniques and model architectures is warranted to fully realize the capabilities of bimodal approaches in MER.

4. Conclusions

In conclusion, our comprehensive examination of datasets integrating both audio and lyrics has uncovered a nuanced landscape of performance accuracy utilizing stacking methods. While the absence of a single consistently superior model underscores the intricate and multifaceted nature of music emotion recognition, our experiments have showcased a remarkable achievement of a maximum accuracy of 96.28%. This notable result highlights the significant potential of effectively integrating both audio and lyrics features, which represents a substantial contribution to the field of music emotion recognition.

A discernible pattern emerges from our findings, indicating that a minimum document frequency of 30 contributes to optimal results when extracting data from lyrics. However, it is crucial to acknowledge that other influencing factors, potentially linked to the subjective choices of playlist creators on platforms like Spotify, also impact performance.

Moving forward, we recommend future research to delve deeper into the performance dynamics of combined audio and lyrics datasets, exploring synchronized patterns in data collection and analysis. This approach holds the promise of unveiling more profound insights into the intricate interplay between audio and lyrics in music analysis. Ultimately, this will pave the way for the enhancement of feature engineering and the development of more robust and comprehensive models, such as the utilization of ensemble methods incorporating CNN and RNN–LSTM architectures in the realm of music emotion recognition, building upon the groundwork laid by our study.

5. Possible Future Works

In the realm of music emotion recognition (MER), future research holds promising avenues for further advancement and innovation. One direction involves exploring additional modalities beyond audio and lyrics, such as music metadata or user behavior data, to enrich the understanding of emotional nuances in music. Additionally, the development and exploration of novel model architectures, including ensemble methods, graph neural networks, and attention mechanisms, offer exciting opportunities for enhancing the accuracy and interpretability of MER systems. Fine-tuning existing model architectures, including convolutional neural networks (CNN) and recurrent neural networks with long short-term memory (RNN–LSTM), presents another opportunity for enhancing model performance and accuracy. Moreover, leveraging transfer learning techniques and pretrained models from related domains could expedite model training and improve efficiency. Integrating real-time data streams into model training processes offers the potential for dynamic and adaptive emotion recognition systems, enabling personalized music recommendations and immersive user experiences. Evaluating model performance across diverse music genres and cultural contexts will ensure robustness and applicability across different user demographics. Lastly, interdisciplinary collaboration between experts in musicology, psychology, and machine learning can foster innovative approaches and deepen our understanding of the emotional dimensions of music. Through these future research endeavors, we can push the boundaries of MER and unlock new insights into the profound impact of music on human emotions.

Author Contributions

The contribution of the authors in this research are the following: Conceptualization, L.J.M.R. and A.T.; methodology, L.J.M.R. and A.T.; validation, L.J.M.R. and A.T.; formal analysis, L.J.M.R. and A.T.; data curation, L.J.M.R.; writing—original draft preparation, L.J.M.R.; writing—review and editing, L.J.M.R. and A.T.; supervision, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Heshmat, S. Music, Emotion, and Well-Being. 2019. Available online: https://www.psychologytoday.com/intl/blog/science-choice/201908/music-emotion-and-well-being (accessed on 3 April 2024).
Soon, B. Stacking to Improve Model Performance: A Comprehensive Guide to Ensemble Learning in Python. 2023. Available online: https://medium.com/@brijesh_soni/stacking-to-improve-model-performance-a-comprehensive-guide-on-ensemble-learning-in-python-9ed53c93ce28 (accessed on 3 April 2024).
Panda, R.; Redinho, H.; Gonçalves, C.; Malheiro, R.; Paiva, R.P. How does the Spotify API compare to the Music Emotion Recognition State-of-the-art. In Proceedings of the 18th Sound and Music Computing Conference, Virtual, 20 June–1 July 2021. [Google Scholar]
Hu, X.; Downie, J.S. Exploring Mood Metadata: Relationships with Genre, Artist, and Usage Metadata. In Proceedings of the 8th International Conference on Music Information Retrieval, ISMIR 2007, Vienna, Austria, 23–27 September 2007; pp. 67–72. [Google Scholar]
Fernández-Sotos, A.; Fernández-Caballero, A.; Latorre, J.M. Influence of Tempo and Rhythmic Unit in Musical Emotion Regulation. Front. Comput. Neurosci. 2016, 10, 80. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Eerola, T.; Vuoskoski, J.K. A comparison of the discrete and dimensional models of emotion in music. Psychol. Music 2011, 39, 18–49. [Google Scholar] [CrossRef]
Strapparava, C.; Mihalcea, R. Learning to identify emotions in text. In Proceedings of the 2008 ACM symposium on Applied computing (SAC ‘08). Association for Computing Machinery, Fortaleza, Ceara Brazil, 16–20 March 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 1556–1560. [Google Scholar] [CrossRef]
Cui, X.; Wu, Y.; Wu, J.; You, Z.; Xiahou, J.; Ouyang, M. A review: Music-emotion recognition and analysis based on EEG signals. Front. Neuroinform. 2022, 16, 997282. [Google Scholar] [CrossRef] [PubMed]
Choi, K.; Fazekas, G.; Sandler, M. Music emotion recognition with CNN-LSTM recurrent neural networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR) 2017, Suzhou, China, 23–27 October 2017; pp. 76–83. [Google Scholar]
Yang, Z.; Wang, X.; Ji, Q. Emotion recognition in lyrics with attention-based bidirectional LSTM. In Proceedings of the 28th ACM International Conference on Multimedia (MM) 2020, Seattle, WA, USA, 12–16 October 2020; pp. 2217–2225. [Google Scholar]
Zhang, Y.; Wang, H.; Yang, Y. A multimodal deep learning approach for music emotion recognition. ACM Trans. Multimed. Comput. Commun. Appl. TOMM 2018, 14, 1–19. [Google Scholar]
Padmanabhan, A.; Mahanta, S. Audio Feature Extraction. Available online: https://devopedia.org/audio-feature-extraction (accessed on 1 July 2023).
Agashe, R. Building Intelligent Audio Systems—Audio Feature Extraction using Machine Learning. 2021. Available online: https://www.einfochips.com/blog/building-intelligent-audio-systems-audio-feature-extraction-using-machine-learning (accessed on 1 July 2023).
Zhao, Y.; Xu, J.; Wu, X.; Yang, Z. Rhythm pattern analysis for music emotion classification. Multimed. Tools Appl. 2019, 78, 28677–28694. [Google Scholar]
Li, Y.; Lu, K.; Zhang, Z. Hybrid feature fusion for music emotion recognition. IEEE Trans. Affect. Comput. 2018, 9, 572–585. [Google Scholar]
Panda, R.; Malheiro, R.; Paiva, R.P. Audio Features for Music Emotion Recognition: A Survey. IEEE Trans. Affect. Comput. 2020, 14, 68–88. Available online: https://mir.dei.uc.pt/pdf/Journals/MOODetector/TAFFC_2023_Panda.pdf (accessed on 10 May 2024). [CrossRef]
Kim, S.; Lee, J.H.; Lee, S. Multimodal music emotion recognition using audio and lyrics with attention-based fusion recurrent neural networks. IEEE Trans. Affect. Comput. 2020, 11, 109–124. [Google Scholar]
Li, Z.; Li, M.; Zhu, W.; Wang, M. Multimodal music emotion recognition via fusion of audio and lyrics features with attention mechanism. Appl. Sci. 2020, 10, 2887. [Google Scholar]
Hu, X.; Xie, Y.; Hu, X. Emotion recognition in music with lyrics using multitask learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN) 2019, Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar]
Khadkevich, M.; Li, X.; Yang, Z.; Yang, Y. Cross-modal music emotion recognition with graph convolutional networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, Barcelona, Spain, 4–8 May 2020; pp. 321–325. [Google Scholar]
Yang, Z.; Yang, D.; Yang, Y. Music emotion recognition with hierarchical attention-based deep learning. IEEE Trans. Affect. Comput. 2020, 13, 54–67. [Google Scholar]
Yönak, R.; How Spotify Has Changed the Way We Listen to Music. Audioxide. 2019. Available online: https://audioxide.com/articles/how-spotify-has-changed-the-way-we-listen-to-music/ (accessed on 30 June 2024).
Wolf, K. Hyper-Specific Playlists: A Tool for Emotional Connection and Expression, the Daily Universe. 2022. Available online: https://universe.byu.edu/2022/11/03/hyper-specific-playlists-a-tool-for-emotional-connection-and-expression/ (accessed on 8 April 2024).
Yang, Z.; Yang, D.; Yang, Y. Music emotion recognition based on sequential patterns of emotional contours. Multimed. Tools Appl. 2019, 78, 24307–24323. [Google Scholar]
Humphrey, E.; Mantel, B.; Kachele, M. Machine learning for music emotion recognition: Reviewing relevant work. J. Intell. Inf. Syst. 2013, 41, 455–489. [Google Scholar]
Kim, Y.; Schmidt, E.M. Recurrent convolutional neural networks for music emotion recognition. IEEE Trans. Affect. Comput. 2018, 9, 511–518. [Google Scholar]
Wang, J.; Yang, Y.; Chang, K.; Wang, H.; Jeng, S. Exploring the relationship between categorical and dimensional emotion semantics of music. In Proceedings of the MIRUM ’12: Second International ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies, Nara, Japan, 2 November 2012; pp. 63–68. [Google Scholar] [CrossRef]
Alexander, B. Accuracy vs. Precision vs. Recall in Machine Learning: What Is the Difference? 2023. Available online: https://encord.com/blog/classification-metrics-accuracy-precision-recall/ (accessed on 13 May 2024).
Jia, X. Music Emotion Classification Method Based on Deep Learning and Explicit Sparse Attention Network. Comput. Intell. Neurosci. 2022, 2022, 3920663. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Li, Q. A Multimodal Music Emotion Classification Method Based on Multifeature Combined Network Classifier. Math. Probl. Eng. 2020, 2020, 4606027. [Google Scholar] [CrossRef]
Revathy, V.R.; Pillai, A.S.; Daneshfar, F. LyEmoBERT: Classification of lyrics’ emotion and recommendation using a pre-trained model. Procedia Comput. Sci. 2023, 218, 1196–1208. [Google Scholar] [CrossRef]
Medina, Y.O.; Beltrán, J.R.; Baldassarri, S. Emotional classification of music using neural networks with the MediaEval dataset. Pers. Ubiquitous Comput. 2022, 26, 1237–1249. [Google Scholar] [CrossRef]
Chiang, W.C.; Wang, J.S.; Hsu, Y.L. A Music Emotion Recognition Algorithm with Hierarchical SVM Based Classifiers. In Proceedings of the 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 10–12 June 2014; pp. 1249–1252. [Google Scholar] [CrossRef]
Pouyanfar, S.; Sameti, H. Music emotion recognition using two level classification. In Proceedings of the 2014 Iranian Conference on Intelligent Systems (ICIS), Bam, Iran, 4–6 February 2014; pp. 1–6. [Google Scholar] [CrossRef]
Bargaje, M. Emotion recognition and emotion based classification of audio using genetic algorithm—An optimized approach. In Proceedings of the 2015 International Conference on Industrial Instrumentation and Control (ICIC), Pune, India, 28–30 May 2015. [Google Scholar] [CrossRef]
Oh, S.; Hahn, M.; Kim, J. Music Mood Classification Using Intro and Refrain Parts of Lyrics. In Proceedings of the 2013 International Conference on Information Science and Applications (ICISA) 2013, Pattaya, Thailand, 24–26 June 2013. [Google Scholar] [CrossRef]
Abdillah, J.; Asror, I.; Wibowo, Y.F. Emotion Classication of Song Lyrics using Bidirectional LSTM Method with GloVe Word Representation Weighting. RESTI J. Syst. Eng. Inf. Technol. 2020, 4, 723–729. Available online: https://www.academia.edu/66996948/Emotion_Classification_of_Song_Lyrics_using_Bidirectional_LSTM_Method_with_GloVe_Word_Representation_Weighting (accessed on 29 May 2024).
Malheiro, R.; Panda, R.; Gomes, P.J.S.; Paiva, R.P. Classification and Regression of Music Lyrics: Emotionally-Significant Features. Int. Conf. Knowl. Discov. Inf. Retr. 2016, 2, 45–55. [Google Scholar] [CrossRef]

Figure 1. Basic Feature Extraction Process.

Figure 2. Integration Diagram for Stacked Ensemble Model.

Figure 3. Audio Base Model.

Figure 4. Lyrics Base Model.

Figure 5. Dual Input Stacked Ensemble Model used for Verse1, Chorus, Verse2, and Whole song Datasets.

Figure 6. Verse 1 Performance Evaluation Results.

Figure 7. Chorus Performance Evaluation Results.

Figure 8. Verse 2 Performance Evaluation Results.

Figure 9. Whole Song Performance Evaluation Results.

Figure 10. Verse 1-Chorus-Verse 2 Performance Evaluation Results.

Table 1. Raw Audio and Lyrics Feature Datasets.

Song Part	Audio Features	Lyrics Features
Song Part	Audio Features	Document Frequency	Unigram	Bigram
Verse 1	(400, 29)	5	(400, 369)	(400, 427)
		10	(400, 189)	(400, 198)
		30	(400, 45)	(400, 45)
Verse 2	(400, 29)	5	(400, 363)	(400, 417)
		10	(400, 170)	(400, 176)
		30	(400, 45)	(400, 45)
Chorus	(400, 29)	5	(400, 237)	(400, 278)
		10	(400, 127)	(400, 131)
		30	(400, 29)	(400, 29)
Whole Song	(400, 29)	5	(400, 835)	(400, 1427)
		10	(400, 520)	(400, 637)
		30	(400, 183)	(400, 194)
Verse 1-Chorus-Verse 2	(400, 87)	5	(400, 969)	(400, 1119)
		10	(400, 486)	(400, 505)
		30	(400, 119)	(400, 119)

Table 2. Comparison with Related Works.

Model	Dataset	Features	Evaluation Accuracy
CNN-LSTM and Sparse Attention Network [29]	Chinese Audio Lyrics	Audio and Lyrics	71.2%
Multi Feature combined classifier and stacking fusion [30]	Million Song Dataset	Audio and Lyrics	78.2%
LyEmoBERT [31]	MoodyLyrics	Audio and Lyrics	92%
Our Proposed Verse1-Chorus-Verse2 Stacking Method	Verse1, Chorus, Verse2 datasets	Audio and Lyrics	96.28%
Multilayer Perceptron (MLP) [32]	MediEval Dataset	Audio Only	73%
Hierarchical SVM Classifier [33]	219 classical Music	Audio Only	92.33%
Two-level Classification using SVM [34]	280 Pop Music	Audio Only	87.27%
Genetic Algorithm [35]	200+ English and Hindi Songs	Audio Only	84.57%
Plutchick’s emotion model [36]	100 songs	Lyrics Only	43.4%
Bidirectional Long-Short Term Memory Using GloVe [37]	2189 songs from Moody Lyrics	Lyrics Only	91.0%
Classification and Regression by quadrant [38]	180 lyrics	Lyrics Only	77.1%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raboy, L.J.M.; Taparugssanagorn, A. Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition. Appl. Sci. 2024, 14, 5761. https://doi.org/10.3390/app14135761

AMA Style

Raboy LJM, Taparugssanagorn A. Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition. Applied Sciences. 2024; 14(13):5761. https://doi.org/10.3390/app14135761

Chicago/Turabian Style

Raboy, Love Jhoye Moreno, and Attaphongse Taparugssanagorn. 2024. "Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition" Applied Sciences 14, no. 13: 5761. https://doi.org/10.3390/app14135761

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition

Abstract

1. Introduction

1.1. Theoretical Background from a Musical Perspective

1.2. Related Works on Music Emotion Recognition Domains and Challenges

1.3. Related Works on Using Audio and Lyrics Features

1.4. Related Works on Using Spotify and Emotion-Labeled Playlists

1.5. Related Works on Audio and Lyrics Data for Using Stacking Techniques

1.6. Bridging Gaps and Our Contributions

2. Methodology

2.1. Establishment of Dataset from Emotion-Labeled Spotify Playlist

2.2. Development of Stacked Ensemble Model

2.2.1. Base Model Representation

Audio Data Representation

Audio Base Model

Lyrics Data Representation

Lyrics Base Model

2.2.2. Dual Input Stacked Ensemble Model for Datasets Verse1, Chorus, Verse2, and Whole Song

Concatenation through Predictions of Dual Input Stacked Ensemble Model

2.2.3. The Six Input Stacked Ensemble Model

Concatenation Process for Six Input Stacked Ensemble Model

2.2.4. The Meta-Learner for Final Classification

2.3. Training the Model

2.4. Optimizer and Loss Function

2.5. Model’s Performance Evaluation Criteria

3. Results and Discussion

3.1. Performance of Basic Stacked Ensemble Model

3.1.1. Performance of Basic Stacked Ensemble for Verse 1 Dataset

3.1.2. Performance of Basic Stacked Ensemble Model for Chorus Datasets

3.1.3. Performance of Basic Stacked Ensemble Model for Verse2 Dataset

3.1.4. Performance of Basic Stacked Ensemble Model for Whole Song Dataset

3.2. Performance of 6-Input Stacked Ensemble Model

3.3. Song Parts Implications for Music Emotion Recognition

3.4. Effectiveness of Multi-Input Stacking Methods

3.5. Comparison with Other Existing Studies

4. Conclusions

5. Possible Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI