Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The main weakness of the work consists in the fact that it does not justify a new standard in the field as mentioned in section 1.5. I agree that it contributes to the field, but it does not generate a new standard. To affirm that it generates a new standard in the field, it is necessary to deepen the literature review by exposing with a higher level of depth the review of works of the same field. It should be clarified what a standard is in this context, which ones exist at present, and support with solidity the reasons for considering it as a contribution to a new standard.
- Table 1 is not clearly understood, it should be better explained.
- It is important to make it clear whether the dataset is your own or taken from a source. In the contributions, the dataset is mentioned, but in the development of the paper, it seems that the dataset is taken from Spotify.
- The general model integrates through concatenations all the parts of the song structure, verses, and choruses. It is important to clarify why in the experiments the song is analyzed globally, and not by parts of the structure.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe study delves into music emotion recognition within artificial intelligence frameworks. While the methodology is well-articulated, the primary objective of the research remains unclear to this reviewer.
The authors mentioned considering spectrum bandwidth in their computations. Could you elaborate on how this frequency spectrum relates to emotion recognition, if at all?
Regarding references, please include one for Russell's Emotion Plane at L209.
At L213, could you provide scientific rationale for excluding gym and yoga songs?
Furthermore, it would be helpful to clarify whether all songs used in the study were in English at L225.
I suggest a minor revision.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript entitled, ‘Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach 2 for Enhanced Music Emotion Recognition’, provides the narrative on emotional recognition through the rhythm and lyrics using the structured framework. The authors have utilized Spotify playlists to construct a stacked ensemble model to improve emotion prediction.
Specific comments to improve:
1. The narrative in this manuscript may appear to the readers that the authors did not write the manuscript. The description does not specifically address the dataset but reads like a review article. The manuscript needs to describe the methodology precisely.
2. The abstract does not reflect what is described in the manuscript
3. The authors have provided a detailed introduction describing the referenced articles, which may be helpful to the readers.
4. The methodology sections need clarity as the description sounds hypothetical rather than the experimental/mathematical derivation. The description of Table 1 needs to be clear so, the readers can appreciate how these numbers are derived from, ‘what song’ the authors are referring to. Otherwise, it appears as a generalized statement.
5. The maximum likelihood estimator and function described in prediction vectors and output classification are generally adopted in many existing algorithms how are the methods described in this paper unique?
6. The results section needs to be improved. The comparative bar diagram (Fig. 4 to 8) does not provide an accurate scale, providing a scale on the y-axis will help.
Comments on the Quality of English LanguageThe description needs to correspond to the data being presented and not superfluous.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for Authorsl This study uses a stack ensemble model for music emotion recognition.
l The detailed approach that attempted to classify the song into verse1, chorus, and verse2 segments through structural analysis is impressive.
Comments for author File: Comments.pdf
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe work has improved considerably in terms of its introduction, and the inclusion of details that allow a better understanding of its purpose and development methodology.
However:
1. The concatenation process should be clarified in more detail, perhaps explaining with a brief example how to concatenate the audio of a song with a lyric. What type of information does this vector have? What is its structure? Explain better why it makes sense to do this concatenation.
2. It is not clear the data flow between the concatenated and the meta learner. This process should be detailed to understand it.
3. It would be necessary to compare the classification results of this proposed standard versus a system that only classifies by text or audio.
4. I believe that there should be a more practical explanation somewhere in the methodology through a brief example, where the concatenation process is clarified, and also justifies the strengths of carrying out this process from a musical perspective.
The topic of the work is extremely interesting, but I consider that it requires a rewriting process that allows a better understanding of the contribution, including a more extensive review of the state of the art that supports the importance of making classifications with the concatenation of sound and lyrics. For this last part, it is also important a theoretical background that supports the contribution from the musical point of view.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf