Next Article in Journal
A Multi-Story Expandable Systematic Hierarchical Construction Information Classification System for Implementing Information Processing in Highway Construction
Previous Article in Journal
Research on Detection of Rice Pests and Diseases Based on Improved yolov5 Algorithm
 
 
Article
Peer-Review Record

Augmentation Embedded Deep Convolutional Neural Network for Predominant Instrument Recognition

Appl. Sci. 2023, 13(18), 10189; https://doi.org/10.3390/app131810189
by Jian Zhang 1,* and Na Bai 2
Reviewer 1:
Reviewer 2:
Appl. Sci. 2023, 13(18), 10189; https://doi.org/10.3390/app131810189
Submission received: 11 August 2023 / Revised: 4 September 2023 / Accepted: 8 September 2023 / Published: 11 September 2023

Round 1

Reviewer 1 Report

The paper is an engaging examination of an essential topic. It is well organized and explicated. The novelty is satisfactory, but it is not a game-changer. I have only three observations that the authors should consider.

1. There are lacking references that are highly pertinent to the topic. For example, the research of Szeliga et al. (2022), Gururani et al. (2019), Seetharamanet al. (2019), Crameret al. (2019), Hung et al. (2019), and Manilow et al. (2020). These papers present qualitative and quantitative findings that should be contrasted to the authors' findings.

2. As shown in Table 5, the overall accuracy of the authors' methods is nearly identical to that of many other methods. After adding the appropriate error bars, I am uncertain as to which method is the best. 

3. The works of Szeliga (2022) or Hung (2019) are more accurate overall. In addition, Hung presents a detailed accuracy for each instrument to be recognized, whereas the current paper does not. This could be a more effective method to compare the various approaches.

 

 

The English is acceptable.

 

Author Response

Reviewer#1, Concern # 1: There are lacking references that are highly pertinent to the topic. For example, the research of Szeliga et al. (2022), Gururani et al. (2019), Seetharamanet al. (2019), Crameret al. (2019), Hung et al. (2019), and Manilow et al. (2020). These papers present qualitative and quantitative findings that should be contrasted to the authors' findings.

Author response:  Thanks for the comment. The papers you mentioned have made significant contributions to improving the quality of our manuscript. In this version, the proposed pre-training methods and pre-trained features from the aforementioned papers are leveraged to optimize our proposed AEDCN model. Specifically, AEDCN adds a pre-training process based on introduced datasets as a similar transfer-learning approach to improve the adaptability of AEDCN for the predominant instrument recognition task. Then, AEDCN integrates the data augmentation stage as a part of the instrument recognition neural network, briefly, the AEDCN uses 2 fully connected layers following convolutional layers in the backbone multi-task backbone instrument recognition network and incorporates a constructed Adversarial Conditional Embedded Variational AutoEncoder (ACEVAE) between the added fully connected layers of the multi-task backbone instrument recognition network. Our manuscript focuses on the task of recognizing predominant musical instruments, where we detect whether specific instruments appear in music. In this version of our paper, we have revised the title and task to be "predominant instrument recognition." Based on the above situation, we use IRMAS dataset which has monophonic training data and polyphonic testing data in our experiments. In experiments, we also add qualitative and quantitative findings of the mentioned methods in our manuscript. However, the music source separation methods are commonly used for polyphonic training datasets, so we did not add them for comparison.

Thanks for this comment.

 

Reviewer#1, Concern # 2: As shown in Table 5, the overall accuracy of the authors' methods is nearly identical to that of many other methods. After adding the appropriate error bars, I am uncertain as to which method is the best.

Author response: Thanks for this comment. In this version, we add a pre-training process for the predominant instrument recognition task and improve the effectiveness of the proposed AEDCN, correspondingly, the experimental results of AEDCN perform better than other methods. Moreover, compared with other data augmentation based methods, AEDCN generates features of the fully connected layer rather than directly generating the high-resolution input spectrograms. Therefore, the data augmentation process has lower computational burden than the commonly used data augmentation methods.

Thanks for this comment.

 

 

Reviewer#1, Concern # 3: The works of Szeliga (2022) or Hung (2019) are more accurate overall. In addition, Hung presents a detailed accuracy for each instrument to be recognized, whereas the current paper does not. This could be a more effective method to compare the various approaches.

Author response: Thanks for the comment. We carefully read these two papers and they are very helpful for our manuscript. The transfer learning based staged training method in Szeliga (2022) is introduced to our AEDCN model to improve the effectiveness. We read the experiments in their work, although they achieve better experimental results, they only use the monophonic training data rather than polyphonic testing data. For comparison, we use the staged training method for training deep neural networks in our experiments for the complete IRMAS dataset.

Hung et al. (2019) introduce multi-task learning for instrument recognition. However, the targets of multi-task learning in their work and our manuscript differ. In their work, they propose a method to recognize both pitches and instruments. However, the pitches in the IRMAS dataset are not labeled. In our manuscript, we leverage multi-task learning in AEDCN by grouping the instruments based on their onset types and instruments themselves. This approach allows us to effectively utilize the available labeled data for instrument recognition. By incorporating multi-task learning, we aim to enhance the performance of our model in recognizing the predominant instruments.

Thanks for this comment.

Author Response File: Author Response.doc

Reviewer 2 Report

entire paper is very well structured and explain. Proof read to correct for typo and grammatical errors. The suggestions mentioned below may be addressed to improve the quality of the manuscript.

1.       The problem and context are clearly explained in the abstract. But think about simply explaining why enhancing instrument representation consistency is crucial for precise recognition.

2.       The creation of AEDCN as a novel technique is well highlighted in the abstract. Give a brief explanation of how the "combined 2-channel representation of instruments" adds to the innovation to make your point more clear.

3.       The lack of consistency in current augmentation techniques is mentioned in the abstract. A phrase or two describing the possible effects of this inconsistency on recognition accuracy would be helpful.

4.       Give a brief overview of the variety of models and augmentation approaches that were utilized for comparison while pointing out that AEDCN performs better than other deep neural networks and augmentation strategies.

5.       Give an idea of the new research avenues that this study opens up, highlighting the potential wider impact of your approach on the subject of music information retrieval.

Suggested to Proof read the paper to correct for typo and grammatical errors

Author Response

Reviewer#2, Concern # 1: The problem and context are clearly explained in the abstract. But think about simply explaining why enhancing instrument representation consistency is crucial for precise recognition.

Author response: Thanks for the comment. We add a brief explaining about the inconsistency in Abstract. Specifically, in the commonly used data augmentation based instrument recognition models, the data generation process is commonly independent from the recognition process, in this case, the generated data may not fit the requirement of the instrument recognition model. In particular, the types of instrument spectrograms generated through data augmentation may not necessarily align with the specific requirements of the instrument recognition models. This discrepancy can cause inconsistencies between the augmented data and the needs of the instrument recognition model, which can negatively impact its accuracy and performance. Moreover, generating the high-resolution spectrograms brings high computation cost. In this manuscript, the proposed AEDCN can generate synthetic feature samples that specifically correspond to certain designated labels. This process enables the network to augment the training data by generating additional feature samples for particular classes, consequently improving the model's capacity to recognize and classify these classes.

Thanks for this comment.

 

Reviewer#2, Concern # 2: The creation of AEDCN as a novel technique is well highlighted in the abstract. Give a brief explanation of how the "combined 2-channel representation of instruments" adds to the innovation to make your point more clear.

Author response: Thanks for this comment. In this version, we describe the innovation of "combined 2-channel representation of instruments" in the main contributions in Introduction: “Proposal of a combined 2-channel representation aimed at capturing distinctive rhythm patterns of various instrument types more effectively. This representation comprises a mel-spectrum and a tempogram in the two channels, allowing it to better capture the characteristic rhythmic patterns associated with specific instruments.”

Specifically, the effectiveness of the tempogram for instrument recognition tasks can be attributed to its ability to capture the unique rhythm patterns exhibited by different instruments. The tempogram is a representation that provides a detailed analysis of the rhythmic content in an audio signal, highlighting the temporal structure and variations in intensity. By utilizing the tempogram, instrument recognition models can extract valuable temporal information that is crucial for distinguishing between different instruments. The tempogram representation enhances the discriminative power of the model, allowing it to better capture the characteristic rhythmic patterns associated with specific instruments. Furthermore, the tempogram helps in overcoming challenges such as variations in playing style, dynamics, and articulation. It provides a robust representation that is invariant to these factors, improving the accuracy and robustness of instrument recognition systems. Therefore, we introduce this representation to our predominant instrument recognition model.

Thanks for this comment.

 

Reviewer#2, Concern # 3: The lack of consistency in current augmentation techniques is mentioned in the abstract. A phrase or two describing the possible effects of this inconsistency on recognition accuracy would be helpful.

Author response: Thanks for the comment. We revise the Abstract and describe the possible effects of this inconsistency:
Abstract: Instrument recognition is a critical task in the field of music information retrieval, and deep neural networks have become the dominant models for this task due to their effectiveness. Recently, incorporating data augmentation methods into deep neural networks has been a popular approach to improve instrument recognition performance. However, existing data augmentation processes are always based on simple instrument spectrogram representation and typically independent of the predominant instrument recognition process. These approaches may result in a lack of coverage for certain required instrument types, leading to inconsistencies between the augmented data and the specific requirements of the recognition model. To build more expressive instrument representation and address this inconsistency, this paper constructs a combined 2-channel representation for further capturing unique rhythm patterns of different types of instruments and proposes a new predominant instrument recognition strategy called Augmentation Embedded Deep Convolutional neural Network (AEDCN). AEDCN adds 2 fully connected layers into the backbone neural network and integrates data augmentation directly into the recognition process by introducing a proposed Adversarial Embedded Conditional Variational AutoEncoder (ACEVAE) between the added fully connected layers of the backbone network. This embedded module aims to generate augmented data based on designated labels, thereby ensuring its compatibility with the predominant instrument recognition model. The effectiveness of the combined representation and AEDCN is validated through comparative experiments with other commonly used deep neural networks and data augmentation-based predominant instrument recognition methods using a polyphonic music recognition dataset. The results demonstrate the superior performance of AEDCN in predominant instrument recognition tasks.

Thanks for this comment.

 

 

Reviewer#2, Concern # 4: Give a brief overview of the variety of models and augmentation approaches that were utilized for comparison while pointing out that AEDCN performs better than other deep neural networks and augmentation strategies.

Author response: Thanks for the comment. In this version, we introduce related models and augmentation approaches in our related works: “To address this issue, data augmentation methods have been introduced in instrument recognition models. Yu et al. [10] proposed constructing a network with an auxiliary classification based on onset groups and instrument families to generate valuable training data. In another study by using convolutional recurrent neural networks (CRNN), predominant instrument recognition in polyphonic music was addressed [9]. Hung et al. (2019) introduce multi-task learning for instrument recognition. In their work, they propose a method to recognize both pitches and instruments [16]. To augment the data, they employed a Wave Generative Adversarial Network (WaveGAN) architecture to generate audio files [7-9]. These approaches demonstrate the utilization of various techniques, including feature extraction, deep learning, image processing, and data augmentation, to improve instrument recognition accuracy and handle challenges such as low-quality recordings and polyphonic music.” Moreover, in experiments, we add a data augmentation based deep neural network which use different generative model (VAE) for data augmentation for comparison to verify the effectiveness of the proposed AEDCN.

 Thanks for this comment.

 

 

Reviewer#2, Concern # 5: Give an idea of the new research avenues that this study opens up, highlighting the potential wider impact of your approach on the subject of music information retrieval.

Author response: Thanks for the comment. We add the research avenues in the Conclusions. This paper proposes an augmentation based deep convolutional neural network AEDCN for instrument recognition. Compared with other data augmentation based instrument recognition methods, AEDCN only generate the features rather than directly generating the high-dimensional spectrograms. Specifically, AEDCN designs an ACEVAE between the added full connected layers in a multi-task label augmentation based backbone instrument recognition network. According to the recognition results, ACEVAE can augment specific data based on specially designated labels. Experiments verify the effectiveness of the AEDCN.

In addition, the AEDCN model and the designed data augmentation methods offer a simplified approach to data augmentation for tasks that require more data but face challenges in generating original data due to high computational burden. By applying data augmentation techniques within the AEDCN model, we can effectively increase the size and diversity of the training dataset without relying solely on generating entirely new data. This approach reduces the computational complexity associated with generating original data while still achieving the benefits of augmented data. This simplified data augmentation approach may enable researchers and practitioners to enhance the performance and generalization ability of their models, even when access to large amounts of original data is limited or computationally expensive.

Thanks for this comment.

Author Response File: Author Response.doc

Back to TopTop