Multi-Scale Features for Transformer Model to Improve the Performance of Sound Event Detection
Round 1
Reviewer 1 Report
1. Elaborating the introduction more with sufficient background/literature review is recommended. Add some latest work (listed below) on noise/sound detection but there are more than this as well. Also, talk about techniques used by those researchers and compare the proposed work with those works.
Suggested Papers:
a. Automatic Detection of Noise Events at Shooting Range Using Machine Learning
b. Detection and Identification of Background Sounds to Improvise Voice Interface in Critical Environments
c. Rare Sound Event Detection Using Deep Learning and Data Augmentation
2. I suggest authors highlighting the novelty of their work clearly along with the contribution.
3. A proper diagram should be added to show the overall method involved in the proposed work, which gives the reader a clear picture of this work at a glance. Besides, highlight important parameters used for the different methods.
4. In Result and Discussion: Add the table for comparing the result of the proposed work with previous work done by other researchers.
5. Was the parameters for the proposed architecture also decided based on reference 19? If so, please mention that as well. Gives reader clear picture of how the architecture parameters, filters, and layers were decided. And if not; please mention how parameters, activation layer, filters, and layers were decided.
6. Was training, validation, and evaluation data from same block of data? If so, are we creating overfitting model? Please clarify on test, validation, and evaluation set. Looks confusing when I read.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Approved for publication.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
This manuscript titled "Multi-scale Features for Transformer Model to Improve the Performance of Sound Event Detection" constructed a multi-scale feature extraction model by using the Transformer encoder to classify and predict sound events. At the same time, this paper applied this model to the mean-teacher model, thereby proving the effectiveness of this model in semi-supervised learning. This paper finally used the data set of DCASE 2019 to verify the model, which can show better performance, and has a certain performance improvement compared with the baseline Transformer and the baseline mean-teacher model. Here, it is recommended to publish after some modification.
Here are some comments and suggestions for this article:
- This article introduces the SED task well and mentions CRNN, a model architecture of SED, but lacks an interpretation of the internal architecture of the CRNN model. It is recommended to add an interpretation and comparison of the internal architecture of CRNN to help illustrate the advantages of the Transformer architecture in the SED task.
- In this paper, it is very valuable to verify the effectiveness of the multi-scale model in semi-supervised learning from the perspective of the mean-teacher model, which can make the model better extend to a variety of learning mechanisms.
- The experimental part of this paper has sufficient data and detailed hyperparameters. If some ablation experiments could be designed for the Transformer encoders inside the multi-scale model, it would better reflect the value of the multi-scale extraction model.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
This paper can be accepted with the below changes being done. 1) In the ABSTRACT section - Please state the problem at first. 2) In the INTRODUCTION section - it should be in Capitals (deep neural networks (DNNs)). 3) The paper needs to be checked throughout by a native English speaker. 4) In the "Proposed Network Architecture" section - Please state the action performed at every step and the logic for the same. 5) Please also write about FUTURE WORK that should be done. 6) In the RESULTS section, define the training set - three parts clearly and the reason for doing so. 7) Please add the RELATED INFORMATION section with details about Related papers with details.Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
I have no further concerns. I accept the paper.