Next Article in Journal
Double Sliding-Surface Multiloop Control Reducing Semiconductor Voltage Stress on the Boost Inverter
Previous Article in Journal
Experimental Comparison of Diesel and Crude Rapeseed Oil Combustion in a Swirl Burner
 
 
Article
Peer-Review Record

Sound Event Detection Using Derivative Features in Deep Neural Networks

Appl. Sci. 2020, 10(14), 4911; https://doi.org/10.3390/app10144911
by Jin-Yeol Kwak and Yong-Joo Chung *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(14), 4911; https://doi.org/10.3390/app10144911
Submission received: 23 June 2020 / Revised: 13 July 2020 / Accepted: 15 July 2020 / Published: 17 July 2020
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

In this manuscript, the authors propose to use log-mel-filterbank and its first and second derivatives, for sound event detection with two different architectures of deep neural networks. During the training, they also use weakly labeled or unlabeled audio data. They test their method on two datasets related to the Detection and Classification of Acoustic Scenes and Events 2018 and 2019. They report an improvement of 16.9%. Here are my comments and questions:

  1. Could you compare your method with some other approaches? What are its advantages? I believe it is doable because you are using the publicly available challenge data.
  2. How do you select the hyperparameters controlling the network learning process?
  3. Remove the word 'are' in the sentence starting with 'They are can...' (row 49).

Author Response

Comments and Suggestions for Authors

In this manuscript, the authors propose to use log-mel-filterbank and its first and second derivatives, for sound event detection with two different architectures of deep neural networks. During the training, they also use weakly labeled or unlabeled audio data. They test their method on two datasets related to the Detection and Classification of Acoustic Scenes and Events 2018 and 2019. They report an improvement of 16.9%. Here are my comments and questions:

 

Could you compare your method with some other approaches? What are its advantages? I believe it is doable because you are using the publicly available challenge data.

--> We compared the proposed methods with the baseline systems of DCASE 2018(CRNN) and DCASE 2019(Mean-teacher model) which were officially announced at the challenge. They use the same data and quite similar network architectures as ours. Although exact performance comparison was difficult due to the different implementation details of the neural networks, we could confirm that the proposed methods always show better performance than the baseline systems of DCASE 2018 and 2019 by the use of the derivative features.

More details are mentioned in the revised paper. See Line 237~243, Line 277~283 and Table 3 & 5.

How do you select the hyperparameters controlling the network learning process?

-->Since we used the same audio data and similar neural network architecture as the baseline system of DCASE 2018 and 2019. We tried to follow the hyper parameters used in the baseline systems of DCASE 2018 and 2019. Other hyper parameters not specified was selected based on the performance on the validation data.

This fact is mentioned in the revised paper. See Line 223~224.

Remove the word 'are' in the sentence starting with 'They are can...' (row 49).

-->Yes, we corrected the mistyping.

Reviewer 2 Report

This paper deals with sound event detection based on the neural network by using selected so-called derivate features. Authors declare that that
their model has improved state-of-the-art models.

I have several comments:


  1. References 1 to 7 should appear in a section called State-of-the-art or Related works. The main role of Introduction is to describe a current research issues and provide authors´motivation behind the work.

  2. Line 75, log-mel filterbank, please explain how it works.

  3. What is a static feature? please define it. Subsections 2.1 and 2.2 provide very limited information about the features selection. Please improve readability.

  4. Line 114, You used the average pooling layer but conv block contains max-pooling layer? Why did you decide to use GAP?

  5. Equations are too big and disrupt the reading, please modify them into correct format and size.

  6. Table 3 shows average relative improvement 11.6% but in terms of Error the model is better only by 0.5%. For a complex overview of the proposed model authors should provide also training time for each compared model.

Author Response

Comments and Suggestions for Authors

This paper deals with sound event detection based on the neural network by using selected so-called derivate features. Authors declare that that

their model has improved state-of-the-art models.

I have several comments:

References 1 to 7 should appear in a section called State-of-the-art or Related works. The main role of Introduction is to describe a current research issues and provide authors´motivation behind the work.

  • It was not easy to make separate section for the related works considering both the volume of the contents and connection with other parts of the Introduction. So, if the principle about the Introduction is not so critical, I hope to place the references as it is.

Line 75, log-mel filterbank, please explain how it works.

    • We added a figure that shows the extraction process of the log-mel filterbank. See Line 99.

What is a static feature? please define it. Subsections 2.1 and 2.2 provide very limited information about the features selection. Please improve readability.

  • We specifically mentioned about the static feature and improved the readability. See Line 104~111.

Line 114, You used the average pooling layer but conv block contains max-pooling layer? Why did you decide to use GAP?

  • GAP is used to compute the clip-level output of the neural network by time-averaging the frame-level sigmoid output at the classification layer. This is necessary for the training of the weakly-labeled data. We mentioned this in the revised version. See Line 120~121.

Equations are too big and disrupt the reading, please modify them into correct format and size.

  • We modified the equations for the correct size.

Table 3 shows average relative improvement 11.6% but in terms of Error the model is better only by 0.5%.

  • As we know, ER is less sensitive than F-score in the performance evaluation. I think that, if the performance gap in the F-score were much larger than presented, the performance difference in ER would have been more manifest.

For a complex overview of the proposed model authors should provide also training time for each compared model.

  • Since the proposed model only adds derivative features at the input layer, the increase in the number of weights and computation time is very small. For example, the number of weights in the proposed CRNN model is about 127,000 compared to 126,000 without derivative features. We mentioned this in the revised version. See Line 130~133.

Round 2

Reviewer 1 Report

All my comments and questions from the previous iteration have been addressed. I believe your manuscript has been improved and now warrants publication.

Reviewer 2 Report

Authors have incorporated my comments in the required form.

Back to TopTop