Next Article in Journal
A Novel Double-Sided Pulse Interval Modulation (DS-PIM) Aided SIM-OFDM for 6G Light Fidelity (LiFi) Networks
Next Article in Special Issue
A Gunshot Recognition Method Based on Multi-Scale Spectrum Shift Module
Previous Article in Journal
DA-Transfer: A Transfer Method for Malicious Network Traffic Classification with Small Sample Problem
 
 
Article
Peer-Review Record

Identifying the Acoustic Source via MFF-ResNet with Low Sample Complexity

Electronics 2022, 11(21), 3578; https://doi.org/10.3390/electronics11213578
by Min Cui 1,*, Yang Liu 1, Yanbo Wang 2 and Pan Wang 3
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2022, 11(21), 3578; https://doi.org/10.3390/electronics11213578
Submission received: 26 August 2022 / Revised: 28 October 2022 / Accepted: 30 October 2022 / Published: 1 November 2022
(This article belongs to the Special Issue Emerging Trends in Advanced Video and Sequence Technology)

Round 1

Reviewer 1 Report

Please read and attend the general comments I wrote in the pdf file. I hope that could be useful for the structure of your work.

Comments for author File: Comments.pdf

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents a new model (MFF-Resnet) in which they propose the fusion of audio and MFCC features by means of an attention-based residual neural network. A transfer learning method is proposed in order to perform training with few samples and results are shown comparing the authors' proposal with those provided by a pre-trained network (R-Resnet). 

 

The work presents a number of weaknesses that require a rethinking of the work so that it can be scientifically sound for the scientific community working in MIR or ESC.

 

1. I strongly recommend a literature review more adapted to the field of ESC. It is noteworthy that not even one of Piczak's early works is cited: 

 

[1] K.J. Piczak, Esc: Dataset for environmental sound classification, in: Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 1015–1018. doi:10.1145/2733373.2806390.  

 

A good starting point would be to review these two papers (among many others), which would put the authors' approach in context:

 

[2] Tripathi, A. M., & Mishra, A. (2021). Environment sound classification using an attention-based residual neural network. Neurocomputing, 460, 409-423.

[3] Bansal, A., & Garg, N. K. (2022). Environmental Sound Classification: A descriptive review of the literature. Intelligent Systems with Applications, 200115.

 

The first is a paper very similar to the one presented by the authors but with a much better set of references, methodological approach and presentation of results. The second is a general review necessary to adequately frame the work.

 

 

2. A methodological review of the research is essential. This involves:

a) adequately describing the database used, 

b) comparing the results with other databases such as ESC-10 or DCASE data, 

c) comparing the results obtained with state-of-the-art techniques, and 

d) providing a repository (e.g., on Github) that can be accessed to contrast the obtained results. 

I recommend the article [2] as one of many from which the authors can draw inspiration for the development of their work.

 

3. Finally, a correction of the English language is essential. The number of errors is very high and in many cases makes the reading of the work very complicated. 

 

In my humble opinion, the work is at a very preliminary stage and needs additional time and work to be presented in a solid way.

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

Explain what MVDR is (Minimum Variance Distortionless Response Beamformer?) and why is useful?

Please reformulate As shown in Figure 1, Sound is a kind of multi-channel waveform data, however, 118 considers only the sound’s time domain information that leaves out its frequency domain 119 information

Please be clearer regarding Fig3, i.e. the dimension of F(X) and X

Regarding  Fig. 4 it is not clear if the authors made the acquisition themselves or used an annotated data basis

Please reformulate (Fig4) Acoustic data is waveform data in essence.

Fig.7: Please reformulate 1 × 1 convolutions are used to adjust channel 287 number in.

Please be clearer regarding the acquisition of the training set and the test set.

Maybe such details could be included in the Abstract

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Please review the file attached

Comments for author File: Comments.pdf

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

I understand that the authors' intended contribution to sound classification may be interesting. Especially in merging conventional audio features with the latent feature space provided by a deep neural network. All this realized through an attention layer. 

The authors have addressed some of the comments made in my previous review, which I appreciate.

However, the manuscript still contains many errors, some of which invalidate the work presented, in my humble opinion.

I thank the authors for including a comparison with other results, presented in Table 3. However, the results shown do not clarify the actual ability of the network to classify sounds, as only training results are presented. It is essential to provide results on test data. The same is true for Table 2. It is necessary to know the generalization capability of the developed models. Similarly, it is not specified with which network the data in Table 3 have been calculated (which have been the training data and which have been the test data), nor what is the value presented in the table. No ranking measures such as F-score or any other are given.

In addition, the manuscript still has many problems with the wording in English. A revision of the language by a translator is mandatory.

Just to give a few examples:

- Pag 3, line 136. In waveform -> waveforms represented in the time domain

- Pag 4 line 141 deep convolutional network can be used to high level feature representation. Lack of verb

- Pag 5 line 173: and completed work from signal acquisition to signal identification -> It is not clear

- Pag 5, line 191 which are original in waveform-> it is not clear what does it means

- Pag 6 Essentia and Freesound must be properly referenced

- The words in boxes in Figure 5 are bad justified

- Pag 7 line 256 start-> input

- Figure 7 is hardly seen 

- Pag 7 line 262,  Due to the convolution is a low pass filter -> The convolution can be the type of filter that is desired

- Pag. 8 line 292, is consist-> consist

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

The manuscript is much improved over the original version, the methodology and results are more adequately presented. The results obtained seem relevant. It would be interesting if the code were available on a platform commonly accessed by the scientific community such as GitHub, which provides a lot of visibility to this type of work.

Author Response

Dear Professor:

Thank you in advance for your time and colleagueship, We have uploaded the code to Github according to your suggestion. Here is the website address:

https://github.com/nucmail/sound_ESC_rcnn

we are looking forward to hearing from you. Please feel free to contact me if there are any questions. 
Best Regards

Back to TopTop