Next Article in Journal
ECG Heartbeat Classification Using CONVXGB Model
Previous Article in Journal
The Application of Adaptive Tolerance and Serialized Facial Feature Extraction to Automatic Attendance Systems
 
 
Article
Peer-Review Record

Environmental Sound Classification Based on Transfer-Learning Techniques with Multiple Optimizers

Electronics 2022, 11(15), 2279; https://doi.org/10.3390/electronics11152279
by Asadulla Ashurov 1,2, Yi Zhou 1,2,*, Liming Shi 1,2, Yu Zhao 1,2 and Hongqing Liu 1,2
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2022, 11(15), 2279; https://doi.org/10.3390/electronics11152279
Submission received: 24 June 2022 / Revised: 12 July 2022 / Accepted: 12 July 2022 / Published: 22 July 2022

Round 1

Reviewer 1 Report

Dear authors,

The title is interesting, however, there are a few points which could help you improve the early draft.

First of all, please add more facts and figures to highlight the research gap and your theoretical contribution. On page 2, it is mentioned that "The essential contributions of this study can be summarized as follows". It will be great to show the gap by adding relevant references. Otherwise, it could not be clarified for the readers.

It is mentioned that "In recent years, the quantity of work done in the domain of ESC has increased exponentially". Many domains have developed in this area. Then, please compare the growth with other relevant fields, instead of proposing a general statement.

Please avoid general statements such as "Recent studies have shown this concept to be helpful in various applications". Such statements might be considered controversial.

Many recent works are overlooked. Please add more insights from the more recent works (2019 or later).

You could also use supervised or unsupervised machine learning techniques to reach deeper findings. It is just a suggestion.

The findings are precise and well discussed.

Please add the limitations of the research.

Best of luck!

Author Response

Coauthors and I much appreciated the reviewer's supportive, constructive, and critical remarks on this work. Let us begin by expressing our deep gratitude and thanks for the valuable time and efforts that you and the reviewers have dedicated to providing your valuable feedback on the manuscript. The feedback has been quite detailed and helpful in enhancing the paper. We strongly believe that the comments and suggestions have increased the scientific value of the revised manuscript by many folds. In revising, we have taken their input into full consideration. We are submitting the corrected article with the recommendation included. The manuscript has been changed based on the reviewer's suggestions, and the following are our replies to all of the comments.

Author Response File: Author Response.pdf

Reviewer 2 Report

Line 6: Which hyper-parameters and pre-trained models? I suggest listing them.

Line 8: It would be more correct to say the models were applied to the spectrogram data, not the other way around.

Line 12: The word ‘achieve’ should either be written as ‘achieved’ or ‘achieves’, depending on the tense.

Line 28: The word ‘being’ should be replaced with the word ‘is’.

Line 36: You claim that CNNs have overcome the limitations of conventional approaches. To which limitations are you referring? Can you be more specific? I would also suggest adding a citation here.

Line 37: The fact that a CNN can overcome certain challenges faced in image recognition does not imply it is ideal for 2D sound diagram classification. You need a stronger supporting statement here. Can you provide further evidence that a CNN is suitable for this task?

Line 43: Refined in what way? Can you elaborate?

Line 47: On which data were these models pretrained? This isn’t mentioned anywhere in the first couple of sections.

Line 49: This should read ‘a publicly accessible…’.

Line 64: I believe you intended to say ‘evaluate’ rather than ‘facilitate’.

Line 72: This statement should be in present tense (‘the following paragraphs discuss…’).

Line 89: What do you mean when you say conventional models do not outperform? To what are they being compared? In other words, what are they not outperforming?

Line 92: I like your stated goal, but you need to provide some context and a comparison with the techniques discussed in the previous paragraph. Are you trying to improve on their reported classification accuracy?

Line 98: You need to add a space between the word ‘model’ and reference ‘[21]’.

Line 100: The abstract reads as though yours is the first study to investigate the effectiveness of CNNs for ESC. However, this is clearly not the case. As such, you need to clarify your novel contributions. I suggest focusing on transfer learning, which has not been described in enough detail. Specifically, on which data were the models pretrained?

Line 116: Either the word ‘exploits’ or ‘performs’ should be removed as they are duplicates.

Line 123: The word ‘study’ should be removed.

Line 132: What was the result of this comparison? What is the most advanced method?

Line 135: This statement needs to be more definitive and specific. To what kind of improvements are you referring? Motivation and need are perhaps the greatest weaknesses of this paper. For example, in [9] a very similar technique was applied to the same dataset (Urbansound8k), yielding an accuracy of 93%. Do you hope to improve on these results in your study? Why is a new approach needed if these other models were so successful?

Line 136: This statement does not seem valid, based on the results reported above.

Line 140: The word ‘techniques’ should be pluralized, and the sentence should be in present tense.

Line 145: This reference to Figure 4 should be a reference to Figure 1 instead. Also, the following sentence should begin with the word ‘The’.

Line 153: There is a space missing after the period. The following sentence mistakenly includes two spaces after the period.

Line 158: This statement is confusing. Are you saying you normalized all of the sound clips to the same length, or you normalized the intensities, but only for clips that were of the same size? What was the standard ‘network input size’? Were all clips the same length?

Line 163: This statement requires further detail. Which optimizers and how many epochs? Also, what do you mean by ‘satisfactory accuracy’? Can you quantify this statement?

Line 168: Equation (1) is never explained in the text; I believe it was supposed to be placed on Line 179.

Line 188: Section 3.2 is overly vague and does not specify which methods were used in your study. For example, it states that Keras, Tensorflow, and Torch are popular; which was used in this study? Also, you state that transfer learning may result in a more efficient model. Was that the case in your study? This section needs to be rewritten with information specific to your technique.

Line 197: You state fine-tuning is applicable only when the two datasets are identical. However, that was not the case in this study, since you were using parameters from image recognition models (ImageNet) to identify sound spectrograms. Is this not a contradiction? How were you able to include fine-tuning in your study if this requirement was not met?

Line 216: A space is needed after the word ‘Figure’.

Line 218: The word ‘is’ needs to be removed from the last sentence of the caption for Figure 2.

Line 220: This should read ‘This model utilizes…’.

Line 224: This should read ‘to achieve’.

Line 236: Pages 7 and 8 are too generalized, providing basic summaries of individual algorithms without clarifying how they were applied in your study.  For example, on Line 239, you state that restricting network resolution can improve model accuracy and performance. Was this done as part of your study? On Line 270, you list a range of parameters from 1.7-6.9 million. How many did you include? You need to describe any modifications that were made to accommodate the Urbansound8k data. Where is the novelty in your approach?

Line 276: If this network can only accept images of size 224x224, does this imply you downsampled the sound spectrograms? This is never discussed.

Line 278: This paragraph repeats the same information twice, beginning on this line. The second instance should be removed entirely.

Line 309: The word ‘dimension’ is misspelled.

Line 321: A reference is made to several types of optimizers, but no detail is included. Which optimizers? Also, are you using the terms optimizers and hyper-parameters interchangeably because they are not necessarily equivalent.

Line 329: How many different values did you test as part of this optimization? You only report the final values, but it would be interesting to know the range of tested numbers.

Line 333: I believe your use of ‘implanted’ should be ‘implemented’.

Line 337: What do you mean by the ‘metric value’?

Line 345: This should either read ‘this experiment’ or ‘these experiments’.

Line 351: Was synthetic biasing a problem in this case? In other words, did you see any evidence of samples being included in both the training and tests sets? You never specify.

Line 365: You need to quantify these statements. For example, what fraction of images were rotated, flipped, etc.? How many samples were available before and after augmentation?

Line 372: Figure 4 shows 10 plots and is described as a visual representation of the dataset. However, your dataset included thousands of different files, correct? Is this simply a sample from each class?

Line 391: I don’t understand this statement; adjusted in what way? How does this new layer differ from the initial top layer?

Line 410: I believe this reference to Table 1 should actually be a reference to Table 2.

Line 438: This statement requires further clarification. Can you explain what you did specifically?

Line 444: The results are not ‘perfect’ since they are less than 100%.

Line 447: Is this supposed to be a reference to Table 3?

Line 457: Your stated goal was to determine whether transfer learning could be applied to sound classification, but this goal is never revisited. You provide a summary of individual results for each algorithm, but you don’t explain whether these results support your initial hypothesis. You need to add a statement here, evaluating the success of the study in the context of your primary objectives.

Line 456: The word ‘need’ should either be ‘needs’ or ‘needed’ in this case.

Line 462: The word ‘figure’ should be capitalized.

Line 492: You need to add a paragraph here providing the ‘big picture’ summary you want the reader to understand. You make several observations concerning the performance or runtime produced by a given algorithm, but you never explain why this might be the case. In addition, you don’t identify the significance of individual results. For example, having read this section, I am still uncertain as to which algorithm is superior. What is the ultimate result of your study?

Line 504: You may want to mention that ‘your method’ involves the use of DenseNet201.

Line 505: You reference a ‘different’ optimizer; to which are you referring?

Author Response

Coauthors and I much appreciated the reviewer's supportive, constructive, and critical remarks on this work. We strongly believe that the comments and suggestions have increased the scientific value of the revised manuscript by many folds. In revising, we have taken their input into full consideration. We are submitting the corrected article with the recommendation included. The manuscript has been changed based on the reviewer's suggestions, and our replies to all of the comments. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

1.This article deals with  several CNN pre-trained models on the Urbansound8K public environmental sound dataset. it identifies the model that achieves the most outstanding performance as the best for the sound spectrogram and refines it for better classification.

The purpose of this work  is to propose a study that aims to determine the effectiveness of the use of pre-trained convolutional neural networks (CNN) for audio categorization and the feasibility of reconversion from the study of various hyper -parameters for several pre-trained models on several stages: the transfer of raw sound signals in an image format (log-mel spectrogram). Subsequently, the obtained spectrogram data is applied to the selected pre-trained models. This research work proposes a transfer learning method for ESC based on the log Mel spectrogram on the UrbanSound8K dataset, with several classes.

 

2. The essential contributions of this study can be summarized as follows:

• A study of pre-trained CNNs and the use of transfer learning to accurately classify environmental sounds into ten categories. Next, a comparison between the classification performance of several pre-trained models using the ambient sound dataset and the analysis of the advantages and disadvantages of these strategies.

• Then using the publicly available audio dataset to test the effectiveness of transferring learning to the chosen CNNs in terms of classification accuracy, precision.

• The Choice of nine most used preformed patterns are in conjunction with fine tuning to reduce training time while maintaining output accuracy.

• Finally an evaluation of the performances of the pre-trained models modified to study their advantages is conducted, as well as to compare the results obtained with those of the associated methods.

 

 

3. The novelty and originality of this research are mainly due to the simulation  aspect and the experiments carried out by the authors in order to achieve the expected results and to find pre-trained CNN models  that worked great for this research. The references are rich and adapted to the requirements imposed by the pertinence of the subject.

 

4. The improvements envisaged in relation to established methodologies are obtained by  the CNN pre-trained models. The work demonstrates the seriousness of the authors who have succeeded to in demonstrating and reveals that the CNN pre-trained models exhibits the greatest performance in terms of the whole metrics with efficiency.

 

5. The results and conclusions are promising and demonstrate some large applications.   Exploring the chosen data set and various training parameters can help determine the most successful combinations. Thew classification technique represents a significant advance in developing accurate classifiers for increasingly complex environmental sound datasets, especially those incorporating distinct spectrogram representations generated from sound waveforms.

 

6.   The selected pre-trained models performed well and achieved high accuracy in this research. In addition, the effect of essential recycling factors on classification accuracy and processing time is studied during CN Training. Various optimizers are used to evaluate the proposed method on the publicly available sound dataset UrbanSound8K.

The proposed method achieves high accuracy much better than several methods cited in the references.

Author Response

The authors appreciate a lot for the reviewer's comments and constructive suggestions.  We have carefully examined the manuscript and modified some narrations accordingly. In revising, we have taken the reviewer's suggestions into full consideration and highlighted the contributions pointed out by the reviewer. Please see the revised manuscript for details. 

 

Back to TopTop