The paper presents a comprehensive analysis of results for machine learning-based drone detection with advanced acoustic augmentation. However, there are notably limitations in this paper:

1. The methodology, particularly the machine learning model, is described as relying on existing methods without adequate depth. This raises questions about the novelty and contribution of the research. A more thorough explanation of the chosen methodology, its implementation, and any modifications or improvements is required.

2. The absence of comparison with other machine learning approaches is a significant limitation.

3. The exploration of audio augmentation techniques for enhancing the robustness of drone sound classification is valuable. However, their direct impact on model performance is unknown.

Comments on the Quality of English Language

Good.

Author Response

Thank you for you patience and the constructive remarks. Below I make a few comments on my revision and briefly address your main points:

Methodology Depth: I expanded the description of the machine learning model, detailing the choice of methodology, its implementation, and any modifications or improvements made. This includes a more thorough explanation of the VGGish network adaptation for acoustic applications and the rationale behind selecting specific augmentation techniques.
Comparison with Other Approaches: I recognize the significance of comparing our method with other machine learning approaches for drone detection. While this version of the manuscript does not include a comparative analysis due to the focused nature of our research on augmentation techniques, I understand the value such a comparison would bring. My work primarily aimed to explore the efficacy of specific acoustic augmentation methods in improving the robustness of drone sound classification. I believe this contribution is a step towards a more nuanced understanding of augmentation's role, setting the groundwork for future studies that could incorporate a broader comparative analysis.
Impact of Audio Augmentation Techniques: I clarified the direct impact of audio augmentation on model performance. This includes a detailed discussion of the effects of pitch shifting, harmonic distortion, and environmental noise on classification accuracy. Special emphasis was placed on the significant improvement observed with pitch augmentation for the C3 drone in outdoor measurements, demonstrating the practical relevance of my findings.

Kind regards
Sebastian Kümmritz

Reviewer 2 Report

Comments and Suggestions for Authors

The research in the field of sound drone detection looks relevant and interesting nowadays. The author proposed to use training data augmentation to increase the performance of the recognition system. This approach seems modern and quite effective. Also, the author proposed the usage of two networks: the first is used for binary classification (‘no drone’ and ‘drone’); if the drone is detected, the second network is used to classify the type of drone.

The remarks could be formulated as follows.

1. The author should add relevant references, that use the chosen architecture “vggish” in similar cases.

2. The author should provide more accurate and up-to-date analysis of the related work. For example, the section with the related work doesn’t exist. Also, the idea of adding papers concerning the same problem may add some scientific soundness to the submitted paper.

3. The author should explain the behavior of the accuracy vs. distortion level dependence (results in tables 3-5). The non-monotonic behavior may indicate an incorrect training process for the model (for example, overfitting). So, the author should describe the training step in detail.

4. The author should provide some description of the used training dataset and/or some characteristics of the training dataset obtained after the proposed preprocessing (for example, the number of samples or the distribution of classes). The author adds a reference to the original paper with the description of the dataset, but it’s advisable to add this description to the text explicitly for greater understanding of the underlined problems of trained networks.

5. The author concluded that the non-random initialization of the network weights contributes to obtaining more reliable results. But the problem of reaching the worst position of weights (local minima or plateau) is well-known. So, the author should specify the number of launches for which he has obtained the statistical data, such as “mean” and “std”.

6. It would be nice to add some direct comparison of overall detection accuracy achieved with and without augmentation. It would be yet more informative to add a dependence of overall detection accuracy on the augmentation level.

Author Response

Thank you for you patience and the constructive remarks. Below I make a few comments on my revision and briefly address your main points:

During the revision process, an essential error in the dataset was identified, where some drone sounds were misclassified. Consequently, all evaluations were repeated, and the values in the paper were updated accordingly. It's important to note that these changes did not affect the overall conclusions of the paper. In fact, a slight improvement in classification accuracy for outdoor data was observed.
Most of your suggested improvements have been incorporated. However, I did not make the requested enhancements regarding the introduction, the relevance of sources, and the State of the Art section due to a stark contrast with the second reviewer's feedback. Admittedly, a dedicated State of the Art section would have been included if not for the journal's prescribed structure (Introduction, Methods, Results, etc.), leading to its implicit coverage in the introduction. The sources mentioned there are mostly less than 5 years old (many under 3 years) and are closely related to the topic.
Regarding your suggestion to add a direct comparison of overall detection accuracy with and without augmentation, and its dependence on the augmentation level: While valuable, the current research design does not support this. The influence of augmentation was demonstrated using a single drone model. Training adjustments are required to exclusively use augmented RAR measurements for training and outdoor measurements for validation, enabling the proposed display. This limitation is acknowledged and addressed in the "Limitations and Challenges" section and will be a focus of the subsequent project.

Reviewer 3 Report

Comments and Suggestions for Authors

The comprehensive article of substantial volume has been submitted for review. The provided literature review thoroughly examines recent advances in this research area. The list of publications is sufficient and up-to-date. The article includes all the necessary components for a scientific paper. However, the article itself is of large volume. Large volume is not always an advantage and can hinder readers from fully understanding the essence of the article. Therefore, I suggest highlighting the most important parts in the article, and eliminating elements that do not contribute scientific value.

Comments:

At first glance, there are appendices with MATLAB scripts at the end of the article. I believe they are unnecessary. After all, you provided a link to the GitHub project where these scripts are most likely available. So, that should be sufficient. Matlab scripts do not contribute scientific value to the article.

It is necessary to provide a more detailed explanation of how the test dataset was collected. In the “Data” section, it is briefly mentioned that different drones were used, but somewhere only towards the end of the article in the results section (lines 522-523), it is noted that 40 different drone models were used. All information related to dataset collection should be presented in one place, either in the 'Data' section. In the Data section, it should be clearly explained how the data was collected and the quantity of data collected. How many drones from every class were used? What did the dataset look like, what parameters were stored for input and target matrices, etc.

There is very little information about the used 'vggish' network and its structure. A link to GitHub is provided. It would be necessary to discuss the network structure in more detail. Currently, two classification tasks are being addressed. In the first step, it is classified whether there is a drone or not. In the second step, drones are classified into four categories. The question remains: is the same network with exactly the same structure used to solve both tasks?

What hardware was used for training the networks? How long did the training take? What hardware was used for classification? Was classification done in real-time? How long did the classification take? Was only the classification accuracy problem addressed or was speed also important?

You have written that „In a drone detection system, detecting a drone could trigger a wake-up process for a more power-intensive unit (e.g., an Arduino)“. The arduino is power intensive unit to classify drones into c0–c3 categories? When what hardware was used to make classification in the first step to classify if where is a drone or not?

Regarding the "Non-drone sounds," what recordings were used: sounds from the forest, city noises, or perhaps recordings from construction workers using an electric drill, whose engine spectrograms might essentially resemble drone engine spectrograms? It is necessary to provide a more detailed description of the dataset being formed.

Figure 2. There are for confusion matrices. What are „Classifier 3p101, 3p102, 3p103 and 3p104? What is the difference between these classifiers? Where these classifiers are described in the text?

Table 2. „91.4%” is correct. But there are a lot of float numbers were comma is used instead of dot, like this example “96,6%”. The paper should be reviewed in order to fix numbers representation.

It would be interesting to see and compare more spectrograms like in the Fig. 3. Figure 3. This spectrogram is of which model of drone and to which class it could be added from c0-c3? It is necessary to be more precise.

Do I understand correctly Fig 3, bottom? The same drone flies over the territory and it is classified incorrectly into deferent class depending on the distance from microphone?

It is necessary to provide the information on what criteria the boundary threshold values for every augmentation techniques were chosen. Let’s say Harmonic distortion till 63%. Why 63%? Why noise till 72%? And others…

Figure 4. It is better to provide the results of a) and b) on same figure with different colors. It will be easy to compare.

Figure 5. What should we see in this figure? This image is non-informative. We observe a distribution, but cannot compare it to anything. Whether the results in this figure are good or bad cannot be determined. The purpose of this image becomes unclear

Figure 6. Again, what is intended to be depicted in this image? Does it belong to one class, but it seems to jump across other classes, or is there another goal? Figure 7 the same. Necessary to provide deeper explanation, o add data to images in order to compare with something.

Lines 457-459. "Challenges in Higher Drone Categories (C2 and C3): • In contrast, the classification of larger drones (C2, and especially C3) was less successful." From which data exactly is it evident that this was less successful? In the article, after such statements, there is a lack of links to images, tables, providing this data.

Do you extract features of collect dataset signals or the network does it? I have found FFT analysis only in the discussion section. Necessary to explain it.

I apologize if I asked questions to which the answers are already in the text, but the article is extensive, and sometimes I lost the essence while reading. Therefore, I suggest reducing the volume of the article and presenting the most important information concisely.

Author Response

Thank you for you patience and the constructive remarks. Below I make a few comments on my revision and briefly address your main points:

Upon incorporating the suggested revisions, a critical error in the dataset was discovered, leading to incorrect drone sound classifications. This necessitated the repetition of all evaluations and the corresponding updates in the paper. These modifications did not alter the paper's fundamental conclusions; in fact, they resulted in improved classification accuracy for outdoor data.
In response to the comment about the text's length, it has been significantly shortened, aiming to enhance clarity and focus on the essence of the work.
Regarding the criteria for selecting boundary threshold values for each augmentation technique: The rationale behind the step sizes has been clarified in the text. However, it's worth questioning whether concerns would have arisen if conventional increments of 5% or 10% had been used?
In response to the inquiry about the less successful classification of larger drones (C2 and C3) and the request for data support: The section discussing multiple augmentations has been entirely removed from the paper, as the results did not contribute value and require extensive further research.
Concerning feature extraction: It was conducted prior to training, as detailed in the "Network and Training" section.
On the request for more spectrograms like Fig. 3 and specifics about the drone model and classification: While appreciating the value of spectra and spectrograms, attempts to present a clear comparative illustration of different drone spectra were ultimately abandoned due to presentation challenges and deemed unnecessary for text comprehension.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the response and the improvement on the manuscript. However, I suggest the author respond review comments one by one, and mark the corrections in the paper. Otherwise it is difficult to track. I still have some questions:

Acoustic augmentation methods for enhancing the robustness of drone sound classification are dependent on the algorithm. This reliance represents a potential limitation in this study, as only one machine learning model is utilized.

Author Response

Thank you for the hints and suggestions for improvement. In the section "Limitations and Challanges", I discuss the limiting influence of using a single model. I have also created a list of the changes below to make it easier to understand (see the attached document).

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The author presents extensive research in field of distortions impact on drone detection accuracy. It’s worth noting, that the author remarks the most part of the comments. So, the comments 1-4 are addressed adequately. However, some comments still remain unsolved.

Concerning the comment 5. The author’s conclusion that using non-random initialization of the network weights contributes to obtaining more reliable results still looks unclear. The common approach used in a great variety of practical tasks is based on random weight initialization. It’s proved that such approach provides the best stability and robustness for the most of neural networks. At the same time, non-random initialization influences performance deviation in multiple training processes, but the problem of reaching the worst position of weights (local minima or plateau) is unresolved. The author should describe the advantage of non-random weight initialization for the global case, not for the experiments provided. Maybe random weight initialization with a smaller deviation provides the best possible result.

Concerning the comment 6. The direct comparison of overall detection accuracy achieved with and without augmentation of the tested ’HP-X4 2020’ drone (some results from figures 3 (bottom), 4, 5, and 6) would provide more information about the proposed research. The noted figures don’t allow us to estimate the quality of the proposed method.

The added comments are listed below.

1. The text in lines 194-197 repeats the text in lines 181-185.

2. The names “True Class” and “Predicted Class” in Table 3 look unclear. Maybe such definitions as “Precision” and “Recall” would be correct?

3. The author should describe how he adds the augmented data to the training dataset. If the amount of data increases, then the author should specify how much. Otherwise, if the amount of data doesn’t change, then the author should determine the fraction of augmented data, for example. This information helps to estimate the resulting performance more adequately.

4. The author should specify some rules that he follows when choosing the augmentation level for further testing in a real-world scenario. For example, augmenting the pitch by +/- 1.4 semitones shows the worst performance in Table 6, but the author chooses this augmentation level for real-world testing in Figure 5. That choice looks unclear. So, how is the pitching level chosen to improve the classifier's performance in a real-world scenario?

5. It’s worth noting that the journal “Drones” doesn’t limit the ability to include additional sections, for example, “Related Work” or “State of the Art”.

Author Response

Thank you for the hints and suggestions for improvement. I have incorporated your suggestions and comments as far as possible. Please find attached a changelog with all the changes that have been incorporated. One of your main criticisms relates to my use of a non-random initialization of the network. I am not questioning the importance of random initializations. Here in the study, however, I am interested in investigating the influence of subtle changes in the augmentation on the performance of the classifier. Here it has been shown that these are sometimes smaller than the fluctuations caused by the random processes. Seed initialization therefore refers to the methodical procedure for investigating augmentation and is not intended to imply that this should be used as a general approach to training. For clarification, I have emphasized this point again at the relevant points in the paper. Thank you for your observation regarding the inclusion of additional sections such as 'Related Work' or 'State of the Art' in the journal 'Drones'. I appreciate this flexibility and will certainly consider integrating these elements in future submissions. However, for the current version, integrating such sections would require extensive restructuring. Considering this and in light of feedback from other reviewers, I have decided to maintain the current structure of the manuscript. I believe this approach best serves the coherence and focus of the work at this stage. In response to your suggestion regarding the overall detection accuracy, I understand the importance of directly comparing detection accuracy with and without augmentation for the drone example, as indicated in the figures mentioned. I wish to clarify that each figure already incorporates a comparison to the scenario without augmentation, designed to illustrate the augmentation's effects on classification under various conditions. The impact of augmentation on improving detection accuracy is extensively discussed within the manuscript text Finally, I would like to add a private note: This is my first journal paper of this scope. I have learned a great deal in the course of this project and the writing of this article, and your comments have played an important part in this. I would therefore like to thank you in particular for your extensive feedback.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The author took into account many of my comments. The author removed unnecessary redundancy in the information. Author discussed the dataset and the structure of artificial neural networks in more detail.

The value of the article has significantly increased, and now it can be published.

Author Response

I highly appreciate your helpful comments from the first revision round. Not only did they make a good contribution to improving the substance of the paper from my point of view. They were also good learnings and helpful for me to improve future publications. To keep you informed about the latest revisions, I included a chengelog for the actual version.

Author Response File: Author Response.pdf

Article Menu

The Sound of Surveillance: Enhancing Machine Learning-Driven Drone Detection with Advanced Acoustic Augmentation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI