Next Article in Journal
Hybrid Dual-Scale Neural Network Model for Tracking Complex Maneuvering UAVs
Next Article in Special Issue
Performance Analysis of a Wildlife Tracking CubeSat Mission Extension to Drones and Stratospheric Vehicles
Previous Article in Journal
The Detection of Tree of Heaven (Ailanthus altissima) Using Drones and Optical Sensors: Implications for the Management of Invasive Plants and Insects
Previous Article in Special Issue
Using Unoccupied Aerial Systems (UASs) to Determine the Distribution Patterns of Tamanend’s Bottlenose Dolphins (Tursiops erebennus) across Varying Salinities in Charleston, South Carolina
 
 
Article
Peer-Review Record

Using YOLO Object Detection to Identify Hare and Roe Deer in Thermal Aerial Video Footage—Possible Future Applications in Real-Time Automatic Drone Surveillance and Wildlife Monitoring

by Peter Povlsen 1,*, Dan Bruhn 1, Petar Durdevic 2, Daniel Ortiz Arroyo 2 and Cino Pertoldi 1,3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 24 November 2023 / Revised: 21 December 2023 / Accepted: 22 December 2023 / Published: 24 December 2023
(This article belongs to the Special Issue Drone Advances in Wildlife Research)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you for the interesting submission. My quick review showed the manuscript to be well-structured and written, novel, and on an important topic related to drone applicability. Due to time limitations, I have no specific revisions recommended at this time, but also don't see any obvious issues that need to be addressed or that would prevent publication in general.

Author Response

Dear reviewer. Thank you very much for your time and effort reviewing this paper. Best regards, Peter

Reviewer 2 Report

Comments and Suggestions for Authors

In the article, the authors present research on the use of YOLO (You Only Look Once) object detection technology for identifying hares and roe deer in thermal aerial video footage from drones. They focus on the possibilities of applying this technology in the automatic monitoring of wildlife, which is a new and promising direction in this field.

The greatest advantage of the study is its innovation and practical application. The use of drones with IR camera and advanced machine learning techniques opens up new possibilities for monitoring and protecting wildlife. The methodology is described in detail, adding scientific value and allowing other researchers to replicate the experiment.

However, the proposed method also has its weaknesses. It requires manual control of the results, indicating limitations in automation. The authors critically assess their findings, particularly the low efficiency of the mAP indicator. They also point out the need for further research, such as the impact of the background on animal identification. They emphasize the technical complexity of the proposed conceptual implementation of the method, which could be a barrier to its wider application.

In summary, this article is a significant contribution to the field of wildlife research, demonstrating the potential applications of modern technologies in environmental protection. However, it requires further research and development to fully exploit its potential.

Detailed remarks:

Line 131: Roboflow.com should be better described, indicating its purpose and basic functionalities.

Lines 148-160: What about the preparation of the IR camera for work? It usually requires some temperature stability with the environment. There may also be other technical recommendations related to camera preparation for work?

Line 195: Do the numbers of images in the training set mean original images and augmented images? This is not clear.

Line 210: Wandb.ai should be better described, indicating its purpose and basic functionalities.

Author Response

Thank you very much for your time and effort reviewing this paper. We have addressed your comments, please see details below.

Line 131: Roboflow.com should be better described, indicating its purpose and basic functionalities.

Added: Roboflow.com, an end-to-end computer vision platform, was used for manual annotation...

Answer: The functionalities are elaborated in section 2.2 Image annotation.

Lines 148-160: What about the preparation of the IR camera for work? It usually requires some temperature stability with the environment. There may also be other technical recommendations related to camera preparation for work?

Answer: The thermal camera on M2AE and the Zenmuse H20N requires no preparation, besides the automatic warm-up of the drones when booting up. This takes mere second, regardless of the ambient temperature being +20 or –10 degrees C.

Line 195: Do the numbers of images in the training set mean original images and augmented images? This is not clear.

Answer: The first column “Number of images” does not include augmented images. The Training, Validation, and Test sets do include augmented images.

Added: The number of images before augmentations, approximate number of objects per image...

Line 210: Wandb.ai should be better described, indicating its purpose and basic functionalities.

Added: To assess the training, the plugin Weights & Biases, an AI developer platform, was used (wandb.ai). This automatically stores data about each training progress and calculates parameters such as precision, recall and mAP, to be accessed online later.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper "Using YOLO object detection to identify hare and roe deer in 2 thermal aerial video footage - possible future applications in 3 real-time automatic drone surveillance and wildlife monitoring" furthers a previous investigation by Povlsen et al. using drones equipped with thermal cameras to detect POI, hare and roe deer. The authors adjust the sampling method and apply object detection (YOLO5) methods and assess the mAP with an independent dataset. The paper discusses how mAP scores can be misleading and propose a conceptual automated AI-based method for wildlife detection. 

The paper overall is sound. I feel it discusses an important topic on model performance metrics and how these can be misleading. There are some grammatical errors throughout the manuscript and some sentences can be reworded to improve clarity (I have mentioned some examples below).

Furthering the discussion on comparing the difference between using a test set subset from the training set, as well as independently sampled, with regards to inferring model performance would be welcome (i.e. the importance of a test set separate in space and/or time). Additionally, whether there are better metrics or methods to assess the model during training as well as validation. 

Given that mAP is potentially not a great indicator of model performance, particularly in the validation set during training, are there better indicators to use to assess the model during training?

 

Intro:

Line 37 ‘reducing’ to ‘reduce’.

Line 35-40: This sentence is really long. Consider breaking up.

Line 102 – 105: Revise for clarity. The first sentence doesn’t describe what Povlsen showed, just what they did. The lines 104-105 describe this, which makes for a potentially awkward sentence split.

In paragraph starting 102, you don’t mention the animals being surveyed until near the end of the paragraph. I suggest mentioning earlier. Also were these the same animals sighted in Povlsen et al. (they were but you should mention here).

lines 102-104: you need to either mention the drone (M2E), field of view, of ground sampling distance (such as cm/pix) as any of this info, the height is somewhat meaningless.

lines 108-110: this seems it is more suited to either methods or results

lines 107 – 113: There is likely a better lead in/argument for the actual aims of the paper, which is to explore the utility in using AI to improve post-processing efficiency, and provide further insights regarding assessing model performance metrics. Lines 107-112 don’t deal with the problem space, rather just that you altered the method from the previous paper, which does not provide strong justification for the aims.

Line 126: fix grammar (‘fairly simple’).

 

Methods:

It is great to see the authors collected a test set independent of the training set, to test the model performance on. This is seldom done unfortunately, despite it giving a better indication of how well the model might perform in a monitoring situation, rather than giving an indication of how well the model knows its own data.

Are there other metrics (such as loss) that you used to monitor training in addition to the mAP, or was it just mAP?

It would be great to have a table on the breakup of the test set (if possible), including number of images taken at what distance and level of zoom. If for instance the test set was biased towards closer images, then this would influence the results of assessing model performance.

 

Discussion:

Paragraph starting 289: The proposed method is interesting, however, ‘wildlife monitoring’ is extremely broad. This method will not suit many situations, such as where fixed wing drones are needed to cover large survey areas, and may be complex in situations where animals may be in and out of sightability (such as marine environments or heavily forested areas), or perhaps where density estimates are required, or differentiation of similar shaped animals. Although the proposed method is conceptual, perhaps the authors could be a little more specific or include some more detail in the discussion with reference to their proposed method.

293-294: the height matters for regulations, but for wildlife monitoring, the focus should be on ground pixel resolution, which is dependent on focal length of the camera lens as well as height.

296: There is no real justification or discussion as to why the authors propose a gimbal angle of 45 degrees.

335-336: the authors make an excellent point.

343-346: I don’t get how image segmentation and unsupervised learning can increase the resolution of the gathered data?

 

Comments on the Quality of English Language

English is sound, but there are some sentence revisions and minor grammatical errors

Author Response

Thank you very much for your time and effort reviewing this paper. We have addressed your comments and conducted proof-reading, please see details below.

There are some grammatical errors throughout the manuscript and some sentences can be reworded to improve clarity (I have mentioned some examples below).

Answer: Examples addressed, and proof-reading conducted.

Furthering the discussion on comparing the difference between using a test set subset from the training set, as well as independently sampled, with regards to inferring model performance would be welcome (i.e. the importance of a test set separate in space and/or time). Additionally, whether there are better metrics or methods to assess the model during training as well as validation.

Given that mAP is potentially not a great indicator of model performance, particularly in the validation set during training, are there better indicators to use to assess the model during training?

Answer: Object detection networks use several loss functions: a) class loss (image classification): L2 loss (MSE), b) localization loss (bounding box): Intersection Over Union (IoU) loss, c) confidence loss: binary cross-entropy loss. Given that mAP calculates precision and recall at several threshold values of IoU it considers both localization and classification aspects and is a good performance metric, but we show that it is not flawless. Trained model should be validated manually (or otherwise) before taken into use.

 

Intro:

Line 37 ‘reducing’ to ‘reduce’.

Answer: Done

Line 35-40: This sentence is really long. Consider breaking up.

Answer: Sentence broken up:

The prospects of drones in wildlife monitoring have already been proven to save time, create better imagery and spatial data for especially cryptic and nocturnal animals [8,9], and reducing the risks and hazards for the observer [10,11]. However, the methods are still in the early stages, and need further development to be truly superior and cost-saving compared to traditional monitoring methods.

Line 102 – 105: Revise for clarity. The first sentence doesn’t describe what Povlsen showed, just what they did. The lines 104-105 describe this, which makes for a potentially awkward sentence split.

In paragraph starting 102, you don’t mention the animals being surveyed until near the end of the paragraph. I suggest mentioning earlier. Also were these the same animals sighted in Povlsen et al. (they were but you should mention here).

lines 102-104: you need to either mention the drone (M2E), field of view, of ground sampling distance (such as cm/pix) as any of this info, the height is somewhat meaningless.

Revised and added: Povlsen et al [28] showed that by flying in predetermined flight paths at 60 meters altitude with a thermal camera pointing directly down (90°), covering the transects that were simultaneously surveyed. By traditional transect spotlight counting, it was possible to spot roughly the same number of animals as the ground-based spotlight count [28]. =>

Povlsen et al [28] flew in predetermined flight paths at 60 meters altitude with a DJI Mavic 2 Enterprise Advanced, with the thermal camera pointing directly down (90°), covering the transects that were simultaneously surveyed, monitoring hare, deer, and fox. By transect counting, it was possible to spot roughly the same number of animals as the traditional ground-based spotlight count [28]. However, this method covered a relatively small area per flight, and required post-processing of the captured imagery, still making it time-consuming. In the present study we tried a slightly different approach, by manually piloting the UAV continuously, using the scouring method which also had been shown to match and potentially surpass the traditional spotlight method [9].

 

lines 108-110: this seems it is more suited to either methods or results

Answer: This refers to the scouring method in the previous study (reference 9) compared to (reference 28). Hopefully this has been clarified in lines 102-108 now.

lines 107 – 113: There is likely a better lead in/argument for the actual aims of the paper, which is to explore the utility in using AI to improve post-processing efficiency and provide further insights regarding assessing model performance metrics. Lines 107-112 don’t deal with the problem space, rather just that you altered the method from the previous paper, which does not provide strong justification for the aims.

Answer: Added to line 112: To improve post-processing efficiency and possibly even collect data in real-time automatically, while the drone is airborne.

 

Line 126: fix grammar (‘fairly simple’).

Answer: simple -> simply

 

Methods:

It is great to see the authors collected a test set independent of the training set, to test the model performance on. This is seldom done unfortunately, despite it giving a better indication of how well the model might perform in a monitoring situation, rather than giving an indication of how well the model knows its own data.

Are there other metrics (such as loss) that you used to monitor training in addition to the mAP, or was it just mAP?

Answer: We only looked at mAP, which is derived from various loss parametrics. In future studies it would be interesting to look closer into these, and other metrics.

It would be great to have a table on the breakup of the test set (if possible), including number of images taken at what distance and level of zoom. If for instance the test set was biased towards closer images, then this would influence the results of assessing model performance.

Answer: This is a very good point, and should definitely be taken into consideration when building a dataset for training in the future. Unfortunately, it was not possible for this study, since the images used were stillshots of video footage and the metadata (flight height etc.) from the video files were therefore not transferred.

Discussion:

Paragraph starting 289: The proposed method is interesting, however, ‘wildlife monitoring’ is extremely broad. This method will not suit many situations, such as where fixed wing drones are needed to cover large survey areas, and may be complex in situations where animals may be in and out of sightability (such as marine environments or heavily forested areas), or perhaps where density estimates are required, or differentiation of similar shaped animals. Although the proposed method is conceptual, perhaps the authors could be a little more specific or include some more detail in the discussion with reference to their proposed method.

Answer: The proposed method could potentially be applied in a wide variety of monitoring situations, including in marine habitats and when using fixed winged UAVs. Fixed wings could switch to loitering mode when POIs are detected, until an observation/confirmation was made, movement patterns could be implemented in the automatic species differentiation, and image segmentation NNs could be used with large groups of animals. This however needs further experimentation.

293-294: the height matters for regulations, but for wildlife monitoring, the focus should be on ground pixel resolution, which is dependent on focal length of the camera lens as well as height.

Answer: This is true, but since this method uses a thermal camera with a powerful zoom, the ground pixel resolution varies continuously. The height is a limiting factor on how fast an area can be covered, to some degree of course.

296: There is no real justification or discussion as to why the authors propose a gimbal angle of 45 degrees.

Answer: The angle was proposed and discussed in the previous study, that this study refers to (reference 9).

335-336: the authors make an excellent point.

Answer: Thank you!

343-346: I don’t get how image segmentation and unsupervised learning can increase the resolution of the gathered data?

Answer: Image segmentation opens up the possibility of detecting age, sex, and body condition as well, by size comparisons for one, and unsupervised learning methods will drastically increase the size of the training datasets since there will be significantly less need for manual annotations.

Comments on the Quality of English Language: English is sound, but there are some sentence revisions and minor grammatical errors.

Answer: Proof-reading conducted.

Back to TopTop