Next Article in Journal
SSMA-YOLO: A Lightweight YOLO Model with Enhanced Feature Extraction and Fusion Capabilities for Drone-Aerial Ship Image Detection
Previous Article in Journal
Evaluation of Mosaic Image Quality and Analysis of Influencing Factors Based on UAVs
 
 
Article
Peer-Review Record

An Efficient Adjacent Frame Fusion Mechanism for Airborne Visual Object Detection

by Zecong Ye 1,2, Yueping Peng 1,*, Wenchao Liu 1, Wenji Yin 1, Hexiang Hao 1, Baixuan Han 1, Yanfei Zhu 1 and Dong Xiao 1,3
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Submission received: 17 February 2024 / Revised: 25 March 2024 / Accepted: 6 April 2024 / Published: 7 April 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors proposed an efficient mechanism for frame fusion and showed the performance of this mechanism in comparison with pre-existing ones. 

I don't see any flaw in this manuscript and hence would like to suggest the journal of drone to consider this paper for publication.

Author Response

Comments and Suggestions for Authors: The authors proposed an efficient mechanism for frame fusion and showed the performance of this mechanism in comparison with pre-existing ones. I don't see any flaw in this manuscript and hence would like to suggest the journal of drone to consider this paper for publication.

Responds: Thank you very much for your recognition of my paper work! We tried our best to improve the manuscript and made some changes in the manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

 

 

summary

 

This paper proposes efficient adjacent frame fusion mechanism for small object detection with airborn images. Drone-to-drone object detection requires less complexity of the mechanism and higher accuracy then ground-to-drone object detection. With feature alignment fusion module and background subtraction module, this mechanism can maintain high accuracy but with faster speed of less calculation.

 

General concept comments

 

The motion relationship information between the adjacent frame features and the key frame features can be considered to align the features. However, Stationary object relavent with camera has no motion information. Is it possible to align correctly with the stationary object?

 

The experanental environment was held in Intel Xeon W-2245 CPU and NVIDIA RTX 3090 24G GPU. However, in real environment, the algorithm should be used in onboard computer which is not as powerful as the environment in the experiment. Experiment in downgraded onboard computer is necessary for testing the method.

 

The results shows the yolov5s in the first row of table 1. And it shows almost the same performance sith the suggesting mechanism. However, the lower line shows results of different model, which is not unified. The set of algorithm compared with the suggested method should be the same in every data.

 

Specific comments

 

In table 1, the hightlighted numbers represents the best results. However, while FPS of yolov5s is higher then the others, it is not highlighted.

 

In ablation experiment and analysis, FPS and parameter data is provided. As the

 

mathod is focused on efficiency, it is necessary to provide these results.

Comments on the Quality of English Language

N/A

Author Response

Question 1: The motion relationship information between the adjacent frame features and the key frame features can be considered to align the features. However, Stationary object relative with camera has no motion information. Is it possible to align correctly with the stationary object?

Responds: Thank you for your comments first! We have redescribed the flaws of the Tiny Airborne Object Detection(TAD) [1]. More precisely, this method does not recognize targets whose UAV trajectory is perpendicular to the imaging plane of the camera's field of view or those that are hovering, as detailed in lines 66 to 68. Our method is able to obtain the changing state of the object for any situation, including whether it is stationary or not, because our method calculates the local similarity between the features in the keyframe and the adjacent frame. When the target is stationary, after local similarity calculation it will be found that the key frame is most similar to the features at the center of the adjacent frame, which ensures that the target in the adjacent frame is correctly aligned with the target in the key frame.

 

[1] Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A Real-Time and Lightweight Method for Tiny Airborne Object Detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 3016–3025.

 

Question 2: The experimental environment was held in Intel Xeon W-2245 CPU and NVIDIA RTX 3090 24G GPU. However, in real environment, the algorithm should be used in onboard computer which is not as powerful as the environment in the experiment. Experiment in downgraded onboard computer is necessary for testing the method.

Responds: Thank you for your comments first! As reviewer suggested that experiment in downgraded onboard computer is necessary for testing the method, experimenting and optimizing the method on an airborne computer is our next step. However, the relevant equipment has not yet been procured back, so here in the paper we have added the model calculations GFLOPs to facilitate the reader to evaluate the computational complexity of the algorithm. In this way, our method can be evaluated by comparing the FLOPs computation with those in the other paper and the corresponding FPS in the onboard computer. At the same time, we use four different resolutions (1280, 800, 640, and 480) to compare the performance of the proposed method for a trade-off between performance and throughput as detailed in lines 317 to 328.

For example, as mentioned in the Tranvisdrone[1] “To show the deployment capability of our model on edge-computing devices, we use NVIDIA Jetson Xavier NXboard. It has 7025Mb of GPU memory & 6 CPU cores. Our 640 resolution model obtains the real-time fps of 33 without any complex TensorRT optimizations & keeping the board temperature well below 50°C.” The computation of the algorithm at 640 resolution is 168.1 GFLOPs. The computational amount is similar to that of our algorithm at 800 resolution. At the same time, we can find in Table 2 that our algorithm also achieves a high accuracy metric at this resolution.

We hope the correction could meet with approval.

 

[1] Sangam, T.; Dave, I.R.; Sultani, W.; Shah, M. TransVisDrone: Spatio-Temporal Transformer for Vision-based Drone-to-Drone Detection in Aerial Videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 6006–6013. https://doi.org/10.1109/ICRA48891.2023.10161433.

 

Question 3: The results shows the yolov5s in the first row of table 1. And it shows almost the same performance with the suggesting mechanism. However, the lower line shows results of different model, which is not unified. The set of algorithm compared with the suggested method should be the same in every data.

Responds: Thank you for your comments first! As reviewer suggested that the set of algorithm compared with the suggested method should be the same in every data, therefore we have done a large number of experiments using Yolov5l instead of Yolov5s for all experimental. Meanwhile, we supplemented our experiments by comparing several other algorithms, as detailed in Table 1, and also found that an error in our method of calculating FPS led to this value being too low, for which we retested.

Additionally, in FL@480, Yolov5s appears as a baseline and we are adding our mechanism to Yolov5s. The main purpose of doing so is to highlight the fact that the accuracy improvement after using the method in the paper is so large, while the decrease in FPS is within an acceptable range. Specifically, after using the method proposed in this paper, the AP of FL@480 is improved by 8.9, and the number of parameters is only increased by 0.2 M.

We hope the correction could meet with approval.

 

Question 4: In table 1, the highlighted numbers represents the best results. However, while FPS of yolov5s is higher then the others, it is not highlighted.

Responds: Thank you very much for pointing out the error, we have proofread the full text to ensure there are no similar issues.

We hope the correction could meet with approval.

 

Question 5: In ablation experiment and analysis, FPS and parameter data is provided. As the method is focused on efficiency, it is necessary to provide these results.

Responds: Thank you very much for your valuable comments, we have supplemented FPS and parameter results in ablation experiment and analysis.

We hope the correction could meet with approval.

 

Question 6: Extensive editing of English language required

Responds: We apologize for the poor language of our manuscript. We worked on the manuscript for a long time and the repeated addition and removal of sentences and sections obviously led to poor readability. We have now worked on both language and readability. We really hope that the flow and language level have been substantially improved.

We hope the correction could meet with approval.

Reviewer 3 Report

Comments and Suggestions for Authors

drones-2898822 An Efficient Adjacent Frame Fusion Mechanism for Airborne Visual Object Detection

The paper proposed an efficient adjacent frame fusion mechanism to improve the performance of airborne visual object detection. Related works are well-presented. The authors aim to develop an efficient framework that reduces the used frames and parameters. However, this study needs more work to support its conclusion.

 

1. Line 47, “TAD” needs to be expanded when it first occurs. It's the same with Line 192; “DCN” needs to be expanded when it first occurs.

2. In lines 56-57, the authors state, “…the method cannot identify the target that is relatively stationary with the camera.” I didn’t see any part of this study indicating the solution of identification. As the authors might know, detection and identification/classification are totally different level tasks in terms of complexity and difficulty. Identical to Lines 60-62, “…the method does not have the ability to determine whether the target is a drone or other category.” The authors might need to focus only on the gaps they are solving in this study. Of course, you could always mention and discuss this at the end of this paper.

3. In lines 57-60, “…this method cannot identify the drone target with a slightly larger scale, because the motion features obtained by this method can only extract the edge information of the target with a slightly larger scale…” This sentence is totally confusing and unclear to me. In what degree the authors defined “a slightly larger scale”? This is a clear clarification- it might be an example of a numerical description.

4. The authors state that the training and test settings are the same as Ref. 17. Why is there no benchmark comparison with existing studies? I saw there was one in Ref. 17 Table 1 (even though I understood the hardware environment might not be the same). Also, in Lines 292-295 and 298-300, there is a lack of evidence (in your Table 1) or explanation to support the proposed approach outperformed existing studies with better efficiency. The authors need to explain why, in the FL@480 case, the FPS improvement is not significant when compared toYolov5s. Also, why, in the NPS@1290 case, are the evaluation metrics not as good as Trans(5frames)’s?

 

 

 

5. Figure 6 totally confuses me. The authors might need to draw a dashed line separating the two cases. Also, it would be clearer if the original images (fourth and fifth lines) are moved to the top, and the detection images (first line) are moved to the bot. Following the order: “Original-Feature Heatmaps-Results.”

Author Response

Question 1: Line 47, “TAD” needs to be expanded when it first occurs. It's the same with Line 192; “DCN” needs to be expanded when it first occurs.

Responds: Thank you for your comments first! We have made corrections in the paper, as detailed in lines 55 and 209.

We hope the correction could meet with approval.

 

Question 2: In lines 56-57, the authors state, “…the method cannot identify the target that is relatively stationary with the camera.” I didn’t see any part of this study indicating the solution of identification. As the authors might know, detection and identification/classification are totally different level tasks in terms of complexity and difficulty. Identical to Lines 60-62, “…the method does not have the ability to determine whether the target is a drone or other category.” The authors might need to focus only on the gaps they are solving in this study. Of course, you could always mention and discuss this at the end of this paper.

Responds: Thanks to the reviewer, we have redescribed the flaws of the method. More precisely, this Tiny Airborne Object Detection(TAD) [1] does not recognize targets whose UAV trajectory is perpendicular to the imaging plane of the camera's field of view or those that are hovering, and in the paper we add a description of why the study is able to cope with these problems. Specifically, the mechanism we proposed can be inserted into the general object detection like the attention mechanism, so whether the target whose motion trajectory is perpendicular to the imaging plane of the camera's field of view or hovering target can be detected, and the target characteristics can be enhanced through the proposed mechanism. In addition, the general object detection algorithm can recognize objects with different scales. Our method is able to cope with the shortcomings of the TAD method while using pixel motion features. The location of the modification is in the paper at lines 66 to 70 and 82 to 87.

We understand that detection and recognition/classification are completely different levels of tasks in terms of complexity and difficulty, for which we have removed “…the method does not have the ability to determine whether the target is a drone or other category.” This TAD method is so innovative that we devote some space to it, which uses only local similarity computation to obtain motion inconsistencies to determine the target position. At the same time, we were inspired by it and analyzed the scene, see section 3.1, where it is perfectly feasible to use local similarity computation to obtain pixel motion in an airborne vision scene. This allows to completely avoid the use of high space and time complexity mechanisms such as transformers for feature enhancement. At the end of the paper, we revisit the TAD algorithm to inspire subsequent researchers, as detailed in lines 365 to 368.

It is worth mentioning that during the implementation of the paper, we reproduced the core part of the TAD method and combined it with a general object detection algorithm, i.e., using motion inconsistency to regress the target in order to improve the detection performance, and the experiments proved that the method is indeed effective compared to the baseline. Then we thought of combining the TAD method with the method proposed in this paper, but the experimental effect of the combination is reduced compared with only using the method proposed in this paper, which may be due to the fact that the information learnt through local similarity computation is the same leading to redundancy which makes the performance of the network degraded.

We hope the correction could meet with approval.

 

[1] Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A Real-Time and Lightweight Method for Tiny Airborne Object Detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 3016–3025.

 

Question 3: In lines 57-60, “…this method cannot identify the drone target with a slightly larger scale, because the motion features obtained by this method can only extract the edge information of the target with a slightly larger scale…” This sentence is totally confusing and unclear to me. In what degree the authors defined “a slightly larger scale”? This is a clear clarification- it might be an example of a numerical description.

Responds: Thank you for your comments first! We are very sorry for making you totally confused and unclear and we have re-written this part according to the reviewer's suggestion. We finally changed it to drone target with size more than 32×32, because when detecting this kind of target by this method, it cannot locate the center of the target, but only the edges of the target, as detailed in lines 68 to 70. As for why it is 32×32, the main reason is because in the TAD method, the downsampling is used 3 times, the pixels will become one eighth of the original, when the target is 32×32, the target size from the intuition will become 4×4. The 2×2 pixels in the center of the target are not getting high motion inconsistent activation values, instead, it is the edges of the target that tends to get high motion inconsistent activation values. This leads to an error in the target location, which can only locate the edges of the target.

We hope the correction could meet with approval.

 

Question 4: The authors state that the training and test settings are the same as Ref. 17. Why is there no benchmark comparison with existing studies? I saw there was one in Ref. 17 Table 1 (even though I understood the hardware environment might not be the same). Also, in Lines 292-295 and 298-300, there is a lack of evidence (in your Table 1) or explanation to support the proposed approach outperformed existing studies with better efficiency. The authors need to explain why, in the FL@480 case, the FPS improvement is not significant when compared to Yolov5s. Also, why, in the NPS@1290 case, are the evaluation metrics not as good as Trans(5frames)’s?

Responds: Thank you for your comments first! We supplemented our experiments by comparing several other algorithms, as detailed in Table 1, and also found that an error in our method of calculating FPS led to this value being too low, for which we retested.

We are very sorry that lines 292-295 and 298-300 were confusing for you, and we will explain further:

In FL@480, Yolov5s appears as a baseline and we are adding our mechanism to Yolov5s. The main purpose of doing so is to highlight the fact that the accuracy improvement after using the method in the paper is so large, while the decrease in FPS is within an acceptable range. Specifically, after using the method proposed in this paper, the AP of FL@480 is improved by 8.9, and the number of parameters is only increased by 0.2 M. Meanwhile, in the review comments of others, they suggested to use the same baseline, so we have done a lot of experiments by replacing all the experimental baseline with yolov5l, instead of using yolov5s.

In NPS@1280, the accuracy evaluation metrics are not as good as Transvisdrone (5 frames), this is firstly because the total amount of information used by the algorithms is different, Transvisdrone (5 frames) utilizes the information of 4 adjacent frames, whereas our method only uses the information of 1 adjacent frame. Secondly because of the large variability of the datasets, the algorithms do not work the same way for various datasets, and it can be found in Table 1 that the FL dataset is difficult to learn compared to the NPS dataset. We explain this in lines 303 to 309 of the paper. However, FL@1280 and NPS@1280 experiments illustrate the efficiency of the proposed mechanism under the condition of using 1 frame adjacent frames. We will investigate how to efficiently improve the performance using multi-frame adjacency frames in future work.

We hope the correction could meet with approval.

 

Question 5: Figure 6 totally confuses me. The authors might need to draw a dashed line separating the two cases. Also, it would be clearer if the original images (fourth and fifth lines) are moved to the top, and the detection images (first line) are moved to the bot. Following the order: “Original-Feature Heatmaps-Results.” 

Responds: Thank you for your comments first! We are very sorry for the confusion and we have reordered the diagrams according to your comments following the order: "Original-Feature Heatmaps-Results". At the same time, we have drawn a dashed line separating the two cases and added text to the far left of the image to make it easier for the reader to read.

We hope the correction could meet with approval.

Back to TopTop