YOLO-HR: Improved YOLOv5 for Object Detection in High-Resolution Optical Remote Sensing Images
Round 1
Reviewer 1 Report
The submission proposes a lightweight object detection network for high-resolution remote sensing images based on the YOLOv5 framework in order to balance detection accuracy, speed, and the amount of model parameters, as well as take advantage of existing features. The manuscript also conducts a series of experiments and discussions on the SIMD dataset. However, I have a number of major concerns with the manuscript, which are outlined below.
1- In the introduction section, the submission lists various target detection methods, but does not analyze the advantages, disadvantages and bottlenecks of these methods. For example, in line 82, the issues of the aforementioned methods are summarized suddenly, making the motivation of the proposed method less obvious.
2- In section 2, the subsection of “Datasets of Remote Sensing Image Object Detection” is not relevant to this paper. Table 2 has the same problem. There are considerable object detection algorithms designed for high resolution remote sensing imagery, which are ignored in this paper. Please provide more descriptions and references for these algorithms in Related Work section.
3- In section 2.2, it is recommended to add the reasons for choosing hybrid soft attention, i.e. advantages and the problems that can be solved.
4- In section 3, the scientific contributions of the proposed method should be more clearly clarified. For example, the existing method flow (e.g., section 3.2; section 3.3) can be ignored or described simply. The authors should describe more sentences about your own ideas.
5- In line 300, the paper describes that “Figure 5 depicts a visual comparison of the heat map of some detection findings prior to and following the addition of the MAB module”. Meanwhile, the Figure 5 titles the second and the third lines as “YOLO-HR without MAB” and “YOLO-HR with MAB”. However, in my understanding, the YOLO-HR algorithm has MAB, right? What do you mean by “YOLO-HR without MAB”? Is it means experiments on “YOLOv5s+MPH” algorithm?
6- In Table 5, the ablation studies on MAB seems ignore. Why not add the experiments of "Yolov5s+MAB"? Moreover, since both “Yolov5s” and “YOLO-HR” have added the "Finetune" ablation experiments, why not add the results from “yolov5s+MPH+Finetune”.
7- Please replace Figure 7 with a clearer version.
8- Some presentations and mistakes should be improved, included but not completed:
1) Abbreviation inconsistency: Yolov5 (line 18) and YOLOv5 (line 1).
2) Font inconsistency: line 132; line 145; format of brackets in 322.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
This paper proposed YOLO-HR for object detection in high resolution remote sensing image, whose balance between effectiveness and efficiency are verified based on SIMD dataset. Overall, this paper is interesting. However, some critical issues need to be addressed before publication.
1. Abstract:
1) Please adopt one or two sentences to point out the importance of the research, not a half of abstract.
2) Please supplement the specific and detailed techniques you proposed.
3) I do suggest revise “between detection effect and speed” as “between effectiveness and efficiency”, and the followings in the main manuscript are also suggested.
2. The first paragraph line 28~ line 43
The organization seems too confused, and I cannot catch its meaning and emphasis. I think the first paragraph should briefly introduce the background of this study, and point out its importance. In this point, it should be reorganized.
3. The deficiencies of existing works line 84~ line 94
Please supplemented the specific references to support your claims, e.g., “Some researchers employ a two-stage model for object recognition [XX-XX]”,
4. The subjects are missing in your contributions. line 99~ line 104
5. Minor comment
Could you please clarify that the what the differences between “the detection in remote sensing image” and “the detection in radar”? where the latter item can be seen in
Joint detection threshold optimization and illumination time allocation strategy for cognitive tracking in a networked radar system, IEEE Trans. Signal Process., doi: 10.1109/TSP.2022.3188205.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
1. In Table 6, the comparison with the experimental results of the state-of-the-art(SOTA) model is missing
(Faster RCNN was proposed in 2016, which is no longer the model with the best performance). Please
provide more review research and experimental comparisons of SOTA methods.
2. Please fix some typographical problems, such as the centering problem of image 6 and the line break
problem of equation 8.
3. we would like to see the results of the ablation experiments on MAB module, which is one of the
contributions of this paper, so as to prove the effectiveness of this attention module.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
I would like to thank the author for the efforts on the additional experiments and improved paper. However, I have some additional comments that need to be addressed:
Regarding Reply 2, the authors describe that “the high resolution is actually the application of the network input to the high-resolution remote sensing data set”&”It is difficult to sort through thousands of high-resolution remote sensing image target detection network papers, but their basic network algorithms are relatively simple to summarize and compare”. However, I have a completely different view. First of all, the object of this paper is the detection of targets in remote sensing images. Therefore, it is necessary to summarize the literature of related work on RS object detection. Second there are obvious differences between objects in remote sensing images and objects in natural scene images, such as object distribution, object scale, perspective, etc. This is the reason why many baseline methods in CV field are not very accurate when applied to remote sensing images. Since the authors insist that Tables 1 and 2 are relevant to the paper, the algorithms created for the dataset in Table 1 should be summarized in Table 2. The existing Table 2 seems to be for a list of network backbones rather than object detection networks in RS images. Finally, if a review of papers on remote sensing imagery is not important, what can this paper bring to the field?
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
The authors are suggested to recheck the whole manuscript before publication. In addition, it is suggested to add refs. to clarify the difference between remote sensing image detection and radar detection.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
The overall content is well revised, including the introduction and the method part. But the experimental
results part is still insufficient, please provide more comparisons of SOTA methods. Furthermore, the
expression could be shorten and refined to avoid using long sentences.
Author Response
Please see the attachment.
Author Response File: Author Response.docx