Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe main suggestions for revisions are as follows:
- When introducing the background of infrared small target detection, the description of the background and challenges of infrared image target detection is somewhat generalized. It would be better to clearly emphasize the current challenges, highlight the shortcomings of existing methods, and introduce the contributions of the proposed model.
- In the section on related work, the paper mainly reviews the categories of different methods. It could further emphasize the connection between these methods and this work. There is a lack of detailed comparisons between the proposed method and other related works; it is recommended to add relevant content.
- The paper currently uses traditional object detection metrics such as Precision, Recall, F1 Score, mAP@0.5, and mAP@[.5:.95] for performance evaluation. It is recommended to include visual comparison methods such as ROC curves or AUC curves to more intuitively and comprehensively display the results.
- In the experimental results comparing algorithmic performance, the proposed model is compared with existing similar methods, but there is a lack of more detailed discussion about the advantages of the proposed model over others.
- In the results of algorithmic performance comparison, more detection results from different scenarios can be added to comprehensively demonstrate the effectiveness of the proposed algorithm.
- Regarding the detection performance of the algorithm in practical scenarios, it is recommended to add original images to provide a clearer contrast. Section 3.6 needs a more detailed title. Additionally, it is suggested to include a discussion on the algorithm’s detection capability for multi-target scenarios.
English still needs improvement to enhance reading fluency
Author Response
Thank you for your constructive feedback. Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript presents an improved YOLOv8-based model (IRST-YOLO) for infrared small target detection. The proposed model integrates several novel components, including the Dual-Path Fusion Downsampling Convolution (WFDC) module, an Involution-based Spatial Pyramid Pooling module with Attention (SPPF-IA), and Deformable Convolutions (D-C2f). The experimental results on the SIRST-5K and IRSTD-1K datasets demonstrate performance gains over baseline and state-of-the-art methods. While the study is well-structured and methodologically sound, there are several weaknesses that should be addressed to improve its clarity, technical rigor, and scientific contribution.
1. While the paper claims to introduce novel components, each of the proposed improvements (WFDC, SPPF-IA, and D-C2f) is largely a combination of existing techniques (e.g., deformable convolutions, spatial pyramid pooling, and involution). The paper should clearly articulate how these modifications uniquely contribute beyond existing YOLO variants and previously developed infrared small target detection methods.
2. While deformable convolutions improve feature extraction for variable-scale objects, they may introduce high computational overhead.The manuscript does not analyze whether the increased computational cost is justified and how to provide a better tradeoff between performance and efficiency? Similarly, while Coordinate Attention (CA) is chosen, no clear justification is provided as to why it is superior to other attention mechanisms.
3. While mAP@[.5:.95] and mAP@0.5 are reported, no discussion is provided on inference speed, model complexity, or computational cost. The proposed model introduces additional layers and computations (e.g., deformable convolutions), which likely increase latency. A detailed FLOPs and model parameters should be included to assess practical deployment feasibility.
4. Many parts of the introduction and related work repeat background information on YOLO and CNN-based models without adding new insights. The writing should be more concise and focus on the unique contributions. In addition, figures demonstrating the failure cases of IRST-YOLO are missing. The paper should include failure case analysis to discuss where and why the method struggles.
5. Some works about multi-modal detection and attention mechanisms should be cited in this paper to make this submission more comprehensive, such as
[a] Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection; DOI: 10.1109/tpami.2024.3511621[b] Local Patch Network With Global Attention for Infrared Small Target Detection; DOI: 10.1109/TAES.2022.3159308
Author Response
Thank you for your constructive feedback. Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript proposes an improved YOLOv8-based model, IRST-YOLO, for infrared small target detection. New modules (WFDC, SPPF-IA, D-C2f) have been proposed to improve model performance. Experiments on SIRST-5K and IRSTD-1K datasets show significant performance gains over the baseline YOLOv8 and state-of-the-art alternatives. The manuscript presents solid engineering improvements to YOLOv8 for infrared small target detection, but it would benefit from stronger theoretical justifications, broader comparisons, and computational efficiency analyses. Please see the following for the major concerns.
1. Lack of Justification for Module Selection
• The proposed WFDC, SPPF-IA, and D-C2f modules seem effective, but the manuscript lacks a deeper theoretical justification or ablation studies comparing them to alternative approaches (e.g., different attention mechanisms, feature fusion strategies, or downsampling techniques).
• For instance, how does SPPF-IA compare with standard Transformer-based feature fusion methods?
2. Comparative Baseline Selection
• The manuscript only compares IRST-YOLO to other YOLO-based methods. However, Transformer-based and two-stage detection approaches (e.g., Faster R-CNN, DETR) should be included for a more comprehensive comparison.
• It would be insightful to see if ViTs or hybrid CNN-Transformer models perform better in this task.
3. Computational Efficiency and Deployment Feasibility
• While YOLOv8 is known for efficiency, the added modules (especially deformable convolutions) introduce extra computational costs. A thorough analysis of model complexity, inference time, and memory usage is missing.
• How does the improved model scale to real-world applications where real-time performance is critical?
4. Dataset Diversity and Generalization
• Both SIRST-5K and IRSTD-1K datasets focus on hollow infrared targets. How does the model perform on more diverse infrared datasets (e.g., mixed clutter environments, occlusions, or varying temperature conditions)?
• Additional testing on real-world scenarios would strengthen the claim of robustness.
5. Hyperparameter Justification and Reproducibility
• Learning rate choices, augmentation strategies, and loss weighting seem arbitrary. A brief justification or sensitivity analysis would add value.
• The paper should provide more implementation details, such as training time per epoch, hardware requirements, and availability of pretrained models.
Author Response
Thank you for your constructive feedback. Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsComments
The same acronym for different spelled-out forms of what it stands for.
NOTE 1.
Line 160: Spatial Pyramid Pooling with Attention (SPPF-IA)
Line 161: Spatial Pyramid Pooling module with Attention (SPPF-IA)
Line 241: Spatial Pyramid Pooling Module with Attention (SPPF-IA)
Line 477: Spatial Pyramid Pooling module (SPPF- IA)
more
Line 151: Dual-Path Fusion Downsample Convolution (WFDC)
Line 153: Dual-Path Fusion Downsample Conv (WFDC)
Line 211: Dual-Path Fusion Downsample Convolution (WFDC) module
Line 476: Dual-Path Fusion Downsampling module (WFDC)
Question : Downsample Convolution or Downsampling module?
NOTE 2.
Line 342: Deformable Convolution C2f module (D-C2f)
Line 477: Deformable C2f module (D-C2f)
Line329: C2f module with deformable convolutions.
Table 4: De-C2f
Is it the same module or not?
NOTE 3.
Fig.-10 Drawing is illegible
NOTE 4.
Equation 2: The components [fbn, σ, fconv] must be described
Note 5
Line 105: SiLU activation function to perform initial feature extraction
Line 229: SiLU (CBS) module. Besides: CBS = Computer-Based System?, Component Based Servicing ? Core-Based System?)
Line 232: SiLU(σ) activation function
How to understand this?
Note 6.
Table 2/table 3. Comparison of algorithms on the SIRST-5K dataset / on the SIRST-1K dataset included YOLOv8, IDD-YOLO, ASF-YOLO, YOLO-ANT, HIC-YOLOv5, IRST-YOLOv8
Figure 8. Comparison of detection performance between IRST-YOLO, baseline models and other comparison models included IRST-YOLOv8, YOLOv8, HIC-YOLOv5, IDD-YOLO, YOLO-ANT. Where is ASF-YOLO ?
Recommendation
1. Authors present the known methods and it would be great to clearly pin point the flaws of the known methods and where their method intends to outperform today knowledge. The statements on this subject are too general e.g. :
a. “Deformable Convolution: The backbone’s feature extraction module was refined by integrating deformable convolutions, enabling the network to adapt to targets with varying shapes, scales, and deformations, significantly improving detection precision”
2. The reviewed article may grab attention of readers and keeping them interested throughout, but several aspects of the manuscript need minor clarification and revising.
CONCLUSION
The manuscript may be published in the journal "Electronics" after making appropriate corrections.
Author Response
Thank you for your constructive feedback. Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsNo more comments.
Comments on the Quality of English LanguageNo more comments.
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for the revision and response. The reviewer’s concerns have been addressed.