Next Article in Journal
An Approach to Control Multilevel Flying-Capacitor Converters Using Optimal Dynamic Programming Benchmark
Previous Article in Journal
Harmonic Interference Resilient Backscatter Communication with Adaptive Pulse-Width Frequency Shifting
 
 
Article
Peer-Review Record

Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery

Electronics 2025, 14(5), 947; https://doi.org/10.3390/electronics14050947
by Huicong Wang 1,2, Kaijun Ma 3, Juan Yue 1, Yuhan Li 1,2, Jiaxin Huang 1,2, Jie Liu 1,2, Linhan Li 1,2, Xiaoyu Wang 1,2, Nengbin Cai 3 and Sili Gao 1,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2025, 14(5), 947; https://doi.org/10.3390/electronics14050947
Submission received: 29 January 2025 / Revised: 20 February 2025 / Accepted: 24 February 2025 / Published: 27 February 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The main suggestions for revisions are as follows:

  1. When introducing the background of infrared small target detection, the description of the background and challenges of infrared image target detection is somewhat generalized. It would be better to clearly emphasize the current challenges, highlight the shortcomings of existing methods, and introduce the contributions of the proposed model.
  2. In the section on related work, the paper mainly reviews the categories of different methods. It could further emphasize the connection between these methods and this work. There is a lack of detailed comparisons between the proposed method and other related works; it is recommended to add relevant content.
  3. The paper currently uses traditional object detection metrics such as Precision, Recall, F1 Score, mAP@0.5, and mAP@[.5:.95] for performance evaluation. It is recommended to include visual comparison methods such as ROC curves or AUC curves to more intuitively and comprehensively display the results.
  4. In the experimental results comparing algorithmic performance, the proposed model is compared with existing similar methods, but there is a lack of more detailed discussion about the advantages of the proposed model over others.
  5. In the results of algorithmic performance comparison, more detection results from different scenarios can be added to comprehensively demonstrate the effectiveness of the proposed algorithm.
  6. Regarding the detection performance of the algorithm in practical scenarios, it is recommended to add original images to provide a clearer contrast. Section 3.6 needs a more detailed title. Additionally, it is suggested to include a discussion on the algorithm’s detection capability for multi-target scenarios.
Comments on the Quality of English Language

English still needs improvement to enhance reading fluency

Author Response

Thank you for your constructive feedback. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents an improved YOLOv8-based model (IRST-YOLO) for infrared small target detection. The proposed model integrates several novel components, including the Dual-Path Fusion Downsampling Convolution (WFDC) module, an Involution-based Spatial Pyramid Pooling module with Attention (SPPF-IA), and Deformable Convolutions (D-C2f). The experimental results on the SIRST-5K and IRSTD-1K datasets demonstrate performance gains over baseline and state-of-the-art methods. While the study is well-structured and methodologically sound, there are several weaknesses that should be addressed to improve its clarity, technical rigor, and scientific contribution.

1. While the paper claims to introduce novel components, each of the proposed improvements (WFDC, SPPF-IA, and D-C2f) is largely a combination of existing techniques (e.g., deformable convolutions, spatial pyramid pooling, and involution). The paper should clearly articulate how these modifications uniquely contribute beyond existing YOLO variants and previously developed infrared small target detection methods.

2. While deformable convolutions improve feature extraction for variable-scale objects, they may introduce high computational overhead.The manuscript does not analyze whether the increased computational cost is justified and how to provide a better tradeoff between performance and efficiency? Similarly, while Coordinate Attention (CA) is chosen, no clear justification is provided as to why it is superior to other attention mechanisms.

3. While mAP@[.5:.95] and mAP@0.5 are reported, no discussion is provided on inference speed, model complexity, or computational cost. The proposed model introduces additional layers and computations (e.g., deformable convolutions), which likely increase latency. A detailed FLOPs and model parameters should be included to assess practical deployment feasibility.

4. Many parts of the introduction and related work repeat background information on YOLO and CNN-based models without adding new insights. The writing should be more concise and focus on the unique contributions. In addition, figures demonstrating the failure cases of IRST-YOLO are missing. The paper should include failure case analysis to discuss where and why the method struggles.

5. Some works about multi-modal detection and attention mechanisms should be cited in this paper to make this submission more comprehensive, such as

  [a] Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection; DOI: 10.1109/tpami.2024.3511621

[b] Local Patch Network With Global Attention for Infrared Small Target Detection; DOI: 10.1109/TAES.2022.3159308

Author Response

Thank you for your constructive feedback. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript proposes an improved YOLOv8-based model, IRST-YOLO, for infrared small target detection. New modules (WFDC, SPPF-IA, D-C2f) have been proposed to improve model performance. Experiments on SIRST-5K and IRSTD-1K datasets show significant performance gains over the baseline YOLOv8 and state-of-the-art alternatives. The manuscript presents solid engineering improvements to YOLOv8 for infrared small target detection, but it would benefit from stronger theoretical justifications, broader comparisons, and computational efficiency analyses. Please see the following for the major concerns.

1. Lack of Justification for Module Selection

The proposed WFDC, SPPF-IA, and D-C2f modules seem effective, but the manuscript lacks a deeper theoretical justification or ablation studies comparing them to alternative approaches (e.g., different attention mechanisms, feature fusion strategies, or downsampling techniques).

For instance, how does SPPF-IA compare with standard Transformer-based feature fusion methods?

2. Comparative Baseline Selection

The manuscript only compares IRST-YOLO to other YOLO-based methods. However, Transformer-based and two-stage detection approaches (e.g., Faster R-CNN, DETR) should be included for a more comprehensive comparison.

It would be insightful to see if ViTs or hybrid CNN-Transformer models perform better in this task.

3. Computational Efficiency and Deployment Feasibility

While YOLOv8 is known for efficiency, the added modules (especially deformable convolutions) introduce extra computational costs. A thorough analysis of model complexity, inference time, and memory usage is missing.

How does the improved model scale to real-world applications where real-time performance is critical?

4. Dataset Diversity and Generalization

Both SIRST-5K and IRSTD-1K datasets focus on hollow infrared targets. How does the model perform on more diverse infrared datasets (e.g., mixed clutter environments, occlusions, or varying temperature conditions)?

Additional testing on real-world scenarios would strengthen the claim of robustness.

5. Hyperparameter Justification and Reproducibility

Learning rate choices, augmentation strategies, and loss weighting seem arbitrary. A brief justification or sensitivity analysis would add value.

The paper should provide more implementation details, such as training time per epoch, hardware requirements, and availability of pretrained models.

Author Response

Thank you for your constructive feedback. Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Comments

The same acronym for different spelled-out forms of what it stands for.

NOTE 1.

Line  160: Spatial Pyramid Pooling with Attention (SPPF-IA)

Line  161: Spatial Pyramid Pooling module with Attention (SPPF-IA)

Line   241: Spatial Pyramid Pooling Module with Attention (SPPF-IA)

Line 477:  Spatial Pyramid Pooling module (SPPF- IA)

 more

Line  151: Dual-Path Fusion Downsample Convolution (WFDC) 

Line  153: Dual-Path Fusion Downsample Conv (WFDC)

Line  211: Dual-Path Fusion Downsample Convolution (WFDC) module

Line  476: Dual-Path Fusion Downsampling module (WFDC)

Question : Downsample Convolution or Downsampling module?

 

NOTE 2.

Line 342: Deformable Convolution C2f module (D-C2f) 

Line 477:  Deformable C2f module (D-C2f)

Line329: C2f module with deformable convolutions.

Table 4:  De-C2f

Is it the same module or not?

 

NOTE 3.

Fig.-10    Drawing is illegible

NOTE 4.

Equation 2: The components  [fbn,  σ, fconv]  must be described    

 

Note 5

Line  105: SiLU activation function to perform initial feature extraction

Line 229: SiLU (CBS) module. Besides:   CBS = Computer-Based System?, Component Based Servicing ? Core-Based System?)

Line  232: SiLU(σ) activation function  

How to understand this?

Note 6.

Table 2/table 3. Comparison of algorithms on the SIRST-5K dataset / on the SIRST-1K dataset included  YOLOv8,  IDD-YOLO,  ASF-YOLO,   YOLO-ANT,  HIC-YOLOv5,  IRST-YOLOv8

Figure 8. Comparison of detection performance between IRST-YOLO, baseline models and other comparison models included  IRST-YOLOv8, YOLOv8,  HIC-YOLOv5,  IDD-YOLO,  YOLO-ANT. Where is   ASF-YOLO ?  

 

Recommendation

1. Authors present the known methods and it would be great to clearly pin point the flaws of the known methods and where their method intends to outperform today knowledge. The statements on this subject are too general e.g. :

a. “Deformable Convolution: The backbone’s feature extraction module was refined  by integrating deformable convolutions, enabling the network to adapt to targets with  varying shapes, scales, and deformations, significantly improving detection precision”

2. The reviewed article may grab attention of readers and keeping them interested throughout, but several  aspects of the manuscript need minor clarification and  revising. 

 

CONCLUSION 

The manuscript may be published in the journal "Electronics" after making appropriate corrections. 

 

 

Author Response

Thank you for your constructive feedback. Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

No more comments.

Comments on the Quality of English Language

No more comments.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for the revision and response. The reviewer’s concerns have been addressed.

Back to TopTop