Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery

Wang, Huicong; Ma, Kaijun; Yue, Juan; Li, Yuhan; Huang, Jiaxin; Liu, Jie; Li, Linhan; Wang, Xiaoyu; Cai, Nengbin; Gao, Sili

doi:10.3390/electronics14050947

Open AccessArticle

Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery

by

Huicong Wang

^1,2,

Kaijun Ma

³,

Juan Yue

¹,

Yuhan Li

^1,2,

Jiaxin Huang

^1,2

,

Jie Liu

^1,2,

Linhan Li

^1,2,

Xiaoyu Wang

^1,2

,

Nengbin Cai

³ and

Sili Gao

^1,*

¹

Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100864, China

³

Shanghai Key Laboratory of Crime Scene Evidence, Shanghai 200083, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 947; https://doi.org/10.3390/electronics14050947

Submission received: 29 January 2025 / Revised: 20 February 2025 / Accepted: 24 February 2025 / Published: 27 February 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Infrared small-target detection plays a crucial role in applications such as public safety monitoring. However, it faces significant challenges due to the loss of target features, which weakens detection performance. To tackle this problem, this study proposes an improved infrared small-target detection model based on YOLOv8n. First, the Dual-Path Fusion Downsampling Convolution (WFDC) module enhances the backbone network’s ability to extract fine-grained features of targets, preventing the loss of image details as the depth of the convolutional neural network increases. Second, the Involution and the Coordinate Attention (CA) mechanisms are integrated into the spatial pyramid pooling module, where self-attention and the Involution modules aggregate contextual semantic information over a broader spatial range, enriching channel information at each scale. Finally, deformable convolutions are incorporated into the backbone of the model, enabling better handling of target deformations across various scenarios. Experiments conducted on the SIRST-5K and IRSTD-1K datasets demonstrate that our method outperforms both the baseline YOLOv8 model and several state-of-the-art YOLOv8-based improved detection methods. The results show that, compared to the baseline model, our approach achieves mAP@[0.5:0.95] improvements of 12.3% and 16.4% on the two datasets, respectively. These results highlight the effectiveness of our proposed enhancements in improving detection accuracy and model robustness.

Keywords:

infrared small-target detection; YOLOv8; object detection; deformable convolution; dual-path fusion downsampling; infrared imaging

1. Introduction

Infrared search and tracking (IRST) systems are widely recognized for their all-weather operability, strong concealment, and high resistance to interference. These systems have been extensively utilized in diverse fields such as remote sensing, forest fire detection, and anti-drone technologies [1,2,3,4,5]. For example, in maritime rescue operations, they assist in identifying distress signals, while in geographic surveys, they enable precise tracking under complex environmental and adverse weather conditions. However, compared to general target detection, infrared small-target detection poses unique challenges. These include the diminutive size of the targets, low image contrast, low signal-to-clutter ratio, and blurred target boundaries, all of which complicate both research and practical applications in this domain.

The primary difficulties in infrared small-target detection arise from the small size of the targets, low contrast between the target and background, and low signal-to-clutter ratios. These issues are further compounded by blurred target boundaries and the presence of environmental factors such as varying lighting, temperature, and noise interference. As a result, detecting small infrared targets is significantly more challenging than detecting larger or higher-contrast objects. In these scenarios, the traditional methods for infrared target detection often struggle, failing to detect small targets with high precision and accuracy.

To address these challenges, various detection methods have been developed, which can be broadly classified into model-driven and data-driven approaches. Model-driven methods are further categorized into three primary groups: those based on filter construction [6,7], data structure analysis [8,9], and methods inspired by the human visual system [10,11,12,13]. Filter-based methods aim to design specialized filters that enhance target regions while suppressing background noise and clutter. Data-structure analysis methods leverage geometric and statistical characteristics to identify targets. Meanwhile, human visual system-based methods emulate attention mechanisms in human vision to improve target saliency. Although effective in certain conditions, these model-driven methods heavily rely on handcrafted features and prior knowledge, which limits their adaptability to dynamic scenarios, especially when targets vary in shape or size or when background complexity increases. However, filter-based methods often fail in complex environments where target contrast is low, and background clutter is high, particularly in infrared small-target detection. Data structure analysis methods struggle in the presence of varying environmental conditions that alter the target’s appearance, such as changes in temperature or lighting. Human visual system-based methods often fail to address the challenge of preserving fine-grained details required for detecting small targets.

In contrast, data-driven approaches—facilitated by the advancements in convolutional neural networks (CNNs)—have transformed infrared small-target detection. These approaches leverage multilayer neural networks to automatically extract high-dimensional features from data, bypassing the need for manual feature engineering. Common deep learning frameworks for object detection include two-stage detectors, such as Faster R-CNN [14] and Mask R-CNN [15], as well as one-stage detectors like YOLOv3 [16] and YOLOv5. These methods often rely on aggressive downsampling to capture deep semantic information, leading to a loss of fine-grained details crucial for detecting small targets. Furthermore, CNNs can be sensitive to variations in target scale, rotation, and deformation, making them less robust in complex infrared environments.

Recent advancements have sought to address these challenges. For instance, Dai [17] introduced an Asymmetric Context Modulation (ACM) module, which combines global context features with channel-wise attention to enhance both semantic understanding and fine detail preservation. Li [18] proposed a Dense Nested Interaction Module (DNIM) and Cascade Channel and Spatial Attention Module (CCSAM) to integrate features and adaptively focus on critical multi-scale information progressively. They focus on enhancing feature extraction and improving multi-scale information integration. Similarly, Xu [19] developed a Hierarchical Context Fusion (HCF) network, incorporating modules such as Parallel Patch-aware Attention (PPA), Dimension-aware Selective Integration (DASI), and Multi-Dilation Channel Refiner (MDCR) to improve detection performance significantly. Chen [20] proposal a local patch network (LPNet) with a global attention mechanism to improve infrared small-target detection by effectively integrating both global and local features, addressing class imbalance and enhancing multi-scale feature fusion. Tang [21] introduced the ConTriNet, a robust Confluent Triple-Flow Network that uses a “Divide-and-Conquer” strategy with modality-specific and modality-complementary flows, enhancing RGB-Thermal Salient Object Detection by minimizing inter-modality discrepancies and effectively aggregating multi-scale cues. While these approaches make strides in improving infrared small-target detection, they still struggle with the core issue of balancing the preservation of fine details with the need for deep semantic feature extraction, which is essential for small-target detection.

To tackle these issues, we propose an enhanced architecture based on YOLOv8, named IRST-YOLO, specifically designed for infrared small-target detection. This improved framework addresses the limitations of existing models when applied to such tasks. Comparative experiments conducted on publicly available datasets demonstrate that IRST-YOLO outperforms state-of-the-art models, achieving superior results in both detection accuracy and robustness. The key contributions of this study are as follows:

(a): Dual-Path Fusion Downsampling Convolution: We introduce a novel downsampling module that effectively preserves both fine-grained details and semantic features during the downsampling process, addressing the crucial issue of detail loss in small-target detection.
(b): Involution-based Spatial Pyramid Pooling with Attention (SPPF-IA): A new module is introduced between the backbone and neck of the network to enhance channel information and improve feature fusion, providing better adaptability to complex infrared scenes.
(c): Deformable Convolution: We integrate deformable convolutions into the backbone, allowing the network to adapt to targets with varying shapes, sizes, and deformations, significantly improving detection precision in challenging scenarios.

Through these innovations, IRST-YOLO overcomes the limitations of existing CNN-based methods and achieves superior performance in infrared small-target detection. Our extensive experiments demonstrate that IRST-YOLO outperforms state-of-the-art models in terms of both detection accuracy and robustness.

2. Materials and Methods

2.1. YOLO (You Only Look Once) Object Detection Framework

YOLOv8, as one of the latest iterations in the YOLO series, is renowned for its efficiency in real-time object detection, combining fast inference speed with high accuracy. Its flexible and modular design allows it to adapt to a wide range of tasks, making it a leading choice for real-time detection applications. To meet varying computational and performance requirements, YOLOv8 is available in multiple versions, each tailored for specific use cases. Figure 1 provides an overview of its fundamental framework.

2.1.1. Backbone Network

The backbone network in YOLOv8 is responsible for extracting features from the input image. This process involves five downsampling operations and four feature-extraction modules, resulting in feature maps at four distinct scales: P2, P3, P4, and P5. Each scale corresponds to progressively reduced resolutions, obtained through 2×, 4×, 8×, and 16× downsampling, respectively. These feature maps are subsequently passed into the feature-fusion network, where they are integrated to form a unified representation. The final merged output is delivered to the detection head for object recognition.

Key components of the backbone network include:

Conv Module: Utilizes 2D convolutions, batch normalization, and the SiLU activation function to perform initial feature extraction.

C2f Module: An enhanced CSPNet-based bottleneck structure with two convolutional layers. It features two branches: one with a convolution layer and multiple bottlenecks, and another with a single convolution module. These branches are merged to retain residual features and enhance feature reuse.

SPPF [22] (Spatial Pyramid Pooling—Fast): Positioned between the backbone and the feature-fusion network, this module aggregates features from different receptive fields using small pooling layers, effectively enriching feature representation while reducing computational overhead.

2.1.2. Neck Network

The neck network in YOLOv8 combines elements from the Feature Pyramid Network [23] (FPN) and Path Aggregation Network [24] (PANet), designed to enhance multi-scale feature representation:

FPN: Employs a bottom-up pathway to progressively reduce feature map sizes and extract richer semantic details, followed by a top-down pathway that upsamples feature maps to enrich lower-level features with semantic information. Lateral connections ensure that feature maps at the same scale are integrated across layers, improving multi-scale object representation.

PANet: Builds upon the FPN by introducing an additional bottom-up path, which propagates positional information from lower layers to higher ones, enhancing localization capabilities across different scales.

2.1.3. Detection Head

YOLOv8’s base version incorporates three detection heads to identify objects across large, medium, and small scales, corresponding to feature maps P3 (

80 \times 80

), P4 (

40 \times 40

), and P5 (

20 \times 20

). Each detection head divides the feature map into two branches:

Regression Branch: Outputs bounding box coordinates and sizes.

Classification Branch: Generates class confidence scores for detected objects.

To refine the results, YOLOv8 employs Non-Maximum Suppression (NMS), which eliminates redundant bounding boxes and outputs the most accurate detections based on location, size, and confidence scores.

2.1.4. Extended Versions: P2 and P6

To accommodate diverse detection scenarios, YOLOv8 offers additional variants:

P2 Version: Incorporates an extra detection head in the neck, targeting a

160 \times 160

feature map. This modification is specifically designed for detecting smaller targets, and improving performance in scenarios where high-resolution feature maps are crucial.

P6 Version: Extends the network depth with six downsampling operations, enabling the detection of larger objects and high-resolution imagery. This variant is particularly effective for handling large-scale and extra-large targets.

2.2. Improved YOLOv8 Object Detection Model: IRST-YOLO

The proposed IRST-YOLO, illustrated in Figure 2, is an enhanced version of the YOLOv8-P2 model. Designed to achieve an optimal balance between detection speed and precision, IRST-YOLO is particularly well-suited for real-time infrared small-target detection tasks, where both high accuracy and fast processing are critical. The architecture consists of three main components: the backbone for feature extraction, the neck for feature fusion, and the detection head for precise target identification. The baseline YOLOv8-P2 model has been augmented with three significant improvements, detailed below.

Dual-Path Fusion Downsampling Convolution (WFDC): To address the challenge of preserving fine-grained details and semantic information during the detection of small infrared targets, we developed a novel downsampling module called Dual-Path Fusion Downsampling Convolution (WFDC). Traditional pooling and downsampling techniques often lead to the loss of critical spatial details, which are essential for small-target detection. The WFDC module mitigates this issue by combining Space-to-Depth Convolution (SPD-Conv) [25] with a fusion mechanism, ensuring the retention of fine-grained spatial features during the downsampling process. This design not only preserves detail but also enhances semantic richness, contributing to more accurate target identification.

Involution-based Spatial Pyramid Pooling with Attention (SPPF-IA): The second enhancement is the introduction of the Involution-based Spatial Pyramid Pooling with Attention (SPPF-IA), strategically positioned between the backbone and neck. This module is designed to enhance the flow of semantic information throughout the network. By incorporating Involution operations, which dynamically generate pixel-adaptive convolution kernels, the SPPF-IA module enriches the feature maps with detailed contextual information. Additionally, an attention mechanism ensures that the network focuses on the most critical features, improving the quality of information passed to deeper layers. This modification significantly boosts the overall performance of the model by optimizing feature representation and fusion.

Deformable C2f (D-C2f): The final improvement involves optimizing the primary feature-extraction process within the backbone. We propose the Deformable C2f (D-C2f), which replaces standard convolutional layers with deformable convolution kernels (DCN) [26]. Unlike traditional convolutions with fixed sampling grids, deformable convolutions dynamically adjust their kernel shapes and sampling positions based on the input data. This adaptability allows the network to handle targets with varying scales, rotations, and deformations more effectively. As a result, the D-C2f module significantly improves the model’s ability to extract accurate features, even in challenging scenarios with complex target shapes or backgrounds.

2.3. Dual-Path Fusion Downsampling Convolution (WFDC)

2.3.1. Principle of the SPD Module

The Space-to-Depth Convolution (SPD-Conv) module [25] is designed to enhance the preservation of fine-grained features in shallow feature maps during the network’s downsampling process. Compared to traditional strided convolutions and pooling layers, SPD-Conv offers notable advantages in retaining critical spatial details. This module can be seamlessly integrated into any convolutional neural network to improve feature-extraction quality. Figure 3 shows the SPD-Conv.

Slicing Operation: Given an input feature map X has dimensions

[C_{1}, S, S]

, where

C_{1}

represents the number of channels, and S is the width and height, the slicing operation divides the input into smaller sub-feature maps based on the stride (step). The process can be expressed as follows:

\begin{matrix} f_{0, 0} = X [0 : S : s t e p, 0 : S : s t e p], f_{1, 0} = X [1 : S : s t e p, 0 : S : s t e p], \dots, \\ f_{s t e p - 1, 0} = X [s t e p - 1 : S : s t e p, 0 : S : s t e p]; \\ f_{0, 1} = X [0 : S : s t e p, 1 : S : s t e p], f_{1, 1} = X [1 : S : s t e p, 1 : S : s t e p], \dots, \\ f_{s t e p - 1, 1} = X [s t e p - 1 : S : s t e p, 1 : S : s t e p]; \\ \dots, \dots, \\ f_{0, s t e p - 1} = X [0 : S : s t e p, s t e p - 1 : S : s t e p], f_{1, s t e p - 1} = X [1 : S : s t e p, s t e p - 1 : S : s t e p], \\ \dots, f_{s t e p - 1, s t e p - 1} = X [s t e p - 1 : S : s t e p, s t e p - 1 : S : s t e p] . \end{matrix}

(1)

This operation produces step2 sub-feature maps. For example, when step = 2, the input is divided into four sub-feature maps

f_{0, 0}

,

f_{1, 0}

,

f_{0, 1}

,

f_{1, 1}

, each with dimensions

[C_{1}, \frac{S}{2}, \frac{S}{2}]

, achieving a 2× downsampling.

Channel Concatenation: The resulting sub-feature maps are concatenated along the channel dimension to form a new feature map

X_{0}

with dimensions:

[C_{1} \times s t e p^{2}, \frac{S}{s t e p}, \frac{S}{s t e p}]

. For instance, when step = 2, the new feature map has half the width and height of the original map, but the channel count increases fourfold.

Convolution Operation: A convolution operation with a stride of 1 is applied to

X_{0}

, yielding an output feature map

X_{1}

with dimensions:

[C_{2}, \frac{S}{s t e p}, \frac{S}{s t e p}]

. This step retains discriminative information while refining the number of output channels.

2.3.2. Design of WFDC Module

Detecting small targets in infrared images presents unique challenges, such as low resolution, blurred boundaries, and limited color features. These characteristics hinder the model’s ability to extract fine-grained details and semantic features effectively, often leading to information loss.

In a convolutional neural network, shallow layers close to the input capture textures, edges, and spatial details, but these features lack semantic richness. Conversely, deeper layers enhance semantic information through repeated convolutions and pooling operations, but at the cost of reduced resolution and detail. This trade-off is particularly detrimental to small-target detection, where fine-grained details are crucial.

To address these issues, we propose the WFDC module, which integrates two complementary branches:

SPD Branch: Uses SPD-Conv to retain fine-grained spatial information during downsampling.

Strided Convolution Branch: Applies traditional strided convolution for effective semantic feature extraction.

These branches independently process the input feature map, and their outputs are fused through point-wise convolution to preserve both semantic richness and spatial detail, as illustrated in Figure 4.

Comparison of SPD-Conv and Strided Convolution: While SPD-Conv retains fine-grained details, its discontinuous receptive field may disrupt the integrity of the input feature map, leading to semantic information loss. In contrast, strided convolution maintains continuous receptive fields, better capturing contextual and global features. By combining these two methods, the WFDC module balances fine-grained and semantic feature extraction, mitigating feature loss during downsampling.

Mathematical Representation of the WFDC Module:

Given an input feature map

x_{i}

, with dimensions

[C_{i n}, W, H]

, the WFDC module processes the input as follows:

Convolution–Batch Normalization–SiLU (CBS) Module:

\begin{matrix} f_{C B S} (k, s, c, g, x_{i}) = f_{b n} (σ (f_{c o n v} (k, s, c, g, x_{i}))) \end{matrix}

(2)

SPD-Conv:

\begin{matrix} f_{S P D} (x_{i}) = x_{i} [:, 0 : - 1 : 2, 0 : - 1 : 2] \oplus x_{i} [:, 1 : - 1 : 2, 0 : - 1 : 2] \\ \oplus x_{i} [:, 0 : - 1 : 2, 1 : - 1 : 2] \oplus x_{i} [:, 1 : - 1 : 2, 1 : - 1 : 2] \end{matrix}

(3)

\begin{matrix} F_{S P D - C o n v} (k, c_{o u t}, g, x_{i}) = f_{C B S} (k, 1, c_{o u t}, g, f_{S P D} (x_{i})) \end{matrix}

(4)

WFDC Module Output:

\begin{matrix} F_{W F D C} (k, c_{o u t}, x_{i}) = f_{C B S} (k, c_{o u t}, f_{C B S} (k, s = 2, c_{o u t}, g, x_{i}) \oplus F_{S P D - C o n v} (k, c_{o u t}), g, x_{i}) \end{matrix}

(5)

Here,

f_{C B S}

represents a convolution operation (

f_{c o n v}

) with batch normalization (

f_{b n}

) and SiLU activation (

σ

), k denotes the kernel size, s is the stride, c is the desired output channel count, g indicates the number of groups,

x_{i}

is the input feature map, and ⊕ denotes channel-wise concatenation. The final output feature map has dimensions

[C_{o u t}, \frac{W}{2}, \frac{H}{2}]

.

The WFDC module effectively balances the strengths of SPD-Conv and strided convolution, enabling it to retain fine-grained spatial details while extracting rich semantic information. Unlike traditional downsampling techniques, WFDC enables the model to retain both detailed features and high-level semantic information simultaneously, a feature that traditional methods like max-pooling or standard convolutions do not provide. This is especially important in infrared small-target detection, where fine details and spatial accuracy are crucial for identifying small targets in complex backgrounds. The unique contribution of WFDC is its ability to mitigate the information loss that occurs in traditional downsampling operations, enhancing the detection performance of small targets without compromising the model’s ability to learn global semantic patterns.

2.4. Involution-Based Spatial Pyramid Pooling with Attention (SPPF-IA)

2.4.1. Design of SPPF-IA Module

The Spatial Pyramid Pooling—Fast (SPPF) module, is an improvement over the original spatial pyramid pooling structure proposed by He et al. [22], is utilized in YOLOv8 to integrate feature maps from three different pooling scales. This approach enables the fusion of local and global features, thereby enhancing the semantic richness of the feature maps and improving overall model performance.

However, to improve computational efficiency, the SPPF module employs a

1 \times 1

convolution for channel compression after feature concatenation. While this reduces computational overhead, it also results in a loss of valuable channel information, limiting the module’s effectiveness in retaining detailed feature representations.

To address this limitation, we propose the SPPF-IA module by introducing Involution [27] and a Coordinate Attention (CA) [28] mechanism into the SPPF module. These enhancements are designed to further optimize feature extraction and fusion, particularly for small-target detection tasks.

Involution dynamically generates pixel-adaptive convolution kernels, allowing each pixel to receive unique weights based on its spatial context. This capability enhances the representation of small-scale targets by mitigating the channel information loss caused by traditional

1 \times 1

convolutions.

Coordinate Attention (CA) is integrated at the beginning of the SPPF module to assign higher weights to critical features while suppressing irrelevant information. By highlighting key features, the CA mechanism ensures that the network focuses on the most salient aspects of small targets, significantly improving detection accuracy and robustness in complex scenes.

The SPPF-IA module achieves a balance between computational efficiency and feature richness. It is particularly effective in detecting small-sized targets, offering more accurate localization and classification. Figure 5 illustrates the structure of the improved module, demonstrating its superior performance in feature fusion and its ability to substantially reduce information loss.

2.4.2. Principle of the Involution Module

The Involution module, as shown in Figure 6, is designed to enhance the receptive field by introducing inverse transformations in both spatial and channel domains. Unlike traditional convolution kernels with fixed weights, Involution kernels are dynamically generated based on the input feature map, enabling them to adapt to local and contextual variations. Kernel Design: The Involution kernel, denoted as

H \in R^{H \times W \times K \times K \times G}

, is defined by the input feature map’s height (H) and width (W), the kernel size (K), and the number of groups (G). Each group shares the same Involution kernel. For a pixel at spatial coordinate

(i, j)

, the corresponding kernel

H_{i, j}

is applied to the feature map.

Kernel Generation: The kernel

H_{i, j}

is dynamically generated from the input feature map X using a mapping function

ϕ

:

\begin{matrix} H_{i, j} = ϕ (X_{Ψ_{i, j}}) \end{matrix}

(6)

Here,

X_{Ψ_{i, j}}

represents the local context around pixel

(i, j)

. This generation process implicitly disperses channel information of a single pixel across its spatial neighborhood, enriching the receptive field. Output Computation: The output feature map Y is computed by performing multiply–add operations between the input feature map X and the Involution kernel H:

\begin{matrix} Y_{i, j, k} = \sum_{(u, v) \in Δ K} H_{i, j, u + ⌊K / 2⌋, v + ⌊K / 2⌋, ⌊K G / C⌋} X_{i + u, j + v, k} \end{matrix}

(7)

Here,

Δ K

denotes the kernel’s spatial extent, and the indices

(u, v)

represent positions within the kernel. This operation allows the kernel to adapt dynamically to spatial and contextual variations, capturing both local details and global patterns.

By integrating Involution and Coordinate Attention, the SPPF-IA module significantly enhances the feature-fusion process. Involution, unlike traditional convolutions, dynamically adapts its kernel based on input features, which makes it better suited for small-target detection where spatial details vary significantly. The SPPF-IA module combines multi-scale pooling with Involution, allowing the model to effectively capture both local and global context while focusing attention on critical features that distinguish small targets from the background.

The integration of the Involution with spatial pyramid pooling is particularly advantageous in infrared small-target detection, as it enhances the model’s spatial awareness across scales and adaptability to varying target shapes. This fusion of techniques allows SPPF-IA to retain rich feature information across different resolutions while dynamically focusing on relevant features.

The dynamic, pixel-adaptive nature of Involution improves small-scale target representation, while the CA mechanism ensures that the model prioritizes critical features. These enhancements enable the SPPF-IA module to outperform the traditional SPPF module in both accuracy and robustness, making it an essential component of the IRST-YOLO framework.

2.5. Deformable C2f

Deformable convolution [26,29] is an advanced convolutional technique designed to dynamically adjust the sampling positions and shapes of convolution kernels. It was introduced to overcome the limitations of traditional convolution, which struggles to handle complex transformations such as the scaling, rotation, and deformation of objects. Conventional convolution operations rely on a fixed geometric structure, which can result in reduced accuracy and stability when dealing with irregular shapes or dynamically changing targets. To address these challenges, Zhu et al. [29] introduced deformable convolution, which enhances the flexibility of convolution by adaptively modifying the sampling positions of kernels. This modification allows the convolution to better capture target transformations, improving its performance in scenarios with varying object shapes.

The core mechanism behind deformable convolution is its ability to predict offset values for the convolution kernel’s sampling points. These offsets enable the kernel to adaptively adjust its sampling positions, allowing it to better capture variations in target shapes and positions. Unlike traditional convolution, which uses a fixed grid for sampling, deformable convolution dynamically adjusts the sampling positions according to the predicted offsets. This adaptability allows the kernel to handle deformations and fine details more effectively, particularly in challenging environments with complex backgrounds.

Mathematical Representation: The mathematical formulation for deformable convolution is as follows [29]:

\begin{matrix} y (p) = \sum_{k = 1}^{K} w_{k} \cdot x (p + p_{k} + Δ p_{k}) \cdot Δ m_{k} \end{matrix}

(8)

In this equation,

y (p)

represents the value at position p in the output feature map, x is the input feature map,

w_{k}

is the convolution kernel weight,

p_{k}

is the sampling position of the convolution kernel in a regular convolution operation,

Δ p_{k}

is the offset predicted by the deformable convolution, and

Δ m_{k}

is a modulation weight, which adjusts the contribution of each sampling point.

This equation illustrates how deformable convolution dynamically adjusts the convolutional kernel’s sampling positions, computing a weighted sum of values from the input feature map at the adjusted positions. This mechanism allows deformable convolution to capture more complex and irregular target shapes, thereby improving model accuracy and stability, particularly in complex environments where target deformations occur frequently.

Application of Deformable Convolution in IRST-YOLO

In the enhanced IRST-YOLO model, we have refined the feature-extraction process within the YOLOv8 backbone to address the specific challenges associated with detecting small infrared targets. Traditional convolutional layers are limited by fixed receptive fields, which hinder the detection of small or deformed targets. To address this, we integrated deformable convolutions into the backbone’s feature-extraction module. This allows the network to adaptively adjust the sampling positions of convolution kernels based on the input features, enabling IRST-YOLO to better handle targets with varying shapes, scales, and deformations. This refinement significantly enhances the network’s ability to accurately detect small targets, improving detection precision and localization accuracy, particularly in complex infrared environments.

By integrating deformable convolutions into the C2f module, we enable the network to better adjust its convolutional kernels to handle irregular target shapes and dynamic movements. This flexibility leads to more precise feature extraction, which is critical for detecting small and often distorted infrared targets. The improved model’s ability to dynamically adjust to complex target variations results in better localization and classification accuracy, especially in environments with varied background conditions.

Figure 7 illustrates the structure of the deformable C2f, demonstrating how the integration of deformable convolutions enhances the feature-extraction process. These improvements directly contribute to the model’s ability to detect small targets more effectively in infrared images.

The deformable convolution C2f (D-C2f) introduces significant improvements to the YOLOv8 backbone by replacing conventional convolutions with deformable convolutions. This modification greatly enhances the network’s flexibility in handling targets with varying shapes, scales, and deformations, which are common in infrared small-target detection. By dynamically adjusting the kernel’s sampling positions, deformable convolutions enable the model to more accurately capture the intricate details of targets, leading to improved detection performance, especially for small and irregular targets in complex environments.

3. Results

3.1. Experimental Setup

Experimental equipment:

The experiments were conducted on a system equipped with the following hardware and software configurations:

CPU: 13th Gen Intel® Core™ i5-13490F (10 cores, 16 logical processors, 2.5 GHz base clock speed);

GPU: NVIDIA GeForce RTX 4060 Ti with 16 GB of graphics memory;

Framework: PyTorch (version 2.1.2), accelerated using CUDA 11.8.

This setup provided sufficient computational power to train and evaluate the IRST-YOLOv8 model efficiently, ensuring robust and reliable results.

Dataset:

1.

SIRST-5K Dataset: The SIRST-5K dataset [30] was the primary dataset used to validate the effectiveness of the proposed IRST-YOLOv8 model. This dataset was generated using a negative sample enhancement method proposed by Lu [30], designed to produce a large number of negative samples for self-supervised learning. It consists of a substantial amount of synthetic data with corresponding labels, offering a wide variety of typical infrared small-target scenarios. Key characteristics of the dataset include:

•: Rich background diversity to simulate complex real-world conditions.
•: Inclusion of interference samples, which enhances the model’s generalization capabilities.

The SIRST-5K dataset serves as a reliable foundation for model training and evaluation, particularly in challenging infrared small-target detection scenarios.

2.

The IRSTD-1K [31] dataset: The IRSTD-1K dataset is a specialized dataset tailored for infrared small-target detection tasks. It contains 1000 images specifically curated for detecting small infrared targets, reflecting real-world complexities. Notable features of the dataset include:

•: Varied backgrounds, introducing challenges akin to those in practical applications.
•: Low contrast and small target sizes, which closely mimic real-world conditions in infrared imaging.

These datasets collectively provide a robust experimental foundation for assessing the performance of IRST-YOLOv8 in detecting small infrared targets across diverse and complex scenarios.

3.

Data Splitting: Both datasets were divided into three subsets:

•: Training set: 60% of the images;
•: Validation set: 20% of the images;
•: Testing set: 20% of the images.

This data-splitting strategy ensures that the model is rigorously trained, validated, and tested, enabling a comprehensive evaluation of its performance.

3.2. Experimental Setup

To expedite training, the input image size was adjusted to

640 \times 640

pixels. A batch size of 8 was deployed, and training was conducted over 2000 epochs. Early stopping with a patience of 100 was applied to avert overfitting. Stochastic Gradient Descent (SGD) was selected as the optimization method, with an initial learning rate of 0.001, a final learning rate of 0.001 × lrf (as show in Table 1) and a momentum of 0.937, based on common practice in deep learning models for object detection tasks. During experiments, these settings were found to work well during initial experiments with the YOLOv8 backbone. The loss-function weights were set to 0.5 for the classification loss, 7.5 for the bounding box loss, and 1.5 for the depth feature loss. We applied common data-augmentation techniques such as mosaic, copy–paste, random affine transformations, blending, and image cropping (as detailed in Table 1). These augmentations were selected to simulate real-world variability in infrared images and improve the model’s generalization ability. The final augmentations were chosen based on the performance stability they offered across different datasets. We start training directly from random initial parameters, without using a pre-trained model.

YOLOv8 offers five model scales: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. The depth and width of the model increase progressively across these scales, with larger models providing higher accuracy but requiring more computational resources. For this study, the YOLOv8n variant was selected due to its balance between efficiency and performance. This version has a depth and width of 0.33 and 0.25, respectively, with a maximum channel count of 1024.

3.3. Evaluation Criteria

The performance of the object detection model was evaluated using standard metrics, including IoU (Intersection over Union), Precision, Recall, F1 Score, AP, and mAP. These metrics are defined as follows:

IoU (Intersection over Union): This metric quantifies the overlap between the predicted bounding box (A) and the ground-truth box (B). It is calculated as the ratio of their intersection to their union:

\begin{matrix} I o U = \frac{A ⋂ B}{A ⋃ B} \end{matrix}

(9)

Precision: Precision measures the proportion of true-positive samples (TP) among all samples predicted as positive:

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

(10)

Recall: Recall indicates the proportion of true-positive samples correctly identified by the model out of all actual positive samples:

\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}

(11)

F1 score: The F1 score is the harmonic mean of Precision and Recall, balancing the trade-off between the two metrics:

\begin{matrix} F 1 = \frac{2 \times T P}{2 \times T P + F P + F N} \end{matrix}

(12)

AP (Average Precision): AP represents the area under the Precision–Recall (PR) curve at a specified IoU threshold:

\begin{matrix} A P = \int_{0}^{1} P (R) d R \end{matrix}

(13)

mAP (Mean Average Precision): mAP is the average of AP across all categories. “mAP@0.5” indicates the mAP at an IoU threshold of 0.5, whereas “mAP@[0.5:0.95]” refers to the mean AP over IoU thresholds from 0.5 to 0.95:

\begin{matrix} m A P = \frac{\sum_{i = 1}^{N} A P (i)}{N} \end{matrix}

(14)

In these formulas, TP (true positive) denotes correct detection results, FP (false positive) refers to incorrect detection outcomes, and FN (false negative) indicates missed detections. N represents the total number of detection task categories. These metrics comprehensively evaluate the model’s performance in both localization and classification, providing a clear assessment of its effectiveness in detecting infrared small targets.

3.4. Experimental Results and Analysis

3.4.1. Experimental Results

Due to its effective balance between accuracy and speed, YOLOv8n has become a popular choice for various object detection tasks. Building on this foundation, this study adopted YOLOv8n as the base network to explore potential improvements in infrared small-target detection. The performance of the enhanced model, IRST-YOLO, was evaluated using the SIRST-5K and IRSTD-1K datasets. Figure 8 shows the PR curves of our proposed model and the baseline model YOLOv8 on the SIRST-5K and IRSTD-1K datasets. From these curves, it can be seen that the performance of IRST-YOLO consistently outperforms YOLOv8, especially in the high-precision region, indicating that IRST-YOLO has the advantage of a lower fals- positive rate in infrared small-target detection.

As shown in Table 2 and Table 3, the proposed model achieves superior performance compared to the YOLOv8n baseline and other state-of-the-art algorithms.

3.4.2. Analysis

The results in Table 2 and Table 3 demonstrate that the proposed IRST-YOLO model significantly outperforms the YOLOv8n baseline and other comparison models. Key observations include: SIRST-5K Dataset: mAP@[0.5:0.95] increased by 12.3%, and mAP@0.5 improved by 7.6% compared to YOLOv8n. Consistent improvements were observed in Precision, Recall, and F1 Score. IRSTD-1K Dataset: mAP@[0.5:0.95] increased by 16.4%, and mAP@0.5 improved by 6.3% compared to YOLOv8n. Notable gains in classification metrics further validate the robustness of the proposed model.

In addition to the performance improvements, we also assessed the model’s efficiency in terms of parameters and FPS (frames per second), both of which are crucial for real-time applications. Table 4 provides a comparative analysis between YOLOv8n and IRST-YOLO. While IRST-YOLO incorporates several architectural innovations, such as the WFDC, SPPF-IA, and deformable convolution modules, it still maintains a competitive parameter count relative to YOLOv8n, ensuring its feasibility for real-time use. Despite the added complexity from these advanced modules, IRST-YOLO achieves a high FPS, demonstrating its capability to perform real-time detection without sacrificing accuracy. In contrast, although YOLOv8n delivers solid performance, IRST-YOLO strikes a superior balance between high detection accuracy and efficiency, making it better suited for scenarios that demand fast inference.

The superior performance can be attributed to the following key innovations: The WFDC module ensures the retention of fine-grained details, preserving critical small-target features during feature extraction. The SPPF-IA module enhances channel information across multiple scales, improving semantic richness and feature fusion. Deformable convolution improves the backbone network’s adaptability to varying scales, rotations, and deformations, ensuring a better perception of small-target features.

3.4.3. Comparison with Other Methods

In addition to outperforming YOLOv8n, IRST-YOLO demonstrates strong performance relative to other advanced models:

IDD-YOLO [32]: Incorporates an attention mechanism (LCSA) into the GhostNet backbone, with GSConv and C3Ghost modules in the neck, achieving improved performance while maintaining a lightweight structure. While IDD-YOLO’s lightweight design is beneficial for resource-constrained environments, it struggles with small-target detection in infrared images, especially in complex backgrounds. IRST-YOLO surpasses IDD-YOLO by introducing the Dual-Path Fusion Downsampling Convolution, which ensures that fine-grained details are preserved during the downsampling process—critical for small-target detection. Furthermore, IRST-YOLO’s ability to handle small targets in cluttered environments is enhanced by the deformable convolution module, which allows for adaptive feature extraction across varying scales, shapes, and deformations—something that IDD-YOLO does not fully address.

ASF-YOLO [33]: Combines spatial and scale features in the feature-fusion network, integrating attention mechanisms for enhanced detection and segmentation, particularly for medical imaging tasks. While this approach improves feature selection, IRST-YOLO outperforms ASF-YOLO by providing a specialized SPPF-IA module (Involution-based Spatial Pyramid Pooling with Attention), which enhances feature fusion across multiple scales while preserving channel information. This enables IRST-YOLO to detect and localize small targets more effectively, even under challenging infrared imaging conditions. ASF-YOLO’s design focuses more on segmentation, but its ability to handle small-target detection in dynamic environments is limited compared to the robust detection capabilities of IRST-YOLO.

YOLO-ANT [34]: Utilizes large kernel convolution for dynamic kernel selection, improving multi-scale detection and the extraction of small-target features. Although this approach helps capture diverse target scales, IRST-YOLO delivers superior results in small-target detection by incorporating deformable convolutions in the backbone. This feature allows for more flexible and precise adaptation to varying target shapes and scales, improving detection accuracy in real-world infrared environments where target characteristics can change rapidly. Furthermore, IRST-YOLO’s Dual-Path Fusion Downsampling Convolution ensures that fine-grained details crucial for small-target detection are retained, which is a challenge for large kernel-based approaches like YOLO-ANT.

HIC-YOLOv5 [35]: Enhances the YOLOv5 framework with an additional detection head and self-convolution layers, combined with CBAM attention, for better small-target detection. However, IRST-YOLO improves on this by introducing deformable convolution to the backbone, which provides a more adaptable feature-extraction process that can accommodate varying target scales, rotations, and deformations. Moreover, IRST-YOLO addresses the loss of fine-grained details during downsampling—a common issue in many YOLO-based models—through its novel Dual-Path Fusion Downsampling Convolution. This improvement ensures that shallow, detailed information is retained, which is essential for accurate small-target localization in infrared images.

Despite the advancements in these models, IRST-YOLO achieves superior performance across key metrics, particularly in mAP@0.5 and mAP@[0.5:0.95], which emphasize accurate target localization and classification. This indicates that IRST-YOLO not only excels in classifying targets but also provides precise localization, which is crucial for infrared small-target detection tasks.

The experimental results confirm that IRST-YOLO achieves a notable balance between fine-grained detail retention and robust target identification. The proposed improvements allow the model to adapt effectively to complex backgrounds and small-target scenarios, providing a significant advantage over existing YOLO-based algorithms. Figure 9 illustrates representative detection results on real infrared images, highlighting the enhanced accuracy and robustness of the proposed method.

3.5. Ablation Study

3.5.1. Impact of Key Modules

To evaluate the contributions of the three proposed modules—WFDC, SPPF-IA, and D-C2f—we conducted a series of ablation experiments. The results, summarized in Table 5, demonstrate the effectiveness of each module in enhancing the model’s performance.

The WFDC module delivers the most substantial performance improvement: on the validation set, mAP@0.5 improved by 7.1%, and mAP@[0.5:0.95] increased by 10.7%. On the test set, mAP@0.5 improved by 6.5%, and mAP@[0.5:0.95] increased by 10.6%. These results highlight the module’s ability to retain fine-grained details during downsampling, significantly enhancing the detection of small infrared targets.

The SPPF-IA module also provides notable gains: on the validation set, mAP@0.5 increased by 5.7%, and mAP@[0.5:0.95] improved by 9.6%. On the test set, mAP@0.5 increased by 6.1%, and mAP@[0.5:0.95] improved by 9.7%. These improvements are attributed to the module’s ability to enhance multi-scale feature representation and semantic richness while focusing on critical features via attention mechanisms.

The D-C2f module demonstrates a strong contribution to mAP@0.5 but a relatively smaller impact on mAP@[0.5:0.95]. This suggests deformable convolutions effectively adapt to target shape transformations and improve fine-grained feature perception. However, their contribution to the extraction of high-level semantic features is limited, resulting in a less pronounced improvement at higher IoU thresholds.

When the three modules are integrated, the validation set shows an additional 1.0% increase in mAP@[0.5:0.95], while the test set shows improvements of 0.2% in mAP@0.5 and 1.5% in mAP@[0.5:0.95]. These results confirm that the combination of WFDC, SPPF-IA, and D-C2f modules delivers a synergistic improvement in the model’s overall performance.

3.5.2. Comparative Experiments on Different Downsampling Modules

To further evaluate the performance of the proposed WFDC module, we conducted comparative experiments against the SPD-Conv module and the baseline stride-2 convolution layer (s2-Conv). Using YOLOv8 and HIC-YOLOv5 as benchmark models, we trained and tested these models on the SIRST-5K dataset with different downsampling modules. The results, summarized in Table 6, show that: For YOLOv8, incorporating SPD-Conv improved mAP@[0.5:0.95] by 9.9%, while WFDC delivered a slightly higher improvement of 10.6%. For HIC-YOLOv5, SPD-Conv increased mAP@[0.5:0.95] by 1.6%, and WFDC further enhanced it by 2.3%. These findings confirm the superior performance of the WFDC module in improving small-infrared-target detection accuracy.

While SPD-Conv excels at preserving fine-grained details and improving small-target detection accuracy, it has notable limitations. As shown in Figure 10c,d, the feature maps generated by the model using SPD-Conv retain more detailed target information than those produced by the s2-Conv layer. The heatmaps reveal that SPD-Conv captures points of interest related to small targets more effectively. However, the discontinuous receptive fields in SPD-Conv disrupt the semantic feature information extracted by the backbone network. This reduces the model’s sensitivity to target boundaries and limits its ability to distinguish between noise and actual targets.

For example, Figure 10a shows an original image with its corresponding target mask in Figure 10b. In Figure 10c, the model using s2-Conv perceives multiple objects but suffers from significant noise interference. In Figure 10d, the model using SPD-Conv better identifies small-target points of interest but struggles with local semantic information, making it less effective at differentiating targets from noise.

The WFDC module addresses the limitations of SPD-Conv by combining fine-grained and semantic feature extraction. Figure 10e demonstrates that the WFDC module retains detailed small-target information while effectively capturing semantic features, ensuring precise target localization. Compared to Figure 10d, the WFDC-enhanced model eliminates most noise interference while accurately extracting small targets. This highlights its robustness in managing complex backgrounds and improving target boundary identification.

The WFDC module’s dual-path design effectively resolves the issue of discontinuous receptive fields in SPD-Conv. One path focuses on retaining detailed spatial information critical for small-target detection. The other path emphasizes capturing contextual and semantic information to improve target differentiation.

This synergy enables the WFDC module to sustain high detection accuracy while boosting the model’s overall robustness, particularly in scenarios with challenging backgrounds and ambiguous target boundaries.

The comparative experiments demonstrate the clear advantages of the WFDC module over SPD-Conv and s2-Conv. The WFDC module achieves superior performance in both mAP@[0.5:0.95] and robustness, making it highly effective for small-infrared-target detection tasks. Its ability to integrate fine-grained and semantic features ensures precise target localization, even in complex environments. These findings validate the importance of the WFDC module in enhancing the detection capabilities of infrared small-target detection models, as shown by the results across different benchmarks and heatmap analyses.

3.5.3. Comparative Experiments on Different Attention Mechanisms in the SPPF-IA Module

In the final model, the Coordinate Attention (CA) mechanism was implemented within the SPPF-IA module. To evaluate the superiority of the CA mechanism over alternative attention mechanisms, such as Squeeze-and-Excitation (SE) and Convolutional Block Attention Module (CBAM), comparative experiments were conducted. The CA mechanism in the SPPF-IA module was replaced with SE and CBAM mechanisms, respectively, while keeping all other experimental conditions consistent. These models were trained and tested on the SIRST-5K dataset, and the results are summarized in Table 7.

The findings reveal that the model incorporating the CA mechanism achieved the highest performance in both mAP@0.5 and mAP@[0.5:0.95], demonstrating its significant advantage over SE and CBAM.

The SE (Squeeze-and-Excitation) mechanism recalibrates the importance of each channel by learning channel-specific weights. While this enhances channel-wise feature importance, it does not consider spatial information within feature maps. Consequently, the SE mechanism’s inability to exploit spatial dimensional details limits its effectiveness in infrared small-target detection, particularly in scenarios requiring precise spatial localization.

The CBAM (Convolutional Block Attention Module) introduces a spatial attention mechanism to complement channel-wise attention. This improves the model’s ability to capture positional information, enabling better spatial feature extraction. However, CBAM primarily focuses on local features and struggles to model long-range dependencies, reducing its overall performance in complex detection tasks.

SE (Squeeze-and-Excitation) focuses solely on channel attention, which helps recalibrate feature maps but does not consider spatial dependencies, making it less effective for small-target detection where spatial context is crucial. CBAM (Convolutional Block Attention Module) combines both channel and spatial attention, but its spatial attention mechanism primarily focuses on local regions and may struggle to capture global spatial dependencies, which are important for detecting small targets that could appear at various locations within an image. In contrast, CA mechanism simultaneously captures long-range spatial dependencies and channel-wise attention, making it particularly suitable for small-target detection in infrared images where targets may vary in position and scale. The experimental results demonstrate that the CA mechanism outperforms both SE and CBAM in all metrics, particularly in mAP@[0.5:0.95], where it achieved an improvement of 1.5% over SE and 0.9% over CBAM. This highlights the CA mechanism’s superior ability to balance spatial and channel attention, which is critical for infrared small-target detection.

The results confirm that the SPPF-IA module with the CA mechanism delivers superior detection performance compared to SE and CBAM mechanisms. By simultaneously capturing long-range spatial positional information and channel dependencies, the CA mechanism enhances the model’s ability to detect small infrared targets with greater precision and robustness. This makes the CA-enhanced SPPF-IA module a critical component in achieving state-of-the-art performance in infrared small-target detection tasks.

3.6. Failure Case Analysis

Although IRST-YOLO demonstrates strong overall performance, there are certain cases where the model’s detection accuracy could be improved. To better understand its limitations, we present a detailed analysis of two failure cases, as shown in the figures below.

Figure 11a: This image illustrates a case where the model failed to detect a target due to low signal-to-noise ratio (SNR) and low contrast between the target and the background. Despite the target being present in the infrared image, the poor visibility, compounded by environmental noise, led to a missed detection. This issue is common in infrared imaging, particularly when the target’s thermal signature is weak or obscured by complex backgrounds. Enhancing the model’s robustness in such conditions remains an important direction for future research.

Figure 11b: In this example, the model erroneously detected a portion of a large object as a small target. Specifically, a false detection occurred where a small section of the object, which was not part of the actual target, was mistakenly identified as the target. This highlights a potential weakness in the model’s ability to differentiate between foreground and background features, especially in scenarios where objects have irregular shapes or ambiguous edges.

These failure cases underscore the challenges in infrared small-target detection, particularly in noisy or ambiguous conditions. While IRST-YOLO excels in many scenarios, future improvements may focus on better handling of low-contrast environments and reducing false detections, particularly in complex or crowded scenes.

3.7. Evaluation of IRST-YOLO in Practical Infrared Small-Target Detection Scenarios

To further assess the proposed IRST-YOLO model’s practical performance, we tested it on a small set of self-collected infrared images. These images were captured under diverse real-world conditions characterized by complex backgrounds, low contrast, and varying target sizes.

Figure 12 presents the detection results, including several examples that highlight the model’s ability to accurately identify small infrared targets in challenging environments. These figures highlight the model’s ability to accurately identify small targets, even in cluttered, low-contrast, or complex infrared scenes.

As shown in Figure 12a–c, the IRST-YOLO model demonstrates remarkable performance in detecting small infrared targets across different scenarios. The detection boxes indicate that the model is capable of identifying targets with high confidence scores, even under adverse conditions such as poor visibility or significant background noise. Figure 12a: The model successfully detects multiple targets in a cluttered environment, showcasing its robustness in distinguishing small targets from complex and noisy backgrounds. Figure 12b: This example highlights the model’s ability to detect targets at varying distances, illustrating its adaptability to changes in target size and scale. Figure 12c: In a scenario with minimal contrast between the target and background, the model still achieves accurate detections, underscoring its enhanced sensitivity to fine-grained features and subtle variations in infrared intensity.

These qualitative results validate the model’s effectiveness in real-world applications, particularly for small-infrared-target detection tasks in complex environments. The robust performance of IRST-YOLO can be attributed to the contributions of the WFDC and SPPF-IA modules, which enhance feature representation and preserve critical spatial details. This further confirms the model’s suitability for practical deployments in scenarios such as surveillance, reconnaissance, and public safety monitoring.

4. Conclusions

This study introduces an enhanced infrared small-target detection model, IRST-YOLO, which builds upon the YOLOv8 framework and incorporates several key innovations. The experimental results demonstrate that IRST-YOLO achieves significant improvements over the baseline YOLOv8 model. On the SIRST-5K dataset, IRST-YOLO achieved a 12.3% increase in mAP@[0.5:0.95] and a 7.6% improvement in mAP@0.5. On the IRSTD-1K dataset, IRST-YOLO outperformed YOLOv8 with a 16.4% increase in mAP@[0.5:0.95] and a 6.3% improvement in mAP@0.5.

These results underscore the effectiveness of the proposed enhancements, particularly the WFDC module and the SPPF-IA module, in improving feature extraction and fusion. By retaining fine-grained details and enabling adaptive feature learning, the redesigned downsampling module significantly enhances detection accuracy, showcasing the potential of convolutional neural networks for infrared small-target detection tasks.

The advancements achieved by IRST-YOLO have significant implications for infrared small-target detection, particularly in applications such as public safety monitoring. By improving detection capabilities, the proposed model provides more reliable support for real-time monitoring systems and enhances target recognition in complex environments.

While the proposed model delivers substantial improvements, it has certain limitations. Firstly, the model was validated on datasets focusing primarily on small hollow targets in infrared imagery. Future research will extend this validation to more diverse scenarios to assess the model’s robustness. Secondly, integrating methods such as Generative Adversarial Networks (GANs) could increase the diversity of training data, further enhancing the model’s robustness in complex environments. Another promising direction is optimizing the model structure to reduce computational costs and improve real-time performance, particularly for resource-constrained applications.

This article presents an enhanced version of the YOLOv8 architecture, IRST-YOLO, designed to address the limitations of existing object-detection algorithms in infrared small-target detection tasks. The proposed model incorporates three key improvements. First, the WFDC module: a novel downsampling module that combines the strengths of SPD-Conv, preserving fine-grained features within the backbone network. Second, the SPPF-IA module: An Involution-based Spatial Pyramid Pooling with Attention, enriching the feature maps with additional semantic information. Third, deformable convolution: Integrated into the backbone network, enabling the model to adapt to targets of various shapes and sizes, improving both detection and classification accuracy.

Finally, the experimental results on the SIRST-5K and IRSTD-1K datasets validate the effectiveness of IRST-YOLO, demonstrating consistent superiority over state-of-the-art models in terms of detection accuracy and robustness. These findings highlight IRST-YOLO as a powerful and reliable solution for infrared small-target detection, offering substantial benefits for real-world applications.

Author Contributions

H.W., K.M., J.Y., Y.L., J.H., J.L., L.L., X.W., N.C. and S.G. contributed to this study. Conceptualization, H.W.; methodology, H.W.; software, H.W.; validation, H.W., J.H. and Y.L.; data curation, J.L., L.L. and X.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W., K.M., J.Y., N.C. and S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly funned by the Youth Innovation Promotion Association CAS (2014216), the Opening Project of Shanghai Key Laboratory of Crime Scene Evidence (2024XCWZK12), and the National Pre-research Program during the 14th Five-Year Plan (514010405).

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Eysa, R.; Hamdulla, A. Issues on Infrared Dim Small Target Detection and Tracking. In Proceedings of the 2019 International Conference on Smart Grid and Electrical Automation (ICSGEA), Zhangjiajie, China, 10–11 August 2019; pp. 452–456. [Google Scholar]
Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-Frame Infrared Small-Target Detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Hu, Y.; Wang, K.; Chen, L.; Li, N.; Lei, Y. Visualization of invisible near-infrared light. Innov. Mater. 2024, 2, 100067. [Google Scholar] [CrossRef]
DATA, M. Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation. Innovation 2024, 2, 100055. [Google Scholar]
Shi, K.; Ma, J.; Chen, Z.; Cui, Y.; Yu, B. Nighttime light remote sensing in characterizing urban spatial structure. Innov. Geosci. 2023, 1, 100043. [Google Scholar] [CrossRef]
Lv, P.Y.; Sun, S.L.; Lin, C.Q.; Liu, G.R. Space moving target detection and tracking method in complex background. Infrared Phys. Technol. 2018, 91, 107–118. [Google Scholar] [CrossRef]
Gao, X.; Zhang, Y.; Zhang, L.; Jiang, Y.; Xi, Y.; Tan, F.; Hou, Q. Infrared small target detection algorithm based on filter kernel combination optimization learning method. Infrared Phys. Technol. 2024, 139, 105346. [Google Scholar] [CrossRef]
Gao, C.; Meng, D.; Yang, Y.; Wang, Y.; Zhou, X.; Hauptmann, A.G. Infrared Patch-Image Model for Small Target Detection in a Single Image. IEEE Trans. Image Process. 2013, 22, 4996–5009. [Google Scholar] [CrossRef]
Wang, X.; Peng, Z.; Kong, D.; He, Y. Infrared Dim and Small Target Detection Based on Stable Multisubspace Learning in Heterogeneous Scene. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5481–5493. [Google Scholar] [CrossRef]
Kim, S.; Lee, J. Scale invariant small target detection by optimizing signal-to-clutter ratio in heterogeneous background for infrared search and track. Pattern Recognit. 2012, 45, 393–406. [Google Scholar] [CrossRef]
Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A Local Contrast Method for Small Infrared Target Detection. IEEE Trans. Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
Han, J.; Ma, Y.; Zhou, B.; Fan, F.; Liang, K.; Fang, Y. A Robust Infrared Small Target Detection Algorithm Based on Human Visual System. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2168–2172. [Google Scholar]
He, S.; Pan, S.; An, B. Infrared small target detection based on variance difference weighted three-layer local contrast measure. Infrared Phys. Technol. 2024, 139, 105315. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–20 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 950–959. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Los Alamitos, CA, USA, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chen, F.; Gao, C.; Liu, F.; Zhao, Y.; Zhou, Y.; Meng, D.; Zuo, W. Local patch network with global attention for infrared small target detection. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3979–3991. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–26 June 2021; pp. 12321–12330. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–26 June 2021; pp. 13713–13722. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Li, H.; Qiao, Y.; et al. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5652–5661. [Google Scholar]
Lu, Y.; Lin, Y.; Wu, H.; Xian, X.; Shi, Y.; Lin, L. SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised Learning for Robust Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002911. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 877–886. [Google Scholar]
Lu, Y.; Li, D.; Li, D.; Li, X.; Gao, Q.; Yu, X. A Lightweight Insulator Defect Detection Model Based on Drone Images. Drones 2024, 8, 431. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Tang, X.; Chen, X.; Cheng, J.; Wu, J.; Fan, R.; Zhang, C.; Zhou, Z. YOLO-Ant: A Lightweight Detector via Depthwise Separable Convolutional and Large Kernel Design for Antenna Interference Source Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5016018. [Google Scholar] [CrossRef]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 6614–6619. [Google Scholar]

Figure 1. The basic framework of the YOLOv8 model.

Figure 2. This is the main structure of IRST-YOLO.

Figure 3. Illustration of SPD-Conv when step = 2 (see text for more details). (a) The input feature map. (b) Divide the input into smaller sub-feature maps. (c) Channel Concatenation. (d) The resulting feature map is downsampled by a factor of 2. (e) The final output feature map.

Figure 4. Shows details of Dual-Path Fusion Downsampling Convolution (WFDC).

Figure 5. Shows details of Involution SPPF with Attention module.

Figure 6. Shows details of Involution proposed by Li [27].

Figure 7. Details of deformable C2f.

Figure 8. Comparison of PR curves of IRST-YOLO and YOLOv8.

Figure 9. Comparison of detection performance between IRST-YOLO, baseline models and part of other comparison models.

Figure 10. (a,b) An example of the original picture and mask of the target. (c–e). A comparison of the visualization results of the P5 feature layer of three different downsampling modules.

Figure 11. (a,b) Two failure cases of IRST-YOLO.

Figure 12. (a–c) Detection results of the IRST-YOLO model on self-collected infrared images under complex backgrounds.

Table 1. Hyper-parameter settings.

Name	Value
lr0	0.001
lrf	0.005
momentum	0.937
box	7.5
cls	0.5
dfl	1.5
translate	0.1
mosaic	1.0
close_mosaic	400

Table 2. Comparison of algorithms on the SIRST-5K dataset.

Model	Precision	Recall	F1	mAP@0.5	mAP@[0.5:0.95]
YOLOv8	0.936	0.762	0.840	86.4	73.9
IDD-YOLO	0.947	0.922	0.934	93.8	74.3
ASF-YOLO	0.925	0.779	0.845	88.3	76.7
YOLO-ANT	0.943	0.916	0.929	93.3	79.0
HIC-YOLOv5	0.936	0.891	0.912	93.0	82.1
RT-DETR	0.889	0.905	0.896	92.9	68.0
IRST-YOLOv8 ¹	0.947	0.900	0.922	94.0	86.2

¹ Our method.

Table 3. Comparison of algorithms on the IRSTD-1K dataset.

Model	Precision	Recall	F1	mAP@0.5	mAP@[0.5:0.95]
YOLOv8	0.890	0.801	0.843	88.1	53.6
IDD-YOLO	0.923	0.897	0.909	92.9	59.3
ASF-YOLO	0.904	0.886	0.894	90.9	54.6
YOLO-ANT	0.927	0.895	0.910	93.5	60.4
HIC-YOLOv5	0.935	0.913	0.923	93.4	65.6
RT-DETR	0.892	0.853	0.872	87.0	45.4
IRST-YOLOv8 ¹	0.952	0.933	0.942	94.4	70.0

¹ Our method.

Table 4. Model parameters and FPS.

Model	Parameter (M)	FPS
YOLOv8	3.01	110.15
IRST-YOLOv8	6.12	68.59

Table 5. Ablation study of different modules on SIRST-5k. (√ express that we replaced the original module with the proposed module).

Baseline: YOLOv8n			mAP-Val		mAP-Test
WFDC	SPPF-IA	D-C2f	@0.5	@[0.5:0.95]	@0.5	@[0.5:0.95]
-	-	-	86.3	74.3	86.4	73.9
√	-	-	93.4	85.0	92.9	84.5
-	√	-	92.0	83.9	92.5	83.6
-	-	√	93.1	79.1	92.4	77.8
√	√	-	93.7	86.0	92.9	84.7
√	√	√	93.9	87.0	94.0	86.2

Table 6. The effects of different downsampling modules on detection models. (√ express that we replaced the original module with the proposed module).

Baseline	S2-Conv	SPD-Conv	WFDC	mAP@[0.5:0.95]
YOLOv8n	√	-	-	73.9
YOLOv8n	-	√	-	83.8
YOLOv8n	-	-	√	84.5
HIC-YOLOv5	√	-	-	82.1
HIC-YOLOv5	-	√	-	83.7
HIC-YOLOv5	-	-	√	84.4

Table 7. Experimental resutls comparing the performance of three attention mechanism in SPPF-IA module.

Methods	Precision	Recall	F1	mAP@0.5	mAP@[0.5:0.95]
SE	0.943	0.906	0.924	93.0	84.7
CBAM	0.942	0.904	0.923	93.3	85.3
CA	0.947	0.900	0.922	94.0	86.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Ma, K.; Yue, J.; Li, Y.; Huang, J.; Liu, J.; Li, L.; Wang, X.; Cai, N.; Gao, S. Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery. Electronics 2025, 14, 947. https://doi.org/10.3390/electronics14050947

AMA Style

Wang H, Ma K, Yue J, Li Y, Huang J, Liu J, Li L, Wang X, Cai N, Gao S. Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery. Electronics. 2025; 14(5):947. https://doi.org/10.3390/electronics14050947

Chicago/Turabian Style

Wang, Huicong, Kaijun Ma, Juan Yue, Yuhan Li, Jiaxin Huang, Jie Liu, Linhan Li, Xiaoyu Wang, Nengbin Cai, and Sili Gao. 2025. "Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery" Electronics 14, no. 5: 947. https://doi.org/10.3390/electronics14050947

APA Style

Wang, H., Ma, K., Yue, J., Li, Y., Huang, J., Liu, J., Li, L., Wang, X., Cai, N., & Gao, S. (2025). Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery. Electronics, 14(5), 947. https://doi.org/10.3390/electronics14050947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Small-Target Detection Based on Improved YOLOv8 for Infrared Imagery

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLO (You Only Look Once) Object Detection Framework

2.1.1. Backbone Network

2.1.2. Neck Network

2.1.3. Detection Head

2.1.4. Extended Versions: P2 and P6

2.2. Improved YOLOv8 Object Detection Model: IRST-YOLO

2.3. Dual-Path Fusion Downsampling Convolution (WFDC)

2.3.1. Principle of the SPD Module

2.3.2. Design of WFDC Module

2.4. Involution-Based Spatial Pyramid Pooling with Attention (SPPF-IA)

2.4.1. Design of SPPF-IA Module

2.4.2. Principle of the Involution Module

2.5. Deformable C2f

Application of Deformable Convolution in IRST-YOLO

3. Results

3.1. Experimental Setup

3.2. Experimental Setup

3.3. Evaluation Criteria

3.4. Experimental Results and Analysis

3.4.1. Experimental Results

3.4.2. Analysis

3.4.3. Comparison with Other Methods

3.5. Ablation Study

3.5.1. Impact of Key Modules

3.5.2. Comparative Experiments on Different Downsampling Modules

3.5.3. Comparative Experiments on Different Attention Mechanisms in the SPPF-IA Module

3.6. Failure Case Analysis

3.7. Evaluation of IRST-YOLO in Practical Infrared Small-Target Detection Scenarios

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI