Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection

Xiong, Zenghui; Sheng, Zhiqiang; Mao, Yao

doi:10.3390/rs17091548

Open AccessArticle

Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection

by

Zenghui Xiong

^1,2,3,4

,

Zhiqiang Sheng

^1,2,3,4

and

Yao Mao

^1,2,3,4,*

¹

National Key Laboratory of Optical Field Manipulation Science and Technology, Chinese Academy of Sciences, Chengdu 610209, China

²

Key Laboratory of Optical Engineering, Chinese Academy of Sciences, Chengdu 610209, China

³

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

⁴

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1548; https://doi.org/10.3390/rs17091548 (registering DOI)

Submission received: 23 January 2025 / Revised: 20 April 2025 / Accepted: 22 April 2025 / Published: 26 April 2025

(This article belongs to the Special Issue Deep Learning Innovations in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

This study aims to address a series of challenges in infrared small target detection, particularly in complex backgrounds and high-noise environments. In response to these issues, we propose a deep learning model called the Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network (FMADNet). This model is based on a U-Net architecture and incorporates a Residual Multi-Scale Feature Enhancement (RMFE) module and an Adaptive Feature Dynamic Fusion (AFDF) module. The RMFE module not only achieves efficient feature extraction but also adaptively adjusts feature responses across multiple scales, further enhancing the detection capabilities for small targets. Additionally, the AFDF module effectively integrates features from the encoder and decoder during the upsampling phase, enabling dynamic learning of upsampling and focusing on spatially important features, significantly improving detection accuracy. Evaluated on the NUDT-SIRST and IRSTD-1k datasets, our model exhibits strong performance, showcasing its effectiveness and precision in identifying infrared small targets in diverse complex environments, along with its remarkable robustness.

Keywords:

infrared small target detection; U-Net architecture; feature multi-scale enhancement; adaptive dynamic fusion

1. Introduction

Infrared Small Target Detection (IRSTD) technology uses infrared imaging systems to identify discrepancies in thermal energy distribution across the target area and neighboring zones, facilitating precise target localization. This technology is immune to changes in environmental lighting and exhibits excellent anti-interference capabilities. It employs a passive measurement approach, ensuring better concealment, which makes it well suited for target detection in various complex environments [1]. IRSTD is crucial in multiple application scenarios, including border security, nighttime surveillance, and disaster warning systems [2,3,4,5,6,7,8]. These applications rely on high-precision target detection technologies to ensure the rapid and accurate identification of potential threats or critical events from infrared images. The precise, reliable, and robust detection of infrared small targets under complex environments with multiple interference sources constitutes a pivotal technological cornerstone for building modern infrared monitoring systems. The characteristics of IRSTD include the following: (1) Infrared imaging provides a broad detection range, though its resolution is relatively limited. In infrared images, the target is often minuscule, covering only a small portion of the image and typically lacking well-defined edges or other distinguishing characteristics. (2) Infrared small targets exhibit extremely low observable emission levels when captured by imaging systems with poor signal-to-noise ratios, particularly in non-stationary, complex, and fluctuating background interference. In such cases, the broad distribution of gray values and significant noise interference prevent the direct extraction of the target using information based on gray levels, size, or shape. (3) The detected target may change its state with variations in detection distance, angle, and movement, which adds complexity to the detection process. Due to these factors of interference, IRSTD remains a challenging problem [9,10,11,12].

IRSTD methods are primarily divided into single-frame and multi-frame detection of small targets. Traditional multi-frame IRSTD relies on analyzing time-series data, enhancing detection performance by combining the target’s motion consistency and the background change characteristics. The main detection methods include frame differencing [13,14], optical flow analysis [15], multi-frame accumulation [16,17], and spatiotemporal filtering [18]. Nonetheless, these approaches face challenges such as high computational demands and limited real-time efficiency. Furthermore, in various practical applications, the movement between the image sensor and the target can lead to reduced detection accuracy. For single-frame IRSTD, it relies solely on spatial information to separate small target signals from a single infrared image while suppressing background interference, offering higher real-time performance. However, it is more susceptible to interference from complex environments. This paper examines strategies for enhancing the performance of single-frame IRSTD in difficult environmental conditions.

The techniques for single-frame IRSTD mainly include two categories: model-driven methods and data-driven methods. Model-driven methods detect small targets by constructing a physical model to distinguish between targets and backgrounds. These include filtering methods, background estimation and suppression approaches, human visual system (HVS) approaches, and data structure-based strategies. Filtering methods enhance target intensity and achieve target–background separation by suppressing stray noise and background [19,20]. Background estimation and suppression methods assume that background variations are smooth and continuous, estimating the background and suppressing it to enhance the target signal [21,22]. HVS-based methods assume that targets have higher contrast within local regions and enhance the target by calculating contrast or differences in these regions [23,24]. Methods based on data structures employ a sparse representation framework to separate the infrared image into a target component with sparse characteristics and a background component with low rank, using convex optimization techniques to extract target information [25,26,27,28]. However, these approaches often depend on predefined assumptions, which makes them less adaptable to complex backgrounds and irregular target shapes.

Unlike model-based methods, data-driven approaches offer better adaptability and feature learning abilities, particularly demonstrating notable advantages in handling complex scenarios. These methods can learn deep feature representations through training, which helps better distinguish targets from backgrounds. Through end-to-end learning, data-driven methods reduce manual intervention, adapt to complex nonlinear relationships, and demonstrate stronger robustness and higher detection accuracy when facing dynamic targets and complex backgrounds. Propelled by dual thrusts of algorithmic innovation in deep learning and hyperscale computing resource deployment, data-driven methods have a promising future in IRSTD. The Asymmetric Context Modulation (ACM) [29] method effectively encodes small-scale visual details into deeper layers of the network through context modulation, thereby significantly improving detection accuracy. The Attentional Local Contrast Network (ALCNet) [30] features a feature map cyclic shifting mechanism that enhances the detection ability for small targets. The Infrared Shape Network (ISNet) [31] has an edge module inspired by Taylor finite differences and a bidirectional attention aggregation module, which improves target edge information and reduces noise. The Runge–Kutta Transformer (RKformer) [32], inspired by the Runge–Kutta numerical method, combines the Runge–Kutta method with Transformer. Unlike traditional Transformers, RKformer iteratively updates features during the feature extraction process, progressively optimizing feature representations and improving detection accuracy. The Dense Nested Attention Network (DNA-Net) [33] improves detection accuracy and robustness through multi-layer feature fusion and attention mechanisms. “U-Net in U-Net” (UIU-Net) [34] embeds a secondary U-Net module into the core network, enabling synergistic fusion of macro-contextual patterns and micro-structural details via cross-scale feature interaction. The Attention-Guided Pyramid Context Network (AGPCNet) [35] enhances IRSTD performance in complex backgrounds through multi-scale context information fusion and attention mechanisms. Attention with Bilinear Correlation (ABC) [36] introduced a U-shaped Convolution-Dilated Convolution (UCDC) module and a Convolutional Linear Fusion Transformer (CLTF) module, combining local and global feature extraction to capture target–background feature differences more effectively.

Despite the breakthroughs of advanced models in various fields, numerous challenges remain when applying deep learning to IRSTD. The tiny size of infrared small targets, along with the typical absence of distinct texture and structural features, means that iterative subsampling layers in neural networks can result in the degradation of critical discriminative features. Additionally, designing an effective strategy to integrate deep and shallow features within the network presents a significant challenge. Deep features are strong in capturing semantic information, but they often lack the necessary spatial details. In contrast, shallow features offer detailed spatial information, but this information may be too scattered and lack sufficient semantic depth.

Ref. [37] uses pooling operations of different scales to capture global contextual information of the image and fuses multi-scale contextual information. Ref. [38] enhances the network’s contextual awareness by performing cross-scale and cross-layer feature fusion. Ref. [39] utilizes multiple dilated convolutions of different scales to capture multi-scale spatial context. Inspired by these studies, we propose a Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network (FMADNet) for IRSTD. This network primarily consists of a Residual Multi-Scale Feature Enhancement (RMFE) module and an Adaptive Feature Dynamic Fusion (AFDF) module. The RMFE module effectively processes complex and abstract data representations, while also handling multi-scale information. The multi-scale capability enhances the model’s flexibility in detecting targets of various sizes, ensuring that the detection process is not limited to any particular scale. By operating at different scales, the RMFE module ensures that important target features are not obscured by irrelevant background information. Furthermore, the distinctive design of the RMFE module significantly reduces background noise and enhances the contrast between the target and its surroundings, making it more effective at emphasizing the critical features of infrared small targets in challenging environments. This module’s design can effectively alleviate the issue where iterative subsampling layers in neural networks often cause the loss or degradation of critical discriminative features in complex environments. The AFDF module is specifically designed to tackle the challenge of integrating both deep and shallow features, which is particularly important in small target detection. The AFDF module enables dynamic upsampling learning and efficiently merges features from different scales. This network architecture not only gathers detailed semantic information at different scales but also carefully extracts spatial nuances, thereby boosting the model’s overall ability to interpret and process target features. It effectively avoids the noise introduced by shallow features while providing spatial information. By effectively combining these two modules, FMADNet demonstrates strong robustness and efficiency in handling complex backgrounds, small targets, and uneven lighting conditions.

To conclude, the key contributions of this work are outlined as follows:

We propose FMADNet, a deep learning network that effectively distinguishes between targets and background. The network enhances the feature representation of infrared small targets and implements weighted fusion across multiple scales, significantly improving detection efficiency and stability, especially in complex backgrounds where targets may be occluded or obscured.
The RMFE module analyzes complex and abstract data representations through a deep feature encoding architecture and processes multi-scale information.
We designed the AFDF module, which facilitates dynamic upsampling learning and adaptively assigns weights to different features, optimizing feature fusion and improving detection performance.

2. Related Work

2.1. Single-Frame IRSTD

Model-driven methods for IRSTD primarily include filtering methods [19,20], background estimation and suppression methods [21,22], HVS-based approaches [23,24], and data structure approaches [25,26,27,28]. These approaches rely on predefined physical models and prior knowledge for target detection, offering good interpretability and low data dependency. However, such methods often depend on specific prior assumptions, such as smooth background variations or high target contrast, which makes them less adaptable to target detection in complex scenarios. Furthermore, they are highly sensitive to noise, which may lead to missed detections and false positives.

Data-driven approaches for IRSTD, which automatically extract the characteristics of infrared small targets from extensive datasets, address the shortcomings of traditional model-based techniques in handling complex situations, while notably enhancing detection precision and the ability to generalize. UIU-Net [34] enhances both the performance and reliability of IRSTD by leveraging its nested U-Net architecture. The computational cost of UIU-Net is relatively high, which may limit its application in scenarios with limited computational resources. ISNet [31] features innovative edge information enhancement and noise suppression techniques applied to IRSTD. ISNet exhibits constrained performance in intricate scenarios and relies heavily on the dataset’s quality. The interior attention-aware network (IAANet) [40] adopts a coarse-to-fine detection architecture: it first extracts potential target regions using a region proposal network to remove most background noise, then uses a shallow semantic generator to extract fine-grained local semantic features. If the target’s intensity closely matches that of the background and blends into a complex scene, IAANet may struggle to detect it. AGPCNet [35] achieves a combination of local semantic association and global contextual attention through an attention-guided context block, enhancing the saliency of target features while suppressing noise. It employs a contextual pyramid module to capture multi-scale contextual data, improving the ability to detect targets of various sizes. During the decoding stage, an Asymmetric Fusion Module (AFM) effectively combines shallow detail information with deep semantic information, further optimizing feature representation. AGPCNet experiences increased runtime because of the computational complexity involved in pixel correlation and the fusion of multi-scale features. ABC [36] introduces a CLFT module that uses a bilinear correlation attention mechanism to strengthen target features and reduce background interference, while minimizing target loss due to deep downsampling. In the decoding stage, it adopts a UCDC module to further extract refined semantic features, improving target detection accuracy. The ABC model has a complex structure and high computational cost. Based on prior work, this paper presents FMADNet, a model that effectively isolates targets from their backgrounds, ensuring the accurate identification of infrared small targets.

2.2. Multi-Frame IRSTD

Multi-frame IRSTD approaches build upon single-frame detection by further leveraging the motion characteristics of targets in the temporal domain. These methods extend local contrast-based approaches and sparse low-rank decomposition techniques into three-dimensional space, integrating temporal information to adapt to the spatiotemporal domain. Ref. [41] proposes a novel spatiotemporal tensor model incorporating saliency filter regularization to address the challenges of IRSTD in low-altitude complex backgrounds. This method constructs the infrared image sequence as a comprehensive spatiotemporal tensor, fully preserving the spatiotemporal information of the original data. However, the method requires performing full singular value decomposition in each iteration, resulting in low computational efficiency. Ref. [42] introduces an enhanced multi-modal nuclear norm alongside a local weighted entropy contrast approach to simultaneously boost target detectability and reduce background interference in complex situations. This approach formulates the IRSTD task as an optimization problem of decomposing a three-component tensor in the spatiotemporal domain. Nevertheless, the model optimization is complex, and the solution incurs high computational cost. Ref. [43] puts forward an approach based on a Convolutional Neural Network (CNN). It integrates multi-frame information to address issues such as background clutter interference and weak target features in IRSTD. The designed CNN model is optimized for IRSTD and leverages its strong feature extraction capabilities to improve detection accuracy. However, the method has limitations in multi-frame information extraction. Ref. [15] introduces a method combining multi-scale optical flow estimation with temporal differencing. This approach captures small target trajectories by analyzing motion information at multiple scales and calculates pixel changes between frames using temporal differencing, achieving accurate motion estimation and emphasizing the motion characteristics of targets. However, the method has high computational cost, which affects its real-time performance. Ref. [44] introduces an innovative convolutional Long Short-Term Memory node that captures both intra-frame and inter-frame motion features, overcoming the challenges in detecting infrared small targets. Additionally, they introduce a new loss function specifically designed to optimize the model’s learning of small target motion characteristics. Nevertheless, the method contains multiple hyperparameters that require careful tuning. These advancements demonstrate that the gradual integration of temporal domain information with spatial feature extraction has significantly improved the performance of IRSTD in dynamic and complex environments.

2.3. Cross-Layer Feature Fusion

Cross-layer feature fusion combines features from different network layers, allowing the model to retain fine-grained details while also capturing more comprehensive semantic context. This methodology significantly augments the representational capacity of deep learning architectures, finding extensive utilization across various computer vision tasks. In U-Net [45], shallow features are transferred to symmetric decoder layers through skip connections. However, its network architecture may be limited in capturing global semantic information. Ref. [46] constructs a multi-level feature pyramid to fuse features at different resolutions. It is limited in capturing contextual information, and its fusion method is relatively simple. DenseNet [47] employs dense connections to achieve cross-layer feature fusion, enabling direct communication between layers and promoting feature reuse. As the network depth increases, DenseNet may result in higher computational cost and memory overhead. AGPCNet [35] introduces an AFM to retain more effective information by prioritizing significant features. AFM also suffers from information loss when facing complex backgrounds. In our work, we propose an Adaptive Feature Dynamic Fusion Module (AFDF), which consists of DySample [48] and an Adaptive Feature Fusion module. This module achieves lightweight dynamic learning for upsampling and performs adaptive feature weighting and fusion. This design enhances the integration of information across various layers, leading to a deeper understanding of the data. By seamlessly combining both abstract semantic features and finer details, it establishes a robust framework for enhancing recognition accuracy of small-scale targets in noisy environments, particularly in highly complex and difficult environments.

3. Materials and Methods

3.1. Overall Architecture

The structure of the proposed FMADNet is shown in Figure 1. Specifically, FMADNet is a network architecture based on U-Net, integrating the RMFE module and the AFDF module.

The RMFE module consists of CNNs, the Efficient Multi-Scale Attention (EMA) [49], and residual connections [50]. Referring to ResNet-18, we build the encoder backbone network using RMFE. As the network undergoes continuous downsampling, the characteristics of small infrared targets are at risk of being lost. The EMA employs cross-spatial learning combined with a multi-scale parallel sub-network structure, significantly improving the model’s ability to represent small infrared target features while maintaining the channel dimensions. Residual connections “skip” certain layers by passing the input features to subsequent layers of the network. This approach effectively mitigates the gradient vanishing or explosion problems in deep networks. The AFDF mainly comprises DySample and the Adaptive Feature Fusion module. This module implements lightweight dynamic learning upsampling and feature fusion. Inspired by the position attention network [51], it incorporates coordinate attention (CA) [52] to emphasize spatially significant features, thus greatly enhancing detection precision. By combining these methods, our model is designed to identify small infrared targets with both efficiency and accuracy, while maintaining robustness against varied and complex backgrounds.

3.2. Residual Multi-Scale Feature Enhancement Module

The RMFE primarily aims to improve the feature representation of weak and small targets in infrared images by combining CNNs, the EMA mechanism, and residual connections, as depicted in Figure 2.

Assume the input feature map is

X \in R^{B \times C \times H \times W}

, where B represents the batch size, C the number of channels, and H and W the height and width of the feature map, respectively. RMFE adopts a typical residual structure, comprising two

3 \times 3

convolutional layers, batch normalization layers, ReLU activation functions, and an EMA mechanism. The first convolutional layer uses a

3 \times 3

kernel to perform convolution transformations on the input feature map X, generating a new feature map with the same spatial dimensions (stride is adjustable). The convolution result is then processed with batch normalization and ReLU activation.

X_{1} = ReLU (BN ({Conv}_{3 \times 3} (X))) .

(1)

The second convolution layer is similar to the first one, using a

3 \times 3

convolution kernel again to extract deeper features:

X_{2} = BN ({Conv}_{3 \times 3} (X_{1})) .

(2)

Residual connections are used to ensure efficient gradient flow. If the input channels and output channels are not consistent, the dimensions are adjusted using a

1 \times 1

convolution.

X_{residual} = \{\begin{matrix} {Conv}_{1 \times 1} (X) & if C_{in} \neq C_{out}, \\ X & otherwise . \end{matrix}

(3)

Following the convolution process, the feature map is fed into the EMA mechanism, as illustrated in Figure 3.

The attention mechanism assigns varying weights to the input features, emphasizing important information and reducing the influence of irrelevant details. This mechanism mimics the human attention selection process, where, when handling complex information, we naturally focus attention on the task-relevant parts and ignore unrelated sections. Compared to traditional attention mechanisms such as Squeeze-and-Excitation Networks [53], CA [52], and the Convolutional Block Attention Module [54], the EMA adopts a significantly different strategy in its design. Traditional methods simplify computations by reducing the channel dimension, which may improve efficiency in some scenarios. But they often struggle to preserve structural details of small targets. The EMA mechanism achieves full interaction and fusion of channel information through a multi-scale parallel subnet structure, without increasing the channel dimension compression. Meanwhile, EMA improves fine-grained features across multiple scales by leveraging multi-scale extraction and cross-spatial learning, especially in terms of target details and structural information. By focusing on features at different scales, it is better able to capture the subtle variations of small targets.

Specifically, EMA divides the channel dimension of the input feature map into T sub-groups. This guarantees that each group retains essential key information. The grouped representation is as follows:

X_{2} = [X_{2, 1}, X_{2, 2}, \dots, X_{2, T}], X_{2, i} \in R^{(B \times T) \times \frac{C}{T} \times H \times W} .

(4)

EMA employs a parallel structure consisting of two

1 \times 1

convolution kernels and one

3 \times 3

convolution kernel, each responsible for handling different levels of information. Each

1 \times 1

convolution branch interacts across channels using 1D global average pooling along two spatial directions. On the other hand, the

3 \times 3

convolution branch, with a larger receptive field, captures multi-scale spatial information, especially useful for handling complex spatial dependencies. The parallel structure of these two branches allows EMA to model both local and global dependencies, effectively enhancing the diversity and accuracy of feature representations while maintaining computational efficiency.

Then, through a 2D global average pooling operation, EMA extracts global spatial information along the spatial dimensions. It assists the model in grasping extensive contextual information within the image, thus better capturing long-range spatial dependencies. During the feature fusion process, EMA employs a matrix multiplication operation to integrate the output features from the

1 \times 1

and

3 \times 3

branches. Through this cross-spatial information fusion, EMA effectively captures the interactions between different spatial locations. The two spatial attention maps are merged and activated with the sigmoid function, yielding the final cross-dimensional spatial attention map

A_{final}

. This weight map captures the pixel-level relative relationships and focuses on the global contextual information, generating enhanced output features:

X_{ema} = X_{2} ⊙ A_{final},

(5)

where ⊙ denotes element-wise multiplication.

Finally, the enhanced features

X_{ema}

from the EMA module are added to the output of the residual connection

X_{residual}

, followed by the ReLU activation function:

X_{output} = ReLU (X_{ema} + X_{residual}) .

(6)

3.3. Adaptive Feature Dynamic Fusion Module

The AFDF module is primarily composed of DySample and the Adaptive Feature Fusion module, as shown in Figure 4.

This module achieves lightweight dynamic learning-based upsampling, as well as adaptive feature weighting and fusion. DySample is a method designed for dynamic upsampling, which performs spatial transformations of feature maps by generating a set of sampling points. This approach replaces traditional dynamic convolution with point sampling, optimizing performance and reducing computational costs through a simplified network architecture while maintaining or even improving the quality of upsampling. The core of the DySample method lies in its innovative sampling point generator and offset adjustment strategy. A detailed explanation of the AFDF module and this method is provided below.

The input feature map X is transformed into an offset L through a linear projection, calculated as follows:

L = linear (X) .

(7)

The offset L is then added to a predefined regular grid Y to generate the sampling points D:

D = Y + L .

(8)

To avoid the issues caused by offset overlap, DySample introduces a static scope factor. By multiplying a constant factor (e.g., 0.25) during the offset generation process, the distribution range of the sampling points can be controlled, preventing excessive overlap of the sampling points and thereby improving the quality of upsampling. The following formula represents this process:

L = 0.25 \cdot linear (X) .

(9)

After introducing the DySample method, the network is able to better recover image details, especially demonstrating significant advantages in detecting small infrared targets. The static scope factor optimizes the distribution of offsets, ensuring effective sampling points, which in turn enhances the performance of the network in IRSTD tasks. This approach enhances detection accuracy while minimizing the consumption of computational resources, offering significant computational efficiency compared to traditional upsampling methods.

In the conventional U-Net model, the decoder focuses on upsampling low-resolution features to restore the finer details of the input image, whereas the encoder extracts abstract, high-level features that often hold more semantic information. Inspired by the position attention network [51], we designed an Adaptive Feature Fusion module and integrated the CA [52]. The original position attention network combines the two input features through addition, whereas we adopt a method of concatenation followed by convolution for dimensionality reduction. This approach preserves the independence of each feature, enabling a more thorough use of the input information while preventing the loss of details or data that could arise during the addition process. Specifically, the input features

X_{1}

and

X_{2}

are concatenated along the channel dimension to integrate their semantic information:

X_{concat} = Concat (X_{1}, X_{2})

, where

X_{1}, X_{2} \in R^{N \times C \times H \times W}

,

X_{concat} \in R^{N \times 2 C \times H \times W}

. Then, a

1 \times 1

convolution is applied to the concatenated features to reduce their channel dimensions:

X_{fusion} = {Conv}_{1 \times 1} (X_{concat})

. Next, the width and height of the input feature map are divided into two separate axes, and the average values are computed. The formula is as follows:

m_{w} (i) = \frac{1}{H} \sum_{0 \leq j < H} K (i, j),

(10)

m_{h} (j) = \frac{1}{W} \sum_{0 \leq i < W} K (i, j),

(11)

where H, W, and

K (i, j)

represents the height, width, and value at position

(i, j)

of the input feature map, respectively.

After concatenating the height feature

m_{h}

and the width feature

m_{w}

, channel reduction is performed using a

1 \times 1

convolution:

x_{reduced} = ReLU (BN ({Conv}_{1 \times 1} (Concat (m_{h}, m_{w})))),

(12)

where

x_{reduced} \in R^{N \times \frac{C}{reduction} \times H \times W}

. Next, attention weights for the height and width directions,

m_{h}

and

m_{w}

, are generated, respectively:

m_{h} = σ ({Conv}_{1 \times 1} (x_{reduced}^{h})), m_{h} \in R^{N \times C \times H \times 1},

(13)

m_{w} = σ ({Conv}_{1 \times 1} (x_{reduced}^{w})), m_{w} \in R^{N \times C \times 1 \times W},

(14)

where

x_{reduced}^{h}

represents the reduced features for the height direction,

x_{reduced}^{w}

represents the reduced features for the width direction, and

σ

represents the Sigmoid function.

Finally, the initial fused feature

x_{fusion}

is reweighted to generate the enhanced output feature:

x_{output} = x_{fusion} \cdot m_{h} \cdot m_{w} .

(15)

4. Experimental Results

4.1. Evaluation Metrics

In our comparative experiments, we adopted Intersection over Union (IoU), normalized Intersection over Union (nIoU), and F1 score (F1) as our evaluation metrics.

(1) IoU: Quantifies segmentation accuracy by measuring the overlap between predictions (P) and ground truth (T):

IoU = \frac{TP}{T + P - TP},

(16)

where

T P

denotes true positives.

(2) nIoU: nIoU is a modified version of IoU, specifically developed to tackle class imbalance and overcome the challenges in IRSTD. By normalizing the IoU with class-specific weights, it assigns different weights to each class, ensuring a more balanced model performance across classes and uneven datasets. Its computation is defined as follows:

nIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{T_{i}}{P_{i} + T_{i} - T P_{i}},

(17)

where N represents the total number of samples.

(3) F1: The F1 score assesses the classification performance of a model by balancing precision and recall, particularly in situations involving imbalanced positive and negative samples. The formula is as follows:

F 1 = \frac{2 T P}{2 T P + F P + F N},

(18)

where

T P

,

F P

, and

F N

represent the true positives, false positives, and false negatives, respectively.

4.2. Dataset

We conducted experiments using the NUDT-SIRST [33] and IRSTD-1k [31] datasets. The targets in these datasets are usually defined by their small dimensions, weak signal strengths, and minimal contrast against the background, which makes them ideal for investigating target detection challenges in complex situations. Additionally, the datasets feature diverse scenes, including various background clutters (such as clouds, ground, and sea surfaces) and complex target characteristics, which help evaluate the robustness of the algorithms. For the NUDT-SIRST dataset, the data were equally divided into training and testing phases using a 50-50 split protocol. For the IRSTD-1k dataset, we kept the initial division, dedicating 80% of the images to training and the remaining 20% to testing.

4.3. Implementation Details

(1) Loss Function: The Cross-Entropy Loss (CE LOSS) function has limitations in IRSTD tasks. It mainly emphasizes pixel-level classification accuracy, overlooking the overlap of shapes between the predicted and target regions. Furthermore, infrared targets typically comprise only a small fraction of the total image area, resulting in negligible gradients when computed with standard CE Loss. This makes it less effective for detecting small targets, as their influence on the loss function is significantly reduced. To overcome these difficulties, the paper introduces the SoftIoU Loss, a loss function that directly optimizes the IoU of the target region, making it well suited for detecting small targets. The calculation formula for SoftIoU Loss is as follows:

SoftIoU Loss (O, M) = 1 - \frac{\sum (O \cdot M)}{\sum (O + M - O \cdot M) + ϵ},

(19)

where O is the the predicted probability map from the model, M is the binarized ground truth, and

ϵ

is a small constant added to avoid division by zero.

(2) Experimental Setup: FMADNet was trained using the AdamW optimization method, with the learning rate and batch size set to 0.01 and 16, respectively, over a total of 1500 epochs. The NUDT-SIRST dataset size was set to

256 \times 256

, and the IRSTD-1k dataset size was set to

512 \times 512

. To verify the superiority of the FMADNet method, we compared it with the current state-of-the-art (SOTA) methods, including the model-driven algorithm NTFRA and data-driven algorithms such as ACM, ALCNet, UIUNet, DNANet, ISNet, and AGPCNet. The experimental environment consists of Python 3.10 + PyTorch 2.4.1 + Cuda 12.1. The hardware configuration of the experimental platform is Intel Xeon Platinum 8276L CPU and Nvidia GeForce 3090 GPU × 8.

4.4. Comparision to the SOTA Methods

(1) Quantitative Results: A systematic performance evaluation is conducted, where FMADNet is quantitatively compared against leading competitors across multiple public datasets, as summarized in Table 1. Our algorithm outperforms the competitors on both datasets. The reliance of classical methods on a priori assumptions and hand-crafted parameter selection tends to reduce algorithm robustness while generally producing inferior performance. Deep learning approaches demonstrate superior capabilities through automated feature learning and parameter optimization, substantially decreasing manual configuration requirements and consequently outperforming conventional techniques.

The NUDT-SIRST dataset is primarily used to assess the small-scale feature capturing ability in object detection and segmentation tasks. This dataset contains images with many small targets, which is a huge challenge for most traditional deep learning methods, as these methods typically lose small-scale target details due to downsampling. FMADNet, by integrating the RMFE and AFDF modules, successfully preserves these small target details, significantly improving object detection and segmentation accuracy. On the NUDT-SIRST dataset, FMADNet outperforms the second-best method by 4.18% in IoU, 4.32% in nIoU, and 2.38% in F1 score. This result indicates that FMADNet effectively captures important detail information when processing complex images with small-scale targets. The IRSTD-1k dataset is mainly used to evaluate object detection performance on large-scale datasets, where the targets typically exhibit high complexity and variability. Due to the varying positions, sizes, and shapes of targets in images, traditional methods often struggle to address these challenges. On the IRSTD-1k dataset, FMADNet outperforms the second-best method by 5.46% in IoU and 3.75% in the F1 score. This result highlights FMADNet’s advantage in handling large-scale complex datasets, especially in preserving small target details, demonstrating its superior multi-scale feature fusion capabilities.

We plotted the ROC curves of FMADNet and other SOTA methods across different datasets to evaluate their performance, as illustrated in Figure 5.

The analysis shows that FMADNet demonstrates exceptional detection capability on the NUDT-SIRST dataset, effectively separating infrared small targets from background noise while achieving a balanced trade-off between high true positive rates and low false positives. Similarly, on the more challenging IRSTD-1k dataset, FMADNet exhibits strong generalization ability and stable detection performance, accurately identifying targets in complex infrared environments. Compared with other methods, FMADNet not only outperforms most of them in overall detection performance but also displays a more rapid and smoother rise in its ROC curves across all false positive rate intervals. Overall, the outstanding performance of FMADNet’s ROC curves across different datasets highlights its reliability and efficiency in the task of IRSTD, laying a solid foundation for its broader application in diverse infrared scenarios.

We have added a comparative analysis of computational efficiency and performance, highlighting the advantages of FMADNet over other SOTA methods, as shown in Table 2. Although algorithms such as ACM and ALCNet exhibit lower computational complexity, FMADNet significantly improves infrared small target detection accuracy through its innovative RMFE module and AFDF module. Compared to models with higher computational complexity, such as ISNet, FMADNet significantly improves inference speed and computational efficiency while maintaining detection accuracy. FMADNet’s computational efficiency (6.98 GFLOPs and 47.23 FPS) is far superior to ISNet’s (31.26 GFLOPs and 36.81 FPS), while also showing a slight edge in accuracy. Overall, FMADNet demonstrates a good balance across multiple evaluation metrics. It achieves lower computational complexity and a moderate number of parameters, indicating efficient resource usage. At the same time, FMADNet maintains high inference speed, surpassing most existing methods, and reaches the highest detection performance (89.66% IoU) on the NUDT-SIRST dataset. By achieving an efficient balance between accuracy and performance, FMADNet proves to be highly effective for real-time applications in practical IRSTD scenarios.

(2) Visual Results: The visualization results of the NUDT-SIRST and IRSTD-1k datasets are shown in Figure 6 and Figure 7, respectively. FMADNet significantly outperforms other methods, which is attributed to its ability to effectively enhance target features and distinguish them from background features. The RMFE module, by processing complex and multi-scale data, effectively enhances the feature representation of infrared small targets, making them more prominent visually. This is especially evident in comparison images, where the processed images show clearer and more focused target features. The introduction of the AFDF module not only achieves effective fusion of cross-scale features but also optimizes the integration of features through dynamic learning upsampling and feature weighting techniques. In the visualized images, images processed by AFDF retain important features while eliminating unnecessary background noise, making the targets more distinct. It demonstrates consistent efficacy at diverse resolutions through adaptive coordination of conceptual abstractions and fine-grained feature representations. This is particularly important in scenarios where there is a significant variation in target sizes.

4.5. Ablation Study

To explore whether each module in FMADNet contributes to the improvement of model performance, we conducted experiments and analysis on FMADNet and its three variants. Using U-ResNet as the baseline model, we designed ablation experiments to study the impact of improved or newly added modules on the network’s performance. Each module was added sequentially to the original U-ResNet, and the hyperparameters were kept consistent throughout the experiments, as shown in Table 3. In Experiment (1), as our baseline model, we removed the RMFE and AFDF modules, keeping only the original baseline model U-ResNet. In Experiment (2), we added the RMFE module to the baseline model. The detection metrics improved: especially on the IRSTD-1k dataset, IoU, F1 increased by 5.12% and 3.55%, respectively. The incorporation of the RMFE module enhanced all evaluation metrics relative to the baseline architecture. By combining residual blocks with the EMA, the RMFE module effectively enhanced the feature representation of small and weak targets in infrared images, thereby improving detection performance. In Experiment (3), we added the AFDF module to the baseline model. The addition of the AFDF module also improved all metrics compared to the baseline model. Particularly on the NUDT-SIRST dataset, the detection metrics IoU, nIoU, and F1 improved by 2.34%, 2.38%, and 1.35%, respectively. The AFDF module achieved dynamic learning-based upsampling and focused more on spatially important features, significantly improving detection accuracy. Experiment (4) represents our proposed FMADNet, which combines the RMFE and AFDF modules. Compared to Experiments (2) and (3), the detection performance of FMADNet was further improved.

To assess the comparative performance of different loss functions, we performed ablation experiments on different datasets, comparing CE Loss, Binary Cross-Entropy Loss (BCE Loss), Dice Similarity Coefficient Loss (Dice Loss), and the introduced SoftIoU Loss, as shown in Table 4. The results demonstrate that SoftIoU Loss exhibits comprehensive advantages. On the NUDT-SIRST dataset, SoftIoU Loss achieves superior performance in IoU (89.66%), nIoU (90.21%), and F1 (94.55%), outperforming CE Loss and Dice Loss, while closely matching BCE Loss (IoU gap: 0.07%, F1 gap: 0.04%). On the IRSTD-1k dataset, SoftIoU Loss shows significant improvements: its IoU reaches 73.47%, which is 3.11 percentage points higher than Dice Loss (70.36%).

5. Discussion

In our experiments, we found that infrared images may contain significant noise or artifacts. These disruptive factors frequently induce feature aliasing between critical targets and extraneous background elements, where small targets often exhibit low contrast and are susceptible to being obscured by background clutter. Such inherent characteristics of infrared imagery significantly increase the complexity of accurate target identification during detection processes. Therefore, future research could further explore noise suppression techniques and combine data from different types of sensors (such as visible light, radar, lidar, etc.). The combination of multimodal data can effectively compensate for the limitations of single infrared images in certain scenarios, thereby improving the reliability of small target detection. We also have observed significant differences in target types, background environments, noise levels, and capturing conditions across different datasets of infrared images. These discrepancies result in the model performing well on one dataset but failing to achieve expected performance on another. To improve model adaptability across diverse data sources, future studies could prioritize refining domain adaptation strategies. By bridging feature distribution gaps in infrared imagery, this approach would strengthen small target detection algorithms for real-world deployment.

6. Conclusions

Recently, deep learning, with CNN in particular, has emerged as a powerful solution for complex visual computing tasks. Nonetheless, these advanced models still encounter substantial challenges when applied to the IRSTD. The standard CNN architectures are not specifically optimized for handling targets of small size and high noise, and the hierarchical subsampling process inherent in convolutional feature extraction may cause the erosion of discriminative small-target features. To address these issues, this study introduces the FMADNet for IRSTD. The RMFE module integrates an EMA mechanism based on the Resnet-Unet architecture. U-Net’s characteristic encoder–decoder configuration provides a systematic approach for hierarchical feature fusion at multiple scales. Residual networks enhance the propagation of gradients through deep residual connections, effectively increasing the training depth and performance of the network. We introduce the EMA mechanism in this structure to adaptively enhance or suppress feature responses at multiple scales, further enhancing the detection capability for tiny targets. Additionally, the AFDF module enables dynamic upsampling fusion of features from the encoder with those from the decoder. This innovative design allows the network to dynamically learn upsampling while focusing more on spatially important features. Our method was tested on the NUDT-SIRST, IRSTD-1k datasets, and compared with some of the SOTA methods. Quantitative analysis reveals that our architecture effectively and accurately detects infrared small targets, while remaining robust in the presence of diverse and complex backgrounds.

Author Contributions

Conceptualization, Z.X. and Z.S.; methodology, Z.X.; software, Z.X. and Z.S.; validation, Z.X. and Z.S.; investigation, Y.M.; writing—original draft preparation, Z.X. and Z.S.; writing—review and editing, Z.X., Z.S. and Y.M.; funding acquisition, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62271109).

Data Availability Statement

The data presented in this study are cited within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, Y.; Lai, X.; Xia, Y.; Zhou, J. Infrared Dim Small Target Detection Networks: A Review. Sensors 2024, 24, 3885. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Li, Z.; Siddique, A. Infrared Maritime Small-Target Detection Based on Fusion Gray Gradient Clutter Suppression. Remote Sens. 2024, 16, 1255. [Google Scholar] [CrossRef]
Guo, F.; Ma, H.; Li, L.; Lv, M.; Jia, Z. FCNet: Flexible Convolution Network for Infrared Small Ship Detection. Remote Sens. 2024, 16, 2218. [Google Scholar] [CrossRef]
Ying, X.; Wang, Y.; Wang, L.; Sheng, W.; Liu, L.; Lin, Z.; Zhou, S. Local Motion and Contrast Priors Driven Deep Network for Infrared Small Target Superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5480–5495. [Google Scholar] [CrossRef]
Deng, H.; Zhang, Y. FMR-YOLO: Infrared Ship Rotating Target Detection Based on Synthetic Fog and Multiscale Weighted Feature Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 1–17. [Google Scholar] [CrossRef]
Wang, F.; Qian, W.; Qian, Y.; Ma, C.; Zhang, H.; Wang, J.; Wan, M.; Ren, K. Maritime Infrared Small Target Detection Based on the Appearance Stable Isotropy Measure in Heavy Sea Clutter Environments. Sensors 2023, 23, 9838. [Google Scholar] [CrossRef]
Kim, J.; Huh, J.; Park, I.; Bak, J.; Kim, D.; Lee, S. Small Object Detection in Infrared Images: Learning from Imbalanced Cross-Domain Data via Domain Adaptation. Appl. Sci. 2022, 12, 1201. [Google Scholar] [CrossRef]
Sun, Y.; Yang, J.; An, W. Infrared Dim and Small Target Detection via Multiple Subspace Learning and Spatial-Temporal Patch-Tensor Model. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3737–3752. [Google Scholar] [CrossRef]
Zhong, Y.; Shi, Z.; Zhang, Y.; Zhang, Y.; Li, H. CSAN-UNet: Channel Spatial Attention Nested UNet for Infrared Small Target Detection. Remote Sens. 2024, 16, 1894. [Google Scholar] [CrossRef]
Pan, X.; Jia, N.; Mu, Y.; Gao, X. Survey of small object detection. J. Image Graph. 2023, 28, 2587–2615. [Google Scholar] [CrossRef]
Xiang, Y.; Gong, C.; Ge, L.; Wei, D.; Yin, W.; Feng, Y.; Xiwen, Y.; Huang, Z.; Xian, S.; Han, J. Progress in small object detection for remote sensing images. J. Image Graph. 2023, 28, 1662–1684. [Google Scholar]
Wang, X.; Peng, Z.; Kong, D.; He, Y. Infrared Dim and Small Target Detection Based on Stable Multisubspace Learning in Heterogeneous Scene. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5481–5493. [Google Scholar] [CrossRef]
Zhao, Y.; Li, Y.; Zhu, C.; Wang, S.; Lan, Z.; Qiao, Y. An Adaptive Spatial-Temporal Local Feature Difference Method For Infrared Small-Moving Target Detection. In Proceedings of the 2023 8th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC), Beijing, China, 3–5 November 2023; pp. 346–351. [Google Scholar] [CrossRef]
Yao, C.; Zhao, H. Adaptive Frame Sampling and Feature Alignment for Multi-Frame Infrared Small Target Detection. Appl. Sci. 2024, 14, 6360. [Google Scholar] [CrossRef]
Luo, Y.; Ying, X.; Li, R.; Wan, Y.; Hu, B.; Ling, Q. Multi-scale Optical Flow Estimation for Video Infrared Small Target Detection. In Proceedings of the 2022 2nd International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Nanjing, China, 23–25 September 2022; pp. 129–132. [Google Scholar] [CrossRef]
Hu, Y.; Ma, Y.; Pan, Z.; Liu, Y. Infrared Dim and Small Target Detection from Complex Scenes via Multi-Frame Spatial–Temporal Patch-Tensor Model. Remote Sens. 2022, 14, 2234. [Google Scholar] [CrossRef]
Deng, H.; Zhang, Y.; Li, Y.; Cheng, K.; Chen, Z. BEmST: Multiframe Infrared Small-Dim Target Detection Using Probabilistic Estimation of Sequential Backgrounds. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Sun, X.; Xiong, W.; Shi, H. A Novel Spatiotemporal Filtering for Dim Small Infrared Maritime Target Detection. In Proceedings of the 2022 International Symposium on Electrical, Electronics and Information Engineering (ISEEIE), Chiang Mai, Thailand, 25–27 February 2022; pp. 195–201. [Google Scholar] [CrossRef]
Wang, X.; Peng, Z.; Zhang, P.; He, Y. Infrared Small Target Detection via Nonnegativity-Constrained Variational Mode Decomposition. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1700–1704. [Google Scholar] [CrossRef]
Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Optics & Photonics, Denver, CO, USA, 4 October 1999. [Google Scholar]
Pang, D.; Shan, T.; Li, W.; Ma, P.; Tao, R.; Ma, Y. Facet Derivative-Based Multidirectional Edge Awareness and Spatial-Temporal Tensor Model for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Hao, X.; Liu, X.; Liu, Y.; Cui, Y.; Lei, T. Infrared Small-Target Detection Based on Background-Suppression Proximal Gradient and GPU Acceleration. Remote Sens. 2023, 15, 5424. [Google Scholar] [CrossRef]
Liu, J.; He, Z.; Chen, Z.; Shao, L. Tiny and Dim Infrared Target Detection Based on Weighted Local Contrast. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1780–1784. [Google Scholar] [CrossRef]
Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared Small Target Detection Based on the Weighted Strengthened Local Contrast Measure. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1670–1674. [Google Scholar] [CrossRef]
Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint l2,1 Norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
He, Y.; Li, M.; Zhang, J.; Yao, J. Infrared Target Tracking Based on Robust Low-Rank Sparse Learning. IEEE Geosci. Remote Sens. Lett. 2016, 13, 232–236. [Google Scholar] [CrossRef]
Kong, X.; Yang, C.; Cao, S.; Li, C.; Peng, Z. Infrared Small Target Detection via Nonconvex Tensor Fibered Rank Approximation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–21. [Google Scholar] [CrossRef]
Zhang, L.; Peng, Z. Infrared Small Target Detection Based on Partial Sum of the Tensor Nuclear Norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric Contextual Modulation for Infrared Small Target Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 949–958. [Google Scholar] [CrossRef]
Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional Local Contrast Networks for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 867–876. [Google Scholar] [CrossRef]
Zhang, M.; Bai, H.; Zhang, J.; Zhang, R.; Wang, C.; Guo, J.; Gao, X. RKformer: Runge-Kutta Transformer with Random-Connection Attention for Infrared Small Target Detection. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, Lisboa, Portugal, 10–14 October 2022; pp. 1730–1738. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense Nested Attention Network for Infrared Small Target Detection. IEEE Trans. Image Process. 2023, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-Guided Pyramid Context Networks for Detecting Infrared Small Target Under Complex Background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Pan, P.; Wang, H.; Wang, C.; Nie, C. ABC: Attention with Bilinear Correlation for Infrared Small Target Detection. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2381–2386. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid attention network for semantic segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Wang, K.; Du, S.; Liu, C.; Cao, Z. Interior Attention-Aware Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Pang, D.; Ma, P.; Shan, T.; Li, W.; Tao, R.; Ma, Y.; Wang, T. STTM-SFR: Spatial–Temporal Tensor Modeling With Saliency Filter Regularization for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Luo, Y.; Li, X.; Chen, S.; Xia, C.; Zhao, L. IMNN-LWEC: A Novel Infrared Small Target Detection Based on Spatial–Temporal Tensor Model. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–22. [Google Scholar] [CrossRef]
Du, J.; Li, D.; Deng, Y.; Zhang, L.; Lu, H.; Hu, M.; Shen, X.; Liu, Z.; Ji, X. Multiple Frames Based Infrared Small Target Detection Method Using CNN. In Proceedings of the 2021 4th International Conference on Algorithms, Computing and Artificial Intelligence, ACAI ’21, Sanya, China, 22–24 December 2022. [Google Scholar] [CrossRef]
Chen, S.; Ji, L.; Zhu, J.; Ye, M.; Yao, X. SSTNet: Sliced Spatio-Temporal Network With Cross-Slice ConvLSTM for Moving Infrared Dim-Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6004–6014. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]

Figure 1. Overview of the proposed FMADNet method. FMADNet adopts a basic encoder–decoder structure, consisting of the RMFE and the AFDF.

Figure 2. The detailed structure of RMFE.

Figure 3. The detailed structure of EMA.

Figure 4. The detailed structure of AFDF.

Figure 5. In the ROC curves for all the methods evaluated on the different datasets, the x-axis denotes the False Positive Rate (FPR), while the y-axis indicates the True Positive Rate (TPR).

Figure 6. The visual results of various methods on the NUDT-SIRST dataset are presented, with an enlarged view of the target area shown in the top-right corner. Subfigures (1)–(4) correspond to different test cases. The cyan rectangles denote zoomed-in areas to provide a clearer view of target detection performance across methods. Targets that are correctly identified, false positives, and missed regions are marked using red, yellow, and green circles, respectively.

Figure 7. The visual results of various methods on the IRSTD-1k dataset are presented, with an enlarged view of the target area shown in the top-right corner. Subfigures (1)–(4) correspond to different test cases. The cyan rectangles denote zoomed-in areas to provide a clearer view of target detection performance across methods. Targets that are correctly identified, false positives, and missed regions are marked using red, yellow, and green circles, respectively.

Table 1. The performance of various SOTA methods on the different datasets is evaluated using IoU (%), nIoU (%), and F1 (%) scores. The arrows (↑) next to the metrics indicate that higher values represent better performance. The fonts in red and blue represent the methods in the first and second positions, respectively.

	Methods	Ref.	IoU (%) ↑	nIoU (%) ↑	F1 (%) ↑
NUDT-SIRST	NTFRA [27]	TGRS	8.61	17.22	15.86
	ACM [29]	WACV	58.89	57.20	74.13
	ALCNet [30]	TGRS	80.49	81.37	89.19
	UIUNet [34]	TIP	85.26	84.75	92.04
	DNANet [33]	TIP	85.48	85.89	92.17
	ISNet [31]	CVPR	66.03	70.29	79.54
	AGPCNet [35]	TAES	85.00	86.88	91.89
	Ours	-	89.66	90.21	94.55
IRSTD-1k	NTFRA [27]	TGRS	2.40	21.18	4.69
	ACM [29]	WACV	62.21	55.61	76.70
	ALCNet [30]	TGRS	66.17	64.21	79.64
	UIUNet [34]	TIP	61.70	57.64	76.31
	DNANet [33]	TIP	68.01	68.98	80.96
	ISNet [31]	CVPR	65.44	63.00	79.11
	AGPCNet [35]	TAES	65.73	63.31	79.32
	Ours	-	73.47	68.13	84.71

Table 2. A comparison of the computational complexity and time efficiency on the NUDT-SIRST dataset is presented. The methods in the first and second positions are highlighted using red and blue fonts, respectively. Arrows indicate direction: ↓ for lower is better, ↑ for higher is better.

Methods	Ref.	FLOPs ↓ (G)	Params ↓ (M)	FPS ↑	IoU ↑ (%)
ACM [29]	WACV	0.38	0.29	86.45	58.89
ALCNet [30]	TGRS	7.08	0.38	26.18	80.49
UIUNet [34]	TIP	54.46	50.54	44.18	85.26
DNANet [33]	TIP	14.09	4.70	57.69	85.48
ISNet [31]	CVPR	31.26	1.09	36.81	66.03
AGPCNet [35]	TAES	43.11	12.36	14.11	85.00
Ours	-	6.98	4.31	47.23	89.66

Table 3. Results of ablation study on different datasets.

	NUDT-SIRST			IRSTD-1k
	IoU (%)	nIoU (%)	F1 (%)	IoU (%)	nIoU (%)	F1 (%)
U-Res	85.36	85.93	92.10	67.17	66.96	80.36
U-RMFE	85.92	86.68	92.43	72.29	67.05	83.91
U-Res + AFDF	87.70	88.31	93.45	68.87	67.07	81.57
FMADNet	89.66	90.21	94.55	73.47	68.13	84.71

Table 4. Comparison of different loss functions on different datasets. The fonts in red and blue represent the methods in the first and second positions, respectively.

	NUDT-SIRST			IRSTD-1k
	IoU (%)	nIoU (%)	F1 (%)	IoU (%)	nIoU (%)	F1 (%)
CE Loss	88.68	89.83	94.00	68.74	67.57	81.48
BCE Loss	89.59	89.79	94.51	68.15	68.47	81.06
Dice Loss	88.71	89.52	94.02	70.36	66.20	82.60
SoftIoU Loss	89.66	90.21	94.55	73.47	68.13	84.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiong, Z.; Sheng, Z.; Mao, Y. Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection. Remote Sens. 2025, 17, 1548. https://doi.org/10.3390/rs17091548

AMA Style

Xiong Z, Sheng Z, Mao Y. Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection. Remote Sensing. 2025; 17(9):1548. https://doi.org/10.3390/rs17091548

Chicago/Turabian Style

Xiong, Zenghui, Zhiqiang Sheng, and Yao Mao. 2025. "Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection" Remote Sensing 17, no. 9: 1548. https://doi.org/10.3390/rs17091548

APA Style

Xiong, Z., Sheng, Z., & Mao, Y. (2025). Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection. Remote Sensing, 17(9), 1548. https://doi.org/10.3390/rs17091548

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Multi-Scale Enhancement and Adaptive Dynamic Fusion Network for Infrared Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Single-Frame IRSTD

2.2. Multi-Frame IRSTD

2.3. Cross-Layer Feature Fusion

3. Materials and Methods

3.1. Overall Architecture

3.2. Residual Multi-Scale Feature Enhancement Module

3.3. Adaptive Feature Dynamic Fusion Module

4. Experimental Results

4.1. Evaluation Metrics

4.2. Dataset

4.3. Implementation Details

4.4. Comparision to the SOTA Methods

4.5. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI