MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery

Wang, Zaixing; Dang, Chao; Zhang, Rui; Wang, Linchang; He, Yonghuan; Wu, Rong

doi:10.3390/drones9030224

Open AccessArticle

MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery

by

Zaixing Wang

^1,*,

Chao Dang

¹,

Rui Zhang

²

,

Linchang Wang

¹,

Yonghuan He

¹ and

Rong Wu

¹

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730073, China

²

Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 611756, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(3), 224; https://doi.org/10.3390/drones9030224

Submission received: 1 February 2025 / Revised: 12 March 2025 / Accepted: 16 March 2025 / Published: 20 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

UAV infrared sensor technology plays an irreplaceable role in various fields. High-altitude infrared images present significant challenges for feature extraction due to their uniform texture and color, fragile and variable edge information, numerous background interference factors, and low pixel occupancy of small targets such as humans, bicycles, and diverse vehicles. In this paper, we propose a Multi-scale Dual-Branch Dynamic Feature Aggregation Network (MDDFA-Net) specifically designed to address these challenges in UAV infrared image processing. Firstly, a multi-scale dual-branch structure is employed to extract multi-level and edge feature information, which is crucial for detecting small targets in complex backgrounds. Subsequently, features at three different scales are fed into an Adaptive Feature Fusion Module for feature attention-weighted fusion, effectively filtering out background interference. Finally, the Multi-Scale Feature Enhancement and Fusion Module integrates high-level and low-level features across three scales to eliminate redundant information and enhance target detection accuracy. We conducted comprehensive experiments using the HIT-UAV dataset, which is characterized by its diversity and complexity, particularly in capturing small targets in high-altitude infrared images. Our method outperforms various state-of-the-art (SOTA) models across multiple evaluation metrics and also demonstrates strong inference speed capabilities across different devices, thereby proving the advantages of this approach in UAV infrared sensor image processing, especially for multi-scale small target detection.

Keywords:

infrared sensor; UAV; multi-scale dynamic features; adaptive fusion; feature enhancement; target recognition

1. Introduction

As an important monitoring technology today, UAV data collection technology is widely used in various fields such as agriculture [1], military [2], industry [3], and public welfare due to its efficiency and flexibility. Compared to ground-based monitoring methods, UAV monitoring offers unprecedented advantages for various monitoring tasks with its high coverage, flexibility, multiple perspectives, and low cost. UAVs can quickly cover large areas, making them especially suitable for regions that are difficult to access through ground monitoring, such as remote mountainous areas [4], post-disaster rubble [5], or hazardous environments. UAVs can be equipped with various sensors to capture different types of remote sensing images, such as high-definition images, infrared remote sensing images, and LiDAR data. Compared to visible light images, infrared remote sensing images can effectively capture the thermal radiation of targets and backgrounds in low-light or no-light environments, offering all-weather monitoring capabilities and strong anti-interference performance. However, there are also the following challenges.

(1) Feature extraction of target points of interest: From Figure 1a, in thermal infrared images captured by UAVs in urban environments, points of interest predominantly include vehicles (cars, motorcycles), pedestrians, and other transport units. Such targets are constrained by the grayscale characteristics of infrared images, resulting in the absence of color and texture information. Compounded by inherent feature defects such as morphological blurring and indistinct contour boundaries of the targets themselves, key identification features exhibit significant weakening phenomena.

(2) Multi-scale variation: From Figure 1b, the dynamic movement between the UAV and the target results in non-uniform target scales, and the target edges become fragile. Current methods are unable to effectively accommodate scenarios where regular and multi-scale structures appear simultaneously.

(3) Complex background interference issues: From Figure 1c, in addition to the points of interest, infrared image backgrounds lack texture and color information. Furthermore, the thermal radiation characteristics of some targets exhibit similarities with the background (particularly under variable climatic conditions or at different time periods, such as during periods with significant temperature differences between morning and evening), which can lead to errors in feature analysis.

In past research, various infrared image extraction methods have been proposed, including pixel-based methods (threshold segmentation [6], adaptive threshold methods [7]), geometric feature methods (edge detection [8], contour analysis [9]), region segmentation (region growing methods [10], region merging and segmentation methods [11]), and statistical feature methods (grayscale statistical features [12], texture analysis methods [13]). Although these methods possess strong theoretical value, they exhibit significant limitations in practical applications due to the difficulty in obtaining accurate parameters. For example, pixel-based methods, as image resolution increases, result in a higher proportion of infrared target pixels [14].

This may lead to an excessive focus on single-point pixels, thereby neglecting the spatial correlations between targets. These methods overly rely on pixel-level information [15], making it challenging to handle complex background variations over large areas [16]. Additionally, in high-resolution images, they may produce excessive noise in detection results [17]. Geometric feature methods often require precise target shapes, and their effectiveness significantly diminishes when dealing with irregular or complex-shaped targets. They depend on the clarity of the boundaries between the target and the background [18], whereas in infrared images, target boundaries may be blurred. In summary, these traditional methods generally lack generalizability, typically being applicable only to single types of targets. Their performance is considerably limited when confronted with complex backgrounds, varying lighting conditions, and dynamic targets.

In recent years, with the advancement of computational power and computer vision technologies, deep learning has emerged as a dominant research direction in the field of image science [19,20,21,22]. In 2006, Hinton and colleagues first introduced the concept of deep learning, a groundbreaking development that rapidly propelled the evolution of deep learning methodologies and gave rise to various network architectures, such as Deep Belief Networks (DBNs), Recurrent Neural Networks (RNNs), and Convolutional Neural Networks (CNNs) [23]. Currently, these deep learning approaches are widely applied in the domain of drone image processing. For instance, dense nested networks have been utilized to detect small infrared targets in drone-captured imagery [24]. Lu et al. proposed a hybrid Transformer model to achieve efficient object detection in drone images [25], while Fu et al. introduced a more lightweight and accurate network for drone remote sensing images, achieving average precision improvements of 4.7% and 2.3% on the VisDrone-2021DET and HIT-UAV datasets, respectively [26]. Deep learning also plays a pivotal role in the processing of thermal infrared images. Nie et al. proposed a cross-modal feature fusion interaction strategy to enhance the diversity of feature fusion in visual and infrared remote sensing images [27]. Ou et al. improved the detection speed of infrared intelligent lines by leveraging deep learning algorithms with multiple aspect ratios [28]. Zhang et al. addressed the significant impact of temperature on infrared image quality by proposing a multi-scale residual structure algorithm, demonstrating superior performance in non-uniformity correction [29]. Despite these notable advancements in drone and infrared image processing, several challenges remain [30]. Furthermore, deep learning has achieved significant progress in computer vision tasks such as object HIT-UAV detection [31], image generation [32], path planning in autonomous driving [33], and motion planning in robot control [34]. At the same time, various deep learning methods based on UAV-mounted infrared images have also been proposed. Liang [35] introduced a Transformer-based hierarchical embedded CNN hybrid network that can simultaneously extract local and global pedestrian information at different scales, thereby mitigating the adverse effects of feature discrepancies caused by different modality extractions on recognition. Xu et al. [36] incorporated a multi-scale attention mechanism into the upsampling module of YOLOv9 to address the extraction of small target feature information. To tackle the aforementioned small target problem, existing methods generally increase the network depth [34]. These structures can extract richer features, especially exhibiting better performance for small targets within multi-layer features. However, bulky serial networks typically result in a large number of network parameters and slow inference speeds [35]. To address the difficulty in feature extraction, current methods often simply add attention mechanisms to enhance feature extraction in important regions while reducing interference from irrelevant backgrounds [36]. However, attention mechanisms inherently focus more on enhancing local features, which may be a limitation for small target detection. In many cases, especially in scenarios where targets are extremely small or the difference between the target and background is minimal, local attention may not effectively distinguish between the target and the background, thereby failing to significantly improve detection accuracy [37].

To overcome the aforementioned challenges, this paper proposes a Multi-scale Dual-Branch Dynamic Feature Aggregation Network for UAV-mounted infrared image sensor detection. The contributions are as follows:

(1) Designed a Multi-scale Dynamic Dual-Branch Encoding Structure with Multi-level Dynamic Feature Extraction Capability: This structure effectively extracts complex edge features and target features with low pixel occupancy from infrared images.

(2) Established an Adaptive Feature Fusion Module (AFFM): The purpose of the AFFM is to weightedly fuse features from the same-scale branch and filter out background interference features. This structure not only possesses strong generalization capabilities in fusion performance but also enhances the network’s ability to recover detailed information.

(3) Designed the Multi-Scale Feature Enhancement and Fusion Module (MFEFM): The MFEFM is more suitable for comprehensive extraction of high-altitude infrared images. Due to the loss of multi-dimensional information in images, it achieves fine fusion of different hierarchical features in infrared target regions. By capturing local information while avoiding the degradation of global feature extraction capabilities, it effectively integrates complete multi-dimensional target information and efficiently filters out redundant information.

(4) Performance Evaluation of MDDFA-Net on the HIT-UAV Public Dataset: On the HIT-UAV dataset for UAV-mounted infrared sensor images, MDDFA-Net achieved precision of 94.26%, recall of 93.42%, and mAP@0.5 of 95.43%, significantly outperforming six other state-of-the-art (SOTA) network models. Under PC and RK3588 hardware conditions, it reached FPS of 41.1 and 29.7, respectively, demonstrating a balanced and comprehensive capability that far exceeds other SOTA models.

2. Methodology

2.1. Network Overall

The overall architecture of MDDFA-Net is illustrated in Figure 2, comprising an encoder and a decoder as its core components. To process high-resolution input images (512 × 512), the framework progressively reduces spatial dimensions through 3 × 3 convolutional layers with a stride of 2. Residual blocks and max-pooling layers are then combined to downscale the image to 256 × 256. The input feature maps subsequently enter a dual-branch encoder architecture. One branch employs a four-level encoder based on ResNet34, while the other utilizes a dynamic snake convolution (DSC) block encoder that integrates dynamic snake-shaped convolutions at each layer to extract detailed road shape information. Feature maps from both branches at corresponding scales are concatenated to form feature-sharing layers, which are fed into a cross-channel information interaction enhancement module composed of three 1 × 1 convolutional layers. This module dynamically fuses semantic information from different branches through weight matrices. Similarly, residual blocks and DSC blocks containing four distinct scales and semantic features are simultaneously processed by the Adaptive Feature Fusion Module (AFFM) to precisely capture critical target characteristics. This adaptive mechanism proves particularly effective for object detection in complex environments, significantly enhancing feature extraction capabilities for small targets, distant objects, and morphologically variable instances. In the decoder section, the Multi-scale Feature Enhancement and Fusion Module (MFEFM) receives three AFFM-fused features (128 × 128, 64 × 64, 32 × 32) through its triple input interfaces. Feature fusion occurs between adjacent-scale features in deeper layers through hierarchical interactions. Subsequently, these fused features undergo further integration with features from preceding layers to attenuate redundant information. This process effectively combines high-level semantic information with low-level detail features, thereby expanding the model’s receptive field and strengthening its capacity to capture contextual associations. Ultimately, the fused features from three branches are delivered to the output network for final target feature extraction. The following sections will elaborate on the network’s innovative contributions to UAV infrared object detection tasks and the detailed functionalities of each module.

2.2. Double Branch Structure

Targets in UAV-mounted infrared remote sensing images are typically densely arranged and highly variable. Currently, in deep learning, popular single-chain backbone networks—due to their simple structure—perform well only in straightforward feature extraction tasks. They often fail to adequately capture complex feature relationships in scenarios involving multi-scale feature fusion, cross-layer information transmission, or complex environments, leading to issues such as background confusion and incomplete feature extraction. Dual-branch structures, as parallel processing architectures, possess strong information fusion capabilities. Through two parallel branches, they can simultaneously handle different types of features or acquire information from various scales and angles. Compared to the limitations of single-chain feature extraction, dual parallel branches enable the simultaneous processing of different feature types or acquisition from multiple scales and perspectives.

In this paper, MDDFA-Net consists of a dual-branch encoder composed of ResNet34 blocks and dynamic snake convolution (DSConv) blocks. The feature fusion layer is formed by branches of the same scale, and the features extracted by both branches are input into the Adaptive Feature Fusion Module (AFFM) to obtain fused multi-source features. Through multi-level feature extraction with ResNet34, the network can progressively capture deeper detailed information, significantly improving target recognition and detection accuracy. Furthermore, the deep structure of ResNet34 allows the network to learn more abstract high-level features. The adaptive characteristics of the DSConv branch enable it to dynamically adjust convolution kernels based on the features of different regions, thereby better distinguishing subtle differences between targets of different categories. For targets with minimal interclass differences, dynamic convolution can finely capture more subtle features, reducing confusion caused by similar categories. This flexibility aids in the precise identification and classification of targets in infrared images, especially when targets have similar shapes, textures, or thermal characteristics.

Infrared targets, as grayscale images, possess fragile edges and variable structures, and are prone to false detections and misdetections when overlapping. Standard convolution in ordinary branch structures, such as the standard convolutions in Darknet, exhibits insufficient feature extraction capability and limited receptive fields. In this paper, we design a DSConv branch composed of DSConv blocks.

In Figure 3, we illustrate the distinct mechanisms of dynamic snake convolution compared to the standard 3 × 3 convolution. The core idea is to simulate the trajectory of snake-like motion by dynamically adjusting the sampling path of the convolutional kernel to align with the target morphology. Unlike the fixed square receptive field of the standard 3 × 3 convolution (Figure 3(1)), the sampling points of dynamic snake convolution are incrementally offset along a predefined “snake-like” direction (e.g., horizontal, vertical, or diagonal) (Figure 3(2)), with the offset determined adaptively by the local gradient of the input features. This mechanism allows the convolutional kernel to flexibly adapt its shape according to the geometric deformations of the target (e.g., vascular curvature, road bifurcations), forming a continuous and coherent sampling path, thereby effectively capturing the features of slender and fragile tubular structures. In contrast, the rigid structure of the standard 3 × 3 convolution struggles to accommodate complex topological variations and is prone to feature fragmentation when the edges of tubular targets are blurred or exhibit low contrast with the background. Dynamic snake convolution demonstrates excellent adaptability in ensuring complete boundary constraints and structural integrity.

As shown in Figure 4, the input with dimensions C₁ × H × W is fed into three parallel branches, which, respectively, utilize two dynamic serpentine convolutions and one standard convolution operation to extract multi-scale features from the feature maps. This approach further enhances the encoder’s perception of irregular boundaries in grayscale targets and increases the constraint on the geometric shapes of features, achieving comprehensive extraction of multi-scale features. Finally, the outputs D₁, D₂, and D₃ from the three parallel branches are concatenated, passed through a 1 × 1 standard convolution, and subsequently magnified through transposed convolution and other operations before being input into the DBR for fusion with the features extracted by ResNet34. The fusion process is as follows:

D_{1} = D_{3} = γ (B N (D S C o n v (i n p u t))

(1)

D_{2} = γ (B N (C o n v 1 \times 1 (i n p u t))

(2)

D = γ (C o n v 1 \times 1 {(D_{1} D_{2} D_{2})}_{c o n c a t})

(3)

o u t p u t = γ (B N ({D e C o n v}_{1 \times 1} ((D))

(4)

2.3. Adaptive Feature Fusion Module (AFFM)

When dealing with images that lack texture and color and have low pixel occupancy, background interference factors often arise. Traditional networks typically neglect channel and spatial constraints. Additionally, infrared images are easily affected by environmental noise, and a high proportion of invalid regions and noise can impact the quality of feature fusion. This paper introduces a multi-fusion driven Adaptive Feature Fusion Module for the multi-scale feature integration of the ResNet34 branch and DSConv branch, as shown in Figure 5. The module consists of three components: feature channel embedding, attention weight computation, and feature weighted fusion. Firstly, global average pooling (GAP) is applied to local features (LFs) and global features (GFs), embedding them into the channel vector space to obtain local embedding vectors (LFVs) and global embedding vectors (GLVs), respectively. Then, the two are merged (e.g., by concatenation or summation) to obtain a local–global embedding vector, which integrates feature information at different scales. Next, a fully connected layer (FC) generates attention weights for each channel, which are normalized using a nonlinear activation function (e.g., Sigmoid), dynamically adjusting the focus on different channel features. Finally, the attention weights are used to perform channel-wise weighting of local and global features, enhancing high-weight features and suppressing low-weight features. The weighted features are then fused by pixel-wise summation or concatenation to generate the final output feature F.

L F V = G A P (L F) \oplus G A P (G F)

(5)

Next, the embedding vectors are input into dual fully connected layers (FC), generating local and global attention weights (LWs) and (GWs), respectively. Finally, the features output by the two FC layers are fused through the attention weights to obtain the (fused features) FF, as shown below.

L W = {F C}_{1} (L F V)

(6)

G W = {F C}_{2} (L F V)

(7)

F F = (L W \otimes L F) \oplus (G W \otimes G F)

(8)

Compared to traditional feature fusion modules, AFFM can perform weighted fusion of different features by utilizing channel attention to drive the fusion feature mapping into embedding vectors, thereby completing adaptive fusion and effectively addressing the challenge of branch feature fusion.

2.4. Multi-Scale Feature Enhancement and Fusion Module (MFEFM)

In UAV-mounted infrared images, multi-view sampling by the UAV and textureless edges lead to discontinuous target representations and excessive redundant information. To mitigate these interferences, we designed a decoder structure that incorporates a multi-scale feature enhancement and fusion module with Enhanced-Scale Fusion. This design aims to address issues such as multi-scale information loss, gradient explosion, and insufficient boundary information.

Inspired by the Feature Pyramid Network (FPN), the design of the multi-scale enhancement module focuses on addressing scale discrepancies in features across different layers of deep neural networks. Shallow-layer features in traditional convolutional networks retain high-resolution details, as shown in Figure 6, the MFEFM module takes the outputs from three different layers of the encoder as inputs. Initially, feature fusion occurs between adjacent-scale features at deeper layers to combine high-level semantic information with low-level detailed features. Subsequently, these fused features are further merged and interact with features from the preceding layer, effectively reducing redundant features and enhancing feature expression capabilities. This process facilitates the fine integration of multi-scale features, achieving effective collaboration and optimization of information across different scales. First, low-level feature details of the image are preserved and combined with the outputs of multi-scale fusion. Next, the results obtained by fusing semantic information from the deepest layer are concatenated, thereby emphasizing the model’s focus on deep semantic information. Finally, a 1 × 1 convolution operation is applied to transform the fusion results into a final feature map of size 1 × 256 × 256.

As shown in Figure 7, the ESF module concatenates the two input feature maps along the channel dimension, thereby fusing high-level and low-level information. The fused result is then subjected to dual pooling fusion, a 1 × 1 convolution, and Sigmoid activation, achieving unified integration of information from different semantic levels. Subsequently, guided by the obtained global context weights, these weights are used to perform weighted fusion with the features aggregated in the initial stage. This operation helps to enrich cross-layer interactions and highlight the diversity and comprehensiveness of multi-scale semantic information. The resulting fused output feature is the final representation after scale fusion. Here, V₁ and V₂ denote the two input subfeature maps, and ∀ represents the initial fusion result.

\forall = s i g [{C o n v}_{1 \times 1} C a t (G A P, G M P) (C a t (V_{1}, V_{1}))]

(9)

O u t p u t = C B L (V_{1}) \times \forall \oplus C B L (V_{2})

(10)

3. Data and Experimental Settings

3.1. Dataset Specification

In terms of dataset selection, we employed the HIT-UAV dataset [38], a specialized high-altitude infrared thermal imaging dataset constructed for drone-based object detection tasks. This dataset comprises 2898 infrared thermal images, which were extracted and curated from 43,470 frames derived from hundreds of videos captured by drones in various environments, such as campuses, parking lots, roads, and playgrounds. Additionally, the HIT-UAV dataset provides supplementary metadata for each image, including flight altitude, camera angle, date, and illumination intensity. To address the challenge of high target overlap in high-altitude imagery and to validate the algorithm’s effectiveness in detecting multi-scale targets, the dataset specifically incorporates oblique photography samples covering flight altitudes ranging from 60 to 130 m (corresponding to a ground target scale variation of 116%) and camera angles spanning 30 to 90 degrees (constructing multi-angle observation scenarios). Furthermore, it includes comparative data on illumination intensity between day and night to enhance environmental adaptability. To the best of our knowledge, HIT-UAV is the first publicly available drone dataset tailored for high-altitude infrared thermal imaging scenarios and suitable for detecting pedestrians as well as various vehicles. We partitioned the dataset into training, validation, and test sets in a ratio of 7:2:1.

3.2. Experiment Setting

All experiments in this study were conducted under strictly controlled hardware and software conditions to ensure reproducibility of the MDDFA-Net object detection algorithm evaluation. The experimental platform comprised an Intel Xeon Silver 4210R CPU (2.40 GHz), NVIDIA RTX 3080 GPU with CUDA 12.2 acceleration, and 64 GB DDR4 RAM. Software execution was managed through Python 3.8.18 on a Windows 10 operating system. For the training protocol, we adopted stochastic gradient descent (SGD) optimization with 200 epochs, a batch size of 8, and an initial learning rate of 0.01. To validate deployment feasibility in resource-constrained scenarios, we further benchmarked the model’s inference speed on the Rockchip RK3588 edge AI(EA) computing platform. Complete experimental configurations are systematically documented in Table 1.

3.3. Evaluation Criteria

To comprehensively evaluate the model’s performance, we employ a variety of metrics, including Mean Average Precision (mAP), the number of parameters, GFLOPS, and Frames Per Second (FPS). Precision is defined as the ratio of correctly predicted positive samples to the total number of samples predicted as positive by the model. Recall measures the proportion of actual positive samples that are accurately identified by the model. Mean Average Precision (mAP) serves as a key performance indicator for multi-class classification tasks by aggregating the precision–recall curves across different categories to compute their average precision values. Specifically, mAP@0.5 indicates that the Intersection over Union (IoU) threshold is set at 0.5 during the mAP calculation. GFLOPS quantifies the number of billion floating-point operations the model can perform per second, providing insight into its computational efficiency. FPS refers to the number of images the model can process each second or the time required to process a single image, thereby assessing the detection speed, as demonstrated in Formulas (11)–(14).

P r e c i s i o n = T P / (T P + F P)

(11)

R e c a l l = T P / (T P + F N)

(12)

m A P = \frac{\sum_{j = 1}^{M} {A P}_{j}}{M}

(13)

F P S = \frac{1}{t}

(14)

In this framework, True Positives (TPs) denote the number of instances belonging to the positive class that are accurately identified as positive by the model. False Positives (FPs) refer to the number of instances from the negative class that are incorrectly classified as positive by the model. False Negatives (FNs) represent the number of instances from the positive class that are erroneously classified as negative by the model. True Negatives (TNs) correspond to the number of instances from the negative class that are correctly identified as negative by the model. The variable N signifies the total number of distinct object classes under consideration.

4. Results

4.1. Experimental Results

We compared six methods to validate the reliability of our approach, selecting the models RCYOLO [39], BEMRF-Net [40], FIRENET [41], YOLOv11, RetinaNet [42], and Faster R-CNN [43] for comparison. The first three are currently the leading network algorithms in the field of remote sensing target detection, incorporating various feature extraction techniques such as multi-scale fusion, and are of significant reference value. The latter three are among the most advanced algorithms in public repositories. Table 2 presents the comparison results.

In all test results, our method achieved the highest precision, recall, and mAP@0.5 at 94.26%, 93.42%, and 95.43%, respectively. Compared to existing state-of-the-art (SOTA) deep learning models, these three evaluation metrics improved by 7.93%~1.13%, 8.20~0.17%, and 8.11~1.23%, respectively. Specifically, our method outperformed the second-best model (FIRENET [41]) by 1.13% in precision, 1.10% in recall, and 1.23% in mAP@0.5. Additionally, our model demonstrates efficiency in terms of computational complexity, with 17.1 M parameters and 46.1G FLOPs, which are significantly lower than many competing models such as Faster R-CNN [43] (41.2M Params, 156.3G FLOPs) and comparable to others like RCYOLO [39] (20.9M Params, 48.2G FLOPs). Despite this reduced complexity, our method maintains competitive inference speeds of 41.1 FPS (PC) and 29.7 FPS (EA), further highlighting its balance between performance and efficiency. This underscores the effectiveness of our approach in achieving superior accuracy while optimizing resource utilization.

4.2. Ablation Experiments

MDDFA-Net consists of multiple modules and a complex connection structure. To fully demonstrate the roles of the dual-branch, MFEFM, and AFFM modules, we conducted detailed ablation experiments, as shown in Table 3. Additionally, we discussed the impact of different modules on the number of parameters and the running frame rate. Notably, in the ablation experiments related to the dual-branch structure, we conducted two independent discussions: one comparing the performance of the dual-branch structure with the single-chain ResNet34 structure, and another evaluating the effectiveness of the DSConv branch in infrared image feature extraction. In this context, we compared the DSConv branch with the currently popular transform branch, and we also tested the recently popular transform branch, which has strong global feature extraction capabilities. It can be seen that the original MDDFA-Net achieves optimal performance with mAP@0.5, precision, and recall of 95.43%, 94.26%, and 93.42%, respectively. While the single-branch structure has fewer parameters, it sacrifices 4.56% in mAP@0.5. The global feature extraction capability of the transform branch does not contribute to this task and, instead, introduces a large number of redundant parameters.

As illustrated in Figure 8, five ablation studies were conducted to evaluate five distinct architectures. The single-branch architecture exhibited deficiencies in edge information analysis and suboptimal multi-scale feature extraction performance (as shown in (1)). Redundant weight distributions for small targets and missed detections were observed in configurations (2), (3), and (5). Furthermore, Swin Transformer demonstrated negligible effectiveness in contextual connectivity for infrared sensor image detection tasks, failing to resolve the aforementioned limitations of single-branch architectures (evident in (1) and (2)). Ablation experiments on AFFM revealed that single-scale fusion strategies yielded inadequate feature integration with substantial redundancy (as seen in (2) and (5)). The omission of the MFEFM structure introduced significant background interference redundancy (demonstrated in (2), (3), and (4)) and compromised semantic integration between high-level and low-level features (as in (2)), resulting in insufficient weighting for regions of interest (observed in (1) and (3)).

4.3. Visualization of UAV-Mounted Infrared Image Target Detection

To test the reliability of our model’s actual detection performance, as shown in Figure 9, we conducted visualization tests on three cutting-edge public network models: YOLOv11, Faster R-CNN, and RetinaNet.

In the visualization comparison of multiple algorithms, the MDDFA-Net proposed in this study still maintains the best performance in actual detection visualization. In comparison (1), both YOLOv11 and RetinaNet exhibit varying degrees of missed detections. Specifically, the person in the middle of the resulting image was severely missed, while YOLOv11 missed the car on the far left due to its inability to analyze high-level semantics. Additionally, all three compared models incorrectly fused features of different targets when processing adjacent objects. In comparison (2), due to the lack of texture and color information in grayscale images, Faster R-CNN missed the car in the lower left corner because the background and the target blended together, and the car shared similar features with the adjacent shadow. In comparison (3), the most significant issue was the blurred and fragile edges of some car targets, with parts of the vehicle edges almost merging into the background, leading to a large number of missed detections. In particular, YOLOv11 and RetinaNet missed the bicycle target in comparison (3). The results demonstrate that the proposed method performs well in drone infrared image detection.

5. Conclusions

This paper introduces MDDFA-Net, a novel network deployed on UAVs for extracting targets from infrared images. Initially, it employs a multi-scale dynamic dual-branch structure to extract multi-level and scale features from three scales, where the DSConv branch effectively extracts fragile dynamic edge information and global target information. Next, the same-scale branch blocks enter the AFFM, where channel embedding, attention weights, and feature weighting are adaptively fused to obtain adaptive features at three scales. These features integrate both high-level and low-level feature information. Finally, through the MFEFM in the decoder, the fusion between adjacent scales in the deeper layers effectively resolves the discontinuity issues caused by feature occlusion and attenuates redundant features, achieving refined integration of multi-scale features. After analyzing six SOTA networks on UAV infrared image data, both quantitative and qualitative evaluations demonstrated the significant advantages of this method. It also verified its generalization on public datasets. However, this work has certain limitations: (1) While the multi-scale fusion improves detection accuracy, the computational complexity of the dual-branch structure and attention mechanisms may limit real-time performance in scenarios requiring high frame rates (e.g., fast-moving UAVs). (2) The detection of extremely small targets (e.g., 10 × 10 pixels) remains challenging due to information loss in deep feature downsampling. To address these limitations, discussions on the method’s efficiency, scalability, and ablation studies further confirmed its effectiveness. Future directions will focus on lightweight model optimization (e.g., network pruning and quantization) to balance accuracy and inference speed while exploring hybrid architectures combining super-resolution modules to enhance small target detection. Given the diversity of infrared remote sensing images, future work will also prioritize feasibility analysis across various infrared modalities (e.g., multispectral and hyperspectral) while maintaining high accuracy and addressing domain adaptation challenges.

Author Contributions

Conceptualization, Z.W. and C.D.; methodology, Z.W. and R.Z.; investigation, Y.H. and L.W.; data curation, R.W.; writing—original draft preparation, C.D.; writing—review and editing, Z.W. and C.D.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This project is jointly funded by the Gansu Provincial Science and Technology Department Project (Grant No. 23ZDGE001), the Lanzhou Science and Technology Bureau Project (Grant No. 2023-3-19), the Major Cultivation Project of Scientific Research Platforms in Universities by the Gansu Provincial Department of Education (Grant No.2024CXPT-17), and the Gansu Integrated Circuit Industry Research Institute.

Data Availability Statement

In accordance with the journal’s policy, we have supplemented the public access links to the data and code in the declaration section: https://github.com/DC9874/data1.git (accessed on 18 March 2025).

Acknowledgments

We are grateful to the Gansu Provincial Science and Technology Department, the Gansu Provincial Science and Technology Bureau, the Gansu Integrated Circuit Industry Research Institute, and the Gansu Microelectronics Industry Research Institute for their financial support of this research. In addition, we sincerely thank the editors and all anonymous reviewers for their constructive and excellent reviews of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sarkar, S.; Dey, A.; Pradhan, R.; Sarkar, U.M.; Chatterjee, C.; Mondal, A.; Mitra, P. Crop Yield Prediction Using Multimodal Meta-Transformer and Temporal Graph Neural Networks. IEEE Trans. AgriFood Electron. 2024, 2, 545–553. [Google Scholar]
Wang, S.; Du, Y.; Zhao, S.; Gan, L. Multi-Scale Infrared Military Target Detection Based on 3X-FPN Feature Fusion Network. IEEE Access 2023, 11, 141585–141597. [Google Scholar]
Han, C.; Li, N.; Zhang, T.; Dai, J. A Dual-Parameter Interrogation Fano Resonance Sensor Based on All-Oxide Multilayer Film for Biomolecule Detection. IEEE Sens. J. 2024, 24, 40725–40731. [Google Scholar]
Yue, L.; Shen, H.; Yu, W.; Zhang, L. Monitoring of Historical Glacier Recession in Yulong Mountain by the Integration of Multisource Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 388–400. [Google Scholar]
Ji, Z.; Wang, X.; Wang, Z.; Li, G. An Enhanced and Unsupervised Siamese Network with Superpixel-Guided Learning for Change Detection in Heterogeneous Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19451–19466. [Google Scholar]
Xia, C.; Li, X.; Yin, Y.; Chen, S. Multiple Infrared Small Targets Detection Based on Hierarchical Maximal Entropy Random Walk. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar]
Yang, W.; Ping, J. Infrared small target detection based on local significance and multiscale. Digit. Signal Process. 2024, 155, 104721. [Google Scholar]
Wu, J.; He, Y.; Zhao, J. An Infrared Target Images Recognition and Processing Method Based on the Fuzzy Comprehensive Evaluation. IEEE Access 2024, 12, 12126–12137. [Google Scholar]
Wang, J.-G.; Sung, E. Facial Feature Extraction in an Infrared Image by Proxy with a Visible Face Image. IEEE Trans. Instrum. Meas. 2007, 56, 2057–2066. [Google Scholar]
Huang, S.; Peng, Z.; Wang, Z.; Wang, X.; Li, M. Infrared Small Target Detection by Density Peaks Searching and Maximum-Gray Region Growing. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1919–1923. [Google Scholar]
Cao, L.; Wang, Q.; Luo, Y.; Hou, Y.; Cao, J.; Zheng, W. YOLO-TSL: A lightweight target detection algorithm for UAV infrared images based on Triplet attention and Slim-nec. Infrared Phys. Technol. 2024, 41, 105487. [Google Scholar] [CrossRef]
Rao, W.; Gao, L.; Qu, Y.; Sun, X.; Zhang, B.; Chanussot, J. Siamese Transformer Network for Hyperspectral Image Target Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Zhou, D.; Wang, X. Robust Infrared Small Target Detection Using a Novel Four-Leaf Model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1462–1469. [Google Scholar] [CrossRef]
Xie, Y.; Feng, D.; Chen, H.; Liao, Z.; Zhu, J.; Li, C.; Baik, S.W. An Omni-scale Global-Local Aware Network for Shadow Extraction in Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 193, 29–44. [Google Scholar] [CrossRef]
Wu, R.; Liu, G.; Lv, J.; Bao, X.; Hong, R.; Yang, Z.; Wu, S.; Xiang, W.; Zhang, R. DEM-Based Radar Incidence Angle Tracking for Distortion Analysis Without Orbital Data. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Xie, Y.; Feng, D.; Chen, H.; Liu, Z.; Mao, W.; Zhu, J.; Hu, Y.; Baik, S.W. Damaged Building Detection from Post-earthquake Remote Sensing Imagery Considering Heterogeneity Characteristics. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Lv, J.; Zhang, R.; Yu, B.; Pang, J.; Liao, M.; Liu, G. A GPS-IR Method for Retrieving NDVI From Integrated Dual-Frequency Observations. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Xie, Y.; Zhu, J.; Lai, J.; Wang, P.; Feng, D.; Cao, Y.; Hussain, T.; Baik, S.W. An Enhanced Relation-Aware Global-Local Attention Network for Escaping Human Detection in Indoor Smoke Scenarios. ISPRS J. Photogramm. Remote Sens. 2022, 186, 140–156. [Google Scholar] [CrossRef]
Pirasteh, S.; Rashidi, P.; Rastiveis, H.; Huang, S.; Zhu, Q.; Liu, G.; Li, Y.; Li, J.; Seydipour, E. Developing an algorithm for buildings extraction and determining changes from airborne LiDAR, and comparing with R-CNN method from drone image. Remote Sens. 2019, 11, 1272. [Google Scholar] [CrossRef]
Feng, D.; Chen, H.; Liu, S.; Liao, Z.; Shen, X.; Xie, Y.; Zhu, J. Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Hang, R.; Xu, S.; Yuan, P.; Liu, Q. AANet: An ambiguity-aware network for remote-sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [PubMed]
Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting Tassels in RGB UAV Imagery with Improved YOLOv5 Based on Transfer Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8085–8094. [Google Scholar] [CrossRef]
Hao, M.; Zhai, R.; Wang, Y.; Ru, C.; Yang, B. A Stained-Free Sperm Morphology Measurement Method Based on Multi-Target Instance Parsing and Measurement Accuracy Enhancement. Sensors 2025, 25, 592. [Google Scholar] [CrossRef]
Lu, W.; Lan, C.; Niu, C.; Liu, W.; Lyu, L.; Shi, Q.; Wang, S. A CNN-Transformer Hybrid Model Based on CSWin Transformer for UAV Image Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1211–1231. [Google Scholar] [CrossRef]
Fu, Q.; Zheng, Q.; Yu, F. LMANet: A Lighter and More Accurate Multiobject Detection Network for UAV Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Nie, J.; Sun, H.; Sun, X.; Ni, L.; Gao, L. Cross-Modal Feature Fusion and Interaction Strategy for CNN-Transformer-Based Object Detection in Visual and Infrared Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Ou, J.; Wang, J.; Xue, J.; Wang, J.; Zhou, X.; She, L.; Fan, Y. Infrared Image Target Detection of Substation Electrical Equipment Using an Improved Faster R-CNN. IEEE Trans. Power Deliv. 2023, 38, 387–396. [Google Scholar] [CrossRef]
Chang, Y.; Yan, L.; Liu, L.; Fang, H.; Zhong, S. Infrared Aerothermal Nonuniform Correction via Deep Multiscale Residual Network. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1120–1124. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Gao, X. Dim2Clear Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2127–2139. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Nat. Commun. 2014, 5, 2677. [Google Scholar]
Bai, Z.; Pang, H.; He, Z.; Zhao, B.; Wang, T. Path Planning of Autonomous Mobile Robot in Comprehensive Unknown Environment Using Deep Reinforcement Learning. IEEE Internet Things J. 2024, 11, 22153–22166. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [PubMed]
Liang, S.; Lu, J.; Zhang, K.; Chen, X. Multi-Scale Transformer Hierarchically Embedded CNN Hybrid Network for Visible-Infrared Person Re-Identification. IEEE Internet Things J. 2024, 1. [Google Scholar]
Xu, K.; Song, C.; Xie, Y.; Pan, L.; Gan, X.; Huang, G. RMT-YOLOv9s: An Infrared Small Target Detection Method Based on UAV Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. Available online: https://github.com/suojiashun/HIT-UAV-Infrared-Thermal-Dataset (accessed on 18 March 2025).
Dang, C.; Wang, Z.X. RCYOLO: An Efficient Small Target Detector for Crack Detection in Tubular Topological Road Structures Based on Unmanned Aerial Vehicles. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12731–12744. [Google Scholar]
Cao, S.; Feng, D.; Liu, S.; Xu, W.; Chen, H.; Xie, Y.; Zhang, H.; Pirasteh, S.; Zhu, J. BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction from Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16342–16358. [Google Scholar]
Ultralytics. YOLOv5: Version 6.0. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 January 2023).
Ultralytics. YOLOv8: Version 8.0.0. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 January 2023).
Wang, L.; Zhang, S.; Li, J. YOLOv10: Real-Time End-to-End Object Detection with Enhanced Performance-Efficiency Trade-Off. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Jiang, M.; Gu, L.; Li, X.; Gao, F.; Jiang, T. Ship Contour Extraction From SAR Images Based on Faster RCNN and Chan–Vese Model. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar]

Figure 1. Challenges in UAV-based infrared image detection.

Figure 2. Overview of the proposed MDDFA-Net.

Figure 3. Sampling comparison between standard convolution and dynamic snake convolution.

Figure 4. DSConv block structure.

Figure 5. Adaptive feature fusion module structure.

Figure 6. MFEFM structure.

Figure 7. Enhanced-Scale Fusion.

Figure 8. Visualization of weight heatmaps in ablation studies.

Figure 9. Comparison with the visualization results of the existing SOTA model.

Table 1. Experimental environment settings.

Hardware environment	CPU	Inte(R) Xeon(R) Silver 4210R CPU @ 2.40 GHz
	GPU	NVIDIA GeForce RTX 3080
	RAM	64 G
	Edge AI device	RK3588
Software environment	OS	Windows 10
	CUDA Toolkit	12.2
	Python	3.8.18
Training information	Optimizer	SGD
	Epoch	200
	Batch size	8
	Learning range	0.01

Table 2. Comparison of 6 SOTA algorithms on HIT-UAV dataset.

Network	Precision%	Recall%	mAP@0.5/%	Params (M)	FLOPs (G)	FPS (PC)	FPS (EA)
RCYOLO [39]	91.56	91.12	93.32	20.9	48.2	40.3	29.7
BEMRF-Net [40]	90.52	91.25	91.43	23.3	48.2	39.7	25.8
FIRENET [41]	93.13	92.32	94.20	23.3	48.9	39.5	27.3
YOLOv11	92.31	91.33	93.37	12.5	37.7	43.7	30.5
RetinaNet [42]	90.62	90.15	91.44	21.5	47.6	40.6	29.6
Faster R-CNN [43]	86.25	85.22	87.32	41.2	156.3	37.6	23.5
Ours	94.26	93.42	95.43	17.1	46.1	41.1	29.7

Table 3. Ablation experiments on multi-level and multi-branch structures were conducted on the HIT-UAV dataset.

Transform Branch	Double Branch	Single Branch	AFFM	MFEFM	mAP@0.5%	Precision%	Recall%	Param (M)
-	√	-	√	√	95.43	94.26	93.42	17.12
-	√	-	√	-	93.45	92.38	92.38	16.98
-	√	-	-	√	94.24	93.41	93.85	16.67
√	-	-	-	√	89.25	85.12	89.92	22.21
-	-	√	-	√	90.81	90.33	90.21	16.73

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Dang, C.; Zhang, R.; Wang, L.; He, Y.; Wu, R. MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery. Drones 2025, 9, 224. https://doi.org/10.3390/drones9030224

AMA Style

Wang Z, Dang C, Zhang R, Wang L, He Y, Wu R. MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery. Drones. 2025; 9(3):224. https://doi.org/10.3390/drones9030224

Chicago/Turabian Style

Wang, Zaixing, Chao Dang, Rui Zhang, Linchang Wang, Yonghuan He, and Rong Wu. 2025. "MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery" Drones 9, no. 3: 224. https://doi.org/10.3390/drones9030224

APA Style

Wang, Z., Dang, C., Zhang, R., Wang, L., He, Y., & Wu, R. (2025). MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery. Drones, 9(3), 224. https://doi.org/10.3390/drones9030224

Article Menu

MDDFA-Net: Multi-Scale Dynamic Feature Extraction from Drone-Acquired Thermal Infrared Imagery

Abstract

1. Introduction

2. Methodology

2.1. Network Overall

2.2. Double Branch Structure

2.3. Adaptive Feature Fusion Module (AFFM)

2.4. Multi-Scale Feature Enhancement and Fusion Module (MFEFM)

3. Data and Experimental Settings

3.1. Dataset Specification

3.2. Experiment Setting

3.3. Evaluation Criteria

4. Results

4.1. Experimental Results

4.2. Ablation Experiments

4.3. Visualization of UAV-Mounted Infrared Image Target Detection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI