Next Article in Journal
Accurate Estimation of Forest Canopy Height Based on GEDI Transmitted Deconvolution Waveforms
Previous Article in Journal
An Improved Instance Segmentation Approach for Solid Waste Retrieval with Precise Edge from UAV Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DVIF-Net: A Small-Target Detection Network for UAV Aerial Images Based on Visible and Infrared Fusion

The National Key Laboratory of Optical Engineering, The Rocket Force University of Engineering, Xi’an 710025, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(20), 3411; https://doi.org/10.3390/rs17203411
Submission received: 26 August 2025 / Revised: 26 September 2025 / Accepted: 9 October 2025 / Published: 11 October 2025

Abstract

Highlights

What are the main findings?
  • A novel visible-infrared fusion network for small-target detection in UAV aerial images, DVIF-Net, is proposed, achieving mAP50 of 85.8% and 62.0% respectively on two cross-modal datasets DroneVehicle and VEDAI, with parameters of only 2.49 M.
  • The proposed P4-level cross-modal feature fusion strategy, dual context-guided fusion module (DCGF), and edge information enhancement module (CSP-EIE) have significantly enhanced the ability of visible-infrared feature fusion and the expression ability of small-target edge features.
What is the implication of the main finding?
  • The study provides a lightweight and high-precision solution for real-time UAV-based small-target detection in complex environments such as low illumination, fog, and occlusion.
  • The proposed fusion strategy and module design offer a valuable reference for deploying multimodal detection models on resource-constrained embedded platforms.

Abstract

During UAV aerial photography tasks, influenced by flight altitude and imaging mechanisms, the target in images often exhibits characteristics such as small size, complex backgrounds, and small inter-class differences. Under single optical modality, the weak and less discriminative feature representation of targets in drone-captured images makes them easily overwhelmed by complex background noise, leading to low detection accuracy, high missed-detection and false-detection rates in current object detection networks. Moreover, such methods struggle to meet all-weather and all-scenario application requirements. To address these issues, this paper proposes DVIF-Net, a visible-infrared fusion network for small-target detection in UAV aerial images, which leverages the complementary characteristics of visible and infrared images to enhance detection capability in complex environments. Firstly, a dual-branch feature extraction structure is designed based on YOLO architecture to separately extract features from visible and infrared images. Secondly, a P4-level cross-modal fusion strategy is proposed to effectively integrate features from both modalities while reducing computational complexity. Meanwhile, we design a novel dual context-guided fusion module to capture complementary features through channel attention of visible and infrared images during fusion and enhance interaction between modalities via element-wise multiplication. Finally, an edge information enhancement module based on cross stage partial structure is developed to improve sensitivity to small-target edges. Experimental results on two cross-modal datasets, DroneVehicle and VEDAI, demonstrate that DVIF-Net achieves detection accuracies of 85.8% and 62%, respectively. Compared with YOLOv10n, it has improved by 21.7% and 10.5% in visible modality, and by 7.4% and 30.5% in infrared modality, while maintaining a model parameter count of only 2.49 M. Furthermore, compared with 15 other algorithms, the proposed DVIF-Net attains SOTA performance. These results indicate that the method significantly enhances the detection capability for small targets in UAV aerial images, offering a high-precision and lightweight solution for real-time applications in complex aerial scenarios.

1. Introduction

With the rapid advancement of unmanned aerial vehicle (UAV) platforms and sensor technologies, target detection technology based on UAVs has demonstrated significant application potential in fields such as military reconnaissance [1], search and rescue [2], agricultural crop monitoring [3], power line inspection [4], and security monitoring [5]. However, when UAVs are performing aerial photography tasks, their flight altitude usually varies in accordance with the specific mission environment and requirements. The variation in the flight altitude of UAV can cause targets in images they capture to exhibit diverse target characteristics in terms of scale. Furthermore, in most application scenarios, targets in UAV aerial images will also present characteristics such as small size, large quantity, and dense distribution, which undoubtedly increases the difficulty of detecting small targets in UAV aerial images. Especially in complex scenes, factors such as low resolution and environmental interference will also increase missed and false detection rates of small targets [6]. Images obtained by a single sensor are usually not effective at capturing the complete information of complex scenes, and often require multiple sensors or imaging devices to generate complementary information. Visible images are imaged by reflecting visible light, featuring rich textures and high spatial resolution. However, under extreme conditions such as insufficient lighting and target occlusion, the imaging effect is poor, which has an adverse impact on the model detection performance. In contrast, infrared images rely on the thermal radiation of the target itself for imaging, thus being less affected by lighting and weather conditions and having the ability to work all-weather. However, they suffer from severe loss of texture details and have a low contrast between target and background, making it impossible to fully reflect scene information. Therefore, relying solely on one type of image to perform object detection tasks has significant limitations, especially for UAV missions that depend on visual information. Given the complementary characteristics of visible images and infrared images, effectively fusing the two can not only retain the color, texture and edge information of visible images, but also highlight the thermal radiation information of infrared targets relative to their backgrounds, thereby improving the perception and discrimination ability of small targets in UAV aerial image in complex environments. Therefore, building an efficient end-to-end dual-modal fusion detection network is not only the key to improving the performance of small-target detection, but also an inevitable choice to promote the practical engineering application of UAV visual perception systems.
Over the past few decades, visible and infrared image fusion (VIIF) technology has undergone continuous advancement. It can be roughly divided into two categories: traditional VIIF methods and deep learning-based VIIF methods. Currently, traditional VIIF mainly includes multi-scale transform-based methods [7], sparse representation-based methods [8], subspace decomposition-based methods [9], and hybrid methods [10], etc. These methods rely heavily on manually designed prior knowledge, which not only leads to inefficient performance and high consumption of time and labor, but also results in limited detection accuracy [11]. In contrast, deep learning-based VIIF methods can automatically learn and extract key information from source image, thus eliminating cumbersome process of manually designing features that traditional methods require. Deep learning-based VIIF methods can generally be classified into three types: CNN-based methods [12,13], Transformer-based methods [14,15], and GAN-based methods [16]. For instance, Cai et al. [17] designed an unsupervised image fusion network DCSFuse guided by the correlation of image features, which can adaptively integrate complementary and long-distance context information. Park et al. [18] proposed a Cross-Modal Transformer Fusion (CMTFusion) algorithm based on infrared and visible image fusion, which captures global interactions by extracting complementary information from source image. Qi et al. [19] proposed a pseudo-supervised generative adversarial network (GAN) and single-scale retinal resolution (SSR) embedding method for infrared and visible fusion. This method significantly enhances both contrast and naturalness in fused image by utilizing a residual GAN embedded in the SSR module and a dense hybrid modal input strategy. Although these existing VIIF methods have achieved advanced performance, there are still some issues that need to be further addressed. Specifically, most fusion methods typically adopt elementary fusion strategies, such as simple connection or addition. This overly simplistic fusion approach neglects the differences between modalities, resulting in insufficient interaction between two modalities. This deficiency creates a clear boundary between the processing of single-modal images and feature fusion, resulting in insufficient utilization of cross-modal information and making it difficult to improve the object detection model’s performance. In addition, at present, most models either build a four-channel input or keep visible images and infrared images in two separate branches, respectively, and then merge their features in the downstream network. This method leads to a large amount of redundant information in fusion results, which brings high computational complexity and limited model generalization ability, posing a severe challenge to achieving efficient object detection on embedded devices with limited computing resources.
To address these challenges, we propose a small-target detection method for UAV aerial images based on the fusion of visible and infrared images, DVIF-Net, to enhance the detection effect of small targets in complex scene conditions of UAV aerial images. The main contributions of this paper are as follows:
  • This paper proposes the DVIF-Net, which performs cross-modal feature fusion between the backbone network and the neck network. Unlike other methods that use “Concat” or “Add” to fuse bimodal images, DVIF-Net fuses bimodal information in feature space and focuses on small targets detection task for UAV aerial images.
  • A P4-level cross-modal feature fusion strategy is designed, which achieves better fusion effect by performing one fusion at appropriate positions (rather than multiple fusions), and has fewer model parameters.
  • A novel dual-context information-guided fusion module (DCGF) is proposed, which uses SE attention mechanism to capture and utilize important context information during dual-modal feature fusion, and then guides model to learn this information, thereby enhancing the model’s detection performance. In addition, DCGF employs an element-wise multiplication strategy to enhance the interaction ability between visible and infrared feature information.
  • An edge information enhancement module CSP-EIE based on a cross stage partial structure is proposed, which employs AdaptiveAvgPool to expand the receptive field of the network, enabling the extraction of high-frequency image information across different scales. Meanwhile, an edge enhancer is incorporated to extract target edge features, thereby improving the backbone network’s capability in feature extraction for small targets.
The remaining part of this paper is arranged as follows. Section 2 reviews the related work. Section 3 introduces the proposed DVIF-Net and analyzes and explains the structural framework and working principle of this model. Section 4 verifies the effectiveness and generalization of the method proposed in this paper through experiments. Lastly, Section 5 summarizes and looks forward to the research results of this paper.

2. Related Work

2.1. Small Object Detection Method Based on Deep Learning

Early small-target detection methods mainly relied on manually designed feature detectors. In the field of visible image detection, traditional methods such as using Histogram of Oriented Gradient (HOG) [20] and Support Vector Machines (SVM) [21,22,23] for detection. These methods largely rely on manual labor in the design and selection of features, and their accuracy, objectivity, robustness and generalization are all restricted to a certain extent. Eventually, detection accuracy and speed are not ideal, which affects practical application. In the field of infrared image detection, traditional methods such as filter-based methods [24,25], local contrast augment-based methods [26], human visual system (HVS) -based methods [27,28], and low-rank methods [29,30], all of which are widely used in infrared small targets detection. These methods are often limited by manually adjusting hyperparameters. They are not robust enough to variations in target size and shape and can only be used in specific environments and tasks, lacking adaptability and objectivity.
With the development of deep learning, target-detection algorithms based on convolutional neural networks have gradually become a research hotspot. They can automatically learn image features without human intervention, greatly improving both accuracy and efficiency of detection. Deep learning object detection algorithms are mainly divided into two categories based on the different processing paradigms: two-stage algorithms and one-stage algorithms. Two-stage detection algorithm first generates candidate regions and then uses convolutional neural networks to classify and regress them. Two-stage algorithms have high accuracy and robustness when dealing with complex scenes and small-target detection, but their main disadvantages are high computational complexity and poor real-time performance. Common two-stage algorithms include R-CNN [31], SPP-Net [32], Fast R-CNN [33], and Faster R-CNN [34]. In contrast, one-stage detection algorithms such as YOLO [35,36,37] and SSD [38] transform localization problem into a regression problem, directly predicting bounding box and category of the target object from input image, thus having a faster detection speed and lower computational complexity. Among them, the YOLO algorithm is widely popular in the field of small object detection in UAV aerial images due to its outstanding object detection capability. For instance, Zhang et al. [39] proposed a lightweight feature enhancement, fusion, and context-aware YOLO detector, which solved the problem of low detection accuracy of small targets in aerial images. Di et al. [40] proposed a lightweight and high-precision model UAV-YOLO based on YOLOv8s, which incorporates Dual Separable Convolution (DSC) and an SPPL module to improve multi-scale target robustness through cross-level feature correlation. In the field of infrared target detection, Luo et al. [41] designed the YOLO-SMUG model to address issues such as high computational requirements and accuracy limitations in UAV infrared target detection. Wu et al. [42] enhanced infrared small-target feature extraction by designing a Swin-convolution backbone. Meanwhile, they created a multi-scale void space pyramid pooling module (MASPPM) to improve feature representation, thereby achieving more accurate detection. Zhang et al. [43] proposed a method called NOC-YOLO, which integrates attention mechanisms and multi-scale feature fusion to boost detection accuracy for small vehicles in aerial infrared images. Although above methods have made significant progress in accuracy and lightweight of single-modal object detection, their ability to detect objects under extreme lighting and weather conditions still has fundamental limitations, especially for small-target detection. Therefore, researchers have gradually extended detection methods based on single-modal images to multi-modal image fusion detection.

2.2. Infrared and Visible Light Image Fusion Method

Image fusion is a process that utilizes the complementary characteristics of multimodal image features to fuse effective information from multi-source data, thereby obtaining fused images that incorporate multimodal image features. The fused images usually conform better to the visual characteristics of humans and machines, and are also more conducive to further target recognition and detection of images [44]. Infrared images and visible images exhibit different features due to their different bands. The fusion of infrared images and visible images, with complementary information, offers richer features than single-source information and can enhance target detection accuracy. At present, based on the fusion level, methods for fusing infrared and visible images can be classified into three types: pixel-level fusion, feature-level fusion, and decision-level fusion.
Pixel-level fusion method first registers infrared and visible images, then selects a fusion algorithm to perform pixel-level processing on infrared and visible images to obtain the fused image, which is then input into detector for target detection. Schnelle et al. [45] conducted research using spatial domain and pyramid-based pixel-level visible and infrared fusion methods, demonstrating that the fusion method based on low-pass pyramid ratio has a lower computational cost and superior detection performance. Zhao et al. [46] proposed an infrared and visible image fusion method based on target detection meta-feature embedding. The core idea is to design a meta-feature embedding model and generate target semantic features according to fusion network capability, thereby making semantic features naturally compatible with fused features. In addition, by simulating the optimization strategy of meta-learning, adaptability of the model is further enhanced. Que et al. [47] proposed a fusion algorithm for “fast local salient region detection”. This algorithm first adopts structured decomposition method to decompose infrared and visible images into a detail layer and base layer, and then uses the weight parameter map obtained by visual saliency detection (VSD) to guide base-layer fusion. Finally, detail layer and base layer are fused according to principle of entropy maximization to obtain the final fused image. This method retains and highlights detailed information in image, enriching image data. Although the pixel-level fusion method retains the most complete feature information when performing image fusion, its computational complexity is relatively high, and it is not suitable for application on embedded devices with limited resources.
Feature-level fusion methods extract the input infrared and visible data, first converting them into high-dimensional feature representations and then fusing them through different fusion rules. Finally, the high-dimensional features obtained through fusion are input into the detection model for target detection. Under normal circumstances, fused features offer richer semantic and more detailed information than single features, thereby enhancing detection accuracy and robustness of the model. Cheng et al. [48] proposed a lightweight dual-modal fusion network SLBAF Net based on feature-level fusion for UAV target detection under complex lighting and weather conditions. This network utilizes a dual-modal adaptive fusion module (BAFM) to adaptively fuse visible and infrared feature maps, enhancing the robustness of the model for small-target detection. Wang et al. [49] proposed a cross-scale dynamic convolution driven YOLO fusion network (CDC-YOLOFusion), which innovatively introduced a cross-scale dynamic convolution fusion module (CDCF) to achieve adaptive bimodal feature extraction and fusion based on data distribution. Sun et al. [50] proposed a new TSFusion image fusion framework, which uses a teacher–student learning approach to achieve higher-quality fusion and high computational performance. In addition, they also introduced multi-scale loss functions to guide student networks to learn from teachers at the semantic and pixel levels. Feature-level fusion methods not only achieve a balance between accuracy and speed, but also can flexibly adjust fusion levels and methods, thus becoming a key research direction in infrared and visible fusion target detection.
Decision-level fusion method is the highest-level fusion approach. It extracts features from different modal data, respectively, to obtain deep semantic information, conducts training to obtain detection results, and then scores the data of different modalities to determine detection results of different modalities and obtain final correct detection result. Yao et al. [51] proposed a decision-level fusion object detection model IVF-Mask R-CNN based on Mask R-CNN. This model adopts strategies such as reducing the learning rate and magnifying test image to enhance model’s accuracy and stability, making up for the limitations of a single modal in the field of object detection. Chen et al. [52] proposed a non-learning multimodal detection result fusion method: the probabilistic ensemble technique ProbEn. ProbEn does not require any multimodal data for training and can handle missing modalities through probabilistic marginalization. The decision-level fusion method has characteristics of good fault tolerance and high flexibility compared with other two methods. However, there is a large loss of information during fusion, which is not conducive to the precise positioning of small targets.

3. Methodology

3.1. Overall Framework

Building upon YOLOv10 [53], this paper proposes a visible and infrared image fusion network DVIF-Net for small-target detection in UAV aerial images. Its structure is shown in Figure 1. This network can effectively integrate the features of visible and infrared images and focus on small-target detection tasks in UAV aerial images. Firstly, a novel P4-level cross-modal feature fusion strategy is proposed for fusing visible and infrared images. Compared with traditional mid-term fusion methods, this strategy not only achieves better fusion effects but also has smaller model parameters. Secondly, to better utilize complementary information between two modalities, we introduce a dual-context information guide fusion module, DCGF (see Appendix A). The element-wise multiplication strategy in DCGF not only enhances the original features of the input image, but also introduces the feature information of another mode. Finally, we propose an edge information enhancement module EIE and apply it to the cross stage partial structure. This module significantly enhances boundary perception ability of small-target features through two-stage processing of AdaptiveAvgPool and EdgeEnhancer. In addition, it also features cross-modal compatibility, which not only enhances the detailed information of visible images but also improves the thermal radiation profile of infrared images. Furthermore, Figure 2 presents the detailed structure of DVIF-Net backbone network and each component module.

3.2. P4-Level Cross-Modal Feature Fusion Strategy

Most fusion networks typically employ traditional mid-fusion methods for visible and infrared image fusion [54,55,56], as illustrated in Figure 3a. This approach first uses two separate backbone networks to extract features from visible and infrared images, generating multi-scale feature pyramids. In the downstream network, feature maps from the two modalities are fused at stages P3 to P5 (in some cases across all backbone stages) using operations such as “Concat” or “Add”. The fused features are then passed to the detection head for prediction. Although this fusion strategy can leverage features from different levels, it still has certain limitations. Due to the physical differences between visible and infrared images, excessive fusion across multiple levels may introduce interfering information, thereby reducing fusion effectiveness. Additionally, multi-level fusion can lead to significant redundancy in the fused outputs, increasing computational costs and making it difficult to meet the real-time detection requirements of UAV.
To address these issues, this paper proposes a P4-level cross-modal feature fusion strategy, which effectively utilizes P4 features to enhance interaction between the two modalities while reducing redundant information. As shown in Figure 3b, fusion is performed at the specificP4 level. This is because earlier fusion may result in insufficient feature extraction, while later fusion could lead to the loss of sensitive information related to small targets. The P4-level feature maps retain both rich semantic information and fine-grained details. Specifically, after feature extraction via two independent backbone networks, visible and infrared features are processed and enhanced by a fusion module at the P4 level before being passed to the neck network. This fusion approach not only improves detection accuracy and performance but also reduces model parameters and computational complexity.
Figure 1. Overview of DVIF-Net.
Figure 1. Overview of DVIF-Net.
Remotesensing 17 03411 g001
Figure 2. Detailed diagram of the backbone network structure of DVIF-Net.
Figure 2. Detailed diagram of the backbone network structure of DVIF-Net.
Remotesensing 17 03411 g002
Figure 3. Comparison between traditional mid-term feature fusion and P4 mid-term feature fusion strategies.
Figure 3. Comparison between traditional mid-term feature fusion and P4 mid-term feature fusion strategies.
Remotesensing 17 03411 g003

3.3. Dual-Context Information Guide Fusion Module

Typically, in aerial target detection tasks performed by UAV, images captured by visible cameras contain rich texture and detail information but are sensitive to illumination changes. In contrast, images captured by infrared cameras are insensitive to illumination changes and can highlight thermal targets, yet they suffer from low spatial resolution and a lack of texture details. Therefore, integrating the high-resolution texture information from visible images with the thermal radiation information from infrared images can enhance the performance of UAV-based target detection. However, most existing methods simply concatenate or add the two modalities, neglecting the differences between them and resulting in insufficient interaction of information across modalities.
To more effectively utilize and integrate complementary information from visible and infrared images, this paper proposes a dual context-guided fusion module (DCGF) based on a channel attention mechanism. This module enhances the original features of the input feature maps while introducing complementary information from the other modality. The structure of DCGF is illustrated in Figure 4. Specifically, first of all, we concatenate the input visible features F v i and infrared features F i r along the channel dimension. The concatenated features are then fed into a global average pooling module to compress the spatial information, aiming to enlarge the receptive field of the network, so that it can learn context information of each channel. Next, the compressed features pass through 2 cascaded convolutional layers and a non-linear Sigmoid function to generate attention weights. These weights are applied to the original features via element-wise multiplication, and the resulting weighted features are added to the original features from the other branch to enhance their representational capacity. The feature enhancement process can be expressed as
F ^ i r = F i r   ( F v i     δ ( P w - C o n v n ( G A P ( C ( F i r , F v i ) ) ) ) )
F ^ v i = F v i   ( F i r     δ ( P w - C o n v n ( G A P ( C ( ( F i r , F v i ) ) ) ) ) )
where F ^ v i and F ^ i r represents the enhanced visible and infrared features. denotes element-wise summation and denotes element-wise multiplication. The element-wise multiplication operation adds the important infrared features from the channel attention output to the visible features, while also enhancing the important visible features into the infrared features. Compared to regular feature addition, this operation allows the important channels in the infrared features to be enhanced by the visible features, while the key information in the visible features is guided by the infrared features, thereby improving the interaction between the two modalities. δ represents the Sigmoid function, P w - C o n v n represents n cascaded pointwise convolutional layers (n values 2), G A P represents the global average pooling operation, and C represents the concatenation operation in channel dimension.
Finally, the enhanced visible and infrared features are concatenated for subsequent multi-scale feature fusion. The concatenation process can be summarized as
F f u s i o n = C F ^ i r , F ^ v i
where F f u s i o n represents the features after concatenation.

3.4. Edge Information Enhancement Module

In actual tasks, when UAVs are shooting images, due to their fast-maneuvering speed, high flight altitude and complex shooting scenes, targets in images often have problems such as small size and blurred edges, which affect detection accuracy. To this end, we proposed an Edge information enhancement module (EIE) and embedded it into cross stage partial structure (CSP), naming it CSP-EIE (see Appendix A). This module enhances network’s ability to extract edge features from small targets through multi-scale edge enhancement module and cross-scale connection structure. To fully leverage the advantages of the EIE module, we integrate it into the small object feature extraction stage of DVIF-Net, specifically the P2 stage. The EIE module not only effectively enhances the texture and edge details in visible images but also strengthens the thermal radiation contours in infrared images, demonstrating excellent cross-modal compatibility. By simultaneously improving the edge feature representation of both modalities, EIE enables the network to more accurately capture and fuse critical information from visible and infrared images, thereby enhancing the model’s performance in detecting small objects under complex scenes. The structure of EIE module is shown in Figure 5, and its specific process is as follows.
Given a feature map, X C × H × W and S = s 1 , s 2 , s 3 , , s n represent a set of n different pooling scales. EIE first performs the AdaptiveAvgPool operation on the input feature map X to generate a feature map of size s × s (assuming s 3 ,   6 ,   9 ,   12 ), enabling EIE to capture features at different levels. Among them, each scale branch independently processes information of different granularities; small scales (such as 3 × 3) capture the global context information of feature map, while large scales (such as 12 × 12) retain local details. Then, each branch successively passes through the 1 × 1 convolution compression channel to C/4 to reduce computational load of the module, and then enhances local features through the 3 × 3 group convolution operation. Subsequently, the feature maps of different scales are aligned to a unified scale through bilinear interpolation operations, and EdgeEnhancer is used to highlight features edges, enabling network to better perceive the boundaries and details of small targets.
Meanwhile, in another branch, the input feature maps undergo a local convolution to obtain local information containing the initial detailed features. Then, this local information is concatenated with the features after edge enhancement. Finally, a convolutional layer is utilized to fuse them into a unified feature representation.
Among them, the EdgeEnhancer in Figure 5 enhances the edge information of feature map. It applies AvgPool2d to smooth the input feature map and extract its low-frequency information. Then, the enhanced edge information (high-frequency information) is obtained by subtracting the smoothed feature map from the original input feature map. Subsequently, a convolution operation is used to further process the enhanced edge information. Finally, the processed edge information is added to the original input feature map to form the enhanced output. The specific process of EdgeEnhancer is expressed as follows:
E X = X C o n v X A v g P o o l X
where denotes element-wise summation.

3.5. Loss Function

In the loss function section, the design strategy of DVIF-Net loss function is consistent with that of YOLOv10, mainly consisting of three parts: distributed focal loss, object classification loss, and object localization loss, which are, respectively, represented by L d f l , L c l s , and L b o x . The overall loss function is expressed as follows:
L t o t a l = λ 1 L d f l + λ 2 L c l s + λ 3 L b o x
where λ is a hyperparameter representing the weight of each part. λ 1 , λ 2 , and λ 3 are set to 1.5, 0.5, and 7.5, respectively, and they can be adjusted according to actual task during training.
L d f l is Distribution Focal Loss (DFL), whose function is to enable network to quickly focus on values near the labeled positions and maximize their probabilities. The calculation formula is as follows:
L d f l = I = 0 K × K p = 0 3 I i j obj DEL s i , s i + 1
DEL s i , s i + 1 = y i + 1 y log s i + y y i log s i + 1
where k × k represents the grid numbers on different scales feature maps of the input image. The specific value depends on the network size (e.g., if the input image is 640 × 640 in size, the values are 20 × 20, 40 × 40, and 80 × 80, respectively). p represents four predicted coordinate values. DFL performs regression on the prediction box in a probabilistic manner and requires setting the hyperparameter reg_max (default is 16) in advance. At this point, the output channel of this network branch is 4 × reg_max. Before this, 16 fixed reference values A   ( 0 ,   1 ,   2 ,   ,   15 ) are set, respectively, corresponding to each position of reg_max. For these reg_max values, the softmax function is used for discretization, regarding it as a 16-class classification problem. Target position coordinates obtained from the feature map generally do not fall on specific grid corners, but labels require integers. Taking prediction x m i n as an example, its true value is y , the left integer is y i , and the right integer is y i + 1 . y i + 1 y and y y i correspond to the distance weights from the true values and s i and s i + 1 correspond to the predicted values of y i + 1 y and y y i , respectively.
The object classification loss uses Binary Cross-Entropy (BCE) loss, and its calculation formula is as follows:
L c l s = I = 0 K × K I i j obj c c l a s s e s P i j c log P ^ i j c + 1 P i j c log 1 P ^ i j c
where k × k is consistent with Formula (7), I i j obj indicates whether the prediction target exists in the j t h prediction box in the i t h grid (1 indicates existence, and 0 indicates non-existence). c represents target category. P i j c and P ^ i j c represent the predicted probability and the true probability that the target belongs to a specific category, respectively.
The object location loss adopts CIoU loss function, which measures the differences based on overlapping area, distance between the bounding boxes center, and the similarity of the aspect ratio, making the bounding box regression more stable. The calculation formula for CIoU is as follows:
L b o x = L C I o U = 1 B gt B prd B gt B prd + ρ 2 ( B gt , B prd ) ( w c ) 2 + ( h c ) 2 + α v
α = v ( 1 IoU ) + v
v = 4 π 2 ( a r c t a n w gt h gt a r c t a n w prd h prd ) 2
Some of the parameters involved in the formula are shown in Figure 6; B gt is the area of the real box and B prd is the area of the predicted box. ρ 2 B gt , B prd denotes distance between the centers of these two boxes. h prd and w prd are the width and height of the minimum bounding box of these two boxes. α is a balance term, used to balance the center point position of the bounding box and the size of the bounding box, and v is the cosine value of the aspect ratio of the bounding box. The height and width of the prediction box and true box are h prd , w prd , h gt , and w gt , respectively.

4. Experiment

4.1. Dataset

The DroneVehicle cross-modal dataset [57] is a large-scale visible-infrared remote sensing vehicle dataset created by Tianjin University in 2020. This dataset contains a total of 56,878 aerial images, including 452,570 visible vehicle targets and 500,517 infrared vehicle targets, totaling 953,087 annotated vehicle targets. These images cover vehicles captured from various weather conditions (sunny, rainy and foggy), different angles (15°, 30° and 45°) and heights (80 m, 100 m and 120 m), and possess different environmental scenes such as highways, urban roads, residential areas, parking lots, etc. Figure 7 shows some representative visible and infrared images in the DroneVehicle dataset. The images in this dataset not only present diverse scenes, but also include images from different time periods, from day to night, as well as those under low-light conditions. These target vehicles not only have differences in size, but also have characteristics such as small size, dense occlusion, and similar appearance, which greatly increase the detection difficulty of target detection model.
The DroneVehicle dataset has been divided into a training set, validation set, and test set according to the ratio of 6.3:0.5:3.2. Furthermore, since a white border with a width of 100 pixels is set at the top, bottom, left and right of each image in this dataset, we preprocess the original images to remove the surrounding white border. Table 1 presents in detail the number of tags for different types of vehicles in both visible and infrared images within the DroneVehicle dataset, including five categories: cars, trucks, buses, vans and freight cars. This dataset provides rich training data for vehicle recognition and classification tasks, especially offering data support for identifying multiple types of vehicles.

4.2. Experimental Platform and Parameters

To evaluate the effectiveness of the DVIF-Net method in UAV aerial image detection tasks, ablation experiments and comparative experiments of DVIF-Net are designed based on DroneVehicle dataset. Table 2 and Table 3 present versions and configuration information of the environment required for the experiment, as well as relevant parameter settings during the training process, to ensure repeatability and fairness of the experiment.

4.3. Assessment Indicators

This article utilizes Precision (P), Recall (R), mAP (mean Average Precision) and Parameters as evaluation metrics to assess the model’s performance. mAP represents the mean of average precision of all categories (mAP50 indicates an IoU threshold of 0.5, and mAP50-90 indicates an IoU threshold between 0.5 and 0.95). The Params represent the number of parameters in model and quantify computing and storage resources used. The specific operation formulas are as follows:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
m A P = 1 N i = 1 N P R d R i
where TP, FP, and FN denote true positive, false positive, false negative, respectively, and N indicates the number of categories

4.4. Comparison Experiment

To verify the performance of the DVIF-Net method, comparative experiments are conducted on DroneVehicle cross-modal dataset, comparing it against 15 widely adopted and state-of-the-art methods. Among them are classic detection algorithms (RetinaNet [58], Faster R-CNN, Mask R-CNN [59], Cascade Mask R-CNN [60] and RoITransformer [61]), the YOLO series of lightweight models (YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n), as well as the most advanced visible and infrared fusion object detection models (CFT [54] and UA-CMDet [48], RemoteDet-Mamba [62], CDC-YOLOFusion [63], DE-YOLO [56]). The specific experimental results are shown in Table 4. Under identical conditions, we conducted comparative experiments of single-modal (visible images or infrared images) and multi-modal (visible and infrared fusion), respectively, to analyze the performance metrics of each model.
As shown in Table 4, among single-modal detection models, infrared-based models generally outperform visible-based models. This is because visible images provide limited feature information under nighttime or occluded conditions, while infrared data, being less affected by lighting, can clearly capture target characteristics. Furthermore, multimodal detection methods exhibit superior performance compared to single-modal approaches, as they integrate the textural details from visible images and the thermal radiation information from infrared images. Among all multimodal detection methods, our proposed DVIF-Net achieves the best performance, delivering the highest detection accuracy while having the lowest parameters. Specifically, compared to YOLOv10n, DVIF-Net improves mAP50 by 21.7% for visible modality and 7.4% for infrared modality, with only a 0.23 M increase in Params, demonstrating its effective integration of complementary features from both modalities. Additionally, when compared to algorithms such as RetinaNet, Faster R-CNN, Mask R-CNN, Cascade Mask R-CNN, and RoITransformer, our method shows significant improvements in both accuracy and model lightweightness. At the same time, compared to lightweight models in the YOLO series such as YOLOv5n, YOLOv8n, YOLOv11n, and YOLOv12n, the proposed DVIF-Net achieves the highest accuracy, despite having slightly more parameters than YOLOv10n. Moreover, our method improves detection performance across all target categories, indicating strong robustness in detecting objects at various scales. Similarly, when compared to other state-of-the-art fusion models including CFT, UA-CMDet, RemoteDet-Mamba, CDC-YOLOFusion, and DE-YOLO, DVIF-Net achieves the best results in mAP50, model parameters, and FLOPs, outperforming CFT (+13.3, −42.04, −115.3), UA-CMDet (+21.8, −137.21, −89.9), RemoteDet-Mamba (+4.0, −68.65, −76.3), CDC-YOLOFusion (+5.6, −151.11, −40.7), and DE-YOLO (+2.2, −3.51, −10.5).
In summary, the proposed DVIF-Net method demonstrates high detection accuracy and strong adaptability and robustness when dealing with the task of small-target detection in UAV aerial images. Meanwhile, for the UAV-embedded devices with limited resources, our method can effectively solve the problem of the large computational load of detection model, affecting their practical applications.

4.5. Ablation Experiment

To validate the practicality of the three proposed improvements, we use DroneVehicle cross-modal dataset to evaluate the impact of each improvement unit on model in sequence. The ablation experimental results of the proposed DVIF-Net are presented in Table 5 and Figure 8. As shown in Table 5, we design five different experimental settings. In the first experiment, all the improvements are removed, and the backbone network directly used “Concat” for fusion in phases P3–P5, respectively, which is also the traditional fusion method. In the second experiment, we train using the proposed P4-level cross-modal feature fusion method, where the DCGF and CSP-EIE modules are still removed. In the third experiment, “Concat” is replaced with a DCGF fusion module on the basis of P4-level fusion. In the fourth experiment, in the DVIF-Net framework, DCGF fusion module is replaced with “Concat” for feature fusion. In the fifth experiment, all the improvements are added to model to form the final visible and infrared fusion framework DVIF-Net. From the results presented in Table 5, we can draw the following conclusions:
  • Compared to the conventional P3–P5 multi-stage fusion approach, the proposed P4-level cross-modal fusion strategy significantly enhances model performance, increasing mAP50 from 81.0% to 81.5% while reducing model parameters by 0.96 M. This improvement demonstrates the effectiveness of the strategy. Performing one-time fusion solely at the P4 level not only decreases computational complexity but also achieves better fusion performance.
  • By replacing the “Concat” operation with the DCGF fusion module for feature integration, we observed an overall upward trend in accuracy metrics including P, R, mAP50, and mAP50-95, with a notable improvement of 2.7% in mAP50. This indicates that the DCGF module effectively combines complementary information from visible and infrared images, thereby enhancing feature fusion performance.
  • Even without the DCGF module, using CSP-EIE alone still improves model performance, raising mAP50 from 81.9% to 83.4% while the Params of only 2.477 M. This improvement can be attributed to the combined effect of the AdaptiveAvgPool layer and the EdgeEnhancer in CSP-EIE, which enhance the backbone network’s capability to extract edge information of small targets.
  • The combination of P4Fusion, DCGF, and CSP-EIE achieves the best performance, yielding mAP50 of 85.8% with a total model parameter of 2.485 M. This result confirms the effectiveness of our proposed strategies, demonstrating a significant improvement in detection accuracy for small targets in UAV aerial images under complex scenarios.
In addition, Figure 8 (all data in the figure are normalized, the closer the Params index is to 0, the better and the closer P, R, mAP50, and mAP50-95 are to 1, the better) presents the performance of each metric of DVIF-Net and the comprehensive performance of the entire model. A to E denote the five models in Table 5, respectively. It is clear from Figure 8a that the model E proposed in the article demonstrates the most balanced performance effect—it improves the mean average precision by 2.4% compared to the enhanced Model D combined with CSP-EIE, while only increasing the Params by 0.008 M. In particular, the radar chart in Figure 8b illustrates that model E has the best overall performance. It not only improves accuracy rate but also enhances detection speed.

4.6. Visual Experiment

To further validate the effectiveness and reliability of the proposed DVIF-Net for small-target detection in UAV aerial images under complex environmental conditions, we conducted visualization experiments on DroneVehicle cross-modal dataset. The results are illustrated in Figure 9, Figure 10, Figure 11 and Figure 12. The experiment involved the selection of various types of detection scenarios—such as nighttime, heavy fog, low contrast, small targets, occlusion, and dense distributions—to compare the detection performance of the single-modal model YOLOv10n and our DVIF-Net method on both visible and infrared images.
Specifically, Figure 9 depicts a nighttime scenario. As shown in the original images, the targets in infrared image exhibit clear contours, while those in visible image are nearly invisible due to darkness. When detecting targets in the visible image, YOLOv10n not only missed a significant number of objects but also misclassified air conditioning units on buildings as vehicles. In contrast, our proposed fusion network accurately identified vehicles in the dark and simultaneously improved detection accuracy for infrared targets. Figure 10 presents a heavy fog scenario where targets display limited features in both visible and infrared images. This leads YOLOv10n to miss multiple occluded and corner-located small targets. DVIF-Net effectively improves this issue and reduces the missed detection rate, owing to the element-wise multiplication strategy in the DCGF fusion module, which provides rich multimodal differential information and enhances the expressiveness of the fused features. Figure 11 illustrates a low-contrast scenario. As can be seen from the original image, the difference in gray and brightness values between the target to be detected and the background is not significant, while texture features across different vehicle categories exhibit high similarity. As a result, YOLOv10n recognizes only a limited number of targets. Our method, by integrating the clear texture features in visible images, not only significantly improves the detection accuracy of infrared targets but also correctly identifies the target categories. Figure 12 depicts a parking lot scenario where the targets to be detected are characterized by small size, dense overlapping, and partial occlusion. While YOLOv10n fails to detect multiple occluded small targets, the edge information enhancement module in DVIF-Net strengthens the contour features of these targets, enabling the model to accurately identify more targets and demonstrating its strong adaptability under diverse environmental conditions. Overall, the proposed DVIF-Net has more performed better in accomplishing the detection task of small targets, effectively reducing both false and missed detections, and achieves good performance in the detection of small targets in UAV aerial images in complex backgrounds.
Additionally, we further evaluate the detection capability of DVIF-Net through the confusion matrix, as shown in Figure 13. The purpose of the confusion matrix is to help us understand the model’s performance across different classes by dividing the model’s classification results into true and incorrect categories. This allows us to calculate a range of performance metrics such as accuracy, precision, recall, and F1 score. The values along the diagonal of the confusion matrix represent the proportion of correctly classified instances for each category, while the off-diagonal values indicate the proportion of misclassified instances. From Figure 13a–c, we can see that in the DVIF-Net confusion matrix, all the diagonal values are higher than those of YOLOv10, indicating that the improved method demonstrates better classification performance and higher detection accuracy in UAV aerial images.

4.7. Comparative Experiments on VEDAI Dataset

To further validate the generalization capability and robustness of our proposed method, we conducted comparative experiments with DVIF-Net on the cross-modal remote sensing dataset VEDAI. The VEDAI dataset [63], captured by drones equipped with both visible and infrared cameras, is primarily used for vehicle target detection. It contains 1250 strictly aligned visible-infrared image pairs collected from diverse perspectives, scenarios, and lighting conditions, including nine categories such as Car, Boat, Bicycle, and Tractor. The dataset provides two image resolutions (512 × 512 and 1024 × 1024), with our experiments utilizing the 1024 version for training and testing. VEDAI encompasses multiple scenarios, flight altitudes, and vehicle categories. Beyond small-sized targets, it presents challenges such as multi-directionality, illumination variations, shadow changes, specular reflections, and occlusions. This fully demonstrates the applicability and challenge of using this dataset to verify our algorithm.
To demonstrate the generalization capability of DVIF-Net, this paper further compares its performance with other state-of-the-art algorithms on the VEDAI dataset. Unlike previous selections, the comparison primarily focuses on lightweight models from the YOLO series: YOLOv5n, YOLOv8n, YOLOv10n, YOLOv11n, and YOLOv12n, etc. During model training and validation process, the VEDAI dataset is divided into training, validation, and test sets in a ratio of 7:2:1, with consistent training parameters and evaluation metrics as mentioned above.
As evidenced by the experimental results in Table 6, unlike the DroneVehicle dataset, the visible-based single-modal detection model outperforms its infrared-based counterpart in the VEDAI dataset. This phenomenon can be attributed to two primary factors. (1) The VEDAI dataset is predominantly captured during daytime with sufficient illumination, resulting in clearly discernible target textures in visible images. (2) The vast majority of images in this dataset contain repetitive similar textures (such as lakes, mountains, forests, etc.), which lead to significant interference in infrared image target detection and a relatively low contrast between targets and background. Compared with single-modal detection methods, multi-modal detection methods still demonstrate superior performance. The visible mode of mAP50 has improved by 10.5% compared to YOLOv10n, and the infrared mode has improved by 30.5%, indicating that the DVIF-Net we proposed has also achieved outstanding performance on the VEDAI dataset. Overall, the experimental results fully demonstrate that our method is not limited to specific datasets, but exhibits excellent detection performance in different bimodal datasets, demonstrating strong generalization and applicability.
Similarly, to more intuitively validate the effectiveness of the improvement strategy we proposed, we conducted a visualization experiment on the test set, as shown in Figure 14, Figure 15 and Figure 16. The experiment selected various types of detection scenarios such as small targets, low contrast, occlusion, and density as detection samples to compare the detection performance of the single-modal model YOLOv10n and DVIF-Net. As shown in Figure 14 and Figure 15, the proposed method not only effectively detects smaller targets missed by YOLOv10n but also accurately identifies their categories. Furthermore, in Figure 16, YOLOv10n misidentifies cylindrical structures on rooftops as vehicles, whereas DVIF-Net avoids such false detections, demonstrating that our approach also significantly reduces the false detection rate in complex scenarios.

5. Discussion

Thanks to the introduction of the P4-level cross-modal fusion strategy, using a DCGF structure and a CSP-EIE module, the proposed DVIF-Net has higher adaptability and better detection performance in visible light and infrared small-target detection. It effectively avoids problems such as missed detections and false detections. Meanwhile, compared with 15 other algorithms, the proposed DVIF-Net attains SOTA performance. Although the method we proposed performs well in the cross-modal dataset of unmanned aerial vehicle (UAV) aerial photography, there are still some limitations. Specifically, the effectiveness of DVIF-Net on low-power platforms such as embedded devices needs further testing and improvement. In addition, when facing more complex scenarios (such as storms, smog, etc.), the robustness of this model still needs to be improved.
In future work, we will attempt to employ pruning or distillation techniques to enhance the lightweight and processing speed of the converged network, enabling it to achieve more lightweight and rapid target detection when deployed to embedded devices. In addition, to further enhance the target detection capabilities of UAV in more complex scenarios, we will consider introducing more modalities into the target detection network for fusion to achieve broader applications.

6. Conclusions

In this study, we propose DVIF-Net, a visible-infrared fusion network for small-target detection in UAV aerial images, aiming to improve the detection performance of small targets in UAV aerial images under complex environmental conditions. In the feature extraction stage, DVIF-Net adopts a parallel dual-branch structure to process features of visible and infrared images, respectively. For the fusion of visible and infrared features, we first introduce the P4-level cross-modal fusion strategy that reduces model parameters by decreasing fusion stages while avoiding information redundancy caused by multi-level fusion. Meanwhile, the DCGF module is introduced at the fusion layer. This module adaptively enhances the complementary features between two modalities and suppresses redundant information through a dynamic weight distribution mechanism. In the end, a novel EIE module is proposed, which significantly improves the representation of edge features for small targets via an adaptive average pooling layer and an edge enhancement mechanism. Experimental results on two cross-modal UAV aerial datasets show that our method significantly improves the detection accuracy of small targets, reduces both missed and false detection rates, and enhances the adaptability and stability of the detection model in complex scenarios. Furthermore, compared to 15 other state-of-the-art methods, the proposed DVIF-Net achieves superior performance in visible-infrared small-target detection. Our method not only brings a new idea to the field of multimodal fusion, but also provides a more effective solution for the high-precision detection of small targets in UAV aerial images.

Author Contributions

Conceptualization, X.Z.; validation, H.Z.; formal analysis, H.Z.; data curation, H.Z.; writing—review and editing, H.Z.; software, C.L.; visualization, C.L.; supervision, K.W.; resources, Z.Z.; project administration, X.Z.; funding acquisition, X.Z. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 41404022 and in part by the National Foundation for Enhancing Fundamental Sciences in China under Grant 2021-JCJQ-JJ-0871.

Data Availability Statement

Data related to the current study are available from the corresponding author upon reasonable request. The codes used during the study are available from the corresponding author upon request. The code for preprocessing the DroneVehicle dataset: https://github.com/pythonzhanghui/Datasetpreprocessing.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned Aerial Vehicle
VIIFVisible and Infrared Image Fusion
CNNConvolutional Neural Network
GANGenerative Adversarial Network
DCGFDual-Context Information Guide Fusion
CSP-EIECross Stage Partial-Edge Information Enhancement
HOGHistogram of Oriented Gradient
SVMSupport Vector Machines
YOLOYou Only Look Once
SSDSingle-Shot MultiBox Detector
DFLDistribution Focal Loss
BCEBinary Cross-Entropy
CIoUComplete Intersection over Union
IRInfrared

Appendix A. Parameter Settings

The specific parameters of all modules in the manuscript are as follows:
ModelParameter Settings
DCGF Pw - Conv (k = 1)
CSP-EIEGroup Convolution (g = 4)

References

  1. Wang, Q.; Zhan, Y.; Zou, Y. UAV recognition algorithm for ground military targets based on improved Yolov5n. Comput. Meas. Control 2024, 32, 189–197. [Google Scholar]
  2. Hayat, S.; Yanmaz, E.; Brown, T.X.; Bettstetter, C. Multi-objective UAV path planning for search and rescue. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5569–5574. [Google Scholar]
  3. Bhadra, S.; Sagan, V.; Sarkar, S.; Braud, M.; Mockler, T.C.; Eveland, A.L. PROSAIL-Net: A transfer learning-based dual stream neural network to estimate leaf chlorophyll and leaf angle of crops from UAV hyperspectral images. ISPRS J. Photogramm. Remote Sens. 2024, 210, 1–24. [Google Scholar] [CrossRef]
  4. Duo, C.; Li, Y.; Gong, W.; Li, B.; Qi, G.; Zhang, J. UAV-aided distribution line inspection using double-layer offloading mechanism. IET Gener. Transm. Distrib. 2024, 18, 2353–2372. [Google Scholar] [CrossRef]
  5. Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
  6. Peng, X.; Zeng, L.; Zhu, W.; Zeng, Z. A Small Object Detection Model for Improved YOLOv8 for UAV Aerial Photography Scenarios. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 29–31 March 2024; pp. 2099–2104. [Google Scholar]
  7. Kong, W.; Wang, B.; Lei, Y. Technique for infrared and visible image fusion based on non-subsampled shearlet transform and spiking cortical model. Infrared Phys. Technol. 2015, 71, 87–98. [Google Scholar] [CrossRef]
  8. Yang, B.; Li, S. Multifocus image fusion and restoration with sparse representation. IEEE Trans. Instrum. Meas. 2010, 59, 884–892. [Google Scholar] [CrossRef]
  9. Lu, X.; Zhang, B.; Zhao, Y.; Liu, H.; Pei, H. The infrared and visible image fusion algorithm based on target separation and sparse representation. Infrared Phys. Technol. 2014, 67, 397–407. [Google Scholar] [CrossRef]
  10. Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
  11. Liu, J.; Lin, R.; Wu, G.; Liu, R.; Luo, Z.; Fan, X. CoCoNet: Coupled Contrastive Learning Network with Multi-level Feature Ensemble for Multi-modality Image Fusion. Int. J. Comput. Vision 2024, 132, 1748–1775. [Google Scholar] [CrossRef]
  12. Li, H.; Wu, X.J.; Kittler, J. Infrared and visible image fusion using a deep learning framework. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 2705–2710. [Google Scholar]
  13. Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
  14. Vs, V.; Valanarasu, J.M.J.; Oza, P.; Patel, V.M. Image fusion transformer. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3566–3570. [Google Scholar]
  15. Rao, D.; Wu, X.J.; Xu, T. TGFuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. arXiv 2022, arXiv:2201.10147. [Google Scholar] [CrossRef]
  16. Yang, Y.; Liu, J.; Huang, S.; Wan, W.; Wen, W.; Guan, J. Infrared and visible image fusion via texture conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4771–4783. [Google Scholar] [CrossRef]
  17. Cai, Z.; Ma, Y.; Huang, J.; Mei, X.; Fan, F. Correlation-guided discriminative cross-modality features network for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2024, 73, 5002718. [Google Scholar] [CrossRef]
  18. Park, S.; Vien, A.G.; Lee, C. Cross-modal transformers for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 770–785. [Google Scholar] [CrossRef]
  19. Qi, J.; Abera, D.E.; Cheng, J. PS-GAN: Pseudo Supervised Generative Adversarial Network with Single Scale Retinex Embedding for Infrared and Visible Image Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 1766–1777. [Google Scholar] [CrossRef]
  20. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
  21. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  22. Zhao, Y.; Shi, H.; Chen, X.; Li, X.; Wang, C. An overview of object detection and tracking. In Proceedings of the 2015 IEEE International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 280–286. [Google Scholar]
  23. Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom. Proteom. 2018, 15, 41–51. [Google Scholar]
  24. Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. Proc. SPIE 1999, 3809, 74–83. [Google Scholar]
  25. Bai, X.; Zhou, F. Analysis of new top-hat transformation and the application for infrared dim small target detection. Pattern Recogn. 2010, 43, 2145–2156. [Google Scholar] [CrossRef]
  26. Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Trans. Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
  27. Chen, C.L.P.; Li, H.; Wei, Y.; Xia, T.; Tang, Y.Y. A local contrast method for small infrared target detection. IEEE Trans Geosci. Remote Sens. 2013, 52, 574–581. [Google Scholar] [CrossRef]
  28. Han, J.; Moradi, S.; Faramarzi, I.; Zhang, H.; Zhao, Q.; Zhang, X.; Li, N. Infrared small target detection based on the weighted strengthened local contrast measure. IEEE Trans. Geosci. Remote Sens. Lett. 2021, 18, 1670–1674. [Google Scholar] [CrossRef]
  29. Zhang, L.; Peng, L.; Zhang, T.; Cao, S.; Peng, Z. Infrared Small Target Detection via Non-Convex Rank Approximation Minimization Joint/2,1 Norm. Remote Sens. 2018, 10, 1821. [Google Scholar] [CrossRef]
  30. Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef]
  31. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  32. Song, X.; Fang, X.; Meng, X.; Fang, X.; Lv, M.; Zhuo, Y. Real-time semantic segmentation network with an enhanced backbone based on Atrous spatial pyramid pooling module. Eng. Appl. Artif. Intel. 2024, 133, 107988. [Google Scholar] [CrossRef]
  33. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  34. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  35. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  36. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  37. Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
  38. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. arXiv 2016, arXiv:1512.02325. [Google Scholar] [CrossRef]
  39. Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Tran. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  40. Di, X.; Cui, K.; Wang, R.-F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
  41. Luo, X.; Zhu, X. YOLO-SMUG: An Efficient and Lightweight Infrared Object Detection Model for Unmanned Aerial Vehicles. Drones 2025, 9, 245. [Google Scholar] [CrossRef]
  42. Wu, H.; Huang, X.; He, C.; Xiao, H.; Luo, S. Infrared Small Target Detection with Swin Transformer-Based Multiscale Atrous Spatial Pyramid Pooling Network. IEEE Trans. Instrum. Meas. 2025, 74, 1–14. [Google Scholar] [CrossRef]
  43. Zhang, Y.; Dai, Z.; Pan, C.; Zhang, G.; Xu, J. NOC-YOLO: An exploration to enhance small-target vehicle detection accuracy in aerial infrared images. Infrared Phys. Technol. 2025, 149, 105905. [Google Scholar] [CrossRef]
  44. Ren, Z.; Wang, Z.; Ke, Z.; Li, Z.; Wushour, S. Survey of Multimodal Data Fusion. Comput. Eng. Appl. 2021, 57, 49–64. [Google Scholar]
  45. Schnelle, S.R.; Chan, A.L. Enhanced target tracking through infrared-visible image fusion. In Proceedings of the 14th International Conference on Information Fusion, Chicago, IL, USA, 5–8 July 2011; pp. 1–8. [Google Scholar]
  46. Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding from Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13955–13965. [Google Scholar]
  47. Que, L.; Zhou, T.; Tao, M.; Che, K.; Zhou, Y.; Liang, J. An efficient method of infrared-visible image fusion. In Proceedings of the 2025 IEEE 34th Wireless and Optical Communications Conference (WOCC), Taipa, Macao, 20–22 May 2025; pp. 52–55. [Google Scholar]
  48. Cheng, X.; Geng, K.; Wang, Z.; Wang, J.; Sun, Y.; Ding, P. SLBAF-Net: Super-Lightweight bimodal adaptive fusion network for UAV detection in low recognition environment. Multimed. Tools Appl. 2023, 82, 47773–47792. [Google Scholar] [CrossRef]
  49. Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection. IEEE Trans. Intell. Veh. 2024, 10, 1–14. [Google Scholar] [CrossRef]
  50. Sun, D.; Wang, C.; Wang, T.; Gao, Q.; Li, Z. Efficient fusion network with label generation and branch transformations for visible and infrared images fusion. Infrared Phys. Technol. 2025, 150, 105916. [Google Scholar] [CrossRef]
  51. Yao, J.; Zhang, Y.; Liu, F.; Liu, Y.-C. Object Detection Based on Decision Level Fusion. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019; pp. 3257–3262. [Google Scholar]
  52. Chen, Y.-T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal object detection via probabilistic ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 139–158. [Google Scholar]
  53. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOV10: Real-Time End-to-End Object Detection. Comput. Vis. Pattern Recognit. 2025, 37, 107984–108011. [Google Scholar]
  54. Fang, Q.; Han, D.; Wang, Z. Cross-modality fusion transformer for multispectral object detection. arXiv 2021, arXiv:2111.00273. [Google Scholar] [CrossRef]
  55. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar] [CrossRef]
  56. Chen, Y.; Wang, B.; Guo, X.; Zhu, W.; He, J.; Liu, X.; Yuan, J. DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection. arXiv 2025, arXiv:2412.04931. [Google Scholar]
  57. Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
  58. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
  59. He, K.; Gkioxargia, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar]
  60. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  61. Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 17–24 June 2019; pp. 2844–2853. [Google Scholar]
  62. Ren, K.; Wu, X.; Xu, L.; Wang, L. RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images. arXiv 2025, arXiv:2410.13532. [Google Scholar]
  63. Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Figure 4. The architecture of DCGF module based on the channel attention mechanism.
Figure 4. The architecture of DCGF module based on the channel attention mechanism.
Remotesensing 17 03411 g004
Figure 5. Structure of EIE model.
Figure 5. Structure of EIE model.
Remotesensing 17 03411 g005
Figure 6. Bounding-box regression.
Figure 6. Bounding-box regression.
Remotesensing 17 03411 g006
Figure 7. Some sample images from DroneVehicle dataset.
Figure 7. Some sample images from DroneVehicle dataset.
Remotesensing 17 03411 g007
Figure 8. Normalization effect diagram of all indicators. (a) Normalized histogram of the ablation experiment. (b) Performance demonstration diagram of the ablation experiment.
Figure 8. Normalization effect diagram of all indicators. (a) Normalized histogram of the ablation experiment. (b) Performance demonstration diagram of the ablation experiment.
Remotesensing 17 03411 g008
Figure 9. Night scenario. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 9. Night scenario. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g009
Figure 10. Thick fog scene. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 10. Thick fog scene. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g010aRemotesensing 17 03411 g010b
Figure 11. Low-contrast scene. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 11. Low-contrast scene. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g011aRemotesensing 17 03411 g011b
Figure 12. Parking lot scene. The targets to be detected are small, densely overlapping, and some vehicles are obscured. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 12. Parking lot scene. The targets to be detected are small, densely overlapping, and some vehicles are obscured. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g012
Figure 13. Confusion matrix: (a) YOLOv10n-visible image; (b) YOLOv10n-infrared image; (c) DVIF-Net.
Figure 13. Confusion matrix: (a) YOLOv10n-visible image; (b) YOLOv10n-infrared image; (c) DVIF-Net.
Remotesensing 17 03411 g013
Figure 14. This is an open space beside a farmland, where many very small target vehicles are parked densely and are of similar appearance. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 14. This is an open space beside a farmland, where many very small target vehicles are parked densely and are of similar appearance. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g014
Figure 15. This is an empty meadow in the suburbs, where various types of vehicles are parked. The colors of these vehicles are extremely similar to the background, and their outlines are blurry. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Figure 15. This is an empty meadow in the suburbs, where various types of vehicles are parked. The colors of these vehicles are extremely similar to the background, and their outlines are blurry. The green dotted box is the comparison area, (a) is the original image, (b) is the detection results of YOLOv10n, and (c) is the detection of the proposed DVIF-Net.
Remotesensing 17 03411 g015
Figure 16. This is a road scene with only one vehicle in motion. The green dotted box is the comparison area. (a) Original image. (b) Detection results of YOLOv10n. (c) Detection of the proposed DVIF-Net.
Figure 16. This is a road scene with only one vehicle in motion. The green dotted box is the comparison area. (a) Original image. (b) Detection results of YOLOv10n. (c) Detection of the proposed DVIF-Net.
Remotesensing 17 03411 g016
Table 1. The number of labels of different categories in DroneVehicle dataset.
Table 1. The number of labels of different categories in DroneVehicle dataset.
ModalityCarTruckBusVanFreight Car
RGB339,77922,12315,33311,93513,400
IR428,08625,96016,59012,70817,173
Table 2. Configuration experimental environment.
Table 2. Configuration experimental environment.
EnvironmentParameters
GPUIntel(R) Xeon(R) Platinum 8488C
CPUNVIDIA A100
GPU memory size80 G
Operating systemWin 10
LanguagePython 3.10.17
FramePytorch 2.1.1
CUDA VersionCuda 11.8
Table 3. Training parameters setting.
Table 3. Training parameters setting.
ParametersSetup
Epochs300
Input image size640 × 512
Batch size32
Worker8
Learning rate0.01
Momentum0.937
Weight decay0.0005
OptimizerSGD
Table 4. Performance comparison between DVIF-Net and different algorithms.
Table 4. Performance comparison between DVIF-Net and different algorithms.
MethodModalitymAP50 (%)Params
(M)
FLOPs
/G
CarTruckBusVanFreight CarAll
RetinaNetVisible78.534.469.828.824.147.1145.099.9
Faster R-CNNVisible79.049.077.037.037.255.958.30169.5
Mask R-CNNVisible68.539.866.825.426.845.5242.093.5
Cascade R-CNNVisible68.044.769.329.827.347.8368.0153.8
RoITransformerVisible61.655.185.527.642.361.5273.0123.8
YOLOv5nVisible90.561.390.652.747.168.42.507.2
YOLOv8nVisible91.262.390.256.049.269.83.018.1
YOLOv10nVisible91.563.990.755.451.070.52.266.5
YOLOv11nVisible91.363.090.255.248.869.92.586.3
YOLOv12nVisible91.364.590.552.453.070.32.516.6
RetinaNetInfrared88.835.476.532.139.554.5145.099.9
Faster R-CNNInfrared89.453.587.042.648.364.258.30169.5
Mask R-CNNInfrared88.848.978.432.236.657.0242.093.5
Cascade R-CNNInfrared81.047.279.333.039.055.9368.0153.8
RoITransformerInfrared89.651.088.944.553.465.5273.0123.8
YOLOv5nInfrared97.667.794.550.370.075.22.507.2
YOLOv8nInfrared98.172.196.361.971.379.93.018.1
YOLOv10nInfrared98.271.795.763.970.179.92.266.5
YOLOv11nInfrared98.071.795.561.970.479.52.586.3
YOLOv12nInfrared98.170.495.862.869.879.42.516.6
CFTVI + IR91.967.092.156.455.272.544.53123.5
UA-CMDetVI + IR87.560.787.138.046.864.0139.7098.1
RemoteDet-MambaVI + IR98.281.295.752.967.981.871.3484.5
CDC-YOLOFusionVI + IR98.471.996.264.869.780.2153.6048.9
DE-YOLOVI + IR98.378.496.868.576.083.66.0018.7
OursVI + IR98.682.397.471.978.985.82.498.2
Note: Bold values indicate the best results.
Table 5. Ablation experiment results.
Table 5. Ablation experiment results.
P4FusionDCGFCSP-EIEP (%)R (%)mAP50 (%)mAP50-95 (%)Params (M)
79.475.581.058.23.436
80.675.281.559.72.479
80.480.384.261.12.487
79.478.683.463.12.477
81.680.385.864.42.485
Note: Bold values indicate the best results.
Table 6. Comparative experimental results on VEDAI dataset.
Table 6. Comparative experimental results on VEDAI dataset.
MethodModalitymAP50(%)Params (M)FLOPs
/G
CarTruckBoatTrackorCamping VanPickupPlaneVanOthersAll
YOLOv5nVisible80.546.134.748.665.462.669.125.233.251.72.507.2
YOLOv8nVisible79.955.548.347.467.965.470.925.531.654.73.018.1
YOLOv10nVisible81.156.548.146.067.565.679.526.833.656.12.266.5
YOLOv11nVisible81.153.846.447.067.765.181.225.634.355.82.586.3
YOLOv12nVisible80.558.335.048.161.469.669.515.934.652.52.516.6
YOLOv5nInfrared76.643.536.645.758.959.531.715.324.543.62.507.2
YOLOv8nInfrared74.757.040.038.456.560.426.819.026.044.33.018.1
YOLOv10nInfrared77.452.745.144.956.857.646.818.627.547.52.266.5
YOLOv11nInfrared77.353.045.745.254.363.646.718.325.947.82.586.3
YOLOv12nInfrared77.750.239.246.562.159.944.218.424.747.02.516.6
OursVI + IR81.360.058.858.567.670.592.027.742.062.02.498.2
Note: Bold values indicate the best results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, X.; Zhang, H.; Li, C.; Wang, K.; Zhang, Z. DVIF-Net: A Small-Target Detection Network for UAV Aerial Images Based on Visible and Infrared Fusion. Remote Sens. 2025, 17, 3411. https://doi.org/10.3390/rs17203411

AMA Style

Zhao X, Zhang H, Li C, Wang K, Zhang Z. DVIF-Net: A Small-Target Detection Network for UAV Aerial Images Based on Visible and Infrared Fusion. Remote Sensing. 2025; 17(20):3411. https://doi.org/10.3390/rs17203411

Chicago/Turabian Style

Zhao, Xiaofeng, Hui Zhang, Chenxiao Li, Kehao Wang, and Zhili Zhang. 2025. "DVIF-Net: A Small-Target Detection Network for UAV Aerial Images Based on Visible and Infrared Fusion" Remote Sensing 17, no. 20: 3411. https://doi.org/10.3390/rs17203411

APA Style

Zhao, X., Zhang, H., Li, C., Wang, K., & Zhang, Z. (2025). DVIF-Net: A Small-Target Detection Network for UAV Aerial Images Based on Visible and Infrared Fusion. Remote Sensing, 17(20), 3411. https://doi.org/10.3390/rs17203411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop