FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection

Meng, Fangzhou; Hong, Aoping; Tang, Hongying; Tong, Guanjun

doi:10.3390/rs17061095

Open AccessArticle

FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection

by

Fangzhou Meng

^1,2

,

Aoping Hong

^1,2,

Hongying Tang

¹ and

Guanjun Tong

^1,*

¹

Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1095; https://doi.org/10.3390/rs17061095

Submission received: 22 January 2025 / Revised: 7 March 2025 / Accepted: 17 March 2025 / Published: 20 March 2025

(This article belongs to the Special Issue Advances in Deep Fusion of Multi-Source Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

:

RGB-IR object detection provides a promising solution for complex scenarios, such as remote sensing and low-light environments, by leveraging the complementary strengths of visible and infrared modalities. Despite significant advancements, two key challenges remain: (1) effectively integrating multi-modal features within lightweight frameworks to enable real-time performance and (2) fully utilizing multi-scale features, which are crucial for detecting objects of varying sizes but are often underexploited, leading to suboptimal accuracy. To address these challenges, we propose FQDNet, a novel RGB-IR object detection network that integrates an optimized fusion strategy with a Quad-Head detection framework. To enhance multi-modal feature fusion, we introduce a Channel Swap SCDown Block (CSSB) for initial feature interaction and a lightweight Spatial Channel Attention Fusion Module (SCAFM) to further refine the integration of complementary RGB-IR features. To improve multi-scale feature utilization, we designed the Dynamic-Weight-based Quad-Head Detector (DWQH), which dynamically assigns weights to different scales, enabling adaptive fusion and enhancing multi-scale feature representation. This mechanism significantly improves detection performance, particularly for small objects. Furthermore, to ensure real-time applicability, we incorporate lightweight optimizations, including the Partial Cross-Stage Pyramid (PCSP) and SCDown modules, which reduce computational complexity while maintaining high detection accuracy. FQDNet was evaluated on three public RGB-IR datasets—M3FD, VEDAI, and LLVIP—achieving mAP@[0.5:0.95] gains of 4.4%, 3.5%, and 3.1% over the baseline, with only a 0.4 M increase in parameters and 5.5 GFLOPs overhead. Compared to state-of-the-art RGB-IR object detection algorithms, our method strikes a better balance between detection accuracy and computational efficiency while exhibiting strong robustness across diverse detection scenarios.

Keywords:

RGB-IR object detection; lightweight framework; deep learning; remote sensing

1. Introduction

1.1. Background

Object detection is a fundamental task in computer vision, with extensive applications across various domains, including remote sensing imagery analysis [1], video surveillance [2], and autonomous driving [3]. Despite significant progress in detection methods, the majority of advanced techniques still heavily rely on RGB images [4,5,6,7]. However, their performance degrades in challenging conditions, such as low-light environments, intense glare, or adverse weather. This limitation stems from RGB sensors’ inherent reliance on ambient light, rendering them ineffective in poorly lit or highly variable lighting conditions. Furthermore, RGB imaging struggles in scenarios where targets are obscured by environmental factors such as smoke, fog, or physical obstructions, often resulting in inaccurate detection. In contrast, infrared (IR) sensors capture thermal radiation emitted by objects, enabling reliable detection across varying lighting conditions and through occlusions that impede RGB-based detection [8,9]. The integration of the complementary strengths of both RGB and IR modalities, as demonstrated in multi-modal object detection techniques, has shown great promise in enhancing target characterization and improving detection accuracy across diverse environments [10,11]. Specifically, while IR imaging excels in low-light and adverse weather conditions, RGB imaging provides detailed spatial information and color cues in optimal lighting, making their integration highly beneficial for robust and accurate detection.

To enhance the accuracy and robustness of multi-modal object detection, various architectures have been proposed. Wagner et al. [12] pioneered the use of a two-branch convolutional network (ConvNet) for multi-modal detection, demonstrating the effectiveness of the Halfway Fusion model, where mid-level convolutional features are fused through concatenation. However, simple fusion strategies like concatenation, addition, or multiplication, while easy to implement, often result in redundancy and fail to fully leverage the complementary strengths of different modalities. This leads to suboptimal detection performance and limited generalization ability. To address these challenges, more advanced methods with complex multi-modal fusion mechanisms have been introduced. For instance, Fang et al. [13] proposed a Cross-Modality Fusion Transformer (CFT) based on dual-stream YOLOv5, leveraging the self-attention mechanism of Transformers for seamless intra- and inter-modal fusion. Building on this, Shao et al. [14] introduced MOD-YOLO, integrating the Cross-Stage Partial CFT module to further enhance detection performance. Meng et al. [15] enriched target representations by adding a pre-trained semantic branch to guide detailed feature fusion, while Zhang et al. [16] proposed an illumination-guided feature weighting module to improve the network’s ability to learn reliable modality-specific features. Additionally, Zhang et al. [17] addressed cross-modal misalignment by designing a region alignment module that adaptively aligns features to mitigate positional bias during fusion.

Nevertheless, substantial challenges persist in the domain of multi-modal object detection. Many existing methods rely on intricate fusion strategies after dual-stream feature extraction to enhance multi-modal integration. However, these approaches tend to increase network complexity and computational cost, which limits their suitability for real-time applications. Furthermore, the complementarity of multi-scale features, which is essential for detecting objects of varying sizes, is often underutilized. Low-level features provide fine-grained details that are critical for detecting small targets, while high-level features offer strong semantic representations better suited for larger targets. Failing to fully leverage this scale complementarity can lead to misdetections or missed detections, particularly for small-sized targets in remote sensing images. Additionally, improperly merging features across scales may introduce redundancy or lead to information loss, further impacting detection performance.

1.2. Main Contributions

To address these limitations, we propose FQDNet, an end-to-end RGB-IR object detection network that combines an optimized feature fusion strategy with a dynamic-weighted Quad-Head detection framework. This architecture is specifically designed to overcome challenges in multi-modal integration, multi-scale feature utilization, and real-time applicability. For enhanced multi-modal integration, we introduce a Channel Swap mechanism for initial feature fusion, which preserves modality-specific information while integrating complementary cross-modal features for richer data representation. This process is further refined by a lightweight spatial and channel attention mechanism, improving the fusion of complementary features while maintaining computational efficiency. To fully leverage multi-scale features, particularly low-level features for small object detection, we designed a Quad-Head Detector with an additional detection head at shallower layers. This configuration effectively captures fine-grained details for small targets while utilizing high-level semantic features for larger objects. Additionally, we employ dynamic feature weighting, which adaptively assigns weights to multi-scale features to reduce redundancy, mitigate information loss, and ensure the seamless integration of low-level and high-level information, leading to significant improvements in detection performance. Lastly, to meet the real-time requirements of practical applications, we optimize FQDNet with lightweight modules, reducing computational complexity while maintaining accuracy, achieving an effective balance between performance and efficiency.

The main contributions of this paper include the following:

We propose FQDNet, an end-to-end RGB-IR object detection network that integrates optimized feature fusion with a dynamic-weighted Quad-Head detection framework. Extensive experiments on benchmark RGB-IR datasets (M3FD, VEDAI, LLVIP) demonstrate that FQDNet achieves state-of-the-art performance in multi-modal detection.
To improve feature integration, we introduce the Channel Swap SCDown Block (CSSB) for initial pre-fusion and a Spatial Channel Attention Fusion Module (SCAFM) to refine multi-modal fusion, resulting in more effective feature representation.
We designed the Dynamic-Weight-based Quad-Head Detector (DWQH) to address multi-scale object detection, dynamically adjusting multi-scale feature weights to improve detection accuracy for objects of various sizes.
To meet real-time application needs, we optimized the network with lightweight modules, including the Partial Cross-Stage Pyramid (PCSP) and SCDown, reducing computational complexity while preserving detection accuracy.

The remainder of this paper is organized as follows. Section 2 reviews related work in detail. Section 3 provides an in-depth description of the FQDNet architecture. Section 4 presents experimental results on multiple datasets and compares our method with state-of-the-art approaches. Finally, Section 5 concludes the paper, summarizing our contributions and outlining directions for future work.

2. Related Works

2.1. Traditional Object Detection Algorithms

With the rapid advancement of deep learning, vision-based object detection methods have gained widespread adoption across various domains due to their cost-effectiveness and ease of implementation. Object detection frameworks powered by deep learning are generally classified into two main paradigms: two-stage detectors and single-stage detectors. Two-stage detectors, such as R-CNN [18], Fast R-CNN [19], and Faster R-CNN [20], follow a region-proposal-based approach. These methods first generate region proposals and then classify and refine them using separate neural networks. While two-stage detectors achieve state-of-the-art detection accuracy, their complex architectures and high computational costs make them less suitable for resource-constrained devices and real-time applications. In contrast, single-stage detectors, such as YOLO [21], SSD [22], and RetinaNet [23], have gained popularity in practical applications due to their efficient feature extraction, reduced model complexity, and balanced trade-off between inference speed and detection accuracy. Additionally, recent advancements in Transformer-based architectures, such as DETR (Detection Transformer) [24], RT-DETR [25], and DINO [26], have introduced a new paradigm in object detection. These methods leverage self-attention mechanisms to model global contextual relationships within images, achieving competitive performance in end-to-end detection tasks. However, their slower convergence rates and higher computational demands remain significant challenges.

Among single-stage detectors, the You Only Look Once (YOLO) series [7,27,28,29] has been a cornerstone in the field since its inception in 2015. Known for its real-time performance, open-source availability, and ease of deployment, YOLO has undergone continuous evolution. For instance, Ultralytics released YOLOv8 in 2023, which introduced the C2f (Cross-Stage Partial Fusion) module to enhance the CSPLayer (Cross-Stage Partial Layer) while maintaining a backbone network similar to YOLOv5 [7,30]. Unlike YOLOv7 [31], which primarily focused on optimizing detection performance, YOLOv8 expanded its functionality to include support for instance segmentation [32]. In 2024, Wang et al. introduced YOLOv9 [28], which featured the Generalized Efficient Layer Aggregation Network (GELAN) architecture. GELAN significantly improved model efficiency and usability by enhancing feature retention during information propagation. Subsequently, a research team from Tsinghua University released YOLOv10 [29], which introduced architectural innovations to reduce dependency on non-maximum suppression (NMS). This advancement effectively addressed deployment bottlenecks caused by post-processing delays, further enhancing the framework’s practicality in real-world applications.

2.2. Visible-Infrared Object Detection Methods

Traditional object detection algorithms primarily rely on visible spectrum unimodal data. However, their performance deteriorates significantly under challenging conditions, such as low light, overexposure, and nighttime occlusion, limiting their adaptability to variable environments. In contrast, infrared imaging effectively identifies targets in low-light conditions by providing complementary information absent in visible imaging. Consequently, integrating the complementary properties of visible and infrared images into a unified object detection framework has become crucial for enhancing detection accuracy and model robustness [11,33].

Liu et al. [12] pioneered multispectral pedestrian detection using a two-branch convolutional network (ConvNet), demonstrating that the Halfway Fusion model, which fuses mid-convolutional features, was the most effective approach. Since then, researchers have proposed various strategies to improve detection performance, predominantly leveraging fusion methods based on two-stream YOLO networks [33]. For instance, Fang et al. [13] introduced a cross-modality feature fusion approach based on YOLOv5, called the Cross-Modality Fusion Transformer (CFT). This method utilizes the Transformer’s self-attention mechanism to achieve seamless fusion within and between modalities. Similarly, Shao et al. [14] developed MOD-YOLO, incorporating the Cross-Stage Partial CFT (CSP-CFT) module, along with enhancements such as the VoV-GSCSP module for optimizing the network heads and the SIoU loss function to improve detection accuracy. Further advancements include Dual-YOLO [34] by Bao et al. and GMD-YOLO [11] by Sun et al., which employ attention fusion modules and multilevel feature modulation mechanisms, respectively, to enhance the complementarity of infrared and visible modalities and boost the detection of small targets. Zhang et al. [35] proposed SuperYOLO, integrating multi-modal data with assisted super-resolution learning to effectively improve the detection of multi-scale objects, particularly for small targets.

Building on this prior research, we propose an improved dual-stream architecture that fully extracts intra-modal information while enhancing key complementary information through a channel exchange strategy and a Spatial Channel Attention Fusion Module. Additionally, our dynamic-weight-based four-head detector significantly improves detection performance across multi-scale targets.

2.3. Lightweight Models for Object Detection

In practical applications, improving the real-time performance and efficiency of object detection tasks is essential. As a result, the design of lightweight networks has become a key area of research in recent years. Existing methods can be broadly categorized into three approaches: lightweight backbone network design [36,37], module integration optimization [38,39], and fusion strategy optimization [40,41].

The first approach focuses on reducing model complexity by employing lightweight backbone networks. For example, Cheng et al. [36] optimized YOLOv4 by incorporating MobileViT [42] and coordinate attention, achieving improved detection accuracy while reducing model size. Zuo et al. [43] proposed an anchor-free multispectral pedestrian detection framework based on MobileNetV2, which significantly reduces computational complexity and improves inference efficiency. Li and Ye [37] proposed Edge-YOLO, a lightweight infrared object detection model optimized for deployment on edge devices. Edge-YOLO substitutes the traditional backbone network with a lightweight ShuffleBlock and incorporates a strip depthwise convolutional attention module, significantly lowering computational complexity while maintaining high accuracy.

The second approach optimizes network architecture by integrating efficient modules. Huang et al. [38] introduced ghost convolution and a multi-scale attention mechanism based on YOLOv8, which substantially reduced the model size. Qiu et al. [39] proposed a novel RGELAN structure that employs re-parameterized convolutional modules to alleviate the computational load during feature extraction. Deng et al. [44] integrated lightweight depthwise separable convolution modules, utilizing channel- and point-wise convolutions to reduce the parameter count and computational load in multi-modal pedestrian detection models. Yan et al. [45] developed a compact coal identification and detection system using YOLOv8n and multispectral imaging technology, designing the C2f-Faster module to minimize model size and improve detection speed.

The third approach focuses on optimizing multi-modal fusion strategies, particularly for lightweight multi-modal detection tasks. Fan and Wang [46] proposed Cross-Modal Attention Feature Fusion, a lightweight multispectral fusion strategy that jointly models common-mode and differential-mode attention for adaptive feature enhancement and selection. MOD-YOLO [14] is a lightweight yet powerful dual-stream Transformer-based network that splits fused feature maps into cross-stage outputs and next-stage integrations, optimizing the trade-off between speed, memory, and accuracy. Wu et al. [40] developed a lightweight cross-modal feature fusion method called gated weighted normative feature fusion, which significantly improved the speed and accuracy of multispectral object detection. Yuan et al. [41] introduced C2Former, an efficient calibration and complementary Transformer module that addresses modal calibration errors and fusion inaccuracies.

In addition, our approach is lightweight and optimized in the design of the fusion strategy, detection head, and key modules, effectively reducing the model size and computational requirements while achieving the best balance between detection accuracy and model efficiency.

3. Method

3.1. Overall Architecture

The overall architecture of the FQDNet network is illustrated in Figure 1. FQDNet is an end-to-end detection framework specifically designed for visible and infrared object detection. Its architecture consists of four main components: a dual-stream image feature extraction backbone, multi-modal feature fusion modules, a neck, and a Quad-Head Detector. The backbone, derived from YOLOv8, has been extended to a dual-stream configuration to support dual-modal object detection, allowing for the independent extraction of multi-scale features from RGB and IR modalities. To enhance the extraction and integration of complementary features across modalities, we introduce the Channel Swap SCDown Block (CSSB) for initial pre-fusion, ensuring effective modality-specific feature extraction. Furthermore, the Spatial Channel Attention Fusion Module (SCAFM) is proposed to enhance the fusion of these complementary features. Once the multi-scale features are extracted and fused, they are passed through the neck for further processing and integration, before being forwarded to the detection heads for final predictions. To address challenges such as feature loss during cross-scale feature fusion, a dynamic-weight-based four-head detection structure is designed to improve the network’s capability for detecting multi-scale targets. Additionally, the architecture is optimized to reduce the computational complexity and parameter count by employing the SCDown and Partial Cross-Stage Pyramid (PCSP) modules. Detailed design specifications of the network are presented in the following sections.

3.2. Dual-Stream Feature Extraction and Fusion Backbone

In multi-modal tasks, properly integrating features from different modalities is critical. In our dual-stream fusion detection network, RGB and IR source images are processed through a parallel feature extraction network to perform multi-scale feature extraction independently. Initially, low-level texture features are extracted using a shallow convolutional kernel of size 3 × 3 with a stride of 2, providing a foundational basis for subsequent information fusion. To fully leverage the potential of features from both modalities, we adopt the Channel Swap strategy proposed by Wang et al. [47] and design the Channel Swap SCDown Block (CSSB) for the initial pre-fusion of bimodal features, as depicted in Figure 2. This strategy evaluates the importance of information within each channel by analyzing the parameter slopes of the batch-normalized (BN) layer. For channels with low slope values (indicative of insufficient information), the corresponding channel information from the other modality is used as a replacement. This process effectively fills in missing data and reduces redundancy. By introducing cross-modal information while maintaining modality-specific feature extraction, the CSSB enriches the input features for subsequent module designs, laying a robust foundation for effective fusion.

To further integrate and enhance the complementary information from RGB and IR modalities, we designed the Spatial Channel Attention Fusion Module (SCAFM), as illustrated in Figure 3. This compact and efficient module integrates shared convolution, an attention mechanism, multi-scale convolution, and a residual structure to achieve optimal fusion.

The visible and infrared input features are denoted by

x_{v i}

and

x_{i r}

, respectively. Initially, depthwise separable convolutions with shared weights are applied to extract features independently for each modality. This process ensures inter-modal consistency by eliminating redundant or non-informative features, enabling the fusion to focus on meaningful and complementary attributes. The operation is defined as

x_{v i}^{'} = ReLU ({Conv}_{d w} (x_{v i})), x_{i r}^{'} = ReLU ({Conv}_{d w} (x_{i r}))

(1)

where

{Conv}_{d w} (\cdot)

denotes the shared depthwise convolution with a 3 × 3 kernel.

ReLU (\cdot)

represents the ReLU activation function. Subsequently, the modality-specific features are concatenated along the channel dimension to generate residual features for downstream processing:

x_{r e s}^{'} = Concat (x_{v i}^{'}, x_{i r}^{'})

(2)

where

Concat (\cdot, \cdot)

denotes the channel concatenation operation. To emphasize salient regions in each modality, a lightweight spatial attention module generates weighting matrices

A_{v i}

and

A_{i r}

:

A_{v i} = σ ({Conv}_{1 \times 1} (x_{v i}^{'})), A_{i r} = σ ({Conv}_{1 \times 1} (x_{i r}^{'}))

(3)

where

σ (\cdot)

represents the sigmoid activation function.

{Conv}_{k \times k}

denotes the convolution operation with kernel size k. Unless otherwise specified, the stride is set to its default value of 1. The weighted modal features are then concatenated to form the spatially weighted fusion feature:

x_{f u s e d - s} = Concat (A_{v i} ⊙ x_{v i}^{'}, A_{i r} ⊙ x_{i r}^{'})

(4)

where ⊙ represents element-wise multiplication. Next, multi-scale convolution operations with kernel sizes

k \in {1, 3, 5}

are applied to

x_{f u s e d - s}

, enhancing the module’s capability to capture features across varying receptive fields. The fused features are summed pixel-wise as follows:

x_{m u l t i - s c a l e} = \sum_{k \in {1, 3, 5}} {Conv}_{k \times k} (x_{f u s e d - s})

(5)

Global contextual information is extracted from

x_{m u l t i - s c a l e}

via Global Average Pooling (GAP), followed by a fully connected (FC) layer to generate channel attention weights:

\bar{A} = σ (FC (GAP (x_{m u l t i - s c a l e})))

(6)

The weights

\bar{A}

are applied to

x_{f u s e d - s}

for channel-wise interaction and enhancement:

x_{f u s e d - c s} = \bar{A} ⊙ x_{f u s e d - s}

(7)

Finally, the residual features

x_{r e s}^{'}

are added to

x_{f u s e d - c s}

to preserve critical information and stabilize training. The fused feature outputs are obtained through a channel split operation:

X_{v i}, X_{i r} = Spit (x_{f u s e d - c s} + x_{r e s}^{'})

(8)

The final outputs,

X_{v i}

and

X_{i r}

, encapsulate fused complementary features with enhanced robustness and multi-scale perception, ensuring effective object detection across modalities.

3.3. Dynamic-Weight-Based Quad-Head Detector (DWQH)

The multi-scale multi-modal fusion features extracted in the earlier stages are fed into the network’s neck for further processing and integration. These refined multi-scale features are subsequently passed to the detection head, where they are mapped to the final output space to generate the network’s predictions. Features at different scales play distinct roles in the detection process: low-level features capture intricate details that are essential for detecting small targets, while high-level features provide robust semantic information suitable for identifying larger objects. However, directly merging features across scales may introduce redundancy or result in information loss. To address this challenge, we developed the Dynamic-Weight-based Quad-Head Detector (DWQH), which assigns adaptive dynamic weights to features at each scale, ensuring effective fusion and enhancing the complementarity of multi-scale features.

To optimize detection performance, particularly for small-sized targets, the DWQH employs four detection heads corresponding to four feature levels: Level1, Level2, Level3, and Level4. Inspired by ASFF [48], each detection head fuses three neighboring feature levels. For example, the Level4 detection head takes input feature maps from Level2, Level3, and Level4, denoted by

X_{2}

,

X_{3}

, and

X_{4}

, respectively. Since these features vary in resolution and channel dimensions, alignment operations are applied to ensure compatibility for fusion. Specifically, the resolutions of the features are adjusted as follows:

{\hat{X}}_{2} = Maxpool ({Conv}_{3 \times 3, s = 2} (X_{2})), {\hat{X}}_{3} = {Conv}_{3 \times 3, s = 2} (X_{3}), {\hat{X}}_{4} = Upsample (X_{4})

(9)

where

{Conv}_{3 \times 3, s = 2}

represents a 3 × 3 convolution with a stride of 2, which downsamples

X_{3}

by half to produce

{\hat{X}}_{3}

. Similarly, Maxpool represents the max-pooling operation applied after the convolution, performing an additional two-fold downsampling. This results in a four-fold downsampling of

X_{2}

, producing

{\hat{X}}_{2}

. The resolution of

X_{4}

is preserved, denoted by

{\hat{X}}_{4}

. When upsampling high-level features, bilinear interpolation is used, either by two-fold or four-fold increments. To unify channel dimensions and reduce computational complexity, all input features undergo a 1 × 1 convolution.

Dynamic weights play a critical role in the multi-scale feature fusion process. Independent learnable weights are assigned to each feature layer, along with channel-specific weights for fine-grained control over feature importance. These weights allow the model to emphasize relevant features while suppressing redundant ones. A set of learnable parameters

{W_{α}, W_{β}, W_{γ}}

is normalized using the Softmax function to ensure the weights sum to 1, facilitating selective feature enhancement. For the Level4 detection head, the dynamic weights are computed as

α_{4} = \frac{exp (W_{α})}{\sum_{i = 2}^{4} exp (W_{α})}, β_{4} = \frac{exp (W_{β})}{\sum_{i = 2}^{4} exp (W_{β})}, γ_{4} = \frac{exp (W_{γ})}{\sum_{i = 2}^{4} exp (W_{γ})}

(10)

where

α_{4}

,

β_{4}

, and

γ_{4}

are the normalized dynamic weights. Their generation is fully based on the learnable weight matrix W, whose parameters are optimized through gradient updates. The weights are applied independently to the features of each layer, enabling the mechanism to adapt effectively to different input features. This design not only suppresses redundant features and highlights critical information but also relies solely on low-complexity matrix operations, making it well suited for lightweight, real-time tasks.

After generating dynamic weights, feature fusion is performed through weighted summation:

F_{DWQH - 4} = α_{4} ⊙ {\hat{X}}_{2} + β_{4} ⊙ {\hat{X}}_{3} + γ_{4} ⊙ {\hat{X}}_{4}

(11)

where

F_{DWQH - 4}

denotes the fused multi-scale features, which serve as the input for the Level4 detection head. This fusion process retains both detailed low-level information and high-level semantic features, effectively addressing the detection requirements for multi-scale targets.

3.4. Lightweight Optimization

To enhance the performance of dual-modal feature fusion detection, the proposed feature extraction network incorporates a dual-stream backbone, multi-modal fusion modules, and four detection heads. While these components improve detection accuracy, they also increase the number of parameters and computational overhead. To meet the demands of real-time detection tasks, we have implemented lightweight designs for key network modules, such as the SCDown module for effective downsampling and the Partial Cross-Stage Pyramid (PCSP) module for efficient feature propagation, reducing computational complexity while preserving detection accuracy. The specific optimization strategies are outlined as follows.

The SCDown module, inspired by YOLOv10 [29], is utilized for downsampling operations. This module first adjusts the channel dimensions via a convolution operation with a stride of 1, followed by spatial downsampling using a depthwise convolution with a stride of 2. Compared to traditional 3 × 3 standard convolutions with a stride of 2, SCDown separates spatial downsampling from channel adjustments, reducing computational complexity from

O (\frac{9}{2} H W C^{2})

to

O (2 H W C^{2} + \frac{9}{2} H W C)

and parameter counts from

O (18 C^{2})

to

O (2 C^{2} + 18 C)

. This design minimizes computational costs while preserving key information during downsampling, achieving competitive performance with reduced latency.

Multi-scale feature extraction often involves redundant or duplicate features across channels, which do not contribute meaningful information but increase computational and memory overhead. To address this, we propose the Partial Cross-Stage Pyramid (PCSP) module, an enhancement of the CSP module in the neck portion of the network. As illustrated in Figure 4, the PCSP introduces a partial convolution mechanism [49], splitting features along the channel dimension and performing convolution operations on a subset of channels while leaving others untouched. This approach reduces computational complexity and captures rich semantic information through cross-stage fusion, resulting in a lightweight and efficient feature representation.

Consider the input feature map

X \in R^{B \times C_{1} \times H \times W}

, where B is the batch size,

C_{1}

is the number of input channels, and

(H, W)

are the spatial dimensions. The PCSP operates as follows: The input feature map X is first extended to 2C channels via a 1 × 1 convolution:

Y_{1} = {Conv}_{1 \times 1} (X), Y_{1} \in R^{B \times 2 C \times H \times W}

(12)

Then, the extended feature

Y_{1}

is divided into two parts along the channel dimension:

Y_{1 a}, Y_{1 b} = Split (Y_{1}), Y_{1 a}, Y_{1 b} \in R^{B \times C \times H \times W}

(13)

In the main path, multi-scale features are extracted by cascading n PCSP bottleneck blocks. Each block selectively performs convolutions on some channels while leaving others unchanged, efficiently extracting higher-order semantic features while preserving fine-grained information:

PCSP_Bottleneck (F) = Partial_Conv (Partial_Conv (F)), F \in R^{B \times C \times H \times W}

(14)

Each partial convolution is defined as

Partial_Conv (F) = Concat (Conv (F_{1}), F_{2}), F_{1}, F_{2} = Split (F)

(15)

After cascading, all features (

Y_{1 a}

and outputs from the main path) are concatenated along the channel dimension:

Z = Concat (Y_{1 a}, {PCSP_Bottleneck (Z_{i - 1})}_{i = 1}^{n}), Z \in R^{B \times 2 C \times H \times W}

(16)

where

Z_{0} = Y_{1 b}

. Finally, the concatenated features are aggregated using a 1 × 1 convolution to reduce the number of channels to

C_{2}

, producing the final output:

Y = {Conv}_{1 \times 1} (Z), Y \in R^{B \times C_{2} \times H \times W}

(17)

3.5. Focaler–Inner IoU (FI-IoU)

The bounding box regression loss function is essential in object detection tasks, as it optimizes the position and size of bounding boxes to improve detection accuracy. The Intersection over Union (IoU) [50] metric is commonly used to evaluate the overlap between predicted and ground-truth boxes, providing a measure of prediction accuracy. However, IoU performs poorly when bounding boxes do not overlap, resulting in decreased accuracy, especially for objects with complex shapes. While the CIoU used in YOLOv8 [7] effectively incorporates IoU, centroid distance, and aspect ratio consistency, it has several shortcomings: (1) insufficient optimization for low-IoU cases (e.g., small or occluded objects) due to the minimal gradient contribution; (2) gradient saturation in high-IoU cases, which limits further refinement; (3) a lack of gradient information when IoU is zero, causing stagnation in training; (4) neglecting internal bounding box features, which restricts localization accuracy.

To address these limitations, this paper introduces a hybrid loss function that integrates Focaler-IoU [51] and Inner-IoU [52], termed FI-IoU. This new loss function aims to enhance the optimization of bounding box regression, particularly in challenging scenarios involving occlusion and multi-scale targets, which are prevalent in visible and infrared fusion detection tasks.

Focaler-IoU is a modified bounding box regression loss derived from the traditional IoU loss. It adjusts the loss values through a linear interval mapping, thereby addressing the CIoU limitations in specific cases. This mechanism prioritizes challenging samples, such as small or occluded targets, while reducing the emphasis on simpler cases during training. The Focaler-IoU formula is given by

I o U^{focaler} = \{\begin{matrix} 0, & if I o U < d \\ \frac{I o U - d}{u - d}, & if d < < I o U < < u \\ 1, & if I o U > u \end{matrix}

(18)

where

I o U^{focaler}

is the adjusted Focaler-IoU value,

I o U

is the original IoU, and d and u are lower and upper thresholds within the range

[0, 1]

. By adjusting these thresholds,

I o U^{focaler}

can be tailored to focus on specific regression samples. The corresponding loss function is defined as

L_{F o c a l e r - I o U} = 1 - I o U^{I n n e r}

(19)

Inner-IoU enhances the bounding box regression by incorporating the internal geometry of the overlap region, improving the regression process compared to traditional IoU-based losses, which focus only on the external box properties. Inner-IoU explicitly models the internal overlap feature, defined as

I o U^{I n n e r} = \frac{Inner Area (B_{p} \cap B_{g})}{Area (B_{p}) + Area (B_{g}) - Area (B_{p} \cap B_{g})}

(20)

where

Inner Area (B_{p} \cap B_{g})

represents the effective overlap area between the predicted box

B_{p}

and the ground-truth box

B_{g}

, potentially weighted by spatial attributes like proximity to the center. Even when the IoU is zero, this loss ensures meaningful gradient updates, which keeps the optimization process moving forward and helps improve predictions in high-IoU cases.

The proposed FI-IoU loss function combines the benefits of both Focaler-IoU and Inner-IoU to mitigate the gradient issues associated with CIoU in low-, zero-, and high-IoU cases while incorporating internal bounding box features for improved localization accuracy. The final hybrid loss is formulated as

L_{F I - I o U} = L_{F o c a l e r - I o U} + I o U - I o U^{I n n e r}

(21)

This combined approach effectively mitigates gradient problems in challenging scenarios, such as occlusion, small objects, and multi-scale detection, while enhancing the overall localization accuracy. As a result, the FI-IoU loss function improves the performance of our FQDNet model, particularly in the context of infrared and visible fusion detection tasks.

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

Our proposed method was evaluated on three publicly available RGB-IR object detection datasets: M3FD [53], VEDAI [54], and LLVIP [55]. These datasets provide paired RGB-IR images along with annotations, enabling a comprehensive performance assessment.

The M3FD dataset, introduced by Liu et al. [53] in 2022, was collected using a synchronized system consisting of a binocular optical camera and a binocular infrared sensor. It comprises 4200 aligned RGB-IR image pairs (8400 individual images), with most images having a resolution of 1024 × 768. The dataset contains 34,407 annotated objects categorized into six classes: People, Car, Bus, Motorcycle, Lamp, and Truck. M3FD covers a diverse range of environments, including urban streets, woodlands, and grasslands, as well as challenging scenarios such as tunnels, coastal areas, and foggy conditions. Following standard practice, we randomly split the dataset into training, validation, and test sets with a ratio of 4:1:1.

The VEDAI dataset, introduced by Razakarivony et al. [54] in 2016, is designed for vehicle and aircraft detection in aerial imagery. It includes small objects and diverse scenarios, such as fields, mountains, highways, construction sites, and urban areas. The dataset was collected from Utah AGRC satellite imagery and provides aligned RGB and near-infrared (NIR) image pairs. In total, it contains 1246 pairs of images with resolutions of 1024 × 1024 and 512 × 512, covering 11 vehicle categories. In this study, we focus on the detection of eight vehicle categories in 512 × 512 images. Classes with fewer than 50 instances, such as airplanes, motorcycles, and buses, were excluded. The selected categories include Car, Pickup, Camping, Truck, Other, Tractor, Boat, and Van. The dataset is designed for 10-fold cross-validation, where each split consists of 1089 training image pairs and 121 testing image pairs. Our final results are reported as the average performance across all 10 folds.

The LLVIP dataset, introduced by Jia et al. [55] in 2021, was collected using a binocular camera system. The captured images were precisely cropped and aligned to ensure pixel-wise correspondence. This dataset consists of 15,488 pairs (30,976 images) of high-resolution RGB-IR images, each with a resolution of 1280 × 1024. Most images were captured in extremely low-light conditions, and pedestrian instances were carefully annotated. We follow the official dataset partitioning, which includes 12,025 training pairs and 3463 testing pairs.

4.1.2. Evaluation Metrics

To comprehensively assess the performance of the proposed model, we employ several key evaluation metrics, including precision, recall, mean average precision (mAP), parameter count (Param), and floating-point operations (FLOPs). These metrics collectively provide a thorough evaluation of the model’s accuracy, robustness, and computational efficiency.

Precision quantifies the proportion of correctly predicted positive instances among all predicted positive instances, reflecting the model’s ability to minimize false positives. It is formally defined as

Precision = \frac{T P}{T P + F P}

(22)

where

T P

and

F P

represent the numbers of true positive and false positive predictions, respectively.

Recall (or sensitivity) measures the proportion of correctly predicted positive instances among all actual positive instances, reflecting the model’s ability to minimize false negatives. It is expressed as

Recall = \frac{T P}{T P + F N}

(23)

where

F N

denotes the number of false negative predictions.

Mean average precision (mAP) assesses the trade-off between precision and recall across multiple object classes. Specifically, mAP@0.5 calculates the average precision at an Intersection over Union (IoU) threshold of 0.5, whereas mAP@[0.5:0.95] represents the mean average precision computed over IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05. The latter provides a more holistic and robust evaluation by accounting for varying localization strictness levels.

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} where A P_{i} = \int_{0}^{1} P (i) d R (i)

(24)

where N is the number of object classes, and

P (i)

and

R (i)

denote the precision and recall at a given threshold for class i, respectively.

The parameter count (Param) represents the total number of trainable parameters in the model, which serves as an indicator of model complexity and capacity. A lower parameter count generally enhances computational efficiency; however, excessive reduction may compromise the model’s ability to learn intricate feature representations, potentially degrading performance.

Floating-point operations (FLOPs) are the number of floating-point operations required per forward pass. This metric provides an estimate of the computational cost associated with running the model. Reducing FLOPs while maintaining high accuracy is crucial for real-time and resource-constrained applications.

4.2. Implementation Details

All experiments were conducted using the training sets of the aforementioned datasets, with evaluations performed on the corresponding test sets to ensure consistency and comparability. The input image resolution was set to 640 × 640, and the network was optimized using Stochastic Gradient Descent (SGD) with a momentum of 0.937 and a weight decay of 0.0005. The model was trained for 300 epochs on the M3FD and VEDAI datasets and for 200 epochs on the LLVIP dataset, with a learning rate of 0.035, which was empirically determined for optimal performance. For the baseline models, training epochs were set to the same values, and their default hyperparameters were used for a fair comparison. To ensure an independent evaluation, all models were trained from scratch without pre-trained weights. The implementation was carried out using PyTorch 2.0.1 with CUDA 11.7, and all computations were performed on a single NVIDIA RTX 2080 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.3. Algorithm Performance Experiment

Our algorithm was trained and tested on the M3FD dataset, achieving 88.8% mAP@0.5 and 59.9% mAP@[0.5,0.95] on the test set, while requiring only 4.7M parameters and 16.9 GFLOPs. As illustrated in Figure 5, under low-light conditions, RGB images often suffer from a loss of color and texture information. The presence of artifacts, such as those caused by vehicle headlights, leads the single-modal RGB detection algorithm to misclassify shadows as people. In contrast, IR images are invariant to lighting variations, effectively avoiding such interference and enabling more accurate object detection. Our algorithm effectively exploits complementary information from both modalities, mitigating artifact interference and improving detection accuracy.

In complex background scenes, the single-modal RGB detection algorithm fails to detect objects due to occlusion and background clutter. However, in IR images, the objects appear more prominent, while the background is less complex, making detection easier. By leveraging the advantages of both modalities, our method enhances object detection in such scenarios.

Under rainy road conditions, RGB images exhibit low contrast, causing blurred object edges and hindering precise localization. In particular, the left person appears indistinct, and the right person is occluded, leading to missed detections by the single-modal RGB detection algorithm. Conversely, IR images offer high contrast between the pedestrian and the vehicle, even in the presence of occlusion, thereby facilitating better detection. However, IR images lack fine-grained texture details, causing the single-modal IR algorithm to misclassify the left car wheel as a pedestrian. Our algorithm effectively addresses both missed and false detections by mitigating the effects of occlusion and texture loss, ensuring accurate object identification.

Under sunny road conditions, the RGB image retains rich foreground and background details, enabling the single-modal detection algorithm to detect multiple objects accurately. However, the IR image struggles to differentiate objects in the rightmost crowd due to the absence of texture information, resulting in missed detections. By integrating both modalities, our algorithm ensures the accurate detection of all objects. Furthermore, our fusion strategy, combined with a Quad-Head Detector, effectively leverages multi-modal and multi-scale features. This enables the detection of small, distant objects—such as the light at the center of the image—which single-modal detection algorithms fail to identify.

4.4. Ablation Experiment

To quantify the contribution of each module to the overall model performance, we conducted a series of ablation experiments on the M3FD dataset. The results are summarized in Table 1. The baseline model is a dual-stream YOLOv8 network, which fuses the two feature streams using simple element-wise addition, achieving 81.9% mAP@0.5 and 55.5% mAP@[0.5:0.95]. Method 1 incorporates our proposed fusion strategy, which integrates the CSSB and SCAFM, leading to a performance improvement of +0.7% mAP@0.5 and +0.5% mAP@[0.5:0.95]. Method 2 introduces the DWQH, resulting in a significant increase of +6.5% mAP@0.5 and +4.7% mAP@[0.5:0.95]. Method 4 combines both the fusion strategy and the DWQH, achieving a notable performance gain of +7.1% mAP@0.5 and +5.0% mAP@[0.5:0.95]. However, this improvement comes at the cost of a 1.4M increase in parameters and an additional 7.9 GFLOPs of computation. Methods 3, 5, and 6 demonstrate that incorporating our lightweight modules (SCDown and PCSP) effectively reduces the number of parameters and FLOPs, though it results in a slight decrease in performance. Method 7 integrates all three enhancements—the fusion strategy, the DWQH, and the lightweight modules—yielding a +6.9% improvement in mAP@0.5 and +4.4% in mAP@[0.5:0.95] compared to the baseline. Notably, this is achieved with only a modest increase of 0.4M in parameters and 5.5 GFLOPs in computational cost. These results demonstrate that the combined integration of the three proposed modules achieves the most efficient balance between performance and computational complexity.

4.5. Comparative Experiments

To evaluate the performance of the proposed FQDNet for RGB-IR object detection, we conducted comprehensive comparative experiments with several state-of-the-art real-time detection networks, extending them into two-stream architectures and fusing the two modalities using a simple addition operation. Figure 6 presents a comparison of the model complexity (measured by the number of parameters) and detection accuracy (mAP@[0.5:0.95]) between our algorithm and dual-stream YOLOv5, YOLOv7, YOLOv8, YOLOv9, and YOLOv10 on the M3FD dataset. The results indicate that our algorithm significantly outperforms the compared methods, achieving higher detection accuracy with a similar number of parameters. Among the compared algorithms, dual-stream YOLOv8 delivers the second-best performance at a relatively low parameter count, which motivated our decision to select YOLOv8 as the baseline for further improvements. Detailed performance metrics are provided in Table 2. Among all evaluated algorithms, our proposed FQDNet_s achieved the highest precision (91.3%), while FQDNet_l achieved the best recall (86.9%), mAP@0.5 (92.1%), and mAP@[0.5:0.95] (64.1%). Notably, compared to other lightweight models (Dual-YOLOv5n, Dual-YOLOv7tiny, Dual-YOLOv8n, Dual-YOLOv9t, Dual-YOLOv10n), our FQDNet_n also achieved the highest recall (82.1%), mAP@0.5 (88.8%), and mAP@[0.5:0.95] (59.9%). These findings demonstrate the superiority of our method in balancing model efficiency and detection performance, making it a promising solution for real-time RGB-IR object detection applications.

We also provide the visual detection results of lightweight methods from the M3FD dataset in Figure 7a–e, illustrating scenarios with occluded, blurred, and distant objects in dense scenes, as well as multi-scale targets under various conditions, including cloudy, foggy, daytime, nighttime, and tunnel environments. In these scenarios, some objects, especially occluded or small targets, are either misidentified or completely missed by the comparative methods due to their insufficient ability to utilize multi-modal or multi-scale features. These results highlight that our fusion strategy effectively integrates the complementary features of both modalities, while the Quad-Head Detector better utilizes the multi-scale fused features to significantly improve detection accuracy, ultimately leading to more robust classification and localization performance.

To further demonstrate the advantages of our dual-modal fusion detection network, we conducted experiments on the M3FD dataset, as shown in Table 3. The results clearly indicate that the dual-modal detection method significantly outperforms single-modal approaches by a substantial margin. Our model surpasses YOLOv8n trained on single-modal data by at least 10.3% in mAP@0.5 and at least 8.5% in mAP@[0.5:0.95], demonstrating a significant improvement. Our FM-YOLOn improves mAP@0.5 and mAP@[0.5:0.95] by 6.9% and 4.4%, respectively, compared to the baseline algorithm (Dual-YOLOv8n). Compared to the state-of-the-art dual-modal object detection algorithms CFT [13], YOLOFusion [46], SuperYOLO [35], and MOD-Yolo [14], our FQDNet_n achieves the highest mAP@0.5, mAP@[0.5:0.95], and recall while requiring fewer parameters and computational resources. Our precision is slightly insufficient, which may be because more small objects are detected, but the texture information of small objects is relatively lacking, making it difficult to accurately distinguish them. By appropriately scaling the size of our network, FQDNet_s achieves consistent improvements across all four metrics, with an mAP@0.5 of 91.9%, mAP@[0.5:0.95] of 63.0%, precision of 91.3%, and recall of 86.2%. These results further demonstrate the effectiveness of our method in enhancing object detection. For practical applications, different versions of FQDNet can be selected to meet specific requirements.

4.6. Generalization Experiments

In our experiments, FQDNet demonstrated strong potential in complex scenes across diverse datasets. We conducted generalization experiments on the widely used public datasets VEDAI and LLVIP. The visual detection results of our FQDNet_s model on these datasets are presented in Figure 8 and Figure 9, respectively.

In the VEDAI dataset, objects in the remote sensing images are typically small, which further increases the difficulty of detecting them in complex backgrounds. While our algorithm successfully detects the vast majority of small and medium-sized objects, certain small targets are inevitably missed due to the lack of distinctive texture information and interference from similar backgrounds, as shown in Figure 8c. Table 4 presents a comparison of the results of our algorithm with other methods on the VEDAI dataset. It is clear that the single-modal YOLOv8n trained on the RGB modality outperforms its counterpart trained on the IR modality in both mAP@0.5 and mAP@[0.5:0.95], indicating that visible images provide richer target information than infrared images in this dataset. Dual-modal detection algorithms, benefiting from the complementary nature of both modalities, show improved performance, with our FQDNet achieving the best results. Specifically, FQDNet_n achieves a 5.9% and 3.5% improvement in mAP@0.5 and mAP@[0.5:0.95], respectively, over the baseline (dual_YOLOv8n). Compared to state-of-the-art dual-modal detection algorithms, our model attains the highest mAP@[0.5:0.95] with significantly fewer parameters and lower computational complexity, while its mAP@0.5 is only slightly lower than that of YOLOFusion. This demonstrates the lightweight nature of our approach and its effectiveness in remote sensing scenarios, which can be attributed to the Quad-Head Detector DWQH and the specialized loss function FI-IoU incorporated into our design.

In the nighttime pedestrian detection scenario of the LLVIP dataset, the lack of ambient light results in less distinct object features in the visible images compared to daylight conditions, and some objects may even be completely obscured by darkness. To demonstrate the detection capability of our model in extremely low-light environments, we compared the detection results of multiple algorithms on the LLVIP dataset, as shown in Table 5. The single-modality YOLOv8n trained on the IR modality surpasses its RGB-trained counterpart by approximately 6% in mAP@0.5 and 10% in mAP@[0.5:0.95], highlighting the superior effectiveness of infrared images in capturing target features within this dataset. In contrast, Dual-YOLOv8n exhibits lower mAP@[0.5:0.95] than the single-modality YOLOv8n trained on the IR modality, indicating that simply combining two modalities does not necessarily lead to improved performance. However, our fusion strategy, which incorporates channel exchange and a Spatial Channel Attention Fusion Module, effectively extracts and integrates complementary features from both modalities. As a result, FQDNet_s achieves the highest mAP@0.5 of 96.4%, mAP@[0.5:0.95] of 64.1%, and recall of 91.3%, demonstrating significant performance improvements. Moreover, FQDNet_n surpasses YOLOFuison, SuperYOLO, and MOD-YOLO in performance while utilizing substantially fewer parameters and computations. Notably, it delivers a performance comparable to that of CFT but with approximately one-ninth of the parameters. These results on the LLVIP dataset further validate the effectiveness and robustness of our approach.

5. Discussion

The experiments conducted on the M3FD, VEDAI and LLVIP datasets demonstrate that the proposed FQDNet achieves strong detection performance, robust generalization capability, and a clear lightweight advantage, highlighting its adaptability across diverse detection scenarios. However, certain challenges remain. The model struggles to accurately detect extremely small or weak objects in remote environments and has yet to be thoroughly evaluated for robustness under spatial misalignment conditions. Future research will focus on enhancing the network’s sensitivity to weak signals to improve detection accuracy for small and low-contrast objects, strengthening robustness against spatial misalignment errors to ensure reliable performance in datasets with imperfect registration, and further exploring real-world deployment scenarios to validate its practical applicability in dynamic environments. These efforts will contribute to refining FQDNet and expanding its usability in real-world applications.

6. Conclusions

In this study, we proposed FQDNet, a novel end-to-end RGB-IR object detection network designed to address challenges in multi-modal integration, multi-scale feature utilization, and real-time applicability. Firstly, we introduced the Channel Swap SCDown Block (CSSB) and the Spatial Channel Attention Fusion Module (SCAFM) to enhance multi-modal fusion by effectively extracting and integrating complementary features from visible and infrared modalities. Secondly, to address the underutilization of multi-scale features, particularly for small objects, we developed the Dynamic-Weight-based Quad-Head Detector (DWQH). This module seamlessly integrates low-level details with high-level semantic information. Combined with the hybrid loss function Focaler–Inner IoU (FI-IoU), FQDNet achieves significant improvements in detection accuracy across various object sizes, especially small ones. Thirdly, to meet real-time requirements, we incorporated lightweight optimizations, including the Partial Cross-Stage Pyramid (PCSP) and SCDown modules, reducing computational complexity while maintaining high detection performance. Extensive experiments show that FQDNet improves mAP@[0.5:0.95] by 4.4% on M3FD, 3.5% on VEDAI, and 3.1% on LLVIP over the baseline, with only a 0.4M increase in parameters and 5.5 GFLOPs overhead. It also outperforms other mainstream dual-modal algorithms, showcasing robustness under challenging conditions such as remote sensing, low-light environments, densely populated scenes, and varying object scales.

Author Contributions

Conceptualization, F.M., G.T. and A.H.; methodology, F.M.; validation, F.M. and A.H.; investigation, F.M.; resources, F.M. and G.T.; data curation, A.H.; writing—original draft preparation, F.M.; writing—review and editing, F.M. and H.T.; supervision, G.T. and H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the project program of Science and Technology on Micro-system Laboratory, NO. 6142804230103.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Payghode, V.; Goyal, A.; Bhan, A.; Iyer, S.S.; Dubey, A.K. Object Detection and Activity Recognition in Video Surveillance using Neural Networks. Int. J. Web Inf. Syst. 2023, 19, 123–138. [Google Scholar]
Yang, B.; Li, J.; Zeng, T. A Review of Environmental Perception Technology Based on Multi-Sensor Information Fusion in Autonomous Driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
Zhao, H.; Chu, K.; Zhang, J.; Feng, C. YOLO-FSD: An Improved Target Detection Algorithm on Remote Sensing Images. IEEE Sens. J. 2023, 23, 30751–30764. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Li, Y.; Hu, Z.; Zhang, Y.; Liu, J.; Tu, W.; Yu, H. DDEYOLOv9: Network for Detecting and Counting Abnormal Fish Behaviors in Complex Water Environments. Fishes 2024, 9, 242. [Google Scholar] [CrossRef]
Hussain, M. YOlOv1 to v8: Unveiling each variant–A comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar]
Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal Object Detection in Difficult Weather Conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar]
Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-Frame Infrared Small-Target Detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep Learning in Multimodal Remote Sensing Data Fusion: A Comprehensive Review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar]
Sun, J.; Yin, M.; Wang, Z.; Xie, T.; Bei, S. Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation. Electronics 2024, 13, 443. [Google Scholar] [CrossRef]
Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. In Proceedings of the 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 27–29 April 2016; pp. 509–514. [Google Scholar]
Fang, Q.; Han, D.; Wang, Z. Cross-modality Fusion Transformer for Multispectral Object Detection. arXiv 2021, arXiv:2111.00273. [Google Scholar]
Shao, Y.; Huang, Q. MOD-YOLO: Multispectral Object Detection based on Transformer Dual-stream YOLO. Pattern Recognit. Lett. 2024, 183, 26–34. [Google Scholar]
Meng, F.; Chen, X.; Tang, H.; Wang, C.; Tong, G. B2MFuse: A Bi-branch Multi-scale Infrared and Visible Image Fusion Network based on Joint Semantics Injection. IEEE Trans. Instrum. Meas. 2024, 73, 1–17. [Google Scholar]
Zhang, Y.; Yu, H.; He, Y.; Wang, X.; Yang, W. Illumination-guided RGBT Object Detection with Inter-and Intra-modality Fusion. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar]
Fu, L.; Gu, W.b.; Ai, Y.b.; Li, W.; Wang, D. Adaptive Spatial Pixel-level Feature Fusion Network for Multispectral Pedestrian Detection. Infrared Phys. Technol. 2021, 116, 103770. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Vijayakumar, A.; Vairavasundaram, S. Yolo-based Object Detection Models: A Review and Its Applications. Multimedia Tools Appl. 2024, 83, 83535–83574. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal Loss for Dense Object Detection. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 3–6. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate Detr Training by Introducing Query Denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 13619–13627. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOlOv9: Learning What You Want to Learn using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision, Paris, France, 26–27 March 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOlOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhang, Q.; Wang, C.; Li, H.; Shen, S.; Cao, W.; Li, X.; Wang, D. Improved YOLOv8-CR network for detecting defects of the automotive MEMS pressure sensors. IEEE Sens. J. 2024, 24, 26935–26945. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-freebies Sets New state-of-the-art for Real-time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Yang, C.; Dong, X.; Cheng, C.; Ou, X.; He, X. Research on Improved Semantic SLAM Adaptation to Dynamic Environment Based on YOLOv8. In Proceedings of the 13th IEEE Data Driven Control and Learning Systems Conference, Kaifeng, China, 17–19 May 2024; pp. 772–776. [Google Scholar]
Gallagher, J.E.; Oughton, E.J. Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements, Applications and Challenges. IEEE Access 2025, 13, 7366–7395. [Google Scholar]
Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO Architecture From Infrared and Visible images for Object Detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar]
Cheng, Q.; Li, X.; Zhu, B.; Shi, Y.; Xie, B. Drone Detection Method based on MobileViT and CA-PANet. Electronics 2023, 12, 223. [Google Scholar] [CrossRef]
Li, J.; Ye, J. Edge-YOLO: Lightweight Infrared Object Detection Method Deployed on Edge Devices. Appl. Sci. 2023, 13, 4402. [Google Scholar] [CrossRef]
Huang, M.; Mi, W.; Wang, Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
Qiu, X.; Chen, Y.; Cai, W.; Niu, M.; Li, J. LD-YOLOv10: A Lightweight Target Detection Algorithm for Drone Scenarios based on YOLOv10. Electronics 2024, 13, 3269. [Google Scholar] [CrossRef]
Wu, X.; Jiang, X.; Dong, L. Gated Weighted Normative Feature Fusion for Multispectral Object Detection. Vis. Comput. 2024, 40, 6409–6419. [Google Scholar]
Yuan, M.; Wei, X. C²Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Zuo, X.; Wang, Z.; Liu, Y.; Shen, J.; Wang, H. LGADet: Light-weight Anchor-free Multispectral Pedestrian Detection with Mixed Local and Global Attention. Neural Process. Lett. 2023, 55, 2935–2952. [Google Scholar]
Deng, L.; Fu, R.; Li, Z.; Liu, B.; Xue, M.; Cui, Y. Lightweight Cross-Modal Multispectral Pedestrian Detection Based on Spatial Reweighted Attention Mechanism. Comput. Mater. Contin. 2024, 78, 4071–4089. [Google Scholar] [CrossRef]
Yan, P.; Wang, W.; Li, G.; Zhao, Y.; Wang, J.; Wen, Z. A Lightweight Coal Gngue Detection Method based on Multispectral Imaging and Enhanced YOLOv8n. Microchem. J. 2024, 199, 110142. [Google Scholar] [CrossRef]
Fang, Q.; Wang, Z. Cross-modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar]
Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep Multimodal Fusion by Channel Exchanging. Adv. Neural Inf. Proces. Syst. 2020, 33, 4835–4845. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; Yang, R. Iou Loss for 2D/3D Object Detection. In Proceedings of the 7th International Conference on 3D Vision, Quebec, QC, Canada, 15–18 September 2019; pp. 85–94. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss With Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5802–5811. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle Detection in Aerial Imagery: A small Target Detection Benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]

Figure 1. The overall architecture of the proposed RGB-IR object detection network FQDNet. It consists of a two-stream backbone using RGB and IR inputs, multi-modal feature fusion modules, a neck, and a Quad-Head Detector.

Figure 2. Channel Swap SCDown Block (CSSB).

Figure 3. Spatial Channel Attention Fusion Module (SCAFM).

Figure 4. Schematic diagram of PCSP (Partial Cross-Stage Pyramid) module. The symbol * in the figure represents a convolution operation.

Figure 5. Detection results on the M3FD dataset across four representative scenarios: low-light, complex background, rainy road, and sunny road. The first two columns show the ground truth in RGB and IR images, the next two present YOLOv8 results for single-modal RGB and IR, and the last two display our dual-modal model’s predictions. Objects in red circles highlight cases where the single-modal detection algorithms miss or misclassify objects, whereas our approach successfully detects them.

Figure 6. Performance comparison between FQDNet and other state-of-the-art real-time algorithms in terms of mAP and number of parameters.

Figure 7. Visualization of detection results for lightweight methods on the M3FD dataset. Subfigures (a–e) represent different typical scenarios in the dataset. The first row shows the ground truth for RGB and IR images. Rows 2–7 display the ground truth and detection results for Dual-YOLOv5n, Dual-YOLOv5tiny, Dual-YOLOv8n, Dual-YOLOv9t, Dual-YOLOv10n, and FQDNet_n. Detections are visualized in RGB images with a confidence threshold set to 0.3.

Figure 8. Visualization of FQDNet_s on the VEDAI dataset. (a–e) represent different typical scenarios in the dataset. The first row shows the ground-truth annotations on RGB and IR images. The second and third rows display the detection results of FQDNet_s, annotated on RGB and IR images, respectively.

Figure 9. Visualization of FQDNet_s on the LLVIP dataset. (a–d) represent different typical scenarios in the dataset. The first row shows the ground-truth annotations on RGB and IR images. The second and third rows display the detection results of FQDNet_s annotated on RGB and IR images, respectively.

Table 1. Ablation experiments on the M3FD dataset for the fusion strategy (FT), DWQH, and lightweight modules (LW). The optimal values are highlighted in bold, and the second-best values are underlined. Changes relative to the baseline are shown in parentheses.

Method	FT	DWMH	LW	Param (M)	FLOPs (G)	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@[0.5:0.95] (%)
Baseline	-	-	-	4.3	11.4	89.1	74.1	81.9	55.5
1	✓			4.4	12.1	85.7	76.5	82.6 (+0.7)	56.0 (+0.5)
2		✓		5.6	18.6	87.7	81.6	88.4 (+6.5)	60.2 (+4.7)
3			✓	3.0	9.5	88.3	73.1	81.1	53.4
4	✓	✓		5.7 (+1.4/32.6%)	19.3 (+7.9/69.3%)	87.4	82.0	89.0 (+7.1)	60.5 (+5.0)
5	✓		✓	3.4	10.2	88.2	74.8	82.4	54.8
6		✓	✓	4.3	16.2	86.8	80.9	87.4	59.4
7	✓	✓	✓	4.7 (+0.4/9.3%)	16.9 (+5.5/48.2%)	87.2	82.1	88.8 (+6.9)	59.9 (+4.4)

Table 2. Comparison of detection accuracy and model complexity of high-precision and real-time object detectors on M3FD dataset. Test image resolution is 640 × 640. Optimal values are highlighted in bold, and second-best values are underlined.

Method	Param (M)	FLOPs (G)	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@[0.5:0.95] (%)
Dual-YOLOv5n	2.8	7.0	85.7	71.3	78.2	47.0
Dual-YOLOv7tiny	8.5	20.4	85.7	73.6	80.4	50.0
Dual-YOLOv8n	4.3	11.4	89.1	74.1	81.9	55.5
Dual-YOLOv9t	3.5	14.9	88.0	71.9	79.6	53.3
Dual-YOLOv10n	3.9	11.8	83.5	70.2	77.8	50.8
FQDNet_n (Ours)	4.7	16.9	87.2	82.1	88.8	59.9
Dual-YOLOv5s	11.2	26.5	89.7	77.1	84.6	53.7
Dual-YOLOv8s	16.2	41.0	87.6	77.3	83.6	58.3
Dual-YOLOv9s	12.6	51.5	90.2	77.6	84.4	57.0
Dual-YOLOv10s	12.0	36.6	88.5	74.5	83.1	55.5
FQDNet_s (Ours)	17.6	53.8	91.3	86.2	91.9	63.0
Dual-YOLOv5m	33.1	79.6	89.7	79.0	86.1	56.7
Dual-YOLOv7	50.6	171.9	89.0	81.9	87.8	57.8
Dual-YOLOv8m	37.7	117.6	89.0	77.7	85.8	59.6
Dual-YOLOv9m	40.3	165.5	90.3	78.6	85.4	59.5
Dual-YOLOv10m	25.0	100.1	88.1	78.3	84.0	57.2
Dual-YOLOv10b	31.6	159.2	88.8	77.4	84.4	57.2
FQDNet_m (Ours)	37.2	134.3	90.5	85.4	91.3	64.0
Dual-YOLOv5l	72.7	178.1	91.1	80.9	87.6	59.2
Dual-YOLOv7x	91.1	290.4	90.8	82.8	87.3	58.9
Dual-YOLOv8l	63.4	251.0	88.9	80.0	86.3	60.7
Dual-YOLOv9c	69.1	330.8	91.1	80.8	88.1	61.1
Dual-YOLOv9e	99.0	362.4	89.3	82.4	88.6	61.7
Dual-YOLOv10l	40.7	209.0	91.0	77.4	85.9	58.1
Dual-YOLOv10x	46.5	271.2	88.8	80.3	86.4	58.3
FQDNet_l (Ours)	59.6	267.5	89.6	86.3	92.1	64.1

Table 3. Comparison of experimental results on M3FD. Optimal values are highlighted in bold, and second-best values are underlined. Changes relative to the baseline are shown in parentheses.

Method	Data Modality	Param (M)	FLOPs (G)	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@[0.5:0.95] (%)
YOLOv8n	RGB	3.0	8.2	84.0	71.8	78.3	51.4
YOLOv8n	IR	3.0	8.2	79.0	68.6	74.6	49.1
Dual-YOLOv8n	RGB+IR	4.3	11.4	89.1	74.1	81.9	55.5
CFT [13]	RGB+IR	44.9	17.9	90.3	77.7	85.1	53.9
YOLOFusion [46]	RGB+IR	12.5	28.6	86.0	82.1	87.7	53.8
SuperYOLO [35]	RGB+IR	4.9	56.3	90.1	80.1	88.0	54.8
MOD-YOLO [14]	RGB+IR	24.9	35.7	88.1	79.6	84.5	57.7
FQDNet_n (Ours)	RGB+IR	4.7	16.9	87.2	82.1	88.8 (+6.9)	59.9 (+4.4)
FQDNet_s (Ours)	RGB+IR	17.6	53.8	91.3	86.2	91.9 (+10.0)	63.0 (+7.5)

Table 4. Comparison of experimental results on the VEDAI dataset. Optimal values are highlighted in bold, and the second-best values are underlined. Changes relative to the baseline are shown in parentheses.

Method	Data Modality	Param (M)	FLOPs (G)	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@[0.5:0.95] (%)
YOLOv8n	RGB	3.0	8.2	59.0	67.8	66.8	39.5
YOLOv8n	IR	3.0	8.2	61.3	56.1	62.4	37.4
Dual-YOLOv8n	RGB+IR	4.3	11.4	71.0	63.6	67.1	40.8
CFT [13]	RGB+IR	44.9	17.9	68.6	65.7	70.5	42.6
YOLOFusion [46]	RGB+IR	12.5	28.6	75.8	64.8	73.3	43.8
SuperYOLO [35]	RGB+IR	4.9	56.3	67.2	71.1	72.4	44.2
MOD-YOLO [14]	RGB+IR	24.9	35.7	70.1	65.3	71.8	41.9
FQDNet_n (Ours)	RGB+IR	4.7	16.9	62.3	77.0	73.0 (+5.9)	44.2 (+3.5)
FQDNet_s (Ours)	RGB+IR	17.6	53.8	73.9	65.0	75.9 (+8.8)	47.7 (+6.9)

Table 5. Comparison of experimental results on the LLVIP dataset. Optimal values are highlighted in bold, and the second-best values are underlined. Changes relative to the baseline are shown in parentheses.

Method	Data Modality	Param (M)	FLOPs (G)	Precision (%)	Recall (%)	mAP@0.5 (%)	mAP@[0.5:0.95] (%)
YOLOv8n	RGB	3.0	8.2	87.4	79.3	86.9	48.5
YOLOv8n	IR	3.0	8.2	90.3	88.5	92.6	58.9
Dual-YOLOv8n	RGB+IR	4.3	11.4	92.4	89.4	94.6	58.2
CFT [13]	RGB+IR	44.9	17.9	93.5	90.7	95.4	61.5
YOLOFusion [46]	RGB+IR	12.5	28.6	89.7	90.5	93.1	57.9
SuperYOLO [35]	RGB+IR	4.9	56.3	89.2	89.2	93.2	58.5
MOD-YOLO [14]	RGB+IR	24.9	35.7	94.0	90.3	95.2	60.6
FQDNet_n (Ours)	RGB+IR	4.7	16.9	94.6	89.7	95.5 (+0.9)	61.3 (+3.1)
FQDNet_s (Ours)	RGB+IR	17.6	53.8	94.1	91.3	96.4 (+1.8)	64.1 (+5.9)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, F.; Hong, A.; Tang, H.; Tong, G. FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection. Remote Sens. 2025, 17, 1095. https://doi.org/10.3390/rs17061095

AMA Style

Meng F, Hong A, Tang H, Tong G. FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection. Remote Sensing. 2025; 17(6):1095. https://doi.org/10.3390/rs17061095

Chicago/Turabian Style

Meng, Fangzhou, Aoping Hong, Hongying Tang, and Guanjun Tong. 2025. "FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection" Remote Sensing 17, no. 6: 1095. https://doi.org/10.3390/rs17061095

APA Style

Meng, F., Hong, A., Tang, H., & Tong, G. (2025). FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection. Remote Sensing, 17(6), 1095. https://doi.org/10.3390/rs17061095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FQDNet: A Fusion-Enhanced Quad-Head Network for RGB-Infrared Object Detection

Abstract

1. Introduction

1.1. Background

1.2. Main Contributions

2. Related Works

2.1. Traditional Object Detection Algorithms

2.2. Visible-Infrared Object Detection Methods

2.3. Lightweight Models for Object Detection

3. Method

3.1. Overall Architecture

3.2. Dual-Stream Feature Extraction and Fusion Backbone

3.3. Dynamic-Weight-Based Quad-Head Detector (DWQH)

3.4. Lightweight Optimization

3.5. Focaler–Inner IoU (FI-IoU)

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Algorithm Performance Experiment

4.4. Ablation Experiment

4.5. Comparative Experiments

4.6. Generalization Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI