Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning

Chen, Rui; Li, Dongdong; Gao, Zhinan; Kuai, Yangliu; Wang, Chengyuan

doi:10.3390/drones8090451

Open AccessArticle

Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning

by

Rui Chen

¹

,

Dongdong Li

^1,*,

Zhinan Gao

¹,

Yangliu Kuai

² and

Chengyuan Wang

³

¹

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China

²

College of Intelligent Science and Technology, National University of Defense Technology, Changsha 410073, China

³

Information and Communication College, National University of Defense Technology, Wuhan 430010, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 451; https://doi.org/10.3390/drones8090451 (registering DOI)

Submission received: 31 July 2024 / Revised: 21 August 2024 / Accepted: 29 August 2024 / Published: 1 September 2024

(This article belongs to the Special Issue When Deep Learning Meets Geometry for Air-to-Ground Perception on Drones)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The use of unmanned aerial vehicles (UAVs) for visible–thermal object detection has emerged as a powerful technique to improve accuracy and resilience in challenging contexts, including dim lighting and severe weather conditions. However, most existing research relies on Convolutional Neural Network (CNN) frameworks, limiting the application of the Transformer’s attention mechanism to mere fusion modules and neglecting its potential for comprehensive global feature modeling. In response to this limitation, this study introduces an innovative dual-modal object detection framework called Visual Prompt multi-modal Detection (VIP-Det) that harnesses the Transformer architecture as the primary feature extractor and integrates vision prompts for refined feature fusion. Our approach begins with the training of a single-modal baseline model to solidify robust model representations, which is then refined through fine-tuning that incorporates additional modal data and prompts. Tests on the DroneVehicle dataset show that our algorithm achieves remarkable accuracy, outperforming comparable Transformer-based methods. These findings indicate that our proposed methodology marks a significant advancement in the realm of UAV-based object detection, holding significant promise for enhancing autonomous surveillance and monitoring capabilities in varied and challenging environments.

Keywords:

drone-based object detection; visible–thermal object detection; vision transformer; vision prompt tuning

1. Introduction

Object detection, a central challenge in computer vision, necessitates algorithms that possess robust classification capabilities and precise spatial localization for the identification and location of various targets, such as humans, animals, and vehicles, in images and videos. The performance of detection has been markedly improved by the rapid advancement of deep learning, particularly Convolutional Neural Networks (CNNs) [1], fueling progress in the field and spurring interest in downstream tasks [2,3,4]. The rise of unmanned aerial vehicles (UAVs), with their agility and efficient data collection capabilities, has given birth to the task of drone-based object detection [5]. However, the significant scale variations and variable angles in UAV imagery pose challenges to object detection. Existing algorithms for rotated object detection [6,7,8,9,10], often designed for remote sensing images, struggle to meet these demands.

In the field of drone-based object detection, current algorithms primarily depend on visible light imagery, which inherently limits their effectiveness in complex environments such as nighttime, rainy conditions, dense fog, and instances of occlusion (See Figure 1). With the advancement of sensor technology, modern drones are equipped with a variety of sensors, including infrared payloads, vastly expanding their range of applications and making dual-modal object detection a hot topic of research in the drone sector. The distinct imaging mechanism of infrared, which captures thermal energy, complements visible light imagery, markedly improving the precision and robustness of object detection. However, existing dual-modal detection algorithms often employ dual-stream backbone networks to process each modality separately, neglecting the issue of information imbalance between the two modalities. This leads to a substantial amount of parameter redundancy, highlighting the need for research into more efficient fusion strategies.

The Transformer architecture has achieved great success in the field of natural language processing [11,12] and has since been adopted by researchers in the realm of computer vision [13,14,15]. Its efficiency in processing long-range dependencies and parallelization capabilities have established it as a new paradigm. Dealing with issues such as lighting variations and target occlusions in dual-modal object detection poses challenges for CNNs. CNNs excel at local feature extraction via convolutions but struggle with lighting changes that alter pixel values and occlusion that disrupts these local patterns. Their limited global context understanding and multi-modal interaction hinder performance. Transformers, on the other hand, leverage global context capture capabilities, enabling better generalization across different lighting conditions. For occlusions, Transformers utilize pre-trained masking mechanisms to handle obscured regions, and their self-attention mechanism tracks information changes before and after occlusion, facilitating robust multi-modal global information interaction. However, in the domain of dual-modal object detection, the application of Transformers is limited, with their attention mechanisms often confined to the fusion module [16], not fully harnessing their potential for understanding global context. Additionally, with the emergence of efficient self-supervised learning methods like Masked Autoencoders (MAEs) [17], the Vision Transformer (VIT) [13] architecture can leverage a wealth of pre-trained weights, offering superior feature extraction and generalization for downstream tasks. Therefore, employing a VIT for dual-modal object detection is a promising and innovative approach.

Considering the fine-tuning of models pre-trained on extensive datasets, visual prompt tuning has emerged as a dominant approach. It significantly lightens the computational load and storage requirements of model fine-tuning by introducing only a few parameters. The VPT [18] integrates prompts into pre-trained networks through embeddings, yielding favorable results across 24 downstream tasks in fine-grained classification. The ViPT [19] creatively employs prompts as a dual-modal fusion tool, expanding visible light object tracking to include infrared, depth, and event-based image tracking. Drawing from this insight, we can conceptualize dual-modal object detection as a fine-tuning task. By refining single-modal benchmark models with prompts, we can transition them to dual-modal detection, thereby improving their versatility and robustness for complex scenarios.

In conclusion, we develop a Transformer-based algorithm for visible–infrared object detection, named Visual Prompt multi-modal Detection (VIP-Det). To fully exploit the capabilities of Transformer-based dual-modal algorithms, the Vision Transformer is utilized as the backbone for feature extraction, leveraging its strength in capturing long-range dependencies and global context. To simplify the complex architecture of dual-modal object detection, a prompt-based fusion module is devised that introduces prompts for fusion within a single-stream network, significantly reducing the number of parameters. To optimize pre-trained models and balance modal information, a stage-wise optimization strategy is introduced that commences with training single-modal benchmark models and subsequently refines features with additional modalities, fostering more effective modal integration and refined feature extraction. Our algorithm is tested on the DroneVehicle dataset, and the results demonstrate that it achieves high precision and adeptly accommodates the demands of object detection in intricate settings.

In summary, the contributions of this paper are as follows:

We propose a novel Transformer-based framework for dual-modal object detection, which incorporates the Vision Transformer (VIT) as a backbone, capable of efficiently extracting features and enhancing the precision of object detection;
We introduce a prompt-based fusion module and a stage-wise optimization strategy, utilizing prompts to guide feature fusion and enhance the aggregation capabilities of dual-modal information. Additionally, we employ a phased fine-tuning approach to guide parameter optimization, thereby better transferring the feature representation capabilities of the original model;
We assess the performance of our proposed framework on the DroneVehicle dataset and showcase its superior accuracy when compared to other comparable Transformer-based methods.

2. Related Work

2.1. Visible–Thermal Object Detection

Visible–thermal object detection algorithms stand as prime examples of image fusion technology, overcoming the limitations of single-modality images in complex environments by integrating complementary data from visible and thermal imagery. This synergy significantly enhances the precision and robustness of object detection. Researchers have not only compiled diverse datasets such as KAIST [20], DVTOD [21], and DroneVehicle [22] but have also proposed various cutting-edge algorithmic frameworks. Halfway Fusion [23] excels in merging visible and thermal information at the midlevel feature stage through a unique ConvNet architecture. UA-CMDet [22] introduces an uncertainty-aware mechanism that dynamically assesses the uncertainty of each modality and proposes a novel light-aware cross-modal non-maximum suppression algorithm to further improve detection. C2Former [16] focuses on cross-modal attention learning, facilitating interaction between RGB and thermal data via the ICA module while enhancing computational efficiency with the AFS module. TSFADet [24] offers the TSRA module for precise alignment of features.

However, current visible–thermal object detection algorithms, often based on dual-backbone networks, grapple with high complexity and a large number of parameters. The unequal significance of visible and thermal information under different environmental conditions challenges the assumption of equal importance, necessitating the development of more efficient, lightweight fusion strategies and intelligent mechanisms for adjusting modal weights as a critical research focus.

2.2. Vision Transformer for Object Detection

Inspired by the way humans process information, attention mechanisms in deep learning models dynamically adjust the weights of different parts to enable the models to focus on the critical portions of the input data, thereby enhancing their performance [25,26]. The Transformer is one of the best examples that showcases the power of attention mechanisms. The Transformer model, renowned for its global modeling and parallel processing capabilities in NLP [11,12], has intrigued the field of computer vision. The Vision Transformer (ViT) [13] revolutionized image processing by treating image patches as core processing elements. In the realm of object detection, DETR [15] introduced a new approach by discarding conventional anchor boxes and non-maximum suppression, utilizing an attention-based encoder–decoder framework for direct bounding box and category prediction. The Swin Transformer [14,27] significantly accelerates its computation speed through the use of a sliding window mechanism and hierarchical structure. The innovation of Masked Autoencoders (MAE) [17] for ViT pre-training advanced the field, facilitating self-supervised learning through the prediction of masked pixels, leading to the emergence of models like ViTDet [28], MIMDet [29], and ImTed [30] that enhance detection with an MAE’s pre-trained weights. For remote sensing, RVSA [31,32] tailored a ViT for detecting rotating objects by adjusting attention mechanisms, while STD [33] employed separate network branches to predict bounding box attributes, harnessing ViT’s spatial transformation abilities. In the field of drone-based object detection, a Hybrid Convolutional–Transformer framework [34] was proposed to address the challenge of weak supervision in drone-view imagery.

Nevertheless, the full capacity of Transformers in visible–thermal object detection remains untapped. Currently, Transformers are predominantly used as fusion components alongside CNNs [16], rather than independently harnessing their global modeling and spatial transformation strengths. Future research should focus on the explicit and customized application of Transformer models to visible–thermal object detection. This necessitates developing Transformer architectures that cater to the unique aspects of visible and infrared imagery, propelling advancements in this domain.

2.3. Vision Prompt Tuning

Fine-tuning large-scale pre-trained models on downstream tasks has become a prevalent training strategy for numerous NLP and CV tasks. The essence of this approach is to perform a comprehensive update of the model parameters on a specific dataset. However, this method is less efficient in terms of parameter utilization, as it necessitates creating unique model replicas for each new task and requires storing the enormous pre-trained models. In contrast to past conventional methods, Prompt has emerged as a novel training paradigm and is increasingly becoming the dominant approach for fine-tuning in computer vision. This involves training a large foundational visual model with extensive data and then using different prompts to accomplish various tasks. The Image Inpainting [35] algorithm has trained a model with the objective function, allowing it to rely on visual prompts to perform various tasks. The SAM [36] algorithm uses repeated prompts to direct the model’s output, with prompt formats such as points, bounding boxes, masks, and text, which describe target objects for segmentation. VPT [18] outperforms fine-tuning in classification tasks by embedding prompt parameters before input. ViPT [19] learns modality-specific prompts to adapt frozen pre-trained foundational models to a range of downstream dual-modal tracking tasks, including RGB + Depth, RGB + Thermal, and RGB + Event tracking.

3. Models and Methods

In this section, we introduce VIP-Det (Visual Prompt dual-modal Detection), an innovative algorithm for drone-based visible–thermal object detection that leverages the Vision Transformer architecture. This section commences with an exposition of the motivations that drove the development of the algorithm and an elucidation of its overarching framework. Subsequently, it delves into the technical nuances of the implementation of the prompt-based fusion module. This section concludes with an elucidation of the algorithm’s stage-wise training optimization strategy.

3.1. Overview

Traditional drone-based object detection algorithms are often limited to visible light imagery and may fail under complex environmental conditions such as nighttime, rainy weather, fog, and occlusions. Existing visible–thermal object detection algorithms typically rely on dual-stream backbone networks for feature extraction, which significantly increases the number of parameters and is hindered by the imbalance between the two modalities, thereby limiting the efficiency of their fusion. Vision Transformers (ViTs) have demonstrated impressive performance across a wide range of visual tasks; however, in the domain of visible–thermal object detection, their attention mechanisms are often confined to the fusion module, and the potential of their feature modeling has not been fully exploited.

To address these limitations, our VIP-Det, designed for visible–thermal object detection, introduces the Vision Transformer as its backbone. The algorithm adopts a single-stream network architecture to concurrently extract features from visible and infrared images. A novel prompt mechanism is employed to introduce a small set of learnable parameters for feature-level integration. During training, the algorithm first establishes a baseline model on single-modal data and then refines the model parameters using dual-modal data. The overall network architecture is designed to efficiently integrate the information from both modalities, aiming to enhance the algorithm’s capability in object detection. The overall architecture is illustrated in the accompanying Figure 2.

Our VIP-Det algorithm is composed of four main components: a data preprocessing module, a prompt-based fusion module, a feature extraction module, and a rotated detection head. The data preprocessing module processes both visible light and infrared imagery by patching them into tokenized form and stacking them. The prompt-based fusion module introduces prompts as learnable parameters, guiding feature fusion through training iterations. These prompt-embedded tokens are then inputted into the feature extraction module. This module employs an MAE pre-trained Vision Transformer model, which features 12 layers of Transformer blocks, as its backbone network instead of ResNet-50. The extracted features are then fed into the rotated detection head for classification and regression. In our experiments, we utilized the rotated detection head of STD [33] to achieve the most precise detection results. During the training process, we initially selected one modality as the baseline model and trained it to establish a foundation. Subsequently, we integrated the other modality to facilitate fusion, achieving dual-modal object detection with minimal parameter adjustments for efficient fine-tuning.

3.2. Vision Transformer Architecture

The Vision Transformer (ViT) represents a significant advancement in computer vision, reshaping the traditional approach of Convolutional Neural Networks (CNNs). Instead of sliding convolutional kernels across an image to extract features, the ViT divides the input image into a grid of nonoverlapping patches. Each of these patches is then flattened and converted into a vector, effectively transforming the two-dimensional image data into a sequence of one-dimensional vectors. To encode the spatial relationships between these patches, positional encodings are added to the vector representations, ensuring that the model can distinguish and utilize the positional information. These enriched embeddings serve as the input to a stack of Transformer encoder layers, which form the core of ViT’s architecture. Each encoder layer leverages self-attention mechanisms to allow each patch to attend to and interact with every other patch in the sequence, capturing long-range dependencies and contextual information. This is complemented by feedforward neural networks, which introduce non-linearities and enable the model to learn complex feature representations. As the embeddings traverse through the stacked encoder layers, they are progressively transformed and enriched, ultimately encoding a rich semantic understanding of the input image. In the context of this specific task, the output of the final encoder layer, now enriched with features extracted from both visible and infrared modalities, serves as the foundation for subsequent dual-modal target detection. These features, reflecting the unique properties of both spectra, empower the model to detect and identify objects with unprecedented accuracy and robustness, demonstrating the versatility and power of the Vision Transformer framework in addressing complex computer vision challenges.

3.3. Prompt-Based Fusion

3.3.1. Overview

To fine-tune a single-modal object detection model based on prompts, we first need a pre-trained baseline model for the specific modality, where the embedding layer and Vision Transformer layers are already equipped with relevant parameters. During subsequent fine-tuning, these layers are frozen to preserve their feature extraction capabilities. When pre-training the baseline model, the embedding layer and Transformer layers for feature extraction of that modality, along with the detection head, are trained, while the prompt layer remains untrained.

For simplicity, let us assume there is a pre-trained baseline model for the visible light modality; hence, the visible embedding layer is frozen, and a certain number of Vision Transformer layers are also frozen as per the requirement. Since the infrared modality has not been trained, the infrared embedding layer requires fine-tuning.

For input images, visible light and infrared images are separately fed into their corresponding embedding layers, where they undergo patch partitioning and encoding to obtain visible tokens and infrared tokens. This step is performed in the data preprocessing module. Subsequently, the prompt layer is initialized to generate a certain number of prompt tokens, which are then combined with the previously extracted visible and infrared tokens to form fused tokens. As training progresses, the parameters of the prompt layer are iteratively fine-tuned to produce prompt tokens with lower losses.

These fused tokens are then fed into the Transformer layers of the feature extraction module, where the output from the previous layer, including the prompt tokens, serves as input for the next layer. This process continues through all Transformer layers, from which the visible and infrared tokens are extracted to obtain feature maps. These feature maps are then input into the rotated detection head. For the specific code process, refer to Appendix A.

3.3.2. Introduction

Given a pair of prealigned and coregistered visible and thermal images, denoted

x_{v} \in R^{3 \times H \times W}

(where v stands for visible) and

x_{t} \in R^{3 \times H \times W}

(where t stands for thermal), respectively, with H and W being the height and width of the images, and assuming a batch size of 1 for simplicity, we explore the integration of these modalities within a Vision Transformer framework for object detection.

3.3.3. Image Patch Embedding

A typical ViT with N layers divides the input images into m fixed-size patches

I_{v}^{j} \in R^{3 \times h \times w}

and

I_{t}^{j} \in R^{3 \times h \times w}

for

j \in N, 1 \leq j \leq m

, where h and w are the height and width of each patch. These patches are then embedded into a d-dimensional latent space and position encodings are added:

\begin{matrix} {e_{v}}_{0}^{j} & = E m b e d ({I_{v}}^{j}) {e_{v}}_{0}^{j} \in R^{d}, j = 1, 2, \dots, m \\ {e_{t}}_{0}^{j} & = E m b e d ({I_{t}}^{j}) {e_{t}}_{0}^{j} \in R^{d}, j = 1, 2, \dots, m \end{matrix}

(1)

The sets of patch tokens at layer i are represented as:

\begin{matrix} {E_{v}}_{i} & = {{e_{v}}_{i}^{j} \in R^{d} ∣ j \in N, 1 \leq j \leq m} \\ {E_{t}}_{i} & = {{e_{t}}_{i}^{j} \in R^{d} ∣ j \in N, 1 \leq j \leq m} \end{matrix}

(2)

3.3.4. Prompt-Based Feature Fusion

To facilitate dual-modal feature fusion, we introduce a set of continuous prompt tokens

P = {p^{k} \in R^{d} ∣ k \in N, 1 \leq k \leq p}

initialized randomly and inserted before the first encoder layer

L_{1}

of the pre-trained Transformer. During fine-tuning, only the task-relevant prompts are updated, while the main Transformer parameters are frozen. This leads to:

[{E_{v}}_{1}, {E_{t}}_{1}, Z_{1}] = L_{1} ([{E_{v}}_{0}, {E_{t}}_{0}, P])

(3)

where Z represents the prompt parameters after iteration within the network. The forward pass through the Transformer layers can be expressed as:

[{E_{v}}_{i}, {E_{v}}_{i}, Z_{i}] = L_{i} ([{E_{v}}_{i - 1}, {E_{t}}_{i - 1}, Z_{i - 1}]) i = 1, 2, \dots, N

(4)

Each layer

L_{i}

consists of multihead self-attention (MSA) and a feedforward network (FFN), accompanied by Layer Normalization (LayerNorm) and residual connections.

3.3.5. Detection Head

Finally, a rotated object detection head denoted as Head processes the fused features from the last layer to predict rotated bounding boxes and categories:

y = Head ({E_{v}}_{N}, {E_{t}}_{N})

(5)

3.4. Stage-Wise Training Optimization

The typical approach to dual-modal object detection adheres to a standardized process: Initially, two separate backbone networks are employed to extract features from paired visible and infrared images independently. Subsequently, these dual-modal features are fed into a feature fusion module, integrating information from the two distinct modalities. Ultimately, an object detection head is leveraged for regression prediction, enabling the localization and classification of objects within the images.

However, a notable issue arises from this methodology: During a single training cycle, all network architecture parameters must be learned, resulting in a substantial parameter count and sluggish training speed. To address this challenge, we propose a stage-wise training optimization strategy. See Figure 3.

First, we individually train the mono-modal visible and infrared images using a Vision Transformer backbone network, aiming to develop the capacity to extract fundamental and generic features. This phase targets the establishment of benchmark mono-modal models. Next, we proceed with dual-modal image inputs for modal fine-tuning, freezing partial weights within the backbone networks and introducing prompt parameters for fine-tuning. This promotes efficient dual-modal feature fusion.

By adopting this training paradigm, we not only drastically reduce the parameter count but also simplify the overall model architecture. As there is no need for a separate feature fusion module, our approach relies solely on a minimal set of prompt parameters to achieve dual-modal feature fusion. This not only decreases model complexity but also renders the model more concise and interpretable.

Furthermore, our method harnesses the power of pre-trained models, facilitating seamless migration to dual-modal object detection tasks. By maintaining the invariance of selected weights from the pre-trained models during the fine-tuning phase, our approach effectively leverages the rich feature representations already learned, further enhancing the performance of dual-modal object detection.

4. Results

In this section, we commence by detailing the datasets and evaluation metrics employed in our experiments. Subsequently, we provide the pertinent setup and configuration details. We proceed with a series of ablation studies to validate the efficacy of our algorithm. Finally, we conduct comparative experiments against related algorithms.

4.1. Datasets and Evaluation Metric

The DroneVehicle [22] dataset is a comprehensive and diverse collection of RGB–infrared (RGB-IR) images captured by drones. This dataset encompasses a wide range of scenarios, including urban roads, residential areas, parking lots, and other environments, spanning various times of the day and night. The dataset consists of 28,439 image pairs, each pair containing corresponding RGB and infrared images that have been precisely aligned to ensure accurate representation of the scene. The annotations provided by the dataset authors are extensive and include oriented bounding boxes for five distinct vehicle categories: cars, buses, trucks, vans, and freight cars. The dataset is organized into a training set and a test set, with the training set comprising 17,990 image pairs and the test set consisting of 1469 image pairs. Our experiments were conducted on this DroneVehicle dataset, leveraging its richness and variability to test and refine our vehicle detection algorithm.

We utilized the mean average precision (mAP) as the primary evaluation metric for our detection algorithm, applied to the validation set. To ensure accurate detections, we employed an Intersection over Union (IoU) threshold of 0.5, which helped filter out false positives and contributed to a reliable assessment of the algorithm’s performance.

4.2. Implementation Details

Utilizing pre-trained weights initialized from the MAE, we embarked on training our network specifically for the DroneVehicle dataset, leveraging the computational prowess of an NVIDIA RTX 4090 GPU. Our training strategy commenced with initializing a single-modal base model through 12 epochs, subsequently transitioning into a 12-epoch fusion model training phase, where prompts were integrated to enhance performance. The optimization process employed stochastic gradient descent (SGD) equipped with a momentum factor of 0.9 and a weight decay rate of 0.0001.

During each training iteration, we processed batches containing two images apiece, initiating with a learning rate of 0.001. This learning rate underwent a strategic halving at epochs 8 and 11 to facilitate a smoother convergence. To augment the training data and bolster the model’s generalization capabilities, we applied various image transformations such as flipping, cropping, and splicing.

Post-training, during the inference phase, we utilized non-maximum suppression (NMS) with an Intersection over Union (IoU) threshold set at 0.3 to effectively eliminate redundant bounding boxes, ensuring the precision of our detections. Throughout these endeavors, we leveraged customized versions of the MMRotate and MMDetection frameworks.

4.3. Ablation Experiment

4.3.1. Ablation on Prompt-Based Fusion

To validate the efficacy of our prompt-based fusion module in enhancing the quality of fusion outcomes, we conducted a rigorous ablation study. This investigation entailed a comparative analysis between two experimental setups: the baseline approach, which directly stacked modalities without utilizing the fusion module, and our algorithm augmented with the integrated prompt-based fusion module.

The outcomes of this study, tabulated in Table 1, reveal a notable improvement. Specifically, the inclusion of the prompt-based fusion module resulted in a marked 1.3% increase in mean average precision (mAP). This substantial gain underscores the pivotal role played by our fusion module in bolstering the overall performance of the algorithm, highlighting its effectiveness in fostering seamless and effective modality integration.

4.3.2. Ablation on the Number of Frozen Layers

In the two-stage optimization training strategy, we strategically froze the Transformer encoder layers to safeguard the model’s foundational representation and generalization capabilities. The experiment compared freezing the first 6 layers, fine-tuning the last 6 layers, and fine-tuning all 12 layers, assessing the trade-off between accuracy and efficiency.

The results in Table 2, tabulated, reveal that freezing the first six layers results in a minimal 0.4% decrease in mAP compared to fine-tuning of the full layer, showcasing the effectiveness of partial freezing in reducing training parameters without compromising accuracy. This approach accelerates training and reduces computational demands, facilitating large-scale deployments and iterations.

In conclusion, partial freezing of the backbone network layers in two-stage optimization training is an efficient and practical method, allowing us to balance speed and accuracy by adjusting the number of frozen layers. This discovery offers an innovative approach to optimizing deep learning model training workflows.

4.3.3. Ablation on Stage-Wise Optimization

We conducted experiments to train both a single-stage mono-modal baseline model and a two-stage dual-modal model to validate the effectiveness of the staged training optimization for the detection of dual-modal objects. In our setup, we initially trained mono-modal models for visible light and infrared data, and then we introduced the data of the other modality to fine-tune the corresponding models. The results, as tabulated in Table 3, show that the fine-tuned models exhibited impressive improvements in mAP: the visible light model saw a remarkable 15.6% increase, and the infrared model experienced a 3% increase. The dual-modal object detection algorithms outperformed their mono-modal counterparts on the dataset, which contains challenging environments such as nighttime. The introduction of infrared data mitigates the limitations of using only visible light for object detection, enhancing performance in complex scenarios.

4.4. Performance Comparison

For the purpose of comparison, our experiment involved the implementation of a dual-modal object detection algorithm that underwent fine-tuning on the visible light base model, referred to as VIP-Det. This was set against a range of baseline mono-modal object detection algorithms, including the single-stage R3Det [8], the two-stage Oriented R-CNN [9], and the anchor-free SASM [10]. To ensure a fair comparison, these baseline algorithms were trained separately on either the visible light or infrared datasets. Furthermore, we conducted a meticulous re-implementation of four established RGB + T multispectral methodologies—Halfway Fusion [23], UA-CMDet [22], TSFADet [24], and C2Former [16]—with the objective of rigorously assessing their efficacy in the realm of RGB-IR object detection.

4.4.1. Comparison with Single-Modal Algorithm

The experimental results, as presented in Table 4, offer profound insights upon analysis. Specifically, in the comparison of single-modality algorithms for visible light, Oriented R-CNN emerges as the top performer, surpassing the single-modality benchmark VIP-Det algorithm in terms of precision. This underscores the advanced detection framework and optimization strategies employed by Oriented R-CNN in handling complex scenes and recognizing intricate features.

However, when shifting our focus to infrared data, the narrative shifts. VIP-Det, when trained solely on infrared datasets, demonstrates a remarkable superiority over other single-modality object detection algorithms, including the formidable Oriented R-CNN in visible light. Its precision advantage over Oriented R-CNN reaches a significant 6.84%, highlighting VIP-Det’s unique strengths in processing infrared imagery, possibly attributed to its sensitivity and adaptability to spectral characteristics.

Moreover, in the head-to-head comparison between single-modality and dual-modality algorithms, VIP-Det claims the highest precision level. This achievement not only validates the inherent superiority of the VIP-Det algorithm but also underscores the profound impact of multi-modality information fusion on enhancing object detection performance. By integrating information from both visible and infrared spectra, VIP-Det is able to comprehensively capture target features, mitigating information loss and interference inherent in single-modality approaches. Consequently, it achieves more precise and robust target detection in complex environments.

4.4.2. Comparison with Dual-Modal Algorithm

As shown in Table 5, when compared with the current state-of-the-art dual-modal object detection algorithms, our algorithm has demonstrated remarkable performance, achieving a significant mAP of 75.5%. This achievement not only surpasses other relevant dual-modal algorithms but also validates the effectiveness of our algorithmic innovations in multi-source information fusion and efficient feature extraction.

4.4.3. Comparison of Visual Detection Results

In this experiment, we aimed to validate the robustness of our algorithm and explore the efficacy of multi-modal object detection in complex environments. To this end, we selected Oriented R-CNN, the top-performing algorithm under single-modality conditions, as a benchmark for comparison. Our objective was to demonstrate the advantages of dual-modal object detection in the same environments where Oriented R-CNN is typically applied.

To comprehensively assess performance, we chose four distinct scenarios: daylight, nighttime, rainy/foggy conditions, and scenes with occlusion. Each of these scenarios poses unique challenges to object detection systems, requiring robust algorithms that can overcome factors such as illumination variations, poor visibility, and partial visibility of targets. The results are shown in Figure 4.

In daylight conditions, visible light predominates. Regarding the section highlighted in the red frame, the single-modality Oriented R-CNN misclassifies it in infrared imagery, whereas VIP-Det accurately determines the target category, effectively addressing the issues of low resolution and lack of clarity in infrared images.

Under nighttime conditions, the Oriented R-CNN fails to detect the red-framed area entirely in the single visible light modality due to insufficient information. Conversely, VIP-Det supplements visible light information with infrared imagery, mitigating the inability of visible light-based target detection at night.

In rainy or foggy environments, visible light images tend to blur due to light reflection by raindrops or fog particles, whereas infrared imaging, relying on thermal conduction, is less affected. For the red-framed section, Oriented R-CNN, operating solely on visible light, misses the detection, while VIP-Det, leveraging infrared imagery as an aid, adeptly resolves the issue of unclear textures and blurred contours in visible light images under adverse weather conditions.

In cases of occlusion, such as by trees or other objects, visible light imaging suffers from information loss due to reflection. However, thermal radiation from targets can penetrate certain obstructions. For the red-framed region, Oriented R-CNN, using only visible light, experiences missed detections, whereas VIP-Det, by successfully fusing infrared and visible light information, is capable of detecting occluded targets. For a more extensive showcase of visual results, please refer to Appendix B.

5. Discussion

Our research is primarily focused on the dual-modal object detection task within the UAV field. We have conducted a comprehensive set of ablation studies to validate the reliability of our proposed modules, and we have compared our algorithm with relevant state-of-the-art methods, showcasing its superior detection accuracy and impressive visual results. In contrast to single-modal object detection algorithms, our approach ingeniously fuses features through the use of prompts, endowing it with the capability of dual-modal complementarity and heightened robustness. Compared to existing dual-modal detection algorithms, our method fully exploits the representation and modeling power of Vision Transformers, achieving even better dual-modal feature extraction.

Looking ahead, we envision numerous avenues for further exploration to enhance the performance and practicality of our algorithm in real-world applications. Beyond developing more effective fusion modules and simplifying network architectures, we aim to optimize our model for seamless integration with UAV edge devices, enabling real-time, accurate detections under diverse environmental conditions. Additionally, we will investigate the potential of leveraging both visual and thermal data for battlefield reconnaissance and target identification, paving the way for safer and more efficient drone operations in the field.

6. Conclusions

In this work, our main contribution lies in the introduction of a Transformer-based algorithm for visible–thermal object detection tailored for applications of unmanned aerial vehicles (UAVs), named VIP-Det (Visual Prompt dual-modal Detection). VIP-Det employs a Vision Transformer as its backbone network, innovatively incorporates a prompt-based fusion module for refined feature integration, and adopts a stage-wise optimization strategy for efficient fine-tuning. Through a series of quantitative and qualitative experiments conducted on the DroneVehicle dataset, we demonstrate that VIP-Det surpasses existing dual-modal object detection algorithms, effectively tackling complex UAV-to-ground target detection scenarios, including rainy conditions, nighttime environments, and occlusion, with remarkable performance. This underscores the significant advancement of our proposed methodology in the realm of UAV-based object detection, which has immense potential to improve autonomous surveillance and monitoring capabilities in diverse and challenging environments.

Author Contributions

Conceptualization, R.C. and D.L.; methodology, R.C.; software, R.C.; validation, R.C., D.L. and Z.G.; formal analysis, D.L.; investigation, R.C.; resources, R.C.; data curation, R.C.; writing—original draft preparation, R.C.; writing—review and editing, Y.K. and D.L.; visualization, R.C.; supervision, C.W. and Y.K.; project administration, D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (NSFC) (No. 62102426) and the scientific research project of National University of Defense Technology (No. ZK21-29).

Data Availability Statement

All data utilized in this study originate from the publicly available “DroneVehicle” dataset, released by the VisDrone project team and hosted on the GitHub platform at the following link: https://github.com/VisDrone/DroneVehicle. This comprehensive dataset encompasses images and video footage specifically designed for tasks such as drone and vehicle detection, tracking, and beyond, thereby forming a solid foundation for our experimental endeavors. Throughout our experiments, we directly leveraged the image and video frames within this dataset, engaging in data preprocessing, model training, and result validation. All reported experimental outcomes are solely based on this dataset, adhering strictly to the data usage protocols and terms of the VisDrone project. It is worth noting that as the dataset has undergone public processing and sharing, we are exempted from concerns related to data privacy or ethical implications. Furthermore, we wholeheartedly encourage fellow researchers to harness this valuable resource, fostering collaborative efforts towards advancing the field of drone and vehicle detection technologies.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To better elucidate the prompt-based fusion module, we offer a streamlined pseudo-code flow of the algorithm, facilitating comprehension. Our algorithmic module encompasses two pivotal steps: pre-training and fine-tuning. During pre-training, the focus lies in provisioning initial embedding weights and relevant Transformer layer parameters. The fine-tuning phase, on the other hand, introduces prompt parameters for optimization. By integrating these two steps, our prompt-based fusion module efficiently leverages pre-trained knowledge while flexibly adapting to various tasks through optimized prompts, enhancing overall performance and versatility.

Algorithm A1 Prompt-based fusion

1:: procedure Pre-train( $x_{v}$ )
2:: Divide $x_{v}$ into patches $I_{v}^{j} \in R^{3 \times h \times w}$ , $1 \leq j \leq m$
3:: for $j = 1$ to m do
4:: ${e_{v}}_{0}^{j} = E m b e d_{v} ({I_{v}}^{j}), {e_{v}}_{0}^{j} \in R^{d}$
5:: ${E_{v}}_{0} = C o n c a t e n a t e ({e_{v}}_{0}^{j})$
6:: end for
7:: for $i = 1$ to N do
8:: $E_{v} i = T r a n s f o r m e r_{i} ([{E_{v}}_{i - 1})$
9:: end for
10:: $y = H e a d ({E_{v}}_{N})$
11:: $L o s s (y)$
12:: $u p d a t e (p a r a m e t e r s)$
13:: end procedure
14:: Retrieve weights of visible light embedding and Transformer layers
15:
16:: procedure Fine-tuning( $x_{v}$ , $x_{i}$ )
17:: Frozen weights of visible light embedding and Transformer layers
18:: Divide $x_{v}$ and $x_{t}$ into patches $I_{v}^{j}, I_{t}^{j} \in R^{3 \times h \times w}$
19:: for $j = 1$ to m do
20:: ${e_{v}}_{0}^{j} = E m b e d_{v} ({I_{v}}^{j}), {e_{v}}_{0}^{j} \in R^{d}$
21:: ${e_{t}}_{0}^{j} = E m b e d_{t} ({I_{t}}^{j}), {e_{t}}_{0}^{j} \in R^{d}$
22:: end for
23:: Initialize prompt tokens $P = {p^{k} \in R^{d} ∣ k = 1, 2, \dots, p}$
24:: $[{E_{v}}_{0}, {E_{t}}_{0}, Z_{0}] = C o n c a t e n a t e ({E_{v}}_{0}, {E_{t}}_{0}, P)$
25:: for $i = 1$ to N do
26:: $[{E_{v}}_{i}, {E_{t}}_{i}, Z_{i}] = T r a n s f o r m e r_{i} ([{E_{v}}_{i - 1}, {E_{t}}_{i - 1}, Z_{i - 1}])$
27:: end for
28:: $y = H e a d ({E_{v}}_{N}, {E_{t}}_{N})$
29:: $L o s s (y)$
30:: update(parameters)(excluding parameters of visible light embedding and Transformer layers)
31:: end procedure

Appendix B

In this supplementary section, we incorporate four comprehensive sets of visual comparison graphs to showcase the detection outcomes of our proposed method under diverse and challenging environmental conditions. These include scenarios of daytime, nighttime, foggy weather, and occlusion, providing a more prominent demonstration of the superiority of our approach. Within each set of images, we include four pairs of images specific to that environmental scenario. Each image pair comprises a visible light image on the left and its corresponding infrared image on the right. With the left image depicting the visible light influence and the right image corresponding to the infrared imagery, we can more effectively demonstrate the inherent differences between these two modalities and underscore the algorithm’s adept utilization of their complementary information.

Figure A1. The additional visualization results obtained using VIP-Det in the daytime scenarios. This set showcases the baseline performance under optimal lighting conditions.

Figure A2. The additional visualization results obtained using VIP-Det in the nighttime scenarios. The nighttime set reveals the effectiveness of our algorithm in low-light environments.

Figure A3. The additional visualization results obtained using VIP-Det in the foggy scenarios. This set highlights the ability of our method to penetrate visual obscurities and accurately detect objects, demonstrating its resilience against atmospheric disturbances.

Figure A4. The additional visualization results obtained using VIP-Det in the occlusion scenarios. This set underscores the capability of our approach to recognize objects even when partially hidden or obstructed, illustrating its robustness against occlusion challenges.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 29, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Liu, Z.M. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the ICCV Visdrone Workshop, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 923–932. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.0376. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Yuan, M.; Wei, X. C 2 Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
Zhu, J.; Lai, S.; Chen, X.; Wang, D.; Lu, H. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9516–9526. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
Song, K.; Xue, X.; Wen, H.; Ji, Y.; Yan, Y.; Meng, Q. Misaligned Visible-Thermal Object Detection: A Drone-based Benchmark and Baseline. IEEE Trans. Intell. Veh. 2024; early access. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar]
Yuan, M.; Wang, Y.; Wei, X. Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 509–525. [Google Scholar]
Zhang, W.; Zhao, W.; Li, J.; Zhuang, P.; Sun, H.; Xu, Y.; Li, C. CVANet: Cascaded visual attention network for single image super-resolution. Neural Netw. 2024, 170, 622–634. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Li, Z.; Li, G.; Zhuang, P.; Hou, G.; Zhang, Q.; Li, C. Gacnet: Generate adversarial-driven cross-aware network for hyperspectral wheat variety identification. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5503314. [Google Scholar] [CrossRef]
Cui, L.; Jing, X.; Wang, Y.; Huan, Y.; Xu, Y.; Zhang, Q. Improved swin transformer-based semantic segmentation of postearthquake dense buildings in urban areas using remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 369–385. [Google Scholar] [CrossRef]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October2022; pp. 280–296. [Google Scholar]
Fang, Y.; Yang, S.; Wang, S.; Ge, Y.; Shan, Y.; Wang, X. Unleashing vanilla vision transformer with masked image modeling for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6244–6253. [Google Scholar]
Liu, F.; Zhang, X.; Peng, Z.; Guo, Z.; Wan, F.; Ji, X.; Ye, Q. Integrally migrating pre-trained transformer encoder-decoders for visual object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6825–6834. [Google Scholar]
Wang, D.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing plain vision transformer toward remote sensing foundation model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–15. [Google Scholar] [CrossRef]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. Vsa: Learning varied-size window attention in vision transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 466–483. [Google Scholar]
Yu, H.; Tian, Y.; Ye, Q.; Liu, Y. Spatial transform decoupling for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6782–6790. [Google Scholar]
Li, S.; Xue, L.; Feng, L.; Yao, C.; Wang, D. Hybrid Convolutional-Transformer framework for drone-based few-shot weakly supervised object detection. Comput. Electr. Eng. 2022, 102, 108154. [Google Scholar] [CrossRef]
Bar, A.; Gandelsman, Y.; Darrell, T.; Globerson, A.; Efros, A. Visual prompting via image inpainting. Adv. Neural Inf. Process. Syst. 2022, 35, 25005–25017. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]

Figure 1. Difficulties and challenges faced by dual-modal object detection of UAV. (a) Low resolution of infrared images, and unclear target textures. (b) Under occlusion conditions, parts of target in visible light images are covered by trees. (c) Under night conditions, visible light imaging completely fails. (d) Under heavy fog conditions, visibility of targets in visible light is obstructed. (e) Flight height of drones is unstable, resulting in uneven target scales.

Figure 2. The overarching architecture of VIP-Det encompasses several principal components: a data preprocessing module, a prompt-based fusion module, a feature extraction module, and a rotation detection head. Firstly, the dual-modal images are input in the data preprocessing module to generate visible light tokens and infrared tokens separately. Then, the prompt-based fusion module initializes and generates prompt tokens, which are merged with the tokens from both modalities and jointly input into the feature extraction module. The feature extraction module, comprising multiple Transformer layers, performs feature extraction on the merged tokens. Finally, the extracted feature maps are fed into the rotated detection head to obtain results.

Figure 3. A comparison between the stage-wise training optimization strategy and common dual-modal object detection algorithms. (a) shows the common dual-modal object detection framework. (b) represents the prompt-based fusion and stage-wise training optimization strategy. It is divided into two stages. (b1) shows the process of training the baseline model. (b2) illustrates the process of training the fusion model.

Figure 4. A comparison of visual detection results. In this table, we present a visual comparison of the detection results between the single-modal algorithm, Oriented R-CNN, and our dual-modal algorithm, VIP-Det, across different scene environments. Each set of images encapsulates the detection outcomes from the same pair of visible and infrared images within a given scene. The red bounding boxes highlight the performance differences demonstrated by the algorithms under those specific scenarios, providing a clear visualization of their respective strengths and capabilities.

Table 1. Ablation on prompt-based fusion. Compared to direct feature stacking and fusion, the introduction of prompts simplifies task complexity by minimizing direct modifications to model parameters. This approach mitigates the risk of overfitting and, through the incorporation of additional parameters, enables the model to adapt more flexibly to feature transformations and weight adjustments. Consequently, it enhances the model’s generalization capabilities, making it more robust and versatile across diverse scenarios. The red mark indicates an increase in the number of parameters.

Method	Car	Truck	Freight Car	Bus	Van	mAP	Param
baseline	90.3	68.1	62.2	90.0	56.3	73.4	70.02 M
baseline + prompt	90.4	78.5	61.4	89.8	57.5	75.5	70.10 M (+0.08 M)

Table 2. Ablation on the number of frozen layers. Due to the discrepancy between the pre-training task and the new task, some features in the pre-trained model may not be suitable for the new task. Freezing certain layers can potentially limit the model’s ability to represent features tailored to the new task, resulting in a certain degree of accuracy degradation. However, this approach also brings benefits such as reducing the number of parameters and accelerating the model training speed. The red mark indicates an increase in the number of parameters.

Method	Car	Truck	Freight Car	Bus	Van	mAP	Param
frozen 6 layers	90.3	71.2	63.2	90.1	57.8	74.5	59.45 M
fine-tune 12 layers	90.4	78.5	61.4	89.8	57.5	75.5	70.10 M (+10.65 M)

Table 3. Ablation on stage-wise optimization. With the addition of information from another modality, the fine-tuned model can fully leverage the complementary nature of the data, achieving higher performance. Meanwhile, since the ground truth is uniformly adopted from infrared annotations, the infrared detection performance tends to be better than that of visible light. The red mark indicates an improvement in accuracy.

Method	Car	Truck	Freight Car	Bus	Van	mAP	Modality
visible baseline	78.3	54.9	38.8	83.8	43.8	59.9	RGB
thermal baseline	90.3	72.5	57.8	88.8	52.9	72.5	T
visible + fine-tune	90.4	78.5	61.4	89.8	57.5	75.5 (+15.6)	RGB + T
thermal + fine-tune	90.4	78.5	59.8	89.6	56.9	75.0 (+2.50)	RGB + T

Table 4. Performance comparison of single-modal and dual-modal object detection algorithms. This concise table comprehensively evaluates the performance of diverse object detection algorithms across single-modal (RGB or thermal IR) and dual-modal (RGB + thermal IR) setups. By analyzing average precision (AP) in detecting cars, trucks, freight cars, buses, and vans and calculating mean average precision (mAP), it underscores VIP-Det’s excellence in harnessing dual-modal information. The table highlights the modality used, revealing how modality choice impacts detection accuracy, offering valuable insights. The red mark indicates the maximum precision value in the column.

Method	Car	Truck	Freight Car	Bus	Van	mAP	Modality
R3Det [8]	87.8	35.0	16.1	75.9	16.2	46.20	RGB
Oriented R-CNN [9]	88.9	61.7	39.7	87.9	40.5	63.74
SASM [10]	88.6	52.4	31.6	82.0	32.0	57.32
VIP-Det (V)	78.3	54.9	38.8	83.8	43.8	59.90
R3Det [8]	89.5	29.5	22.3	73.1	16.2	46.12	T
Oriented R-CNN [9]	90.1	61.7	48.2	88.6	39.7	65.66
SASM [10]	89.6	46.8	36.2	80.7	28.8	56.42
VIP-Det (T)	90.3	57.8	61.4	88.8	52.9	72.50
VIP-Det (ours)	90.4	78.5	61.4	89.8	57.5	75.50	RGB + T

Table 5. A comparison with dual-modal algorithms. This refined table introduces a comparative analysis of VIP-Det against leading dual-modal (RGB + thermal) object detection algorithms. By assessing their performance in detecting various vehicle types and calculating the mean average precision (mAP), it offers valuable insights into how VIP-Det fares against the most advanced techniques in the field, further elucidating its strengths and positioning within the current state of the art. The red mark indicates the maximum precision value in the column.

Method	Car	Truck	Freight Car	Bus	Van	mAP	Modality
Halfway Fusion [23]	89.85	60.34	55.51	88.97	46.28	68.19	RGB + T
UA-CMDet [22]	87.51	60.70	46.80	87.00	38.00	64.00
TSFADet [24]	90.01	69.15	65.45	89.70	55.19	73.90
C2Former [16]	90.20	68.30	64.40	89.80	58.50	74.20
VIP-Det (ours)	90.40	78.50	61.40	89.80	57.50	75.50

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, R.; Li, D.; Gao, Z.; Kuai, Y.; Wang, C. Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning. Drones 2024, 8, 451. https://doi.org/10.3390/drones8090451

AMA Style

Chen R, Li D, Gao Z, Kuai Y, Wang C. Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning. Drones. 2024; 8(9):451. https://doi.org/10.3390/drones8090451

Chicago/Turabian Style

Chen, Rui, Dongdong Li, Zhinan Gao, Yangliu Kuai, and Chengyuan Wang. 2024. "Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning" Drones 8, no. 9: 451. https://doi.org/10.3390/drones8090451

Article Menu

Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning

Abstract

1. Introduction

2. Related Work

2.1. Visible–Thermal Object Detection

2.2. Vision Transformer for Object Detection

2.3. Vision Prompt Tuning

3. Models and Methods

3.1. Overview

3.2. Vision Transformer Architecture

3.3. Prompt-Based Fusion

3.3.1. Overview

3.3.2. Introduction

3.3.3. Image Patch Embedding

3.3.4. Prompt-Based Feature Fusion

3.3.5. Detection Head

3.4. Stage-Wise Training Optimization

4. Results

4.1. Datasets and Evaluation Metric

4.2. Implementation Details

4.3. Ablation Experiment

4.3.1. Ablation on Prompt-Based Fusion

4.3.2. Ablation on the Number of Frozen Layers

4.3.3. Ablation on Stage-Wise Optimization

4.4. Performance Comparison

4.4.1. Comparison with Single-Modal Algorithm

4.4.2. Comparison with Dual-Modal Algorithm

4.4.3. Comparison of Visual Detection Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI