1. Introduction
In the field of autonomous driving, precise identification and detection of driving environment targets are fundamental to achieving safe self-driving. However, in open and dynamic scenarios, autonomous driving systems frequently encounter challenges posed by complex lighting conditions and adverse weather (e.g., rain, fog, snow). These factors can introduce noise and occlusion issues into the captured visual information, significantly diminishing the detection performance of the autonomous driving system’s environmental perception, thereby directly jeopardizing driving safety.
An intuitive solution to mitigate the impact of adverse weather on autonomous driving is to employ image restoration techniques [
1] to eliminate the detrimental effects caused by weather, thereby enhancing the quality of captured information. However, image restoration algorithms may introduce issues such as pixel shifts or residual “weather traces” during image processing, failing to effectively address the information loss challenges posed by adverse weather conditions. Therefore, there is a need to explore more comprehensive and effective approaches to counteract the negative influence of adverse weather on the visual perception of autonomous driving systems.
Another effective solution is to leverage the assistance of additional sensors. In the realm of autonomous driving, commonly utilized sensors include infrared imaging and radar, both of which exhibit resilience against adverse weather or complex lighting conditions. This paper opts to utilize infrared images to enhance the object detection capabilities of autonomous driving systems across various environmental conditions, primarily based on the following considerations: Firstly, infrared images can provide clear imaging results at nighttime or in low-light conditions, unaffected by light intensity, while also effectively penetrating through adverse weather conditions such as smoke and heavy fog, which is crucial for the safety of autonomous driving in complex environments. Secondly, infrared imaging demonstrates exceptional sensitivity to heat source targets (e.g., pedestrians, animals), highlighting these targets and thereby offering higher detection robustness in intricate backgrounds and occluded environments. Lastly, compared to radar, infrared imaging devices are more cost-effective and impose fewer demands on hardware resources, making them more suitable for large-scale deployment. Consequently, the application of infrared images in autonomous driving object detection, particularly under nighttime and adverse weather conditions, can provide the system with more reliable object detection capabilities, making it an ideal choice for achieving the safety and efficiency of autonomous driving. Currently available public datasets [
2,
3,
4] for object detection in autonomous driving scenarios using RGB and infrared images lack adverse weather conditions. Therefore, we synthesize a new Adverse Weather and Illumination Dataset (AWID) based on existing public datasets to provide a training and testing benchmark for object detection in various weather conditions as shown in
Figure 1.
Research on cross-modal fusion based on visible light and infrared images has a long history. Existing cross-modal object detection algorithms [
5,
6,
7,
8], with advanced feature extraction and fusion strategies, have made significant progress in this field and achieved high detection accuracy. However, current cross-modal detection techniques often focus on designing the optimal fusion algorithm, neglecting the extraction and enhancement of critical information in each modality’s images before and after fusion. This paper, building upon this foundation, delves deeper into feature extraction and fusion strategies and proposes the CME-YOLO algorithm. Unlike existing algorithms, CME-YOLO focuses throughout the process—from cross-modal feature fusion to input into the detection head—on whether key features are accurately extracted and whether effective information is sufficiently enhanced. By continuously reinforcing key features during feature fusion and upsampling, and combining the Transformer attention mechanism to provide optimal fusion strategies, this algorithm significantly improves detection effectiveness and robustness.
During the cross-modal feature fusion stage, this paper introduces the Cross-Perception Transformer Fusion module (CPTFusion). Specifically, after the dual-stream feature extraction network provides feature maps for each modality, we first employ a residual bottleneck structure that involves dimensionality reduction, followed by dimensionality expansion, and then dimensionality reduction again. This approach reduces the parameter count while preserving critical information. Subsequently, a multi-scale feature extraction module with global and local cross-perception is utilized for feature enhancement, strengthening the expression of key information. Ultimately, the Transformer attention mechanism is leveraged to identify the optimal fusion strategy, achieving the most effective feature fusion.
Upon entering the detection head, to address the issue of feature map quality degradation caused by the neighboring sampling mechanism of bilinear interpolation in traditional upsampling processes, we introduce the Adaptive Upsampling module (AdSample). This module employs a convolutional neural network to learn and select the optimal sampling regions, dynamically adjusting the sampling positions to expand the effective information within the feature map. Consequently, it achieves feature enhancement during the upsampling process, thereby improving the model’s learning and adaptability.
Through these innovative designs, the CME-YOLO algorithm can more effectively enhance and fuse features from visible light and infrared images, significantly improving the accuracy and robustness of object detection. Even in complex environments and adverse weather conditions, this algorithm demonstrates outstanding performance, providing autonomous driving systems with more reliable perception capabilities.
The primary contributions of this paper are as follows:
1. We establish a novel dataset named Adverse Weather and Illumination Dataset (AWID) to simulate intricate real-world scenarios as shown in
Figure 1. The AWID dataset encompasses three adverse weather conditions—rain, snow, and fog—as well as two illumination scenarios—daytime and nighttime. This dataset offers abundant training samples and a dependable testing benchmark for object detection in adverse weather conditions.
2. We design a novel adaptive upsampling module called AdSample. This module, through convolutional neural network learning, automatically identifies the optimal sampling regions during the upsampling process, thereby expanding the effective information within the feature map. This innovative design significantly enhances the detection head’s comprehension and adaptability towards the input fused tensor, effectively improving the overall object detection performance.
3. We propose the Cross-Perception Transformer Fusion module (CPTFusion) to achieve the effective fusion of features from different modalities. CPTFusion extracts multi-scale detailed features from two source images and utilizes the Transformer attention mechanism to identify the optimal fusion strategy. This approach fully leverages the complementary advantages of infrared and visible light modalities, significantly enhancing the performance of cross-modal object detection.
4. By integrating the AdSample and CPTFusion modules, we introduce the Cross-Modal Enhanced YOLO (CME-YOLO) algorithm. This algorithm enhances the critical features of images from different modalities and achieves the effective fusion of cross-modal information. Consequently, it significantly improves the object detection performance of autonomous driving systems under complex lighting conditions and various weather scenarios.
5. Extensive experimental results demonstrate that our CME-YOLO algorithm achieves superior performance on both existing cross-modal object detection datasets and our newly created AWID dataset. Notably, on the FLIR dataset, CME-YOLO’s mAP50 value surpasses the current best algorithm, MFPT, by 6.9%, and its mAP value exceeds the second-best algorithm, CFT, by 10.86%. These outcomes thoroughly validate the superiority and effectiveness of CME-YOLO in object detection tasks within autonomous driving scenarios.
Through these contributions, this paper presents innovative solutions to address object detection challenges in adverse weather and complex lighting conditions within autonomous driving. It advances the development of cross-modal object detection technology and lays a foundation for achieving safer and more reliable autonomous driving systems.
2. Related Work
In this section, we briefly review the relevant work in the field of object detection for autonomous driving systems.
Object detection datasets for autonomous driving scenarios. Researchers are actively creating and utilizing synthetic datasets to enhance the performance of object detection algorithms in autonomous driving scenarios under various extreme conditions. For instance, the WEDGE [
9] dataset leverages the DALL-E [
10] generative model to create 3360 images encompassing 16 extreme weather conditions, with 16,513 bounding-box annotations provided. This dataset supports research on weather classification and 2D single-modal object detection. Additionally, datasets such as BDD100K [
11] and the KITTI Object Detection dataset [
12] contain a wealth of object image samples relevant to autonomous driving, covering common targets like cars, pedestrians, and traffic signs. These datasets have undergone automatic orientation correction and resolution unification, thereby ensuring the accuracy and generalization capabilities of models. These datasets not only provide a rich resource of images but also offer new possibilities for simulating and understanding visual scenes under complex weather conditions, thereby strengthening the visual robustness of autonomous driving systems. Currently, there is no existing visible–infrared object detection dataset that encompasses various weather conditions. Therefore, we synthesized and created the Adverse Weather and Illumination Dataset (AWID) based on available datasets. This dataset aimed to provide training samples and a testing benchmark for cross-modal object detection in multi-weather scenarios.
State-of-the-art multimodal object detection technology. Multimodal target detection, as an important research direction in the field of computer vision [
13,
14,
15], aims to enhance the performance of target detection by utilizing data from different modalities (such as RGB images and infrared images). This field primarily consists of two categories of methods: feature-level fusion and pixel-level fusion. Feature-level fusion [
5,
6,
7,
8,
16] methods achieve this by integrating features from different modalities at various stages of the detector, including early fusion, deep fusion, and late fusion. Pixel-level fusion methods [
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30], on the other hand, focus on merging cross-modality images and inputting them into the detector, leveraging the information from cross-modality input images through the reconstruction of the fused image.
Feature-level fusion has gradually become the mainstream approach in multimodal target detection algorithms. Its core idea is to enhance detection performance by fusing multimodal features while maintaining the characteristics of different modality data. Specifically, feature-level fusion methods add specific fusion modules within the neural network to achieve cyclic fusion and refinement of multimodal features [
5,
6,
7,
8], thereby improving the consistency and quality of the features.
In recent years, dual-stream object detection models based on convolutional neural networks have made significant progress in enhancing recognition performance. These models, by adopting advanced dual-stream backbones and fusion modules, can significantly enhance the expression of features within and between modalities. Zhang et al. [
7] proposed Cyclic Fuse-and-Refine (CFR_1, CFR_2, CFR_3) by adding a specific module within the neural network to achieve cyclic fusion and refinement of multispectral features, thereby improving the performance of multispectral object detection. Fang et al. [
5] proposed the Cross-Modality Fusion Transformer (CFT) by applying the Transformer architecture to multispectral images, achieving significant performance improvements on multiple datasets. The CFT method [
5], by learning long-distance dependencies and global contextual information, achieved more accurate and robust feature extraction for multispectral images. Afnan Althoupety et al. [
31] proposed Dual Attentive Feature Fusion (DaFF) by utilizing a dual attention mechanism and multimodal learning, achieving significant performance improvements on multiple datasets. The DaFF method [
31], by performing dual attention fusion on the features of RGB and IR images, achieved more accurate and robust feature extraction for multispectral images. Zhang et al. [
8] proposed Guided Attentive Feature Fusion (GAFF) by utilizing a guiding attention mechanism and multimodal learning, achieving significant performance improvements in multispectral pedestrian detection tasks. The GAFF method [
8], by performing guiding attention fusion on the features of RGB and IR images, achieved more accurate and robust feature extraction for multispectral images. Zhu et al. [
6] proposed the Multimodal Feature Pyramid Transformer (MFPT) by utilizing a multimodal feature pyramid and Transformer architecture, achieving significant performance improvements in RGB–infrared object detection tasks. The MFPT [
6] method, by combining a multimodal feature pyramid with the Transformer architecture, achieved more accurate and robust feature extraction for RGB–infrared images. These advanced fusion methods are not only applicable to object detection tasks but can also be integrated into other downstream tasks, thereby enhancing overall detection accuracy and robustness.
Unlike traditional cross-modality algorithms, the research focus of the proposed CME-YOLO is on optimizing the entire process from cross-modality feature fusion to the input of the detection head, ensuring that key features are precisely extracted and effective information is fully enhanced. To achieve this, CME-YOLO introduces the Cross-Modality Perception Transformer Fusion module (CPTFusion) and the Adaptive Upsampling module (AdSample). In the feature fusion stage, CPTFusion enhances the abstract expression of key features through multi-scale feature extraction and Transformer attention mechanisms, optimizing the fusion strategy for different modality features and fully leveraging the complementarity of infrared and visible light modalities. In the upsampling stage of the detection head, AdSample utilizes convolutional neural networks to dynamically adjust the sampling region, expanding the effective information in the feature map and further enhancing feature expression. Through this comprehensive optimization design, CME-YOLO excels in feature extraction, fusion, and enhancement, demonstrating outstanding performance and robustness in handling complex scenes and adverse weather conditions. In the future, with the advancement of research and technology, the performance of multimodal target detection is expected to be further improved, playing an important role in more practical applications.
3. Methods
3.1. Dataset Creation
We surveyed currently available public datasets [
26,
32,
33,
34] for object detection in autonomous driving scenarios using RGB and infrared images, such as KAIST [
2], FLIR [
3], and LLVIP [
4]. These datasets typically only include daytime and nighttime scenes with complex lighting variations and lack scenes with adverse weather conditions such as rain, snow, or fog. Therefore, we created a new Adverse Weather and Illumination Dataset (AWID) based on RGB and infrared images, utilizing synthetic means like image translation, in conjunction with public datasets.
We selected the SMOD [
35] dataset as the foundational dataset and expanded it into the Adverse Weather and Illumination Dataset (AWID) as shown in
Figure 2. The SMOD [
35] dataset (SJTU Multispectral Object Detection Dataset) was released by a research team from Shanghai Jiao Tong University. It focuses on object detection in multispectral images within autonomous driving scenarios, particularly the detection of pedestrians, cyclists, bicycles, and cars under complex illumination conditions and low-light environments. The SMOD [
35] dataset comprises 5378 pairs of daytime scene images and 3298 pairs of nighttime scene images. These image pairs are strictly aligned in both time and space, covering a wide range of illumination variations, especially in nighttime scenes, which include very low lighting and high-intensity lighting conditions.
The AWID dataset we created consisted of 20,000 pairs of aligned visible–infrared images and annotations, covering three weather conditions (rain, snow, and fog) and two illumination scenarios (daytime and nighttime). The daytime and nighttime scenes were sourced from real data in the SMOD dataset, while the images for rainy, snowy, and foggy weather conditions were synthesized from the clean daytime scene images.
We synthesized paired rainy, foggy, and snowy images on the clean daytime images from the SMOD [
35] dataset and collected real rainy, foggy, and snowy images from outdoor scenes. These images were combined with the daytime scene data from SMOD [
35] to form the training samples input into CycleGAN [
36]. We obtained the translation weights for daytime-to-rainy, foggy, and snowy images, respectively, to generate adverse weather images paired with the daytime scene data in SMOD [
35]. To ensure high pixel alignment and prevent the threat of pixel displacement caused by unpaired training data in CycleGAN [
36] training, which could affect the validity of object detection labels, we used the weather images generated by CycleGAN [
36] for the first time as training samples and repeatedly trained CycleGAN [
36] until stable daytime–adverse weather paired images were generated. The generated dataset, along with the original SMOD [
35] data and corresponding labels, were input into YOLOv10 [
37] for object detection tasks to verify the effectiveness of the dataset generation. As shown in
Table 1, the detection accuracy decreased under adverse weather conditions. However, within the normal range of weather impact, we considered the pixel displacement caused by image translation to be negligible in the object detection evaluation task. In other words, the generated image data and corresponding labels were deemed suitable for object detection tasks.
3.1.1. Synthetic Paired Weather Images
To begin with, for the addition of rain and snow effects, we employed a particle system approach. A particle system is a technique for simulating complex phenomena by generating a large number of small particles to mimic natural occurrences such as raindrops and snowflakes. In this study, we defined parameters such as the size, velocity, and tilt angle of the particles and wrote corresponding fragment shaders in the Shader language to achieve the visual effects of rain and snow. For instance, the simulation of raindrops was carried out using the Equation (1).
In the context of control theory, the tiltAngle parameter governs the inclination of the raindrops, the rainSize parameter dictates the size of the raindrops, and the rainSpeed parameter controls the velocity at which the raindrops fall. By adjusting these parameters, we can simulate realistic raindrop effects on the image.
For the addition of fog effects, we employed an atmospheric scattering model. That model was based on the principles of light scattering and absorption in the atmosphere, calculating the scattering light intensity for each pixel in the image to simulate the fog effect. The simulation of fog effects was conducted using Equation (2).
where
represents the pixel intensity without fog,
is the atmospheric scattering coefficient,
is the distance from the pixel to the camera, and
A is the atmospheric light intensity. By adjusting the values of
and
A, we could control the density and luminance of the fog, thereby adding varying degrees of fog effects to the image.
3.1.2. Generating a Dataset with CycleGAN [36]
CycleGAN (Cycle-Consistent Generative Adversarial Network) [
36] is a powerful tool for image-to-image translation, particularly suitable for scenarios with unpaired training data. Its structure consists of two pairs of generators and discriminators: Generator
G and Discriminator
, as well as Generator
F and Discriminator
. The algorithm diagram of CycleGAN is shown in
Figure 3. Here,
G attempts to convert images from domain X (e.g., sunny images) to domain Y (e.g., rainy images), while
F performs the reverse conversion. Discriminator
is used to determine if an image comes from domain Y, and Discriminator
is used to determine if an image comes from domain X. They try to distinguish between real images and those generated by the generators. Through cycle consistency and adversarial training mechanisms, CycleGAN can learn and understand the mapping relationships between image domains.
The training of CycleGAN [
36] relies on the clever combination of three main loss functions:
This adversarial loss is crucial for training as it motivates generator G to synthesize images that are indistinguishable from real images in domain Y, while also challenging discriminator to accurately identify the authenticity of the images it evaluates.
b. Cycle consistency loss: This loss ensures that information is not lost during the translation process, meaning that converting from one domain to another and back should result in an image close to the original.
This loss encourages the generator to preserve certain features of the image during the translation, even after the image has been transformed into the target domain.
The training process of CycleGAN begins with the initialization of the parameters for generators G and F, as well as discriminators and . Subsequently, the discriminators and are trained by alternating between real and generated images to enhance their ability to distinguish between real and generated images. Meanwhile, the generators G and F are alternately trained by minimizing the adversarial loss, cycle consistency loss, and identity loss to gradually improve the accuracy and robustness of their image transformations. This series of training steps is continuously repeated until the model converges, ensuring that CycleGAN can efficiently learn the mapping relationships between image domains.
In the generation of adverse weather datasets, CycleGAN [
36] demonstrates its excellent capability. The specific application process is as follows: First, data preparation is conducted, where sunny images from the SMOD [
35] dataset are categorized into domain
X, while real rain, snow, and fog images and synthetic weather images created through special effects are categorized into domain
Y. Next, leveraging CycleGAN’s [
36] training mechanism, the generator
G learns to convert sunny images into images with rain, snow, and fog effects by optimizing adversarial loss, loss of cycle consistency, and loss of identity. Finally, by adjusting the parameters of generator
G and training hyperparameters, the intensity and characteristics of the generated adverse weather effects can be flexibly controlled to meet various practical application needs.
To mitigate the impact of pixel misalignment caused by unpaired training data between real weather images and SMOD’s [
35] clear-sky images on the accuracy of downstream tasks, we employed a cyclic training strategy. Specifically, we used the dataset generated by the first training iteration of CycleGAN [
36] as new training samples and continued to train CycleGAN [
36] iteratively until stable paired adverse weather images were produced. This strategy effectively reduced pixel misalignment issues, ensuring the quality and consistency of the generated images. The part of the dataset we created is shown in
Figure 1.
3.2. CME-YOLO Algorithm
As illustrated in
Figure 4, our proposed Cross-Modal Enhanced YOLO (CME-YOLO) consists of two stages: dual-modal feature extraction and detection. We extended the YOLOv10 [
37] framework by integrating our innovative Cross-Perception Transformer Fusion module (CPTFusion) and Adaptive Upsampling module (AdSample). These additions assist the cross-modal object detection framework in adapting to detection images across various scenarios, enhancing the fusion to capture more detailed features, and ultimately improving the adverse weather cross-modal object detection performance of autonomous driving systems.
Specifically, after the RGB and infrared images undergo feature extraction through their respective backbone networks, they are fused using the CPTFusion module. The resulting fused tensor is then input into the detection head. During the upsampling process within the detection head, we incorporate the Adaptive Upsampling module (AdSample). This module employs convolutional neural networks to learn the optimal sampling strategy, thereby enhancing the quality of the upsampled image. Finally, the object detection task is completed within the YOLOv10 detection head. We provide detailed explanations of the design rationale behind the CPTFusion and AdSample modules in
Section 3.2.1 and
Section 3.2.2, respectively.
3.2.1. Cross-Perception Transformer Fusion (CPTFusion)
Existing deep learning-based image fusion algorithms [
5,
38,
39,
40,
41] typically rely on deep network architectures to achieve a large receptive field for information integration. However, due to the inherent limitations of convolutional operations, these methods focus merely on achieving fusion between modalities, without considering the enhancement of detailed features in each modality tensor before fusion, nor evaluating the interaction between the two to find the optimal fusion strategy. In brief, although existing learning-based image fusion techniques [
28,
42,
43,
44,
45,
46,
47,
48,
49] enhance information reception through deep networks, they fall short in addressing aspects such as cross-modal fusion detailed feature enhancement and multiscale information interaction. These aspects are particularly crucial for acquiring meaningful supplementary information and tackling object detection under adverse weather conditions.
To address the challenges in cross-modal feature fusion, we developed the CPTFusion module, the architecture of which is depicted in
Figure 5. Within the CPTFusion framework, the input infrared and RGB feature tensors were first subjected to dimensionality reduction and reshaped into embeddings. Subsequently, these embeddings were concatenated to form a unified tensor representation, laying the foundation for the subsequent fusion operations. During the fusion process, we introduced a residual bottleneck structure and a Bifurcated Fusion Module, which, while maintaining a low parameter count, preserved critical feature information and extracted multi-scale features. This design not only enhanced the interaction between the dual-stream feature tensors but also enriched and strengthened the feature representation.
Furthermore, we incorporated a variant of the Transformer module to explore the optimal fusion strategy between infrared and RGB feature tensors. By leveraging the synergy of self-attention mechanisms and multilayer perceptrons, this module efficiently captured the complex relationships between feature tensors, thereby achieving more precise feature fusion. We provide detailed descriptions of these three key modules in the subsequent sections. Through this design, CPTFusion not only enhanced the detailed features of both source images but also fully leveraged the complementary advantages of infrared and RGB modalities, significantly improving the performance of cross-modal object detection.
We extracted convolutional features from the middle layers of the network for RGB images and infrared images, denoted as and , respectively. Their dimensions were both , where B is the batch size, C is the number of channels, and H and W are the height and width of the feature maps, respectively. Before performing the fusion operation on the feature maps, we adopted global average pooling (GAP) for dimensionality reduction to reduce computational complexity. First, we performed global average pooling on the feature maps of each modality (RGB and infrared) separately, calculating the average of all positions for each channel. This reduced the size of the feature maps from to . We then reshaped the reduced feature maps to to retain some spatial information. This reduced the computational load while retaining the spatial information of the feature maps. We concatenated the reduced feature maps of the RGB and infrared modalities along the channel dimension, resulting in a feature map of size . Next, we flattened and rearranged the feature maps and introduced a set of learnable positional encodings, which were a set of parameters with dimensions , forming the input sequence I with dimensions to enhance the model’s ability to recognize different feature position information. Through positional encoding, the model could distinguish the spatial relationships between different features during training. We passed the sequence I through a residual bottleneck structure, a Bifurcated Fusion Module, and a Transformer variant block to achieve feature fusion and enhancement. Finally, we reshaped the enhanced features back to the format and input them into the detection head for subsequent multispectral target detection tasks.
(a) Residual Bottleneck Structure
As depicted in
Figure 6, the sequence
I is fed into a residual bottleneck structure, which consists of two
convolutional layers and one
convolutional layer. These convolutional layers are enveloped by Batch Normalization (BatchNorm) layers. Non-linearity is introduced via the ReLU activation function, and the structure is enhanced through a residual connection.
The bottleneck design of this structure reduces the number of model parameters by employing dimensionality reduction followed by expansion. Compared to directly using a convolutional layer, the residual bottleneck structure can significantly reduce parameters and computational complexity without losing too much information. The convolutional layers perform feature fusion without altering the spatial dimensions of the feature maps. This fusion captures cross-channel correlations between modalities, thereby generating richer and more abstract feature representations. The reduction in parameter count also enhances computational efficiency, allowing the network to process information more rapidly and reducing memory usage. This makes it feasible to train deeper or wider networks with limited computational resources. Although the residual bottleneck structure reduces the number of channels in intermediate layers, it effectively retains important feature information due to the use of nonlinear activation functions and residual connections. The residual connections ensure that information can flow directly even in very deep networks, mitigating the problem of vanishing gradients.
Through the residual bottleneck structure, the sequence I achieves the preliminary fusion of dual-stream tensors and effective retention of critical information on the basis of a low parameter count and low computational complexity. This provides a solid parametric foundation and informational support for the subsequent enhancement of detailed feature representation and more efficient fusion.
(b) Bifurcated Fusion Module
In the field of image processing, particularly in the task of multimodal image fusion, effectively understanding local detail features and global contextual features across different modalities, and subsequently achieving effective feature enhancement, represents a critical challenge. The Bifurcated Fusion Module processes the input fused feature sequence by grouping it for convolution operations, capturing local features and global features within groups to enhance multi-scale features. It also employs self-attention mechanisms to strengthen the interaction between global and local features. As shown in
Figure 7, the local path uses a standard
convolutional kernel to capture detailed information, while the global path extracts broader contextual features through a
convolutional kernel. This design allows the model to enhance its understanding of global contextual features in multimodal information while maintaining computational efficiency, simultaneously strengthening important local detail features. The Bifurcated Fusion Module’s flexible adaptation to diverse feature fusion requirements and its parallel processing of local and global information effectively improve the model’s ability to capture multi-scale features, providing robust informational support for handling images from various scenarios.
After passing through the Bifurcated Fusion Module, sequence I represents an enhanced representation of both local and global information, enriching the informational content of the feature tensor. This process transforms the initially naive feature representation into a more sophisticated and high-quality one. At that stage, we consider the majority of the information contained within the tensor to be useful and of high quality.
(c) Transformer Variant Block
The Transformer variant block combines the multi-head self-attention mechanism with multi-layer perceptrons to enhance feature representation and capture complex relationships in sequential data. By inputting the concatenated tensor into the Transformer, we can eliminate the need to focus on the design of fusion strategies. Instead, we can leverage the computational mechanism of the Transformer to directly achieve an optimal fusion strategy, thereby accomplishing efficient fusion, as shown in
Figure 4.
a. Self-Attention
The mechanism of self-attention enables the model to adaptively concentrate on various segments of the input sequence across its entirety. Given an input sequence represented as
, the self-attention module initially generates the query, key, and value matrices through linear projections:
Here,
,
, and
denote the weight matrices associated with the query, key, and value matrices, which are adjusted during the learning process. Subsequently, the attention scores are derived from the following expression:
where
corresponds to the dimension of the key vectors. This scaling factor is crucial for normalizing the dot product, thereby aiding in the stabilization of the model’s training phase. The resulting attention scores indicate the relative importance of each input element in relation to the others within the sequence. In this context, the Transformer employs a multi-head self-attention mechanism, where multiple parallel attention heads perform the aforementioned computations simultaneously. This approach enables the model to better identify the optimal weights for each position in sequence I during the fusion process, thereby facilitating the design of the optimal fusion strategy.
b. Multilayer Perceptron
Following the self-attention mechanism, the resulting output is subsequently fed into a multi-layer perceptron (MLP) module. This module is composed of a pair of linear transformations bookended by a non-linear activation function, which introduces complexity to the model:
Here, and are weight matrices, and are bias vectors, and I is the input vector. : this is the first layer of linear transformation, where is the weight matrix and is the bias vector. : this is the ReLU activation function, which sets negative values to 0 and retains only positive values. : this is the second layer of linear transformation, where is the weight matrix and is the bias vector. An MLP, through multiple fully connected layers and non-linear activation functions, can learn complex patterns and feature interactions in the input data, thereby enhancing the model’s expressive ability and predictive performance.
The Transformer variant block, through the synergistic operation of multi-head self-attention mechanisms and multilayer perceptrons (MLPs), is capable of efficiently uncovering the intricate relationships between the fused tensors and identifying the optimal solutions. By capitalizing on the inherent strengths of the Transformer architecture, we can swiftly pinpoint the optimal fusion strategy for the high-quality feature tensor information obtained from sequence I in the initial stages. This allows for the completion of the final tensor fusion and its subsequent output to the detection head for object detection.
3.2.2. Adaptive Upsampling (AdSample)
In this section, we detail the design of the adaptive upsampling module, AdSample, and demonstrate how it was progressively improved.
Upsampling is to insert new data points between the original data points to increase the density of the data. These new data points are calculated through mathematical methods with the aim of making the data appear smoother or more detailed without altering the original data trends.
Bilinear interpolation. The feature map with coordinates
is upsampled by mapping the x-coordinate to
and the y-coordinate to
, and then performing bilinear interpolation using the pixel values and position coordinates of the four nearest points, as shown in
Figure 8. At this point, the upsampled pixel value at each position is closely related to the four nearest neighbor points in the original feature map, but the selection of these four points depends only on the point coordinates and the upsampling scale factor. We represent the upsampling process with Equation (9).
Here, x denotes the feature map of size , X denotes the upsampled feature map of size after upsampling by a factor of s, and s represents the upsampling scale factor.
The effectiveness of upsampling using bilinear interpolation is closely related to the selection of the sampling region. In images affected by adverse weather conditions, traditional sampling region selection may not be conducive to the quality of the upsampled image or the final object detection task. To identify the optimal sampling domain, we introduced a learnable convolution, denoted as
in the formula, which acts on the sampling position
x to determine the optimal displacement distance
z compared to the original sampling position, as shown in
Figure 9. The sampling area then becomes
Z:
By introducing a learnable convolutional shift to identify the optimal sampling domain, we transformed the upsampling process into an adaptive learning mechanism. This approach was no longer confined to static, predefined mathematical models but instead allowed the network to dynamically adjust its sampling strategy based on the characteristics of the input data. This adaptive upsampling method enabled the network to better adapt to and capture key information from images in different scenarios, thereby enhancing the precision of image processing and analysis. It also improved the robustness and generalization of the model to images from diverse scenes.
The upsampling module was deployed within the detection head, as shown in
Figure 4. We performed upsampling on the fusion tensor at each step prior to detection, ensuring the retention of useful information. By leveraging convolutional neural networks to learn the optimal reference regions for upsampling, the entire neural network was able to better understand the input image information used for object detection and even to possess effective reasoning capabilities.
4. Experiments
In this section, we commence by introducing the datasets employed in our experiments in
Section 4.1. Subsequently, we elaborate on the training parameters, equipment details, and evaluation metrics in
Section 4.2 and
Section 4.3, respectively. To demonstrate the superiority of the proposed algorithm, we conduct comparative analyses with current state-of-the-art algorithms on our self-constructed dataset AWID as well as two RGB-Infrared object detection datasets in
Section 4.4 and
Section 4.5. Furthermore, in
Section 4.6, we validate the effectiveness and generalization capability of the proposed modules and algorithm through ablation experiments. Lastly, in
Section 4.7, we perform a visual analysis to qualitatively showcase the outstanding performance of our algorithm.
4.1. Dataset
All experiments were evaluated on three benchmark datasets, namely, AWID, FLIR [
3], and KAIST [
2].
AWID: The Adverse Weather and Illumination Dataset (AWID) constructed in this paper comprised 20,000 pairs of aligned visible and infrared images along with their corresponding labels. The temporal interval between images was 0.4 s. It encompassed three weather conditions—rain, snow, and fog—as well as two scenarios—daytime and nighttime, with each scenario evenly distributed. The dataset included common road targets such as pedestrians, bicycles, and vehicles. Notably, the weather images and daytime scene images were identical in all aspects except for the weather conditions, making AWID suitable for tasks such as object detection, image restoration, and image fusion.
FLIR [3]: The FLIR dataset, provided by Teledyne FLIR, is a multi-channel image dataset specifically designed for research on infrared pedestrian detection and object detection algorithms. It comprises 26,442 fully annotated frames, covering 15 distinct object categories, including pedestrians, bicycles, cars, and more. The dataset consists of 9711 infrared images and 9233 RGB training/validation images, with a 1:1 matching relationship between the infrared and visible light frames. The images in the FLIR dataset were collected at various locations and under diverse lighting/weather conditions, offering a rich array of scenes and object types. This makes it well suited for training and testing computer vision algorithms, particularly in the fields of object detection and recognition.
KAIST [2]: The KAIST dataset, established by the Korea Advanced Institute of Science and Technology (KAIST), is a cross-modal image library designed for pedestrian detection research. It consists of approximately 95,000 pairs of synchronously captured visible and infrared images, encompassing various lighting conditions during daytime and nighttime to simulate real-world visual challenges. The images, with a resolution of 640 × 480, were recorded at a frequency of 20 Hz, offering a rich diversity of scenes including campuses, streets, and rural environments. Each image pair has been meticulously manually annotated, totaling over 100,000 pedestrian instances with labels for categories such as pedestrians, crowds, and cyclists. Due to the similarity between adjacent images, we performed data cleaning on the dataset. Following the cleaning and re-annotation process, the resulting training dataset comprised 7601 training images and 2257 validation images, which could be directly utilized for training and evaluating object detection models.
4.2. Training Details
In our experiments, we opted for the Stochastic Gradient Descent (SGD) algorithm as our optimization technique, setting the initial learning rate to
, momentum to
, and applying a weight decay of
. All models were trained for 200 epochs on a single NVIDIA GeForce RTX 4090 GPU. Additionally, for enriching our training data, we implemented the mosaic [
50] data augmentation technique, which involved combining four different training images into a single composite image.
4.3. Evaluation Metric
The mAP (mean Average Precision) is a key evaluation metric for object detection tasks, used to measure the average performance of a model across all categories. The mAP is calculated by taking the average of the AP (Average Precision) values for each category. AP measures the quality of detection for a particular category, and its calculation involves concepts such as TP (True Positives), FP (False Positives), FN (False Negatives), Precision, and Recall.
The calculation formula for Precision is:
The calculation formula for Recall is:
TP (True Positives) refers to the number of samples where the predicted result is positive and the actual value is also positive. FP (False Positives) refers to the number of samples where the predicted result is positive but the actual value is negative. FN (False Negatives) refers to the number of samples where the predicted result is negative but the actual value is positive. When calculating AP (Average Precision), the model’s predictions are first sorted in descending order of confidence. Then, for each prediction, the corresponding Precision and Recall values are calculated. Next, a PR curve is plotted with Recall on the x-axis and Precision on the y-axis. Finally, the area under the PR curve is calculated to obtain the AP value, which represents the average precision for the category.
The calculation formula for the mAP is:
Here, N represents the number of categories, and represents the AP value for the ith category. Variants of mAP include mAP50 and mAP75, which evaluate detection quality using different IoU thresholds. These metrics facilitate fair comparisons among different models.
In this section, we introduce mAP50, mAP75, and mAP as evaluation metrics, which comprehensively consider the model’s performance under different IoU (Intersection-over-Union) thresholds. (mAP50: the mAP value when the IoU threshold is 0.5, indicating the model’s average performance at an IoU threshold of 0.5; mAP75: the mAP value when the IoU threshold is 0.75, indicating the model’s average performance at an IoU threshold of 0.75; mAP is the average of the AP values for all categories, measuring the model’s average performance across all categories).
In pedestrian detection, we typically choose Miss Rate as the evaluation metric, which is the ratio of the number of true targets that are not detected to the total number of targets. The calculation formula is:
Here, FN represents the number of samples where the predicted result is negative but the actual value is positive (the number of missed detections), and GT represents the total number of true targets. A lower Miss Rate indicates better model detection performance.
4.4. Comparison with State-of-the-Art Methods on the AWID Dataset
On our self-constructed AWID dataset, we compared the proposed CME-YOLO algorithm with the following methods: the current state-of-the-art cross-modal detectors CFT [
5] and MFPT [
6], as well as a combined algorithm that first processes images using the most advanced multi-weather image restoration algorithm, Domain Translation, followed by unimodal object detection. In this experiment, we selected mean average precision (mAP) as the sole evaluation metric, with the results presented in
Table 2.
It is important to note that since Domain Translation algorithms [
1] are only applicable to weather conditions such as rain, fog, and snow, and do not provide solutions for low-light scenarios, they have no image restoration significance for daytime and nighttime images. Therefore, we only applied that algorithm to the AWID-Hazy, AWID-Rain, and AWID-Snow subsets for image preprocessing and performance comparison.
As evident from
Table 2, our proposed algorithm outperformed both the multi-weather image restoration Domain Translation [
1] algorithm and the current state-of-the-art cross-modal object detection algorithms (CFT [
5] and MFPT [
6]) across all subsets. This result robustly validated the superiority of our algorithm in addressing cross-modal object detection tasks under multi-weather conditions.
Furthermore, we specifically compared our proposed algorithm with advanced unimodal object detectors, the current state-of-the-art cross-modal detectors, and baseline methods on the rainy scenes within the AWID dataset. Detailed results are presented in
Table 3.
Table 3 provides a detailed comparison of the detection performance between our proposed CME-YOLO algorithm and existing unimodal and cross-modal object detection algorithms on the AWID-Rain dataset. Through the comparative data, we can clearly observe that the CME-YOLO algorithm demonstrates a significant advantage in handling adverse weather conditions.
Specifically, by enhancing the representation and efficient fusion of complementary features from different modalities, CME-YOLO achieved more accurate object detection in rainy environments. Its performance metrics were as follows: mAP50: 91.73%, mAP75: 77.64%, mAP: 66.07%. These data robustly demonstrate the exceptional detection capability of CME-YOLO under complex weather conditions such as rain, providing strong technical support for all-weather object detection in practical applications.
4.5. Comparison with State-of-the-Art Methods on FLIR [3] and KAIST [2]
In this section, we evaluate our proposed algorithm on two widely used RGB-Infrared datasets—FLIR [
3] and KAIST [
2] and conduct a comparative analysis with the current state-of-the-art algorithms.
4.5.1. Evaluation on the FLIR [3] Dataset
Table 4 presents a comprehensive comparison of the performance of our proposed CME-YOLO algorithm against existing unimodal and cross-modal object detection methods on the FLIR dataset. Through these data, we can clearly observe that the CME-YOLO algorithm achieved state-of-the-art performance on this dataset, demonstrating significant performance advantages.
Specifically, in comparison to CFT [
5], CME-YOLO achieved an 8.2% improvement in mAP50, demonstrated a significant advantage of 16.27% in mAP75, and enhanced overall performance by 10.86% in mAP. When benchmarked against the latest MFPT [
6], CME-YOLO gained a 6.9% advantage in mAP50 at an IoU threshold of 0.5.
These comparative results thoroughly validated the outstanding performance and strong competitiveness of the CME-YOLO algorithm in cross-modal object detection tasks. Whether against the current state-of-the-art cross-modal detector MFPT [
6] or other excellent algorithms like CFT [
5], CME-YOLO consistently delivered more accurate and reliable object detection outcomes. This achievement not only highlighted the effectiveness of our algorithm’s innovative feature fusion and representation but also underscored CME-YOLO’s potential to play a pivotal role in the field of cross-modal object detection.
4.5.2. Evaluation on the KAIST [2] Dataset
Table 5 provides a further comparison of the performance of our proposed CME-YOLO algorithm against other unimodal object detectors and the advanced cross-modal object detector CFT [
5]. Through this comprehensive comparative analysis, we once again validated the excellence of the CME-YOLO algorithm. The specific performance metrics were as follows: mAP50: 98.02%; mAP75: 81.73%; mAP: 69.9%. These data reiterate the leading advantage of the CME-YOLO algorithm in object detection tasks, showcasing superior detection results compared to both unimodal detectors and the advanced cross-modal detector CFT [
5]. This not only underscores the innovative nature of our algorithm in feature fusion and representation but also provides a solid technical foundation for its extensive deployment in practical applications.
Table 6 presents a detailed comparison of the miss rate (MR) performance between our proposed CME-YOLO algorithm and other cross-modal object detection algorithms for pedestrian detection on the KAIST [
2] dataset. From this table, it is evident that by effectively fusing complementary features from different modalities, the CME-YOLO algorithm significantly reduced the miss rate, achieving the best performance across all metrics.
Specifically, CME-YOLO achieved a miss rate (MR) of 7.13% in all-day scenarios, 8.01% in daytime scenarios, and an impressively low MR of 3.82% in nighttime scenarios. These data clearly demonstrate that the CME-YOLO algorithm outperformed the current state-of-the-art research across daytime, nighttime, and all-day combined scenarios. This outstanding performance further validates the robust capability of our algorithm in handling cross-modal pedestrian detection tasks, particularly highlighting its robustness and accuracy under complex lighting conditions.
4.5.3. Evaluation on Other Metrics
We also compared our proposed CME-YOLO with the most competitive CFT [
5] and MFPT [
6] algorithms in terms of parameter quantity, GFLOPs and computational speed, as shown in
Table 7. Our proposed CME-YOLO outperformed both algorithms in terms of parameter quantity, GFLOPs and running speed, demonstrating the efficiency and practicality of our proposed algorithm.
4.6. Ablation Study
To comprehensively validate the effectiveness and generalization capability of the AdSample, CPTFusion modules, and the CME-YOLO algorithm proposed in this paper, we meticulously designed and conducted a series of ablation experiments. These experiments aimed to delve into the specific contributions of each component to the overall algorithm performance by gradually introducing and combining different modules and strategies. Additionally, they assessed the adaptability and robustness of the algorithm’s design structure and approach across various detectors.
Effectiveness of CME-YOLO. Table 8 compares the detection performance across different datasets (AWID-rain, FLIR [
3], and KAIST [
2]), with the best results highlighted in bold and performance improvements denoted by ↑. On the FLIR [
3] dataset, the YOLOv10 detector using only the RGB modality achieved an mAP value of 38.75%, whereas the YOLOv10 detector using only the infrared modality attained an mAP value of 48.81%. The infrared modality significantly outperformed the RGB modality on the FLIR [
3] dataset, likely due to the presence of numerous low-light scenes in the dataset, which resulted in the loss of effective target regions in the RGB modality. In contrast, the infrared modality provided clearer target information under low-light conditions, thereby achieving higher detection accuracy.
Similarly, the superiority of the infrared modality over the RGB modality was also observed in the KAIST [
2] dataset and the AWID-Rain dataset. This further corroborates the advantage of the infrared modality in handling object detection tasks under low-light or complex weather conditions.
Further analysis of the mAP values comparing YOLOv10 using only the infrared modality with a simple two-stream baseline model revealed that the latter failed to fully leverage the inherent complementarity between different modalities. Worse still, these rudimentary approaches may increase the difficulty of network learning, exacerbate modality imbalance, and consequently lead to performance degradation.
As evident from
Table 8, the deployment of our proposed CME-YOLO algorithm led to substantial performance improvements across all three datasets. Notably, on the FLIR [
3] dataset, the enhancement in detection performance was particularly remarkable, with mAP75 increasing by 5.74% and overall mAP rising by 3.72%. These results robustly demonstrate that the CME-YOLO algorithm, through its effective cross-modal fusion and feature enhancement design, can maximally exploit the complementary advantages of different modalities, thereby significantly boosting the performance of cross-modal object detection.
Effectiveness of CPTFusion and AdSample. To thoroughly evaluate the effectiveness of the proposed AdSample and CPTFusion modules in cross-modality target detection tasks and their effectiveness under different conditions, we conducted ablation experiments on the AWID dataset under each weather condition (day, night, rain, fog, and snow) and on the publicly recognized FLIR [
3] dataset. The experimental results are shown in
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14, where blue indicates the performance improvement over the simplest dual-stream combination after adding the AdSample and CPTFusion modules. We can see that on the AWID dataset under various weather conditions, the network’s performance improved after adding the AdSample and CPTFusion modules, with particularly significant improvements in nighttime scenarios. After adding the AdSample module, the performance of the dual-stream network improved by 0.31–12.26% in mAP50, 0.44–15.86% in mAP75, and 0.98–14.14% in mAP. After adding the CPTFusion module, the performance of the dual-stream network improved by 0.32–12.03% in mAP50, 0.66–13.64% in mAP75, and 0.42–14.19% in mAP. Our proposed CME-YOLO improved by 0.83–12.90% in mAP50, 1.54–16.32% in mAP75, and 1.37–14.26% in mAP compared to the simplest dual-stream combination. On the public FLIR dataset, as shown in
Table 14, the effectiveness of the modules and algorithms proposed in this paper can also be observed. These results fully demonstrate the significant role of the AdSample and CPTFusion modules in enhancing cross-modality target detection performance, and their complementary functions, which together provide a more comprehensive improvement in cross-modality target detection performance.
Complexity of CPTFusion and AdSample. To quantitatively illustrate the computational cost and complexity of the modules proposed in this paper, as shown in
Table 15, we present a comparison of the GFLOPs after adding AdSample and CPTFusion, with all experiments conducted under the same settings. The results show that both AdSample and CPTFusion increased the computational cost, but the increase was within an acceptable range. Moreover, the final GFLOPs of CME-YOLO were still lower than that of the advanced algorithm CFT [
5], which used 224.40 GFLOPs.
Generalization of CME-YOLO. To comprehensively assess the effectiveness and generalization capability of our proposed framework, we conducted a crucial experiment: replacing the backbone and detector within our framework with other classical object detectors, such as YOLOv3 [
52] and YOLOv5 [
53], and testing them on the FLIR [
3] dataset. This experiment aimed to validate the modular design and versatility of our framework, ensuring that it not only performed exceptionally under specific configurations but also adapted to different detector architectures. The experimental results, as shown in
Table 16, included evaluation metrics such as mAP50, mAP75, and mAP, which measured the effectiveness of object detection.
The experimental results in
Table 16 clearly demonstrate the exceptional generalization and robustness of our proposed algorithm. By replacing the backbone within our framework with different classical detectors (e.g., YOLOv3 [
52], YOLOv5 [
53], and YOLOv10 [
37]), we verified the adaptability and effectiveness of our algorithm structure and specific modules across various detector architectures. Here were the specific performance values: when using YOLOv10 [
37] as the baseline, our method achieved a 2.6% performance improvement in mAP50, a 5.87% increase in mAP75, and a 3.76% enhancement in mAP. With YOLOv3 [
52] as the baseline, the performance improvements reached 3.3% (mAP50), 1.1% (mAP75), and 2.4% (mAP). When adopting YOLOv5 [
53] as the baseline, our method outperformed the dual-stream baseline by 6.1% on mAP50, 4.7% on mAP75, and 4.8% on mAP.
These experimental results fully validated the exceptional generalization and robustness of our proposed algorithm. Regardless of whether YOLOv3 [
52], YOLOv5 [
53], or YOLOv10 [
37] was used as the detector, our modular design and specific modules could significantly enhance their performance in cross-modal object detection tasks.
4.7. Visualization of Results
As shown in
Figure 10, the baseline often missed detections in low-light or adverse weather conditions, such as failing to detect cyclists in low light, cars in the rain, bicycles in fog, and cyclists in snow. After adding CPTFusion, although the missed detection issue was resolved, the detection accuracy of various targets, especially small targets, still needed improvement. After adding AdSample, the model became more sensitive to small targets. For example, in rainy scenes, the mAP of cars increased from no detection to 0.5 with CPTFusion and to 0.8 with AdSample, fully demonstrating AdSample’s enhancement of image processing details and significant improvement in small target detection accuracy. The combination of CPTFusion and AdSample, which is the CME-YOLO algorithm proposed in this paper, not only solved the missed detection issue but also significantly improved detection accuracy. For example, in the aforementioned rainy scene, the mAP of the car detection increased to 0.9. Similar improvements were observed under other lighting and weather conditions. It is evident that CPTFusion and AdSample not only individually improved different aspects of the model’s detection performance but also, when combined as in CME-YOLO, efficiently fused and dynamically upsampled the data to capture more detailed and smooth information, significantly enhancing the model’s detection performance and robustness.
Figure 11 provides detailed visualizations of the object detection results achieved by our proposed CME-YOLO algorithm under various adverse weather conditions, including rain, snow, fog, and darkness. Through these visual results, we can intuitively observe the outstanding performance of the CME-YOLO algorithm in complex environments. As evident from the figures, the CME-YOLO algorithm accurately identified and localized all annotated objects across different adverse weather conditions. Whether in blurry rainy scenes, low-contrast snowy environments, or limited visibility foggy conditions, the algorithm demonstrated exceptional detection capabilities. Even in dark environments where human eyes struggle to distinguish objects, the CME-YOLO algorithm consistently performed robust object detection tasks. This was attributed to the algorithm’s efficient fusion of cross-modal information, enabling it to leverage supplementary information from the infrared modality when visible light modality information was insufficient. These experimental results fully validate the efficiency and robustness of the CME-YOLO algorithm in cross-modal object detection tasks. By innovatively addressing challenges in feature representation and modality fusion, the CME-YOLO algorithm delivers reliable and accurate object detection results under diverse complex weather conditions.
In this study, we not only focused on the effectiveness of the object detection algorithm but also provided a visual comparison of weather images before and after Domain Translation, with specific results displayed in
Figure 12.
From
Figure 12, it can be observed that although the images processed by Domain Translation significantly mitigated the impact of adverse weather conditions (such as rain, snow, and fog) on image quality, some residual noise, like “rain streaks” and “snow streaks”, still persisted. These residual noises indicated that the negative effects of adverse weather on RGB images were not entirely eliminated, which, to some extent, continued to pose challenges for object detection tasks.
These residual noises may lead to several issues, including target occlusion where rain or snow bands partially obscure the target, making it difficult for the detector to accurately identify and locate it; feature interference where the noises disrupt the extraction of target features, thereby reducing the precision and robustness of the detector; and false positives or false negatives where the detector might mistakenly identify noise as a target (false positive) or overlook genuine targets obscured by noise (false negative).
Given the limitations of Domain Translation in handling complex scenarios, our proposed CME-YOLO algorithm demonstrates significant advantages in object detection tasks across various weather conditions. Through cross-modal information fusion, the CME-YOLO algorithm can fully leverage the supplementary information from the infrared modality to compensate for the deficiencies of the RGB modality under adverse weather conditions; enhance feature representation to improve the detector’s perception of targets, thereby reducing false positives and false negatives; and boost robustness, enabling the detector to perform object detection tasks stably under a wide range of complex weather conditions. Based on the above analysis, we believe that the CME-YOLO algorithm holds substantial practical value in multi-weather object detection scenarios for autonomous driving systems.
5. Discussion
This paper comprehensively evaluated the performance of our proposed CME-YOLO algorithm in multi-weather object detection and cross-modal object detection tasks through extensive experiments conducted on two RGB–infrared object detection datasets (KAIST [
2] and FLIR [
3]) as well as our self-constructed adverse weather dataset, AWID. The experimental results consistently demonstrated that the CME-YOLO algorithm achieved outstanding performance in both critical tasks.
To further validate the advantages of the CME-YOLO algorithm, we also compared it with the current state-of-the-art deweathering image restoration algorithm, Domain Translation [
1] (DT). Specifically, we first applied the DT algorithm to remove weather effects such as rain, snow, and fog from the images and then used a single-modality object detector for detection. The experimental results indicated that the DT [
1] algorithm could indeed mitigate the impact of weather on images to a certain extent, making the processed images appear clearer overall. However, through careful observation of the visualized images of the deweathering effects, we identified several issues. Although the DT [
1] algorithm removed most of the weather effects, there were still some residual noises such as “rain streaks” and “snow streaks”, which could negatively affect object detection. During the deweathering process, the DT [
1] algorithm might cause sharpening distortions or other forms of detail loss in image regions with fine details, potentially interfering with target feature extraction. In contrast, cross-modal object detection, benefiting from the information provided by infrared images and the design of cross-modal algorithms, could better handle various complex weather conditions and provide more stable and reliable detection results.
Compared to existing cross-modal algorithms, we also explored fusion strategies but with a distinct approach. In our CME-YOLO algorithm, we performed more refined processing on feature tensors both before and after seeking the optimal fusion strategy. Prior to fusing the two-modality feature tensors, we employed dimension transformation and grouped convolution to achieve multi-scale feature extraction and enhancement of critical features. During the fusion process, we utilized a Transformer variant to identify the optimal fusion strategy. After obtaining the fused tensor, within the detection head, we introduced an adaptive upsampling module called AdSample. This module ensured that feature representation remained unchanged during the upsampling process and leveraged convolutional neural networks to find the best sampling regions, selecting the most favorable sampling reference range for object detection. Through these meticulous processing steps in the dual-stream interaction, our algorithm demonstrated superior performance advantages over existing algorithms in cross-modal detection tasks. These improvements not only enhanced the model’s ability to utilize multimodal information but also improved its detection accuracy and stability in complex scenarios.
6. Future Work
In future research, we plan to improve and extend the multimodal target detection algorithm proposed in this paper for autonomous driving systems under adverse weather conditions in the following aspects:
Availability of infrared images. We will continue to explore other possible solutions, such as using radar or other modality data to supplement or replace infrared images. Additionally, we will further research how to improve algorithm performance without infrared images, for example, by enhancing visible light image processing techniques or developing new multimodal fusion strategies.
AWID dataset expansion. To enhance the diversity and representativeness of the AWID dataset, we will expand the dataset’s coverage by collecting and annotating more image data under extreme weather conditions, such as heavy snow, strong light, hail, and sleet. We will also increase the proportion of real-world data to reduce our reliance on synthetic data, thereby enhancing the dataset’s authenticity and diversity. Finally, we will collect real paired visible light and infrared images under various weather conditions to more accurately learn multimodal image features under different weather conditions, thereby improving the robustness and generalization of target detection algorithms.
Experimental setup improvement. To comprehensively evaluate the performance of the algorithm in the real world, we will make the following improvements. We will test the algorithm in a variety of experimental environments, including different lighting conditions, camera settings, and scene types, to assess its performance under various real-world conditions. Additionally, we will conduct dedicated robustness testing to evaluate the stability and reliability of the algorithm when faced with camera calibration errors, lighting changes, and other real-world interference. Finally, we will utilize data augmentation techniques to simulate real-world challenges, such as camera jitter and scene variations, to train more robust models.
User interaction and interface design. We will design user-friendly interfaces that enable users of autonomous driving systems to interact easily with the target detection algorithm. This will include providing clear feedback on detection results and allowing users to adjust algorithm parameters or behavior as needed. We will also consider integrating the target detection algorithm with other components of the autonomous driving system, such as path planning and decision-making modules, to ensure seamless collaboration and optimized user experience.
Through these future efforts, we aim to further refine and apply the proposed multimodal target detection algorithm, enabling it to play a more significant role in target detection tasks for autonomous driving systems under adverse weather conditions.
7. Conclusions
In this work, we developed a dataset named AWID (Adverse Weather and Illumination Dataset) for autonomous driving scenarios under adverse weather conditions, comprising 20,000 pairs of visible light and infrared images. This dataset encompassed two lighting scenarios—daytime and nighttime—as well as three complex weather conditions: rain, snow, and fog. It can provide abundant data support for various computer vision tasks, including object detection, image fusion, image translation, and image restoration.
Furthermore, we proposed a novel cross-modal object detection algorithm called Cross-Modal Enhanced YOLO (CME-YOLO). This algorithm aims to significantly enhance the feature extraction and detection capabilities of dual-stream CNNs in cross-modal object detection by learning detailed features from different modalities and fusing effective information. Specifically, CME-YOLO integrates our self-designed Adaptive Upsampling module (AdSample) and Cross-Perception Fusion module (CPTFusion). These modules enable the efficient extraction of effective information from different modalities and facilitate complementary fusion of cross-modal features. Consequently, CME-YOLO provides comprehensive performance assurance for object detection tasks under adverse weather conditions across all stages, including feature extraction, fusion representation, and effective detection.
Experimental results on two RGB-Infrared object detection datasets and our self-constructed AWID dataset demonstrated that CME-YOLO surpassed current state-of-the-art single-modal and cross-modal object detection algorithms. Even when compared to detection results processed by multi-weather image restoration algorithms like Domain Translation, CME-YOLO still achieved superior performance. Specifically, on the AWID dataset, CME-YOLO attained optimal performance with mAP values of 69.29% for day, 67.36% for night, 68.36% for fog, 66.07% for rain, and 68.97% for snow. On the AWID-Rain, KAIST [
2], and FLIR [
3] datasets, it achieved mAP50 metrics of 91.73%, 98.02%, and 86.9%, respectively, showcasing exceptional performance. Notably, on the FLIR [
3] dataset, CME-YOLO’s mAP value exceeded that of the current second-best algorithm, CFT [
5], by 10.86%, and its mAP50 value surpassed the most advanced algorithm, MFPT [
6], by 6.9%. These results fully validate its strong competitiveness in the field of cross-modal object detection.
To further validate the universality and effectiveness of the CME-YOLO algorithm, we integrated its core modules, CPTFusion and AdSample, with three classic detectors—YOLOv3 [
52], YOLOv5 [
53], and YOLOv10 [
37]—and conducted experiments on the FLIR [
3] dataset. The results indicated that after incorporating CPTFusion and AdSample modules and constructing dual-stream object detection algorithms following the design principles outlined in this paper, the performance of these classic detectors significantly improved compared to their single-stream or dual-stream counterparts. This demonstrates that not only does the CME-YOLO algorithm exhibit outstanding performance on its own, but its core modules and design principles also possess strong generalization and robustness. They can provide substantial performance enhancements for other detectors as well.
The experimental results demonstrate that the CME-YOLO algorithm proposed in this paper exhibits significant advantages in both traditional cross-modal object detection tasks and adverse weather cross-modal object detection tasks. It provides strong technical support for the all-weather visual perception of autonomous driving systems.