1. Introduction
Due to the theoretical and technical limitations of hardware equipment, information obtained from a single modal sensor or a single shooting setting is often insufficient for effectively and comprehensively capturing the nuances of the imaging scene [
1]. For instance, reflected light captures only the surface attributes of an object. Therefore, reflected light cannot reveal insights into the object interior or what lies beyond its immediate surface, offering only a partial view of the object’s external characteristics. Given this constraint, it is necessary to develop related techniques to solve this condition. To achieve this, image fusion is proposed. Its aim is to combine data from multiple sources to create a composite image, which improves the accuracy of image analysis and interpretation by integrating the advantages of different sensors. In particular, the fusion of infrared and visible images provides complementary information and assistance for representation learning, thus enhancing scene depiction and visual perception. By leveraging this unique characteristic, the integration of image fusion technology significantly enhances the performance of diverse natural image tasks, including surveillance [
2], target recognition [
3], semantic delineation [
4], etc.
Over the past few decades, significant advancements have been made in the development of traditional methods, with a particular focus on fusion strategies and theories for integrating infrared and visible images. These methods can be largely categorized into five types: multi-scale transformation techniques [
5,
6], sparse representation strategies [
7], subspace clustering algorithms [
8], saliency-based methodologies [
9], and optimization approaches [
10]. Each category employs mathematical transformations to convert the source images into a specific domain, where they manually design activity levels and establish fusion rules to achieve image fusion. Although these methods achieve success in certain fusion tasks, the transformations used by the above methods are becoming increasingly complex. Frequently, the same transformations are applied to extract features from source images. However, this approach neglects the analysis of the unique characteristics inherent in infrared and visible images. Additionally, the limited selection of fusion rules and the handcrafted features often fail to address complex scenarios [
11].
Currently, the development of deep learning is driving the creation of numerous deep fusion methods. These methods are primarily categorized into three types based on different network architectures. They are based on autoencoders (AEs) [
12,
13], convolutional neural networks (CNNs) [
14,
15], and generative adversarial networks (GANs) [
16,
17,
18]. Deep learning technology usually outperforms traditional fusion methods in terms of fusion performance. However, deep learning also faces new challenges. One critical issue is the difficulty in selectively utilizing pertinent information to guide the training of the fusion framework, owing to the lack of specific fusion labels. While some end-to-end models seek to address this by weighting the pixel values of the source and fused images, they encounter difficulties by treating all regions equally in constructing the loss function [
19]. Furthermore, the field of image fusion largely overlooks the issue of unbalanced lighting. For instance, daytime visible images exhibit high resolution and rich texture, whereas nighttime visible images suffer from a loss of texture detail. When the identical fusion model is applied across two scenarios, it may yield suboptimal results, thereby impacting the performance of the fusion.
To address the previously mentioned challenges, we introduce a light-aware sub-network that utilizes light intensity loss as a guiding mechanism for enhancing the fusion network learning capabilities. In the fusion network, our proposed transformer technique integrates both contextual information mining and self-attention learning into a unified architecture. Subsequently, we propose a detail enhancement module that can utilize both infrared and visible images information, effectively enhancing edge depiction by extracting common and complementary features. In summary, this paper presents three significant contributions:
We develop a progressive image fusion framework that explores the information contained within two modalities. This framework can autonomously learn advanced image features based on lighting intensity, and adaptively fuse complementary and common information, thereby effectively fusing meaningful information in the source image in an all-weather environment.
A multi-state joint feature extraction module is proposed to enable context mining between keywords and self-attentive learning of feature maps in an intra-clustered manner. The module integrates these elements into a unified architecture that merges static and dynamic contextual representations into a consistent output.
We introduce a differential module with learnable parameters for the activation function. It can dynamically adapt to the distribution of input data, effectively extracting and integrating key information from the source images while supplementing differential information.
The rest of this paper is organized as follows:
Section 2 reviews the related works based on techniques for infrared and visible images.
Section 3 describes the proposed method in detail, including multi-state joint feature extraction, Difference-Aware Propagation Module and loss function. The experiments are thoroughly analyzed and discussed in
Section 4. Finally,
Section 5 provides a comprehensive summary of this work.
3. Method
This section introduces the overview of the proposed network, namely the Self-Attention Progressive Network. Then, the multi-state joint feature extraction, Difference-Aware Propagation Module and loss function are presented in detail.
3.1. Overview
Figure 1 illustrates the proposed framework. The network architecture comprises a multi-state joint feature extraction module (MSJFEM), Difference-Aware Propagation Module (DAPM) and image reconstruction module (IRM). To enhance the contextual perception capability of multi-modal images, we adopt enhanced self-attention learning to design a MSJFEM that extracts deep features by leveraging additional contextual information between input keywords. On this basis, the extracted deep features are processed differentially through the DAPM to better highlight the edge details of the image. Considering noise and artifacts in the fusion process, IRM is proposed to further refine and reconstruct the details of the images. Additionally, a light-aware sub-network is integrated into framework to improve the model’s adaptability to lighting conditions. Therefore, we introduce the light intensity loss to measure the probability of scene illumination. By combining Transformer and CNNs, this framework employs a designed loss function to autonomously learn critical image features based on light intensity, such as texture details and unique targets. As a result, it effectively fuses meaningful information from the source images in various conditions.
3.2. Multi-State Joint Feature Extraction
In the traditional self-attention mechanism, individual queries and keys are employed at each location to generate the attention matrix. As a result, the mechanism does not fully exploit the rich context information available between neighboring keys. To address this issue, the multi-state joint feature extraction module is proposed. This module improves self-attention learning utilizing extra contextual information between input keys. It also integrates context mining with the self-attention learning process within a unified architecture. Here, a new multi-head self-attention block is proposed, whose framework is shown in
Figure 2. Specifically, given an input feature map
, this module transforms
X into the value
and key
through the embedding matrix
and
, and the queries are not processed. Meanwhile, the embedding matrix is implemented in space as a 3 × 3 convolution. Then, the input key is context encoded through a 3 × 3 convolution. This process reflects the static context between local neighboring keys. The learned context keycontent
m is utilized as a static representation and then cascades with
Q. The concatenation is followed by depthwise separable convolutions to extract and integrate spatial and channel features. This reduces many parameters and the computational complexity of the module. In addition, we introduce reflection-padding convolutions to deal with the feature map boundaries more effectively. The process can be defined as
where
P denotes the features obtained from the above process, ⊙ denotes cascade and
denotes convolution.
To enhance self-attention learning by mining additional guidance from static context
m, a Softmax function is applied to the channel dimension, resulting in an attention matrix. Then, a matrix multiplication is carried out between the attention matrix and the value
V to generate the weighted attention feature map
n. The process can be defined as
where ∗ denotes a local matrix multiplication operation. The output generated by MSJFM is the fusion of static and dynamic context of the input feature map represented by an add residual connection.
3.3. Difference-Aware Propagation Module
To better integrate complementary and common features, the Difference-Aware Propagation Module (DAPM) is proposed, whose flowchart is depicted in
Figure 3. Specifically, depth information obtained from infrared and visible images using the MSJFEM is used as input for this module. In this module, cross-modal difference calculation is performed to obtain complementary features. The channel weighting process is employed to integrate complementary information. The process can be defined as
where ⊙ denotes channel cascade. ⊗ denotes channel multiplication.
represents activation function, and
G represents global average pooling. In particular, the module uses a parameters learnable Swish activation function that it can be dynamically adjusted according to the distribution of the input data, automatically adapt to the feature distribution of different modalities, and balance the information differences between different modalities. Firstly, the initial parameters and the active value range of the Swish activation function are given. Then, the complementary features are put into the global average pooling and compressed to a vector. In this process, the Swish function automatically adjusts its shape and self-optimizes based on the initial parameters and active values. Finally, the optimized channel weights are multiplied with the complementary features, and the resulting features are cascaded with the original ones to facilitate cross-modal information interaction and enhance the utilization of complementary information. Among them, complementary information reflects the complementary features of different modalities. Meanwhile, through strategies such as global average pooling and cascading features, the DAPM module can effectively integrate information from different modalities and maintain balance between them.
3.4. Light-Aware Sub-Network
The light-aware sub-network consists of four convolutional blocks and two fully connected blocks. The first two convolutional blocks have a 3 × 3 convolutional layer with batch normalization and a corrected linear unit (ReLU) activation layer. To simplify the network, a max pool was added in the subsequent convolutional blocks. The last two fully connected blocks include max pooling, batch normalization and ReLU activation layers to improve the network generalization ability and solve the problem of gradient vanishing. Given that the goal of this network is to distinguish between night sky and daytime conditions, it is crucial to set the final neuron count to 2. From this, we can obtain the illumination probability. In order to further enhance the sensitivity of the network to different lighting conditions, we normalize the illumination probability to obtain the light-aware weight, which is adapted to different lighting conditions through the illumination allocation mechanism.
3.5. Loss Function
3.5.1. Loss Function of Light-Aware Sub-Network
The performance of the fusion process significantly depends on the precision of the light-aware sub-network. It belongs to a binary classification task, which is used to compute the probability values of the images belonging to day and night. Therefore, to guide the training of the light-aware sub-network, we adopted a cross-entropy loss fusion method, which is defined as
where
z represent the labels of the day and night images,
y denotes the classification output of the light-aware sub-network (
,
), and
∂ represents the Softmax function of normalized output probabilities.
3.5.2. Loss Function of Self-Attention Progressive Network
The loss function dictates the kind of information preserved in the fused image, as well as the relative importance assigned to various types of information. The Self-Attention Progressive Network overall objective function comprises illumination loss, auxiliary intensity loss, and gradient loss function, which is defined as
where
,
and
are the balance factors. In the image fusion process, to alleviate the problem of detail loss in visible images under environments with large differences in light intensity, we introduce illumination loss to enhance the detail content of visible under adverse lighting conditions. The auxiliary intensity loss guides the system to dynamically maintain the optimal intensity distribution. This ensures a balanced intensity distribution between infrared and visible images. Guided by gradient loss, the fusion outcome maintains clear edge details while preserving the texture features inherent in the original images.
The illumination loss can promote the Self-Attention Progressive Network to adaptively integrate contextual information according to lighting conditions. It is composed of light-aware weights and intensity loss. The illumination perception weights serve to adjust the weight restrictions on the fused image, whereas the intensity loss functions as a metric to quantify pixel-level disparities between the fused image and its constituent source images. The illumination loss is defined as
where
and
denote the light-aware weights. To obtain more contextual information, the two source images exhibit different intensity distributions under varying lighting conditions. Consequently, the weight constraints of the fused image are adjusted based on the lighting environment.
and
represent the intensity loss of the infrared and visible images, respectively, and the intensity loss is defined as
where
represents the L1 norm of the vector,
denotes the fused image, and
and
represents the source images.
In illumination loss, although the intensity information from the source image is dynamically preserved according to the real-time lighting conditions, it cannot ensure that the fused image maintains the optimal intensity distribution. Therefore, the auxiliary intensity loss is introduced, which is defined as
where
denotes the maximum value in the element-by-element selection.
While maintaining the optimal intensity distribution, ensure that the fused image retains rich texture details. As a result, we incorporate gradient loss to enhance the image texture detail retention, which is defined as
where ∇ denotes the Sobel operator and
denotes the calculation of the absolute value.
4. Experiment
4.1. Datasets
4.1.1. Msrs Dataset [37]
The dataset is a novel infrared-visible image fusion multispectral collection, derived from the MFNet dataset. It comprises 1444 pairs of images, where each image has a corresponding color visible and an infrared image. These images are captured during the day and night in various traditional traffic scenes, such as campuses, streets and rural areas.
4.1.2. TNO Dataset [41]
The dataset contains 261 image pairs, which offer visible light, near-infrared and long-wave infrared nighttime images of various military and surveillance scenes. These images depict different object targets (such as people, vehicles) against various backgrounds (e.g., rural, urban).
4.1.3. RoadScene Dataset [42]
The dataset comprises 221 infrared-visible image pairs, which are generated from FLIR (Forward-Looking Infrared) video after preprocessing steps such as alignment and cropping. It includes a rich variety of scenes, such as roads, pedestrians, etc.
4.2. Evaluation Metrics and Comparison Methods
To showcase the superiority of proposed method, nine approaches for infrared and visible images fusion are employed, including SSR-Laplacian [
22], FusionGAN [
18], RFN-Nest [
13], SEAFusion [
43], SwinFusion [
44], U2Fusion [
42], TarDAL [
45], UMF-CMGR [
46], and CDDFuse [
47]. Based on training methods, these approaches can be categorized into two groups: supervised and unsupervised. The supervised methods include RFN-Nest, SEAFusion, SwinFusion, and TarDAL, while the remaining techniques fall under the unsupervised category.
To evaluate the performance, six metrics are utilized, i.e., Entropy (EN), Mutual Information (MI), Average Gradient (AG), Structural Similarity Index (SSIM), Visual Information Fidelity (VIF) and . From the perspective of statistical information measurement, EN is used to measure the complexity or uncertainty between the source image and the fused image information, while MI is employed to quantify the similarity between the visible images and the fused image. From the perspective of mathematical statistics, AG is employed to assess the sharpness of edges and the quantity of detailed information present in the fused image. In terms of image quality assessment, SSIM is an indicator used to measure the similarity between two images and VIF is employed to assess the effect of image distortion on visual information content. Both reflect the human visual system perception of image quality. is used as a measure of the degree to which detailed edge features from the original image is transferred to the fused image. Moreover, higher values of MI, EN, AG, SSIM, VIF and signify superior fusion performance.
4.3. Implementation Details
The MSRS dataset serves as the basis for training both the light-aware sub-network and the Self-Attention Progressive Network. Given the limited number of training samples, we randomly crop 64 patches from each image. Each patch undergoes augmentation through random flipping, rotation, and rolling. Prior to inputting the patches into the network, all images are normalized to the range of [0, 1].
Specifically, we train the light aware sub-network and the Self-Attention Progressive Network in sequence. Firstly, the cropped visible light image is used to train a light-aware sub-network to obtain light-aware weights. Then, utilize these light-aware weights to construct illumination loss during the training process of the Self-Attention Progressive Network. We adopt ADAM optimizer with
= 0.9 and
= 0.99 to optimize the both networks. The initial learning rate is established as
and is subsequently reduced through an exponential decay during the training process. The light-aware sub-network is configured with a batch size of 128 and 100 training epochs. For the Self-Attention Progressive Network, the batch size is set to 32 with 30 training epochs. For Equation (
6), we set
= 3,
= 7, and
= 50, which are obtained through cross-validation and iterative adjustment. To preserve color information, the visible image is first converted to the YCbCr color space, and the Y, Cb, and Cr channels are separated. Then, a variety of fusion techniques are utilized to combine the Y channel of visible and infrared images. Finally, the merged Y channel is reorganized with the original Cb and Cr channels and transformed back into the RGB color space [
48].
4.4. Performance Comparison with Existing Approaches
To fully evaluate the performance of the proposed method, we perform quantitative as well as qualitative comparisons with nine other methods on the MSRS dataset, RoadScene dataset, and TNO dataset. Specifically, this model is trained exclusively on the MSRS dataset for comparative experiments and then generalized to both the RoadScene and TNO datasets for assessing generalization capabilities.
4.4.1. Quantitative Evaluation
Table 1 presents the quantitative outcomes of six evaluation metrics for 150 image pairs from the MSRS dataset. It can be seen that the proposed method shows superiority in some metrics. The best EN metric means that the proposed method obtains more information from the source image, which indirectly reflects that the model can extract more information from the source image after adding a light-aware sub-network, especially in night scenes. The best AG and
metrics indicate that more edge information is retained in the fusion results, benefiting from the use of reflection filled convolution and the proposed DAPM. In addition, higher VIF and SSIM indicate that the fused image has satisfactory visual effects.
Thirty-five image pairs are randomly selected from the RoadScene dataset and twenty-five image pairs from the TNO dataset for direct testing within the proposed method. The quantitative results are presented in
Table 2 and
Table 3. It can be seen that the proposed method shows superiority in some metrics. Higher EN and MI metrics indicate that greater extraction of information from the source images, and the quality of the images is better. This indirectly reflects that the proposed model pays more attention to the coordination between the EN and MI. The proposed method exhibits average performance in terms of the AG metric. This is relatively reasonable, as the two datasets mainly contain daylight scenes, and the proposed method tends to adjust the intensity of nighttime images to reduce the contrast of the fused images. In addition, the higher VIF and SSIM values indicate that the fused image is more similar to the source image and has a satisfactory visual effect.
4.4.2. Qualitative Evaluation
To demonstrate that the proposed network can fuse meaningful information from source images in different environments, one daytime image and one nighttime image on the MSRS dataset are selected for qualitative evaluation. The visual results are presented in
Figure 4 and
Figure 5.
In day scenes, visible images contain abundant texture information. Thermal radiation information from infrared images can be used as complementary information to visible images. This suggests that the fusion algorithm must maintain the texture details of the visible images, enhance the salience of targets, and minimize spectral distortion. As shown in
Figure 4, FusionGAN, SSR-Laplacian and TarDAL do not retain the detailed information of visible images well. Although the above comparison networks can significantly fuse the texture details of the visible images with the target details in the infrared image, FusionGAN, RFN-Nest, SEAFusion, U2Fusion and TarDAL have the drawbacks of blurry or incomplete infrared target edges. Only SwinFusion, CDDFuses and the proposed method completely retain the contour information of the person, as shown by the red boxes. In addition, the green boxes are used to illustrate background areas affected by varying degrees of spectral pollution. While TarDAL is less affected by spectral pollution in the background, it also preserves the texture of the visible images to a lesser degree. Although CDDFuses and the proposed method have similar architectures, we achieve clearer outlines compared with CDDFuses.
In night scenes, thermal radiation information in infrared images contains distinct targets, while limited detail information from visible images can serve as a supplement to the infrared images. As illustrated in
Figure 5, all networks effectively maintain the prominent targets from the infrared image. However, SSR-Laplacian, RFN-Nest, U2Fusion and TarDAL are unable to clearly reveal background details in dark environments due to the absence of visible images detail information. Although FusionGAN and UMF-CMGR can reveal background information in the dark, they exhibit significant blur, as shown by the green box. Additionally, SEAFusion and SwinFusion excel at preserving background information. However, they are unable to retain the details of both the tree trunks and the road fences simultaneously. This phenomenon is shown by the red boxes.
The visual results of different methods on the TNO dataset and the RoadScene dataset are shown in the
Figure 6 and
Figure 7. Although all networks maintain the distribution of the intensity of critical targets, the person in SSR-Laplacian, FusionGAN, RFN-Nest and UMF-CMGR appears blurred, as shown by the red boxes. The images in FusionGAN, RFN-Nest, UMF-CMGR and TarDAL exhibit uneven contrast, with a significant difference in contrast between bright and dark regions. Additionally, the backgrounds in FusionGAN, SSR-Laplacian, RFN-Nest and TarDAL are affected by varying degrees of spectral pollution, which compromises the visual effects, and this issue is shown by the green box. The method not only adjusts the image contrast but also preserves more texture and edge details while reducing spectral pollution in the background.
In summary, our experimental results do not show any issues such as poor contrast, thermal target degradation, texture blurring, and spectral contamination in the three datasets mentioned above. This indicates that the proposed network exhibits superior performance in light complementarity, as well as in preserving texture and edge details. This is attributed to the proposed MSJFEM and DAPM. However, under extremely low-light conditions or when images are severely blurred, the method does not achieve the best results. The light-aware sub-network we introduced is designed to balance lighting.
4.5. Efficiency Comparison
Running efficiency is one of the important factors for evaluating model performance. In
Table 4, we provide the average running time of 10 methods on the MSRS, RoadScene, and TNO datasets. From the table, it can be seen that deep learning methods have significant advantages in operation, while traditional methods take longer to fuse images. Although the transformation method we use contains a large number of parameters, our technology still achieves the optimal fusion effect while accelerating processing speed. In addition, due to the need to process the static and dynamic context of the image during the multi-state joint feature extraction stage, our running time is slightly inferior compared to other algorithms. In particular, CDDFuse has a similar structure to our method; however, our method still has significant advantages in terms of computational speed.
4.6. Ablation Study
4.6.1. Study of Multi-State Joint Feature Extraction
The Self-Attention Progressive Network model designs a key module to improve the capability of feature extraction, namely, multi-state joint feature extraction. To verify the effectiveness of this module, we test the model by removing or adding this module on the MSRS dataset, TNO dataset and RoadScene dataset.
Table 5 describes the model performance with different module combinations on different datasets.The design integrates context information mining and self-attention learning into a single architecture through multi-state joint feature extraction. It is beneficial for feature representation and enhances the learning of contextual details. From
Table 5, it can be observed that the six evaluation metrics of the model with the addition of multi-state joint feature extraction are all higher than those without this module, especially in MI and
, with an average improvement of
and
, respectively. This suggests that the module effectively preserves information and enhances the learning of edge details.
4.6.2. Study of Difference-Aware Propagation Module
To enhance feature representation by incorporating complementary and differential information between different modalities, the Difference-Aware Propagation Module is designed to facilitate cross-modal complementary information interaction. To validate the effectiveness of this module, we conduct experiments on the MSRS dataset, TNO dataset and RoadScene dataset by either removing or adding the module to the model.
Table 5 reveals that the AG metric without the module is slightly higher than that of the fused image, which is reasonable as the module automatically balances with other modules during the fusion process, especially when there is no illumination loss constraint. In particular, the average improvement of
in SSIM indicates that our module performs well in retaining complementary and differential information.
4.6.3. Study of Bidirectional Difference-Aware Propagation Module
There are typically two methods for obtaining image differences: bidirectional subtraction (DAPM) and absolute value subtraction (BDA). To further validate the effectiveness of bidirectional subtraction in the Difference-Aware Propagation Module, we conduct experiments by applying both methods to three datasets while keeping other parameter settings constant. As shown in
Table 6, fusion images produced using bidirectional subtraction currently exhibit better visual quality, sharper object edges, and the retention of more information within the fused images. The slight decrease in AG measurement results can be attributed to the bidirectional subtraction ability to preserve directional information of the differences. For instance, the visible images exhibit unclear targets in the nighttime scenes of the MSRS dataset. Additionally, more information from the infrared images is preserved, which results in lower AG measurement outcomes. In the TNO dataset containing more daytime scenes, AG measurement results slightly increase. Although there is a slight decrease in information quantity measurement on the RoadSence dataset, the visual quality of the images is significantly improved.
4.6.4. Study of Illumination Guidance Loss
Visible images often lack detail under low-light conditions, which diminishes the key information that fusion networks can extract. To address this issue, the Illumination Guidance Loss is designed to guide the training of the Self-Attention Progressive Network. To validate the effectiveness of the Illumination Guidance Loss, we test the model on the MSRS dataset, TNO dataset and MSRS dataset by removing or adding this loss while keeping all other parameter and configurations constant. As shown in
Table 5, the network performance in terms of information quantity and image quality metrics is superior when the Illumination Guidance Loss is included. This indicates that the loss function can adaptively extract important information for different lighting conditions, thereby effectively preventing spectral pollution and texture blurring. The slight decrease in AG measurement can be attributed to the Difference-Aware Propagation Module, which does not have the constraint of this loss and thus plays a more significant role.
Overall, the performance of this module combination surpasses all others, as evident from the experimental results presented in
Table 5 and
Table 6. It demonstrates the efficiency and excellence of the proposed module.