Next Article in Journal
Enhancing Software Quality with AI: A Transformer-Based Approach for Code Smell Detection
Previous Article in Journal
Human Reliability Analysis in Acetylene Filling Operations: Risk Assessment and Mitigation Strategies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DAGANFuse: Infrared and Visible Image Fusion Based on Differential Features Attention Generative Adversarial Networks

1
Xi’an Institute of Optics and Precision Mechanics of CAS, Xi’an 710119, China
2
School of Optoelectronics, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4560; https://doi.org/10.3390/app15084560
Submission received: 15 March 2025 / Revised: 16 April 2025 / Accepted: 16 April 2025 / Published: 21 April 2025

Abstract

:
The purpose of multi-modal visual information fusion is to integrate the data of multi-sensors to generate an image with higher quality, more information, and greater clarity so that it contains more complementary information and fewer redundant features. Infrared sensors detect thermal radiation emitted by objects, which is related to their temperature, whereas visible light sensors generate images by capturing the light that interacts with objects, including reflection, diffusion, and transmission. However, due to the different principles of infrared and visible light sensors, there is a large similarity difference between the generated infrared and visible images, which makes it difficult to extract complementary information. Existing methods generally use simple splicing or addition methods to fuse features at the fusion layer without considering the intrinsic features of different modal images and the interaction of features between different scales. Moreover, only correlation is considered. On the contrary, the image fusion task needs to pay more attention to their complementarity. For this reason, we introduce a cross-scale differential features attention generative adversarial fusion network, namely DAGANFuse. In the generator, we designed a cross-modal differential features attention module to fuse the intrinsic content of different modal images. We proposed a parallel path calculation of differential features and fusion features for attention weights and performed parallel spatial and channel attention weight calculations on the two paths. In the discriminator, a dual discriminator was used to maintain the information balance between different modalities and avoid common problems such as information blurring and loss of texture details. Experimental results show that our DAGANFuse has state-of-the-art (SOTA) performance and is superior to existing methods in terms of fusion performance.

1. Introduction

Infrared sensors generate images by capturing the thermal radiation information of objects. Even under extreme conditions, harsh weather, and partial occlusion, they can effectively highlight significant targets. However, infrared images cannot provide sufficient environmental information such as texture details and environmental illumination. In contrast, visible light sensors generate images by capturing the light that interacts with objects, including reflection, diffusion, and transmission. This interaction results in rich texture detail information, making visible images more consistent with human visual perception. The fusion of infrared and visible light images not only solves the problems of blurred backgrounds and missing details of infrared sensors but also overcomes the dependence of visible light sensors on environmental conditions, constructing images with both thermal radiation sources and multispectral information. The complementary characteristics of infrared images and visible images have enabled the fusion technology of these two types of images to be widely applied in engineering technology and scientific research [1].
Traditional image fusion algorithms typically perform activity level measurements in the spatial or transform domain and manually design fusion rules to achieve image fusion. Traditional image fusion frameworks mainly include fusion frameworks based on multi-scale transformation [2], sparse representation [3], subspace [4], saliency [5], and variational models [6]. Although existing traditional image fusion algorithms can produce relatively satisfactory results in most cases, there are still some problems that hinder their further development. First, existing methods usually use the same transformation or representation to extract features from source images but do not consider the inherent feature differences of multi-modal images. Second, manually designed activity level measurements and fusion rules cannot adapt to complex fusion scenarios. In order to pursue better fusion performance, the design of activity level measurements and fusion rules becomes more and more complex, limiting the practical application of traditional methods [7].
Consequently, image fusion algorithms based on deep learning have emerged, highlighting their numerous advantages over traditional algorithms. Existing deep learning-based image fusion algorithms primarily aim to address three critical issues in image fusion: feature extraction, feature fusion, and image reconstruction. To tackle these key problems and improve upon the shortcomings of traditional algorithms, researchers have designed various deep learning networks to achieve higher efficiency and better performance. Depending on the network architecture adopted, these algorithms can be categorized into three types: image fusion frameworks based on autoencoders (AE), convolutional neural networks (CNN), and generative adversarial networks (GAN). In AE-based methods, networks pre-trained on large datasets are commonly used to address the issues of feature extraction and image reconstruction. However, the fusion strategies for integrating deep features extracted from different modal images are still manually designed, which restricts the performance of these methods [8]. Compared to AE-based methods, CNN-based methods achieve end-to-end feature extraction, feature fusion, and image reconstruction by designing loss functions and network structures. Nevertheless, these methods often employ a simple serial structure and fail to consider the inherent features of different modal images [9]. GAN-based methods model the image fusion problem as an adversarial game between the generator and the discriminator. The generator aims to blend the underlying distributions of the source images into the fused image, while the discriminator determines whether the input is a source image or a fused image. This process enables the generator to produce a fused image that contains more features of the source images. Although GAN-based methods can achieve better fusion results, problems have gradually emerged during research. Existing methods only perform feature fusion at a single layer and do not consider the interaction of information between different scales, which to some extent limits the fusion performance of their networks. To address the limitations of existing methods, we propose a cross-scale differential features attention generative adversarial network for infrared and visible light image fusion. This method is an end-to-end model based on Wasserstein generative adversarial networks(WGAN) [10] and dual discriminator conditional generative adversarial network (DDcGAN) [11]. Firstly, unlike existing simple fusion strategies, we designed a cross-modal attention module to measure the corresponding activity levels at different scales separately. Secondly, to enable the interaction of feature information across different scales, we designed a cross-scale attention decoder. This decoder is capable of fusing feature information across different scales in a sequential manner from deep features to shallow features. We proposed a parallel path of differential features and fusion features to calculate attention weights and perform parallel spatial and channel attention weight calculations on the two paths so that the fused image can contain more complementary information and continuously optimize the activity levels of different modal images. DAGANFuse effectively fuses the cross-scale features of multi-modal images; extracts more complementary information from features of different scales and different modalities; highlights the infrared target perception and visible light texture details; and makes the fused image of higher quality, with more information, and clearer. The main contributions of this paper are summarized as follows:
  • We designed a cross-modal differential features attention module to measure the activity level at the same scale. The proposed module increases the information and details contained in the fused image by effectively integrating the intermediate features of different modal images.
  • We proposed a new cross-scale attentional decoder, which can enable cross-modal features of different scales to interact. Through the parallel path of differential features and fusion features, it is beneficial to extract more complementary information on the basis of maintaining detailed information.
  • We proposed an end-to-end dual discriminator adversarial fusion network based on WGAN [10]. The experimental outcomes demonstrate that the proposed method achieves the SOTA performance. Compared with other conventional fusion techniques, it exhibits superiority in both qualitative visual depiction and quantitative index assessment, thereby offering a more efficient and robust approach for multi-modal image fusion tasks.
The remainder of this paper is organized as follows. Section 2 provides a comprehensive review of deep learning-based infrared and visible image fusion work, based on the adopted deep learning network framework. Section 3 elaborates on the proposed fusion framework in detail. Section 4 presents the experimental design, results, and analysis. Finally, Section 5 concludes the paper.

2. Related Work

Existing deep learning-based image fusion algorithms primarily adopt three architectural paradigms: AE-based frameworks, CNN-based frameworks and GAN-based frameworks. While these paradigms may overlap, this categorization reflects their dominant design principles and training objectives.

2.1. AE-Based Fusion Methods

Addressing the limitations and challenges inherent in traditional image fusion methods, researchers have increasingly embraced neural network-based approaches, which are highly regarded for their superior nonlinear fitting capabilities [12]. These advanced techniques have significantly enhanced the quality and efficiency of image fusion. Some researchers have proposed a variety of infrared and visible light image fusion methods utilizing autoencoders (AE). During the training phase, the autoencoder network is pre-trained using a large-scale dataset. The encoder in the network is specifically designed for feature extraction, while the decoder is responsible for image reconstruction. In the testing phase, the features extracted by the encoder are fused through manually designed fusion strategies to produce the final fused image. For instance, Li et al. [13] introduced DenseFuse, where dense connections were proposed in the encoder network to produce a powerful ability for feature extraction. Deep learning methods based on filters or optimization in the decomposition stage have certain limitations. To solve this problem, Zhao et al. designed the first deep image decomposition model under the AE network model, named DIDFuse [14]. Both the fusion and decomposition tasks are completed by the AE network. To alleviate the domain difference problem, Zhao et al. proposed a self-supervised IVIF feature adaptive network framework: SFA-Fuse [15]. This network includes a self-supervised feature adaptive network and an enhanced network coordinated by edge details and contrast-based loss functions.
To further improve the fusion performance, some researchers have introduced attention mechanisms into autoencoders. Li et al. [8] proposed the NestFuse fusion method, introducing a nested connection architecture and a spatial/channel attention model into the AE network. Subsequently, Jiang et al. proposed a symmetric encoder–decoder network with residual blocks, named SEDRFuse [16]. They designed a feature fusion strategy based on attention maps, thereby preserving more information from different modality images. Wang et al. [17] proposed a fusion network based on dense Res2net and dual non-local attention models. Thereafter, they introduced UNFusion [18]. Dense skip connections are introduced into the AE network to extract the feature correlation of the intermediate layer. The problem of insufficient fine features and coarse-scale features is solved by designing a multi-scale network structure. The Lp normalized attention model is used as the fusion strategy. These methods propose attention-based fusion strategies and use these attention models to refine and integrate features.

2.2. CNN-Based Fusion Methods

Liu et al. [19] proposed the first CNN-based IVIF method. This method uses Siamese CNN to obtain the weight map of infrared and visible light pixel activity information; uses image pyramids to fuse images at multiple scales; and uses a strategy based on local similarity to adjust the decomposition coefficients of the fusion mode automatically. Inspired by this, researchers have successively proposed a variety of CNN-based methods. Hou et al. proposed an end-to-end unsupervised CNN-based model called VIF-Net [20]. This method introduced a dual-channel dense network to achieve feature extraction and realized feature fusion by directly connecting deep features. Long et al. introduced an IVIF method—RXDNFuse [21]—using an aggregated residual dense network in the literature. This network combines the network structure advantages of ResNeXt [22] and DenseNet [23], effectively avoiding the limitations of manually designed fusion rules. Considering that lighting factors are important for the effects they have on image fusion quality, Tang et al. designed a light perception-based progressive image fusion network PIAFusion [24]. A light perception subnetwork is used to calculate light distribution and probability; in addition, a progressive feature extractor is developed, which contains a cross-modal differential perception fusion module and a mid-way fusion strategy to extract and fuse complementary information in multi-modal images adaptively. Tang et al. proposed a semantic-aware real-time infrared and visible image fusion network, SeAFusion [25]. This method is the first framework to integrate high-level vision tasks with image fusion. By introducing a semantic loss and a gradient residual dense block, SeAFusion significantly enhances the semantic information in the fused images, thereby improving visual quality and detail preservation while substantially boosting performance in high-level vision tasks. Additionally, it achieves efficient real-time processing. Xu et al. proposed an unsupervised end-to-end image fusion network, U2Fusion [26]. This method addresses the multi-modal image fusion problem by automatically evaluating the importance of corresponding source images to determine the adaptive degree of information retention.
Although these methods have adopted some strategies to replace the ground truth label, they have greatly improved the quality of fused images. However, due to the absence of a standard ground truth label, in the image fusion task, the potential performance of CNNs still cannot be maximally presented.

2.3. GAN-Based Fusion Methods

In 2019, Ma et al. [27] first introduced GAN into the infrared and visible image fusion (IVIF) task, namely FusionGAN. Although FusionGAN has achieved certain results, there are still some shortcomings: on the one hand, relying only on adversarial training to add additional detailed information may lead to the loss of some information; on the other hand, it ignores the edge information in infrared images, resulting in relatively blurred target edges in the fusion result. To solve these problems, Ma et al. proposed Detail-GAN [28] and designed detail loss and target edge enhancement loss, respectively. However, these two methods failed to adequately consider the texture information of infrared images and the contrast information of visible light images, resulting in the generated fused images being more similar to sharpened infrared images. Subsequently, Ma et al. proposed a fusion network that employs a multiple classifier as discriminators, namely GANMcC [29], thereby endowing the fused images with abundant texture information and significant contrast. Fu et al. proposed Perception-GAN [30], which uses dense blocks in the generator to connect the shallow layer and source images with rich detail features to the deeper layer. Le et al. proposed an unsupervised continuous learning generative adversarial network UIFGAN [31] for training a single model with memory to achieve multiple image fusion tasks.
The GAN network with a single discriminator cannot well preserve both infrared pixel intensity information and visible light texture detail information at the same time, and the training is not stable enough, which has certain deficiencies. Therefore, Ma et al. proposed the use of dual discriminators in DDcGAN [11] to increase the evaluation and feedback of the generated image, thereby enhancing the stability of training. The two discriminators act on the generator respectively, enabling the fused image to retain features from both modalities. While preserving the texture information of visible light, it also retains the thermal radiation information of infrared images. Building on this, Li et al. introduced a multi-scale attention mechanism into GAN, proposing AttentionFGAN [32]. Meanwhile, Zhou et al. proposed a semantic-supervised dual discriminator generative adversarial network, SDDGAN [33], which guides the fusion process through an information quantity discriminator block.
Distinguished from the methods previously mentioned, our DAGANFuse calculates attention weights in both the spatial and channel dimensions on two parallel paths of differential features and fused features, respectively, to measure the activity level of the source images at the same scale. It then effectively integrates cross-scale features through a cross-scale attention mechanism, thereby obtaining high-quality infrared and visible light fused images.

3. Methods

3.1. The Architecture of the Fusion Network

The principle of our DAGANFuse is presented in Figure 1, which includes a generator and dual discriminators.

3.1.1. Generator Architecture

In the feature extraction phase, two encoders respectively process infrared and visible light images. Both encoders contain four multi-scale convolution blocks (MSCB). Each MSCB has two convolution layers with a 3 × 3 kernel, and strides set to 1 and 2. The filter banks for different scales are all set to 16 × l. When source images are input into these blocks, four-scale features of each source image are obtained, represented as Φ i r l and Φ v i s l , l = 1, 2, 3 and 4. In the feature reconstruction phase, a cross-scale attentional decoder (CSAD) is built, as shown in Figure 2. Features of different scales and modalities are fed into the decoder. Use the proposed cross-modal differential features attention module (CDAM) to reconstruct features from deep to shallow scales. The upsampled output of each CDAM, combined with the next-layer features, becomes the input for the next CDAM.

3.1.2. Cross-Modal Differential Features Attention Module (CDAM)

To integrate the inherent features of different modal images, we propose a cross-modal differential features attention module. The architecture of the CDAM module is shown in the Figure 3.
Different from CBAM [34], we use two convolutional layers to replace the multi-layer perceptron (MLP) in CBAM. Based on this, we propose an attention framework using differential features path module (DFPM) and fusion features path module (FFPM) in parallel. On the two parallel paths, the initial fusion feature Φ f l R H × W × C —where H and W represent the height and width of the source image, and C represents the number of channels—is used to compute its attention maps from the spatial and channel dimensions, respectively. Combine the results from the two paths and allocate the resulting attention weights to measure the activity levels of different modalities of images at the same scale. In DFPM, the differential feature Φ d l is the concatenation of Φ d 1 l and Φ d 2 l , which are defined as follows:
Φ d 1 l = Φ i r l Φ v i s l
Φ d 2 l = Φ v i s l Φ i r l
The generated intermediate fused features, termed Φ f l 1 , will be seen as the initial fused features of the next CDAM. Particularly, at the fourth scale layer, the initial fusion feature is a concatenation of Φ i r l and Φ v i s l .
In the spatial-independent path, we utilize the maximum and average pooling operations to obtain their initial spatial attentional matrices. Then, we concatenate them into a convolution layer to generate the spatial attentional matrix Φ f S A , l R 1 × H × W and Φ d S A , l R 1 × H × W , which are formulated as follows:
Φ f SA , l ( i , j ) = Conv ( [ MP ( Φ f l ( c , i , j ) ) , AP ( Φ f l ( c , i , j ) ) ] )
Φ d SA , l ( i , j ) = Conv ( [ MP ( Φ d l ( c , i , j ) ) , AP ( Φ d l ( c , i , j ) ) ] )
where M P ( · ) and A P ( · ) denote the global maximum and average pooling operations. Conv represents the convolution.
Meanwhile, in the channel-independent path, we first use the maximum pooling and average pooling operations. Then, the obtained initial channel attention vectors pass through two convolution layers and a PReLU activation layer, and are concatenated and fed into a convolutional layer to obtain the channel attention vector Φ f C A , l ( c ) R C × 1 × 1 and Φ d C A , l ( c ) R C × 1 × 1 , which are formulated as follows:
Φ f CA , l ( c ) = Conv ( [ Conv ( PReLU ( Conv ( MP ( Φ f l ( c , i , j ) ) ) ) ) , Conv ( PReLU ( Conv ( AP ( Φ f l ( c , i , j ) ) ) ) ) ] )
Φ d CA , l ( c ) = Conv ( Conv ( PReLU ( Conv ( MP ( Φ d l ( c , i , j ) ) ) ) ) , Conv ( PReLU ( Conv ( AP ( Φ d l ( c , i , j ) ) ) ) ) )
where PReLU represents PReLU activation operations.
Then, we element-wise multiply the obtained two pairs of spatial attention matrices and channel attention vectors. After that, we attempt to normalize them using the sigmoid activation function to obtain the corresponding attention weights. The attentional weights α f i r , l / α d i r , l for infrared path and α f v i s , l / α d v i s , l for the visible path are represented by the following formulas:
α f i r , l ( c , i , j ) = σ ( Φ f CA , l ( c ) × Φ f SA , l ( i , j ) )
α d i r , l ( c , i , j ) = σ ( Φ d CA , l ( c ) × Φ d SA , l ( i , j ) )
α f v i s , l ( c , i , j ) = 1 α f i r , l ( c , i , j ) = 1 σ ( Φ f CA , l ( c ) × Φ f SA , l ( i , j ) )
α d v i s , l ( c , i , j ) = 1 α d i r , l ( c , i , j ) = 1 σ ( Φ d CA , l ( c ) × Φ d SA , l ( i , j ) )
where σ ( · ) denotes the sigmoid activation function.
Finally, the intermediate fused features Φ f l 1 R C × H × W are show as follows:
Φ f l 1 ( c , i , j ) = [ α f i r , l × Φ i r l + α d i r , l × Φ d 1 l , α f v i s , l × Φ v i s l + α d v i s , l × Φ d 2 l ]

3.1.3. Discriminator Architecture

In the discriminator, we adopt two discriminators, D i r and D v i s , to enable a more balanced adversarial game with the generator. The architecture of the discriminators is shown in the Figure 4. The first discriminator D i r is used to distinguish the fused results from the infrared images, and the second discriminator D v i s is used to distinguish the fused results from the visible images. The two discriminators have the same structure, but their parameters are not shared. They consist of four convolutional layers, and the convolution kernel size and stride are set to 3 × 3 and 2, respectively. The output channels are 16, 32, 64, and 128 in sequence. The activation function of the last layer is the Tanh function, and the remaining three layers are all LeakyReLU functions [35].

3.1.4. Loss Function

To generate high-quality fused images due to GAN features, we added intensity and texture loss functions to the generator loss. So, the generator loss function can be expressed as follows:
L generator = L adversarial + λ 1 L intensity + λ 2 L texture
where L g e n e r a t o r denotes the total generator loss function, while L a d v e r s a r i a l and L c o n t e n t represent the adversarial and content loss functions, respectively. The parameter λ 1 and λ 2 is the factor to control the balance between the three terms.
To make generated images contain more intensity information from infrared and visible light images, we define the intensity loss function as follows:
L intensity = 1 H W ( I f I ir 2 + λ 3 I f I vis 1 )
where I f , I i r and I v i s represent the fusion image, infrared image, and visible image. · 1 and · 2 denote the L_1-norm and L_2-norm, respectively. The parameter λ 3 is a weighted coefficient.
To make the fusion image retain more texture information, we introduce texture loss to assist intensity loss. The texture loss is shown as follows:
L texture = 1 H W ( I f max I i r , I v i s 1 )
where ∇ represents the gradient operator.
In DAGANFuse, the adversarial loss consists of two parts: the adversarial loss between the generator and each of the two discriminators, D i r and D v i s . The adversarial loss is formulated as follows:
L adversarial = E D i r ( I f ) E D v i s ( I f )
In addition, we define the loss functions of the two discriminators based on the Wasserstein distance:
L D ir / D vis = E D ir / vis ( I f ) + λ 4 D ir / vis ( I ˜ ir / vis ) 2 1 2
In the formula, the first part represents the Wasserstein distance estimation, and the second part is the gradient penalty. λ 4 is defined as the regularization parameter.

4. Experiments and Discussion

In this section, we first briefly introduce the experimental details. Next, we design ablation experiments to analyze the impact of each part in the proposed attention module. Finally, we evaluate and analyze the advantages of our method through comparative experiments and multiple performance indicators. Our network is implemented on the NVIDIA GPU (GTX 4060) using PyTorch 1.13.1 as a programming environment in PyCharm.

4.1. Experimental Settings

During data preprocessing, to create a dataset capable of training a good model, 25 pairs of infrared and visible images are selected from the TNO Image Fusion Dataset [36] and cropped into 256 × 256 sub-images with a sliding step size of 12, yielding 16,986 image pairs. This practice can split some local features into smaller ones, thereby enhancing detail capture, improving robustness to variations, and increasing data diversity for better model training.
In the training phase, we trained the discriminators twice, then the generator once using the Adam optimizer. The epoch and batch size were set to 14 and 4. The learning rates for the generator and discriminator are set to 1 × 10 5 and 1 × 10 4 . The weighted parameters λ 1 , λ 2 , λ 3 , and λ 4 are set to 1, 10, 1, and 10, respectively.
In the testing phase, we utilize the Image Fusion Dataset [36] and Roadscene Dataset [37]. The TNO Image Fusion Dataset contains multispectral nighttime imagery of different military relevant scenarios. The Roadscene Dataset contains 221 aligned visible and infrared image pairs with rich scenes like roads, vehicles, pedestrians, and so on. These images are highly representative of FLIR video. Moreover, seven representative methods—such as FusionGAN [27], DDcGAN [11], MDLatLRR [38], DenseFuse [13], Res2Fusion [17], MFEIF [39], and SDDGAN [33]—are proposed to compare with the DAGANFuse.
To comprehensively assess the quality of fused results, we conduct evaluations in both qualitative and quantitative dimensions. For quantitative verification, we use entropy (EN) [40], standard deviation (SD) [41], average gradient (AG), mutual information (MI), visual information fidelity (VIF) [42], and spatial frequency (SF) [43].

4.2. Ablation Study

In this section, to verify the effectiveness of our proposed attention mechanism, we conductablation experiments on four comparable models and perform analysis. The four ablation experiments include replacing CDAM with two convolutional layers, termed Without_CDAM; replacing CDAM with CBAM, termed CBAM; without the differential feature path, termed Without_DFPM; and only containing the differential feature path, termed Without_FFPM.
  • Qualitative analysis: As can be seen from the results in Figure 5, in the absence of CDAM, the Without_CDAM model not only reduces the brightness of infrared targets but also has the problem of blurred contours. In contrast, adding the CBAM model enables the fused image to retain some infrared target information but loses some texture details. In addition, the results of the Without_DFPM model, the Without_FFPM model, and our model are similar, with no obvious differences. All of them better retain the brightness information of infrared targets and the texture information of visible images.
  • Quantitative analysis: The quantitative analysis results are shown in Table 1. The red bold and underlined values indicate the optimal and sub-optimal values respectively. Compared with other models, our model obtains the optimal values under the evaluation indicators of entropy (EN), standard deviation (SD), average gradient (AG), and spatial frequency (SF), and obtains the sub-optimal values under the evaluation indicators of mutual information (MI) and visual information fidelity (VIF). In summary, every component of the differential features attention module proposed is effective.

4.3. Fusion Results on TNO Dataset

To test the performance of our DAGANFuse, we first employ the TNO Image Fusion Dataset. We select four pairs of typical images for intuitive qualitative comparison, as shown in Figure 6. The infrared targets and visible details are marked with red and blue frames, respectively. Judging from the fusion results, FusionGAN and DDcGAN inherit more infrared target information, but there are problems of blurred contours and lack of texture information, while MDLatRR and DenseFuse retain more texture information but lose some infrared target information. In contrast, Res2Fusion, MFEIF, and SDDGAN are superior to the previous four methods in comprehensive performance. In general, from the comparison of the results of four typical image pairs, our method can retain more visible light texture information and infrared target information.
To objectively evaluate the quality of fused images, we also conduct quantitative analysis. The index values are shown in Table 2. The best values are indicated in red and bold font, and the sub-optimal values are indicated by underlining. In Table 2, compared with other state-of-the-art fusion methods, our proposed method (DAGANFuse) achieves four best values (EN, AG, VIF, and SF) and two sub-optimal values (SD and MI). This means that DAGANFuse can retain more complementary information from the pixel level and feature level. In conclusion, our proposed DAGANFuse has a good fusion effect in both quantitative index evaluation and qualitative visual description.

4.4. Fusion Results on Roadscene Dataset

To further test the effectiveness of our DAGANFuse, we selected 40 pairs of images from the Roadscene dataset for experimental analysis.
Firstly, we selected two pairs of typical images, FLIR_07210 and FLIR_06570, for qualitative analysis. The fusion results are shown in Figure 7 and Figure 8. Compared with other methods, DAGANFuse has obvious advantages. Firstly, for the prominent infrared targets, such as the streetlight marked with the red box in Figure 7, DAGANFuse can maintain their original shapes while retaining the original brightness as much as possible. Moreover, for the visible texture information, like the “STOP SIGN” and house that are locally magnified with the blue boxes, DAGANFuse preserves clearer information compared to other methods.
Meanwhile, we conducted a quantitative comparison between DAGANFuse and seven other methods on the Roadscene dataset. The index values are shown in Table 3. Among the six evaluation indicators, DAGANFuse has significant advantages in five of them, namely entropy (EN), average gradient (AG), mutual information (MI), visual information fidelity (VIF), and spatial frequency (SF). As for the standard deviation (SD) indicator, it ranks fourth, behind DDcGAN, SDDGAN, and MDLatLRR. DAGANFuse achieves the optimal values in three indicators, AG, VIF, and SF, which indicates that the fusion results generated by the proposed model DAGANFuse not only contain richer gradient and texture information but also have better visual effects. The optimal value obtained in the MI indicator shows that DAGANFuse can retain more modality features in the fused image. It also means that the cross-modal differential features attention module proposed in DAGANFuse can effectively distinguish the intrinsic content of different modal images and integrate them effectively into the fusion results. Although DAGANFuse does not achieve the highest value in the SD indicator, from the overall evaluation of the six indicators and the results of qualitative comparison, DAGANFuse has the best fusion effect.

5. Conclusions

In this paper, we proposed an infrared and visible image adversarial network based on cross-scale attention mechanism, termed DAGANFuse. Different from the existing end-to-end fusion methods, we designed a cross-scale differential features attention module to let fused images contain more complementary information of infrared and visible images on the basis of combining the intensity and texture information of the source images. A parallel path of differential features and fusion features is adopted to calculate the attention weights, and parallel spatial and channel attention weight calculations are carried out on the two paths. Moreover, a cross-scale decoder framework is constructed to interact with feature information of different scales and continuously optimize the activity level in an iterative manner. In terms of the discriminator, a dual discriminator is used to establish a more balanced adversarial game with the generator. In addition, the effectiveness of the key parts of DAGANFuse is verified through ablation experiments, and comparisons with seven state-of-the-art methods are made on the public datasets TNO and Roadscene. Qualitative and quantitative analyses verify that DAGANFuse has significant advantages in fusion performance.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W.; formal analysis, W.L.; investigation, Y.W.; resources, W.L.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, W.L.; project administration, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

  • The following abbreviations are used in this manuscript:
SOTAState-of-the-art
IVIFInfrared and visible image fusion
MSCBMulti-scale convolution block
CSADCross-scale attention decoder
CBAMConvolutional block attention module
CDAMCross-modal differential features attention module
DFPMDifferential features fusion path module
FFPMFusion features fusion path module

References

  1. Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
  2. Liu, Y.; Jin, J.; Wang, Q.; Shen, Y.; Dong, X. Region level based multi-focus image fusion using quaternion wavelet and normalized cut. Signal Process. 2014, 97, 9–30. [Google Scholar] [CrossRef]
  3. Liu, Y.; Chen, X.; Ward, R.K.; Wang, Z.J. Image Fusion with Convolutional Sparse Representation. IEEE Signal Process. Lett. 2016, 23, 1882–1886. [Google Scholar] [CrossRef]
  4. Cvejic, N.; Bull, D.; Canagarajah, N. Region-Based Multimodal Image Fusion Using ICA Bases. IEEE Sens. J. 2007, 7, 743–751. [Google Scholar] [CrossRef]
  5. Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
  6. Ma, J.; Chen, C.; Li, C.; Huang, J. Infrared and visible image fusion via gradient transfer and total variation minimization. Inf. Fusion 2016, 31, 100–109. [Google Scholar] [CrossRef]
  7. Li, S. Pixel-level image fusion: A survey of the state of the art. Inf. Fusion 2017, 33, 100–112. [Google Scholar] [CrossRef]
  8. Li, H.; Wu, X.-J.; Durrani, T. NestFuse: An Infrared and Visible Image Fusion Architecture Based on Nest Connection and Spatial/Channel Attention Models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
  9. Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. STDFusionNet: An Infrared and Visible Image Fusion Network Based on Salient Target Detection. IEEE Trans. Instrum. Meas. 2021, 70, 5009513. [Google Scholar] [CrossRef]
  10. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 214–223. [Google Scholar]
  11. Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.-P. DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef]
  12. Liu, Y.; Wang, L.; Li, H.; Chen, X. Multi-focus image fusion with deep residual learning and focus property detection. Inf. Fusion 2022, 86–87, 1–16. [Google Scholar]
  13. Li, H.; Wu, X.-J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef]
  14. Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep Image Decomposition for Infrared and Visible Image Fusion. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; pp. 970–976. [Google Scholar]
  15. Zhao, F.; Zhao, W.; Yao, L.; Liu, Y. Self-supervised feature adaption for infrared and visible image fusion. Inf. Fusion 2021, 76, 189–203. [Google Scholar]
  16. Jian, L.; Yang, X.; Liu, Z.; Jeon, G.; Gao, M.; Chisholm, D. SEDRFuse: A Symmetric Encoder–Decoder with Residual Block Network for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5002215. [Google Scholar] [CrossRef]
  17. Wang, Z.; Wu, Y.; Wang, J.; Xu, J.; Shao, W. Res2Fusion: Infrared and Visible Image Fusion Based on Dense Res2net and Double Nonlocal Attention Models. IEEE Trans. Instrum. Meas. 2022, 71, 5005012. [Google Scholar] [CrossRef]
  18. Wang, Z.; Wang, J.; Wu, Y.; Xu, J.; Zhang, X. UNFusion: A Unified Multi-Scale Densely Connected Network for Infrared and Visible Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 3360–3374. [Google Scholar] [CrossRef]
  19. Liu, Y.; Chen, X.; Cheng, J.; Peng, H.; Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets Multiresolut Inf. Process. 2018, 16, 1850018. [Google Scholar] [CrossRef]
  20. Hou, R.; Zhou, D.; Nie, R.; Liu, D.; Xiong, L.; Guo, Y.; Yu, C. VIF-Net: An Unsupervised Framework for Infrared and Visible Image Fusion. IEEE Trans. Comput. Imaging 2020, 6, 640–651. [Google Scholar] [CrossRef]
  21. Long, Y.; Jia, H.; Zhong, Y.; Jiang, Y.; Jia, Y. RXDNFuse: A aggregated residual dense network for infrared and visible image fusion. Inf. Fusion 2021, 69, 128–141. [Google Scholar] [CrossRef]
  22. Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5987–5995. [Google Scholar]
  23. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2261–2269. [Google Scholar]
  24. Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar]
  25. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
  26. Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef]
  27. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
  28. Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
  29. Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A Generative Adversarial Network with Multiclassification Constraints for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5005014. [Google Scholar] [CrossRef]
  30. Fu, Y.; Wu, X.-J.; Durrani, T. Image fusion based on generative adversarial network consistent with perception. Inf. Fusion 2021, 72, 110–125. [Google Scholar] [CrossRef]
  31. Le, Z.; Huang, J.; Xu, H.; Fan, F.; Ma, Y.; Mei, X.; Ma, J. UIFGAN: An unsupervised continual-learning generative adversarial network for unified image fusion. Inf. Fusion 2022, 88, 305–318. [Google Scholar] [CrossRef]
  32. Li, J.; Huo, H.; Li, C.; Wang, R.; Feng, Q. AttentionFGAN: Infrared and Visible Image Fusion Using Attention-Based Generative Adversarial Networks. IEEE Trans. Multimed. 2021, 23, 1383–1396. [Google Scholar] [CrossRef]
  33. Zhou, H.; Wu, W.; Zhang, Y.; Ma, J.; Ling, H. Semantic-Supervised Infrared and Visible Image Fusion Via a Dual-Discriminator Generative Adversarial Network. IEEE Trans. Multimed. 2023, 25, 635–648. [Google Scholar] [CrossRef]
  34. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  35. Xu, J.; Li, Z.; Du, B.; Zhang, M.; Liu, J. Reluplex made more practical: Leaky ReLU. In Proceedings of the IEEE Symposium on Computers and Communications, Rennes, France, 7–10 July 2020; pp. 1–7. [Google Scholar]
  36. TNO Image Fusion Dataset. Available online: https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029 (accessed on 12 August 2024).
  37. Roadscene Database. Available online: https://github.com/hanna-xu/RoadScene (accessed on 12 August 2024).
  38. Li, H.; Wu, X.-J.; Kittler, J. MDLatLRR: A Novel Decomposition Method for Infrared and Visible Image Fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef]
  39. Liu, J.; Fan, X. Learning a Deep Multi-Scale Feature Ensemble and an Edge-Attention Guidance for Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 105–119. [Google Scholar] [CrossRef]
  40. Roberts, J.W.; van Aardt, J.; Ahmed, F. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
  41. Rao, Y.-J. In-fibre Bragg grating sensors. Meas. Sci. Technol. 2008, 8, 355–375. [Google Scholar] [CrossRef]
  42. Han, Y. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2008, 14, 127–135. [Google Scholar] [CrossRef]
  43. Eskicioglu, A.M.; Fisher, P.S. Image quality measures and their performance. IEEE Trans. Commun. 1995, 8, 2959–2965. [Google Scholar] [CrossRef]
Figure 1. Architecture of the proposed DAGANFuse. Here, MSCB represents the multi-scale convolution block. CSAD represents the cross-scale attention decoder. D_IR and D_VIS are two discriminators.
Figure 1. Architecture of the proposed DAGANFuse. Here, MSCB represents the multi-scale convolution block. CSAD represents the cross-scale attention decoder. D_IR and D_VIS are two discriminators.
Applsci 15 04560 g001
Figure 2. Cross-scale attention decoder (CSAD).
Figure 2. Cross-scale attention decoder (CSAD).
Applsci 15 04560 g002
Figure 3. Cross-modal differential features attention module (CDAM). Here, Φ f l is the initial fusion feature. Φ d 1 l ( Φ d 2 l ) represents the difference between Φ i r l ( Φ v i s l ) and Φ v i s l ( Φ i r l ). © denotes the concatenation operation. ⊗ denotes the multiplication operation. ⊕ denotes the addition operation. The upper part is the fusion features path module (FFPM), and the lower part is the differential features path module (DFPM).
Figure 3. Cross-modal differential features attention module (CDAM). Here, Φ f l is the initial fusion feature. Φ d 1 l ( Φ d 2 l ) represents the difference between Φ i r l ( Φ v i s l ) and Φ v i s l ( Φ i r l ). © denotes the concatenation operation. ⊗ denotes the multiplication operation. ⊕ denotes the addition operation. The upper part is the fusion features path module (FFPM), and the lower part is the differential features path module (DFPM).
Applsci 15 04560 g003
Figure 4. The architecture of the discriminators: 3 × 3 , filter size; Conv, convolutional layer; FC, fully connected layer.
Figure 4. The architecture of the discriminators: 3 × 3 , filter size; Conv, convolutional layer; FC, fully connected layer.
Applsci 15 04560 g004
Figure 5. The results of ablation study. (a) Infrared image; (b) Visible image; (c) Without_CDAM; (d) CBAM; (e) Without_DFPM; (f) Without_FFPM; (g) Ours.
Figure 5. The results of ablation study. (a) Infrared image; (b) Visible image; (c) Without_CDAM; (d) CBAM; (e) Without_DFPM; (f) Without_FFPM; (g) Ours.
Applsci 15 04560 g005
Figure 6. Qualitative comparison of our method with seven other state-of-the-art methods on four typical infrared and visible image pairs. From top to bottom: infrared images, visible images, the results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and our DAGANFuse.
Figure 6. Qualitative comparison of our method with seven other state-of-the-art methods on four typical infrared and visible image pairs. From top to bottom: infrared images, visible images, the results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and our DAGANFuse.
Applsci 15 04560 g006
Figure 7. The fusion results obtained by compared fusion methods and the proposed method on Roadscene (FILR_07210). These images, in turn, are infrared images, visible images, and the fusion results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and DAGANFuse.
Figure 7. The fusion results obtained by compared fusion methods and the proposed method on Roadscene (FILR_07210). These images, in turn, are infrared images, visible images, and the fusion results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and DAGANFuse.
Applsci 15 04560 g007
Figure 8. The fusion results obtained by compared fusion methods and the proposed method on Roadscene (FILR_06570). These images, in turn, are infrared images, visible images, and the fusion results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and DAGANFuse.
Figure 8. The fusion results obtained by compared fusion methods and the proposed method on Roadscene (FILR_06570). These images, in turn, are infrared images, visible images, and the fusion results of FusionGAN, DDcGAN, MDLatLRR, DenseFuse, MFEIF, Res2Fusion, SDDGAN, and DAGANFuse.
Applsci 15 04560 g008
Table 1. The objective results of ablation studies.
Table 1. The objective results of ablation studies.
ENSDAGMIVIFSF
Without_CDAM6.996937.13771.74841.06730.19983.1063
CBAM7.045140.53602.17001.65960.41714.1848
Without_DFPM7.101941.58283.94432.39430.51458.2949
Without_FFPM7.095841.62063.98412.85340.53738.4275
ours7.125341.89384.04622.65190.53678.5184
Table 2. The average metrics values obtained by the existing fusion methods and the proposed network on TNO.
Table 2. The average metrics values obtained by the existing fusion methods and the proposed network on TNO.
ENSDAGMIVIFSF
FusionGAN6.449327.26183.00962.10650.28465.9125
DDcGAN6.478341.60723.93931.20440.24328.1566
MDLatLRR6.329224.09353.58451.95960.36697.0322
DenseFuse6.289223.23443.19132.02590.34246.0231
MFEIF6.649231.50803.47912.39430.38426.7707
Res2Fusion6.956939.27024.02392.70470.45798.3721
SDDGAN7.035245.84263.13251.37240.39876.4354
DAGANFuse7.125341.89384.04622.65190.53678.5184
Table 3. The average metrics values obtained by the existing fusion methods and the proposed network on Roadscene.
Table 3. The average metrics values obtained by the existing fusion methods and the proposed network on Roadscene.
ENSDAGMIVIFSF
FusionGAN7.008737.15672.68241.85540.29145.8114
DDcGAN7.595056.63063.53911.46700.17637.0693
MDLatLRR7.397654.58905.32831.28810.352410.2970
DenseFuse7.234342.04983.34481.78090.35316.8627
MFEIF7.069538.85282.98742.13770.37876.3558
Res2Fusion7.330846.64833.37442.34490.40737.1569
SDDGAN7.545356.11063.27482.00760.35296.7498
DAGANFuse7.863247.33906.43222.49700.450512.3760
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wen, Y.; Liu, W. DAGANFuse: Infrared and Visible Image Fusion Based on Differential Features Attention Generative Adversarial Networks. Appl. Sci. 2025, 15, 4560. https://doi.org/10.3390/app15084560

AMA Style

Wen Y, Liu W. DAGANFuse: Infrared and Visible Image Fusion Based on Differential Features Attention Generative Adversarial Networks. Applied Sciences. 2025; 15(8):4560. https://doi.org/10.3390/app15084560

Chicago/Turabian Style

Wen, Yuxin, and Wen Liu. 2025. "DAGANFuse: Infrared and Visible Image Fusion Based on Differential Features Attention Generative Adversarial Networks" Applied Sciences 15, no. 8: 4560. https://doi.org/10.3390/app15084560

APA Style

Wen, Y., & Liu, W. (2025). DAGANFuse: Infrared and Visible Image Fusion Based on Differential Features Attention Generative Adversarial Networks. Applied Sciences, 15(8), 4560. https://doi.org/10.3390/app15084560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop