Next Article in Journal
Transcranial Doppler Ultrasound and Transesophageal Echocardiography for Intraoperative Diagnosis and Monitoring of Patent Foramen Ovale in Non-Cardiac Surgery
Next Article in Special Issue
Cross-Domain Person Re-Identification Based on Feature Fusion Invariance
Previous Article in Journal
The Dynamics of the Development of Apneic Breathing Capacity Specific to Synchronized Swimming in Girls Aged 7–14 Years
Previous Article in Special Issue
Link Prediction Based on Data Augmentation and Metric Learning Knowledge Graph Embedding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lightweight Infrared and Visible Image Fusion Based on Nested Connections and Res2Net

1
College of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650031, China
2
Yunnan Key Laboratory of Computer Science, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(11), 4589; https://doi.org/10.3390/app14114589
Submission received: 18 March 2024 / Revised: 25 April 2024 / Accepted: 20 May 2024 / Published: 27 May 2024

Abstract

:
Image fusion is a pivotal image-processing technology designed to merge multiple images from various sensors or imaging modalities into a single composite image. This process enhances and extracts the information contained across the images, resulting in a final image that is more informative and of superior quality. This paper introduces a novel method for infrared and visible image fusion, utilizing nested connections and frequency-domain decomposition techniques to effectively solve the problem of lost image detail features. By incorporating depthwise separable convolution technology, the method reduces the computational complexity and model size, thereby increasing computational efficiency. A multi-scale residual fusion network, R2FN (Res2Net Fusion Network), has been designed to replace traditional manually designed fusion strategies, enabling the network to better preserve detail information in the image while improving the quality of the fused image. Moreover, a new loss function is proposed, which is aimed at enhancing important feature information while preserving more significant features. Experimental results on public datasets indicate that the method not only retains the detail information of visible-light images but also highlights the significant features of infrared images while maintaining a minimal number of parameters.

1. Introduction

Image fusion is a process whereby images from diverse sensors or modalities are amalgamated into a singular image. This crucial phase in image processing is designed to extract and enhance information from multiple sources, resulting in a composite image that is both more informative and of superior quality. Based on variations in imaging devices, image fusion tasks are primarily classified into three categories: multimodal image fusion, remote sensing image fusion, and digital photography image fusion [1]. Among these, the fusion of infrared and visible images, as a branch of multimodal image fusion, has been a topic of considerable interest. Visible images, which are captured following the imaging principles of the human eye, are rich in color and texture information, but they are susceptible to obstruction and cannot function in low-light or nighttime conditions. Infrared images emphasize the thermal distribution characteristics of targets and are suitable for low-light and adverse weather conditions, but they lack color and texture information. Therefore, the fusion of these two types of images can result in a composite image that possesses enriched information and enhanced visual perception, providing significant assistance in various application areas, such as target detection [2], medical diagnosis [3], and remote sensing localization [4]. Currently, infrared and visible image fusion methods are typically categorized into two types based on the representation learning techniques employed in their algorithms: traditional image fusion algorithms and those based on deep learning.
Most traditional fusion algorithms primarily utilize signal processing techniques to accomplish fusion tasks. Traditional image fusion techniques commonly employ multi-scale transformations and methods based on sparse or low-rank representations (SRs/LRRs). Among these, methods based on multi-scale transformations, such as the discrete wavelet transform [5], contourlet transform [6], and shearlet transform [7], are notable. The strength of these methods stems from their capability to extract feature information in the frequency domain, which is unattainable in the spatial domain, which helps enhance the performance of fusion algorithms. However, the efficacy of these algorithms is primarily contingent upon the multi-scale transformation operation, which complicates the task of identifying a suitable transformation applicable to diverse image types. On the other hand, the transformation between spatial and frequency domains not only elevates computational complexity but also results in the loss of crucial image features.
Methods based on sparse/low-rank representations (SRs/LRRs), such as SR-HOG [8], DDL [9], JSR [10], and DLRR [11], are applied directly to source images in the spatial domain to extract features, thereby minimizing the loss of feature information typically incurred by transformations between spatial and frequency domains. However, when the source images contain complex information, the performance of these algorithms can drastically decrease.
In recent years, the rapid advancement of deep learning technology has ushered in new opportunities for the fusion of infrared and visible images. The introduction of convolutional neural networks and attention mechanisms has broadened the scope of image fusion, surpassing traditional algorithms and giving rise to a variety of deep learning-based methods. Li et al. [12] introduced the pretrained deep learning network VGG into the image fusion model, significantly improving fusion performance compared to traditional fusion networks. However, since deep learning processing was only added to a few branches and the pretrained structure was not specifically designed for fusion tasks, the features extracted may not necessarily contain complementary information for infrared and visible images.
To accomplish feature extraction and image reconstruction, the autoencoder-based image fusion method first pretrains an autoencoder on a sizable dataset. Then, for image fusion, a hand-crafted fusion approach is used to combine deep features extracted from various source images. Li et al. [13] proposed a novel autoencoder-based image fusion network called DenseFuse, which adopts the network structure of DenseNet [14] to fully extract image features, enriching the extracted features with more abundant information. Then, a designed fusion method is applied for feature-level fusion, and finally, four convolutional layers are used in feature reconstruction to create fused pictures. During the training phase, a large-scale dataset is used to train the autoencoder designed for the fusion task, facilitating the improvement of feature extraction adaptability in various scenarios. However, this network structure is relatively simple and cannot extract multi-scale deep features. To address this issue, many improved autoencoder image fusion algorithms based on the DenseFuse framework have emerged. In 2019, Song et al. [15] proposed the MSDNet algorithm, which extracts multi-scale features and fuses data across all scales by adding convolutional kernels of varying sizes after the encoder. However, while introducing multi-scale features enriches the information of deep features, it also makes the overall fusion network more complex, increasing the computational complexity of the model. Subsequently, Li et al. [16] further improved the network structure based on DenseFuse and proposed the NestFuse image fusion method, which utilizes nest connections to construct the decoder network structure and achieves multi-scale feature extraction. Nevertheless, this model still requires the manual design of fusion methods and cannot perform fusion specifically for the unique information of infrared and visible images.
Although autoencoder-based image fusion methods significantly improve fusion performance compared to traditional methods, they lack specific datasets for multimodal images, resulting in limitations in their expressive power when dealing with complex multimodal images. With the emergence of more multimodal datasets, a plethora of end-to-end fusion methods have emerged, incorporating end-to-end training, a fusion strategy, and deep feature extraction as the three main fusion process components. In 2017, Prabhakar et al. [17] introduced the DeepFuse model, which was the first to apply an end-to-end network to image fusion. However, its very simplistic network topology results in information loss, as it just uses the final layer’s output. Ma et al. [18] utilized GAN [19] to fuse infrared and visible images. They achieved this by creating an adversarial relationship between the two types of data. However, FusionGAN utilizes content loss and discriminator loss as loss functions, resulting in fused images with fewer texture details. Subsequently, Ma et al. [20] improved upon FusionGAN with FusionGANv2, introducing novel loss functions such as detail loss and target edge-enhancement loss to preserve the detailed information of target edges. To solve many kinds of image fusion challenges, Zhang et al. [21] manually assembled a multi-focus image dataset and used a CNN that had already been trained. However, the network’s results were constrained when used for other image fusion tasks because it was trained on a dataset of images with several foci. Following this, Xu et al. [22] proposed a unified image fusion method that maintains the adaptive similarity between the fusion results and source images by leveraging adaptiveness. However, the loss function employed in U2Fusion, designed solely around gradient-based adaptiveness, fails to fully capture the significance of source images across various fusion subtasks. For instance, in the fusion task of infrared and visible-light images, compared to infrared images, visible-light images exhibit more texture details and dominant gradient clues, resulting in fusion results biased toward visible-light images. Li et al. [23], building upon NestFuse, designed a fusion network called the residual fusion network (RFN) to replace manually designed fusion strategies. By employing a two-stage training method, the RFN retains detailed information and salient features in the fusion features, significantly enhancing the fusion performance of the network. However, the complexity of the network structure leads to a large number of model parameters.
To address the issues present in the aforementioned image fusion networks, we propose a lightweight multi-scale infrared and visible image fusion method based on nested connections and Res2Net. This innovative combination not only enhances the feature extraction process, thereby substantially enhancing the quality of the fused images, but also ensures low computational complexity. Compared to existing techniques, the nested connection structure that we introduce can integrate multi-scale information more deeply, a facet often overlooked in traditional image fusion methods. Furthermore, by designing the multi-scale residual fusion network R2FN to replace traditional manually designed fusion strategies, our method can effectively highlight key information in the images, thereby enhancing their expressiveness while preserving image details. The introduction of depthwise separable convolution significantly reduces computational complexity and memory requirements, making the algorithm suitable for resource-constrained mobile devices. The main contributions of our algorithm are summarized as follows:
  • We employ the frequency-domain decomposition technique to split the source image into detail and base layers, allowing the network to operate on the image with greater precision.
  • We incorporate depthwise separable convolution into the infrared and visible-light image fusion network. In comparison to existing classical fusion methods, our network achieves the lowest number of parameters without compromising performance.
  • We propose a multi-scale residual fusion module (R2FN) to replace manually designed fusion strategies, enabling the effective fusion of features across multiple scales.
  • We design a new loss function that preserves detail information while enhancing salient target features.
  • We conduct experiments on the TNO dataset to test the proposed fusion method. Comparative analysis with existing classical image fusion algorithms demonstrates that our method achieves optimal performance in these fusion tasks.

2. Related Works

2.1. Res2Net

To improve the multi-scale representation capabilities of CNNs, GAO et al. [24] proposed a novel multi-scale backbone architecture called Res2Net, which is utilized for object detection, class activation mapping, and salient object detection. This method divides the input features into multiple branches, with each branch responsible for extracting features at different scales. These branches are connected together in a manner similar to residual connections, enhancing the scale representation capability of features. The framework of Res2Net is illustrated in Figure 1.
The internal connectivity of Res2Net is similar to that of ResNet [25], with the distinction that, in Res2Net, the 3 × 3 convolutions are decoupled. The input features are segmented into multiple groups, each of which is processed by a corresponding set of filters to extract features. Subsequently, the output features of the preceding group are combined with the input features of the subsequent group and processed by the next group of filters. This process is iterated multiple times until all groups of features have undergone processing. Finally, the feature maps from all groups are concatenated and subjected to a set of 1 × 1 filters and then concatenated with the original features to derive the final result. Through this approach, Res2Net enhances the network’s performance and representation capability by increasing the effective receptive field and generating multi-scale feature representations.

2.2. Depthwise Separable Convolution

Depthwise separable convolution (DSC), first proposed by Sifre et al. [26], gained widespread recognition when it was introduced in the MobileNet model by the Google team in 2017 [27]. The fundamental concept of MobileNet involves significantly reducing computational complexity and model size by employing DSC.
Depthwise separable convolution comprises two processes: Depthwise Convolution (DW) and Pointwise Convolution (PW). In DW, the number of convolutional kernels matches that of the input channels, thereby establishing a one-to-one correlation between channels and kernels. Consequently, in DW, the number of output feature maps matches that of the input channels. PW then convolves the output feature maps from DW with convolutional kernels, ensuring that each output feature map integrates information from all input feature maps. The schematic diagram of depthwise separable convolution is illustrated in Figure 2.
In the figure, the input image size is denoted by D F × D F × M and is convolved with convolution kernels of size D K × D K × M to obtain M-channel feature maps. Then, the M-channel feature maps are input to convolution kernels of size 1 × 1 × N , resulting in N-channel feature maps. The computational complexity of the entire process is as follows:
D F · D F · M · D K · D K + D F · D F · M · N
If the input image with the size D F × D F × M is convolved using regular convolution with kernels of size D K × D K × M to obtain the same feature maps as in the above process, the computational complexity is as follows:
D K · D K · D F · D F · M · N
The ratio between (1) and (2) is
1 D K 2 + 1 N
In feature extraction, the commonly chosen kernel size is 3 × 3. Therefore, theoretically, depthwise separable convolution reduces computation by a factor of 8–9 compared to regular convolution.
MobileNetv3, introduced by [28], incorporates depthwise separable convolution, inverted residual blocks, and Squeeze-and-Excitation (SE) modules [29]. The input feature map is initially expanded through convolutional layers to extract additional features. Subsequently, Depthwise Convolution (DW) is applied, followed by the SE module to adjust the weights of each channel, thereby enhancing the model’s accuracy. Finally, downsampling is performed through convolutional layers. When the number of input and output features matches, shortcut connections are utilized by the Bottleneck (Bneck). The structure of the Bneck network is depicted in Figure 3.

3. Approach

3.1. An Overview of the Proposed Method

We propose a lightweight multi-scale infrared and visible image fusion method. The method first divides the original visible-light and infrared images into base and detail layers using mutually guided image filtering (muGIF) [30], which allows for extracting more hierarchical representations in high-frequency and low-frequency domains. The base layer encompasses information such as image content and spatial structure, whereas texture and local shape information are contained within the detail layer. Subsequently, sub-images at the same hierarchical level are input into the image fusion network for fusion. Finally, the fused images from both high-frequency and low-frequency components are merged to derive the final fused image. The flowchart of the proposed algorithm is illustrated in Figure 4.
The decomposition process of the source image primarily consists of two steps. Firstly, the image’s base layer is obtained through the muGIF method, which can be calculated using (4).
I base = muGIF I i , α , T
Here, I base represents the base layer image, muGIF denotes the mutually guided image filtering operation, I i stands for the source image, α is the parameter controlling the extent of texture removal, and T represents the number of iterations. We set α to 0.003 and T to 3.
After extracting the base layer, the detail layer image is obtained through the operation in (5):
I detail = I i I base
When fusing sub-images at the same hierarchical level, we propose a lightweight multi-scale infrared and visible fusion network. Taking the fusion process of the base layer as an example, its architecture is illustrated in Figure 5. We draw inspiration from the RFN-Net’s network structure, where the fusion network consists of encoder, fusion, and decoder modules. We introduce depthwise separable convolution into the encoder and decoder networks, replacing conventional convolutions in the original network to address the issue of the relatively large parameter size. The encoder module of the encoder network comprises two improved bneck layers and a max-pooling layer. Through this combination, the encoder can extract multi-scale depth features with a smaller computational cost. The multi-scale fusion network R2FN is employed to integrate multimodal depth features extracted at each scale. The fused features are then input into the decoder with a nested connection structure. The advantage of this structure is its ability to avoid information loss from previous layers during convolution operations, thereby fully utilizing multi-scale features for image reconstruction.
We adopted the Bottleneck (Bneck) structure from MobileNetv3. DSC is capable of reducing the number of model parameters and computational complexity, but it might also adversely affect the convolution’s capacity to extract features. Therefore, the attention module CBAM [31] is introduced to enhance the model’s attention concentration ability and improve the information-processing mechanism, effectively improving the quality of image fusion and the overall performance of the model. CBAM, as a lightweight and versatile attention mechanism, can be easily added to the convolutional layers of any network at a minimal cost. CBAM applies attention mechanisms simultaneously in both the channel and spatial dimensions, enhancing the model’s accuracy. Additionally, the parameter size of the improved network has been further decreased. The enhanced Bneck structure is illustrated in Figure 6.

3.2. Fusion Network

The fusion network R2FN, tailored for the dual-modal image fusion task, is designed based on the Res2Net architecture. In the fusion network, the parameters of R2FN vary across different layers. The structure of the R2FN network is illustrated in Figure 7.
In the figure, Φ ir m and Φ vi m represent the infrared and visible depth features extracted by the encoder network, respectively. Initially, the outputs of Conv1 and Conv2 are concatenated and fed into a 1 × 1 convolution for channel transformation. The feature maps are subsequently partitioned into s subsets, each characterized by an identical spatial size and 1/s of the total channel count, with s serving as the scale control parameter. The first subset, X 1 , remains unchanged and is directly propagated to Y 1 , while the remaining subsets undergo 3 × 3 convolution operations before being added to the next feature subset. Subsequently, the acquired feature maps are fed into the SE module to modify the weights of each channel, thereby improving the model’s accuracy. Here, ReLU and h-sigmoid are sequentially utilized as activation functions. Finally, the results from the 1 × 1 convolution are added to the fusion convolutional layer Conv3 to obtain the final outcome C. In our experiments, we selected s = 4 as the scale control parameter.

4. Training Strategy

During the training phase, our image fusion network needs to possess superior performance based on two key factors: one is the feature extraction capability of the encoder network and the feature reconstruction capability of the decoder network, and the other is the capability of R2FN to extract dual-mode multi-scale features. Therefore, a two-stage training method is adopted in this study. Firstly, the encoder network and the decoder network are trained as a whole, with the objective of reconstructing the network input. Then, R2FN is trained using multimodal images, with the parameters of the encoder and decoder obtained from the first stage being fixed during this phase.

4.1. Training in the First Stage

We are essentially training an autoencoder network to recreate the input images during the training phase because the fusion layer of the network is dropped. The formulation of the loss function is a critical factor that impacts the quality of image fusion outcomes. In the case of the autoencoder-based image fusion network, the loss function calculates the loss between the reconstructed image and the source image to supervise the learning process. While simultaneously imposing constraints on the output image to maintain consistency in texture details with the input image, our goal is for it to share greater structural and intensity distribution similarity with the source image. Considering these factors, we introduce similarity loss L sim and gradient loss L grad to formulate the total loss function L, defined as follows:
L = L sim + λ L grad
where L sim is used to retain important information from the source image. It limits the resemblance between the fusion result and the source image, ensuring that the fusion result retains the essential features of the source image to the greatest extent possible, thereby enhancing the quality and perceptual effect of the fusion result. L grad is used to constrain the fusion result to maintain consistent gradient information and texture features with the source image. λ represents the balance parameter, which is used to adjust the balance between the loss terms L sim and L grad , keeping different loss terms on the same scale. This enables the encoder and decoder to balance information from different modalities when dealing with infrared and visible images.
To determine L sim , we utilize two metrics, SSIM and MSE, to comprehensively assess the similarity of fusion results. The SSIM is the most commonly used metric to assess the similarity between two images. It assesses their similarity by comparing the brightness, contrast, and structural information of the two images. The SSIM values range from −1 to 1, with a value closer to 1 indicating a higher similarity between the two images. To minimize the loss, we use the dissimilarity between the two images to represent the structural similarity loss L ssim , which is calculated using (7):
L ssim = 1 SSIM ( X , Y )
where X represents the output image, Y represents the input image, and SSIM ( · ) represents the structural similarity operation. It is worth noting that the SSIM primarily focuses on changes in contrast and structure, with weaker constraints on intensity distribution differences. Therefore, we introduce MSE as a supplement. MSE is a metric that measures the error between two images. Using MSE as the loss function ensures that the distribution of pixel intensities in the input and output images are similar in image fusion tasks. L mse can be calculated using (8):
L mse = 1 H W i j X i , j Y i , j 2
where H and W represent the height and width of the image, respectively. X represents the output image, and Y represents the input image. i , j represent the pixel values in row i and column j. Since the scales of L ssim and L mse are different, we introduce a balance parameter, μ , to control the balance between the two terms. The final expression for L sim is as shown in (9):
L sim = μ L ssim + L mse
We utilize gradient operators to compute the gradients of both the input image and the output image, followed by the calculation of the Euclidean distance between them. Gradient operators can compute the gradient values of each pixel in the image, representing the rate of color change at that pixel. Therefore, the gradient loss ensures that the output image has similar texture details to the input image, thereby improving the quality of the fusion result. L grad can be calculated using (10):
L grad = 1 H W i j X i , j Y i , j 2
where ( · ) represents the gradient operator, which can calculate the gradient values for each pixel in the image, and X i , j Y i , j 2 represents the Euclidean distance between the input image and the output image at pixel ( i , j ). A smaller L sim indicates that the texture details in the output image are more similar to those in the input image, leading to higher fusion result quality.

4.2. Training in the Second Stage

In the fusion layer, the multimodal multi-scale feature fusion module R2FN is designed to replace the manually designed fusion strategies typically used in autoencoder-based fusion networks. In the second stage of training, the focus is on training the R2FN module to enhance its capability of extracting multimodal multi-scale features. The R2FN module is trained using multimodal images, aiming to optimize its performance in effectively fusing multimodal multi-scale features. The parameters of the encoder and decoder obtained from the first stage are kept fixed to ensure consistency in the features extracted and reconstructed by these networks. Subsequently, a loss function tailored for R2FN is designed to train the multi-scale depth feature fusion network.
The fixed encoder network is employed to extract multi-scale features from the source images, with the features at each scale being fused by the corresponding R2FN. The fused multi-scale features are then used as inputs to the decoder network to reconstruct the fused image. We define a loss function L R 2 FN as the training loss for R2FN. L R 2 FN consists of two components: detail loss ( L detail ) and feature enhancement loss ( L feature ), defined as follows:
L R 2 F N = β L detail + L feature
where β represents the balancing parameter between L detail and L feature .
In infrared and visible image fusion networks, the visible image typically contributes texture details in the background. Therefore, we define the detail preservation loss by computing the structural similarity loss of the visible-light image. It is defined as follows:
L detail = 1 SSIM O , I v i
In infrared images, more salient object features are typically present. Therefore, a feature enhancement loss function is designed to enhance salient feature information. It is defined as follows:
L feature = m = 1 M ω 1 ( m ) · ϕ f m ϕ i r m 2 · ω i r + ϕ f m ϕ v i m 2 · ω v i
where M represents the number of multi-scale features obtained through downsampling, ω 1 ( m ) represents the balancing parameter for the m-th multi-scale feature, and ω ir and ω vi , respectively, represent the balancing parameters controlling the ratio of visible-light depth features and infrared features.

5. Experiments and Results Analysis

5.1. Dataset and Experimental Environment

In order to verify the efficacy of our method, in the training stage, we selected 80,000 images from the MS-COCO dataset [32] as the first-stage training set and utilized the KAIST dataset [33] as the second-stage training set. In the first-stage training, the balancing parameter λ between the similarity loss and the gradient loss in the loss function was set to 1, and μ was set to 100. In the second-stage training, β was set to 500, ω ir was set to 5, and ω vi was set to 3. The model training parameters were set as follows: epochs = 20; batch size = 4.
During the testing phase, to verify the effectiveness of our method, images were selected from the publicly available infrared and visible-light dataset TNO [34] for experimentation. Six sets of images were chosen for comparative analysis. These images, rich in detail and texture, are suitable for assessing the quality of image fusion. Having been widely used in previous studies, they provide a benchmark for comparing our results with existing methods.
Our experiments were conducted on a system running Windows 11 with hardware specifications that include an Intel(R) Core(TM) i5-12400F 2.50 GHz processor. The model was run on an NVIDIA GeForce RTX 3060 GPU. The software environment for the experiments included Python 3.8.3, PyTorch 1.10.1, CUDA 11.3, and the PyCharm 2020.1 IDE.

5.2. Evaluation Metrics

To objectively evaluate the algorithm, we selected Entropy, the Standard Deviation, and the Structural Similarity Measure (SSIM) as objective evaluation metrics to assess the amount of edge, texture, and contrast information in the fused images. Mutual Information, Feature Mutual Information using Discrete Cosine Transform (FMIdct), Feature Mutual Information using Wavelet Transform (FMIw), and Visual Information Fidelity were used to evaluate the distortion, noise, and artifacts caused by fusion, similarity, and the transfer of complementary information between the fused and source images. We also used the “params" metric to evaluate the model’s size.
EN is used to assess the information content of fused images, with higher values indicating richer content, and is crucial for evaluating fusion effectiveness. The SD measures pixel dispersion, reflecting image contrast, which is important for enhancing visibility and details. The SSIM evaluates the structural similarity between the fused and original images, with high values showing the effective preservation of visual features. MI assesses the degree of information correlation between the fusion result and original images, indicating the preservation of original data. FMIdct and FMIw assess the Mutual Information of discrete cosine and wavelet features, respectively, reflecting the algorithm’s ability to retain significant original features. VIF, which assesses the visual quality of the fusion result, shows that higher values indicate greater fidelity to human visual perception, representing better quality. The model’s size and computational complexity are critical for practical applications, with models having fewer parameters being easier to deploy in resource-limited environments, reducing energy and operational costs.
In comparative experiments, we selected six classical image fusion algorithms as benchmarks: DeepFuse (CNN-based fusion), DenseFuse (DenseNet-based fusion with autoencoders), NestFuse (fusion with nested connections and spatial/channel attention models), FusionGAN (GAN-based fusion), U2Fusion (end-to-end unsupervised fusion), and IFCNN (fusion using multiple fusion strategies).

5.3. Ablation Study

To validate the optimization effects of our various strategies and assess the effectiveness of the proposed methods, we designed and conducted ablation experiments. These experiments aimed to further evaluate the impact of improved techniques on the performance of image fusion.

5.3.1. Frequency-Domain Decomposition

In this section, an analysis is conducted on the frequency-domain decomposition module, and the influence of different parameters of α in the guided filtering operation (muGIF) on the network is examined.
As discussed in Section 3.1, mutually guided image filtering is employed during frequency-domain decomposition to decompose the input source image into base and detail layers. The quality of filtering critically impacts the final image fusion performance. Therefore, in the experiments of this section, α is set to 0.0001, 0.001, 0.01, 0.002, 0.003, and 0.004 to analyze its influence on the filtering effect. (Only the experimental results of infrared image frequency-domain decomposition are listed in this paper, and it is observed that the trends in the decomposition effects of visible-light images are consistent with those of infrared images.) The experimental results are shown in Figure 8 and Figure 9.
From the figure, it can be observed that, with the increase in α , the base layer gradually becomes smoother, as larger α values result in the removal of more high-frequency details. When α increases to a certain extent, such as α = 0.01, the detail layer loses too much high-frequency information, leading to less prominent image details. At α = 0.003, the base layer demonstrates moderate smoothness, removing an appropriate amount of high-frequency detail information while still retaining sufficient structural information. The detail layer preserves more texture and edge information without excessive smoothing, indicating that the filter can better distinguish between the base content and detail content. Therefore, we select 0.003 as the value of α .
After determining the value of α , to verify the impact of the frequency-domain decomposition module on the performance of image fusion, we conducted ablation experiments targeting this module using the same image fusion network. The experimental results are presented in Table 1.
In the table, Experiment 1 corresponds to the fusion results without the frequency-domain decomposition module, while Experiment 2 corresponds to the fusion results with the inclusion of the frequency-domain decomposition module. Clearly, after incorporating the frequency-domain decomposition module, the numerical values of various evaluation metrics improved, validating the effectiveness of this module in enhancing the performance of the image fusion network.

5.3.2. The Loss Function during the First Stage of Training

The first stage of training focuses on the feature extraction capability of the encoder and the reconstruction ability of the decoder, independent of the fusion layer. To balance the magnitudes of different loss terms in the loss function, we introduce the balancing parameters λ and μ , where λ is used to balance the magnitude difference between L sim and L grad , and μ is used to balance the magnitude difference between L ssim and L mse . We assessed the average values of objective metrics under different magnitude combinations to validate the optimal combination of balancing parameters. The experimental results are presented in Table 2. The optimal values are highlighted in bold font.
From Table 2, it can be observed that when μ is set to 100 and λ is set to 1, the image fusion network exhibits better performance.

5.3.3. The Loss Function during the Second Stage of Training

We kept the numerical values obtained in the first stage unchanged and conducted an ablation study on the balancing parameters of the loss function during the second stage of training. Referring to the conclusions drawn in reference [23], we first set β to 700 and conducted ablation experiments for both ω vi and ω ir . The experimental results are presented in Table 3.
From Table 3, it can be observed that the combination of ω ir = 5 and ω vi = 3 performs the best across almost all key performance indicators. This combination not only maintains the richness of image information and contrast but also effectively preserves the structural similarity, feature information, and visual fidelity of the images. Therefore, this combination is considered the optimal parameter setting, providing the best image fusion results.
To find the optimal value of β , which controls the balance between L detail and L feature , we conducted an ablation study by setting ω ir to 5 and ω vi to 3. Due to the larger difference in magnitudes between L detail and L feature , we experimented with β values of 100, 300, 500, 700, and 1000. The experimental results are presented in Table 4.
From Table 4, it can be observed that when β is set to 500, all metrics except for FMI w are at their optimal values, with the value of FMI w being only slightly below the optimum. Considering all key performance indicators comprehensively, β = 500 offers the best overall performance. Therefore, we set β to 500 in our experiments.

5.4. Results Analysis and Comparison

5.4.1. Subjective Evaluation

To validate the effectiveness of our proposed method, we conducted a subjective comparative experiment on a subset of images from the TNO dataset, comparing them with various image fusion algorithms. The comparative results of each algorithm are shown in Figure 10. The comparison results show that the FusionGAN method fails to effectively preserve detailed texture information from the visible-light images. NestFuse and IFCNN methods demonstrate a good representation of the target contours but do not effectively retain the thermal radiation information from the infrared images. The DeepFuse, DenseFuse, and U2Fusion methods exhibit clear contour information and target features, but the introduction of excessive noise leads to poor fused image quality. Particularly in the yellow-boxed areas in Figure 10, none of these algorithms achieve the desired fusion results. In contrast, our algorithm maintains a better balance of information between the infrared and visible-light images in most fusion scenarios, resulting in superior fusion results.

5.4.2. Objective Evaluation

To further validate the effectiveness of our proposed algorithm, we selected eight objective evaluation metrics for comparative analysis, and the comparative results are shown in Table 5.
From Table 5, it is evident that our proposed method achieves the optimal performance across all seven metrics except FMIw. Additionally, it demonstrates significant advantages in terms of model complexity and computational efficiency. This demonstrates that the algorithm can effectively preserve detailed information from visible-light images while highlighting significant features from infrared images. Moreover, it maintains a relatively small parameter count. The reduced number of parameters implies a lighter-weight model and lower training and deployment costs, which are particularly suitable for resource-constrained environments such as mobile devices and real-time systems. This highlights the high practical value of our method, not only theoretically and experimentally excellent but also highly applicable in real-world scenarios.

6. Conclusions

In this article, we propose a novel method for lightweight infrared and visible image fusion based on nested connections and Res2Net. This method combines frequency-domain decomposition, depthwise separable convolution, nested connection networks, and multi-scale residual networks, achieving multi-scale feature extraction while maintaining a small model parameter count. Prior to inputting the source images into the fusion network, a mutually guided filtering operation is applied to better extract hierarchical representations of the images’ high- and low-frequency domains. By improving depthwise separable convolution, the model reduces computational complexity while maintaining high fusion quality. Multi-scale feature extraction is realized through the use of nested connection structures. Through the designed R2FN network, image details are effectively preserved, and significant features of the infrared images are highlighted. Experimental comparisons with several classical image fusion algorithms, in terms of subjective and objective evaluations, demonstrate the superiority of the proposed method across multiple key performance indicators. Notably, the method exhibits advantages in lightweighting, significantly reducing the computational burden, while also enhancing feature extraction and fusion capabilities through the nested connection architecture and the R2FN fusion module. Consequently, this study not only advances the theoretical and practical aspects of image fusion technology but also opens up new pathways for its application in high-dynamic and dynamic environments.

Author Contributions

Conceptualization, X.T.; methodology, X.T.; software, X.T.; validation, X.T., Y.P., and Q.Y.; formal analysis, Y.P.; investigation, X.T.; resources, X.T.; data curation, X.T.; writing—original draft preparation, X.T.; writing—review and editing, X.T.; visualization, X.T.; supervision, Y.P.; project administration, X.T.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Projects of Basic Research Program in Yunnan Province (202401AS070105), the National Natural Science Foundation of China (61761025), and the Development Fund of Key Laboratory of Computer Technology Application in Yunnan Province (2021102).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to confidentiality restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
  2. Cao, Y.; Guan, D.; Huang, W.; Yang, J.; Cao, Y.; Qiao, Y. Pedestrian detection with unsupervised multispectral feature learning using deep neural networks. Inf. Fusion 2019, 46, 206–217. [Google Scholar] [CrossRef]
  3. Bhatnagar, G.; Wu, Q.J.; Liu, Z. Directive contrast based multimodal medical image fusion in NSCT domain. IEEE Trans. Multimed. 2013, 15, 1014–1024. [Google Scholar] [CrossRef]
  4. Zhang, H.; Ma, J.; Chen, C.; Tian, X. NDVI-Net: A fusion network for generating high-resolution normalized difference vegetation index in remote sensing. ISPRS J. Photogramm. Remote Sens. 2020, 168, 182–196. [Google Scholar] [CrossRef]
  5. Sundararajan, D. Discrete Wavelet Transform: A Signal Processing Approach; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
  6. Xia, J.; Chen, Y.; Chen, A.; Chen, Y. Medical image fusion based on sparse representation and PCNN in NSCT domain. Comput. Math. Methods Med. 2018, 2018, 2806047. [Google Scholar] [CrossRef] [PubMed]
  7. Peng, H.; Li, B.; Yang, Q.; Wang, J. Multi-focus image fusion approach based on CNP systems in NSCT domain. Comput. Vis. Image Underst. 2021, 210, 103228. [Google Scholar] [CrossRef]
  8. Zong, J.J.; Qiu, T.S. Medical image fusion based on sparse representation of classified image patches. Biomed. Signal Process. Control 2017, 34, 195–205. [Google Scholar] [CrossRef]
  9. Mairal, J.; Bach, F.; Ponce, J.; Sapiro, G.; Zisserman, A. Discriminative learned dictionaries for local image analysis. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 24–26 June 2008; IEEE: New York, NY, USA, 2008; pp. 1–8. [Google Scholar]
  10. Zhang, Q.; Fu, Y.; Li, H.; Zou, J. Dictionary learning method for joint sparse representation-based image fusion. Opt. Eng. 2013, 52, 057006. [Google Scholar] [CrossRef]
  11. Li, H.; He, X.; Tao, D.; Tang, Y.; Wang, R. Joint medical image fusion, denoising and enhancement via discriminative low-rank sparse dictionaries learning. Pattern Recognit. 2018, 79, 130–146. [Google Scholar] [CrossRef]
  12. Li, H.; Wu, X.J.; Kittler, J. Infrared and visible image fusion using a deep learning framework. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: New York, NY, USA, 2018; pp. 2705–2710. [Google Scholar]
  13. Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
  14. Xu, H.; Ma, J.; Le, Z.; Jiang, J.; Guo, X. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7 February 2020; Volume 34, pp. 12484–12491. [Google Scholar]
  15. Song, X.; Wu, X.J.; Li, H. MSDNet for medical image fusion. In Proceedings of the Image and Graphics: 10th International Conference, ICIG 2019, Beijing, China, 23–25 August 2019; Part II 10. Springer: Berlin/Heidelberg, Germany, 2019; pp. 278–288. [Google Scholar]
  16. Li, H.; Wu, X.J.; Durrani, T. NestFuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
  17. Ram Prabhakar, K.; Sai Srikar, V.; Venkatesh Babu, R. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4714–4722. [Google Scholar]
  18. Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
  19. Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
  20. Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. IFCNN: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
  22. Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
  23. Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
  24. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  26. Sifre, L.; Mallat, S. Rigid-motion scattering for texture classification. arXiv 2014, arXiv:1403.1687. [Google Scholar]
  27. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  28. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  29. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  30. Guo, X.; Li, Y.; Ma, J. Mutually guided image filtering. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1283–1290. [Google Scholar]
  31. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  32. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
  33. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
  34. Toet, A. Data Title. 2014. Available online: https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029/2 (accessed on 14 November 2022).
Figure 1. Architecture of Res2net.
Figure 1. Architecture of Res2net.
Applsci 14 04589 g001
Figure 2. Depth-separable convolution schematic.
Figure 2. Depth-separable convolution schematic.
Applsci 14 04589 g002
Figure 3. Bneck of MobileNetv3 structure.
Figure 3. Bneck of MobileNetv3 structure.
Applsci 14 04589 g003
Figure 4. Two-layer fusion framework of the proposed method.
Figure 4. Two-layer fusion framework of the proposed method.
Applsci 14 04589 g004
Figure 5. Framework of proposed method.
Figure 5. Framework of proposed method.
Applsci 14 04589 g005
Figure 6. The structure of the enhanced Bneck.
Figure 6. The structure of the enhanced Bneck.
Applsci 14 04589 g006
Figure 7. The architecture of R2FN.
Figure 7. The architecture of R2FN.
Applsci 14 04589 g007
Figure 8. The infrared base layer images corresponding to different α values of mutually guided filtering.
Figure 8. The infrared base layer images corresponding to different α values of mutually guided filtering.
Applsci 14 04589 g008
Figure 9. The infrared detail layer images corresponding to different α values of mutually guided filtering.
Figure 9. The infrared detail layer images corresponding to different α values of mutually guided filtering.
Applsci 14 04589 g009
Figure 10. Infrared and visible-light fusion results of each algorithm.
Figure 10. Infrared and visible-light fusion results of each algorithm.
Applsci 14 04589 g010
Table 1. The average values of objective metrics obtained without using the frequency-domain decomposition module and with the frequency-domain decomposition module included.
Table 1. The average values of objective metrics obtained without using the frequency-domain decomposition module and with the frequency-domain decomposition module included.
ENSDSSIMMI FMI dct FMI w VIF
Exp.16.759742.46320.734814.02510.37830.41320.6531
Exp.27.087645.91230.756414.23980.39750.43710.7360
Table 2. The average values of objective metrics obtained by setting different balancing parameters for the loss functions during the first stage of training.
Table 2. The average values of objective metrics obtained by setting different balancing parameters for the loss functions during the first stage of training.
μ λ ENSDSSIMMI FMI dct FMI w VIF
10.16.617641.95420.702113.40940.31250.40030.6629
16.626842.10550.711313.45980.33170.42060.6831
106.634142.21040.716813.47830.33540.42390.6876
100.16.619441.98450.708113.42970.31770.40560.6687
16.633142.15880.717013.47220.33800.42610.6888
106.640442.25580.719513.49840.34110.42910.6922
1000.16.643042.29920.712913.44190.32330.41190.6742
16.665942.64040.718613.80280.35110.43870.7065
106.659342.48870.721713.50890.34610.43400.6969
10000.16.646542.35450.708213.47240.32940.41660.6805
16.656742.51790.719213.53040.33940.42630.6899
106.665442.62930.725413.57090.34480.43270.7019
Table 3. The average values of objective metrics obtained were calculated by setting β to 700 during the second stage of training and by varying the values of ω ir and ω vi . The bold indicates the optimal value.
Table 3. The average values of objective metrics obtained were calculated by setting β to 700 during the second stage of training and by varying the values of ω ir and ω vi . The bold indicates the optimal value.
ω ir ω vi ENSDSSIMMI FMI dct FMI w VIF
226.837443.34730.711713.83290.32110.41210.6826
326.847943.50350.732213.88380.34150.43340.7038
36.855843.61150.738013.90250.34480.43660.7086
426.839743.37640.717213.85300.32640.41780.6886
36.854443.55870.737713.89550.34750.43890.7092
46.862443.66060.741213.92250.35050.44190.7131
526.865543.70510.723013.86410.33240.42400.6945
36.891143.90840.756114.23690.36070.45050.7240
46.883344.12120.747313.93420.35570.44750.7174
56.869443.76290.727813.87620.33820.42980.7003
626.880343.93100.739613.93500.34890.44080.7117
36.890544.12250.751513.97530.35430.44610.6980
46.873043.84840.732613.88780.33620.42740.7272
56.861243.65170.719713.85180.32400.41520.6860
66.849443.45490.706613.81640.31160.40300.6741
Table 4. The average values of objective metrics obtained were calculated by setting ω ir to 5 and ω vi to 3 during the second stage of training while varying the values of β . The bold indicates the optimal value.
Table 4. The average values of objective metrics obtained were calculated by setting ω ir to 5 and ω vi to 3 during the second stage of training while varying the values of β . The bold indicates the optimal value.
β ENSDSSIMMI FMI dct FMI w VIF
1006.645343.74600.743813.91820.34050.42300.7141
3006.784144.65250.753214.02030.35440.43660.7324
5007.087645.91230.756414.23980.39750.43710.7360
7006.891143.90840.756114.23690.36070.45050.7240
10006.714043.59060.750514.14270.34600.43120.7244
Table 5. The average quantitative value of each evaluation index. The bold indicates the optimal value.
Table 5. The average quantitative value of each evaluation index. The bold indicates the optimal value.
ENSDSSIMMI FMI dct FMI w VIFParams
DeepFuse6.758141.89880.732913.51610.37770.41460.65822.9653M
DenseFuse6.763441.90030.733513.52680.38100.41920.65733.1317M
NestFuse6.935543.55190.694213.87090.33210.43780.724910.9310M
FusionGAN6.544038.37320.693313.06730.28310.35340.50827.9873M
U2Fusion6.472230.05040.752612.94440.30770.35400.49985.9896M
IFCNN6.952144.59870.705413.90410.35740.42750.72798.4360M
Ours7.087645.91230.756414.23980.39750.43710.73602.0432M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, Y.; Tu, X.; Yang, Q. Lightweight Infrared and Visible Image Fusion Based on Nested Connections and Res2Net. Appl. Sci. 2024, 14, 4589. https://doi.org/10.3390/app14114589

AMA Style

Peng Y, Tu X, Yang Q. Lightweight Infrared and Visible Image Fusion Based on Nested Connections and Res2Net. Applied Sciences. 2024; 14(11):4589. https://doi.org/10.3390/app14114589

Chicago/Turabian Style

Peng, Yi, Xinyue Tu, and Qingqing Yang. 2024. "Lightweight Infrared and Visible Image Fusion Based on Nested Connections and Res2Net" Applied Sciences 14, no. 11: 4589. https://doi.org/10.3390/app14114589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop