3.2. Edge-Consistent Detail Layers Fusion
Existing simple average or maximum fusion schemes are performed on each pixel in the spatial image. However, detail layers obtained from NSST contain rich directional texture information, and using a simple scheme does not fully exploit this directional information. To provide a clear illustration of this effect,
Figure 3 offers a one-dimensional signal example. The visible image predominantly encompasses regular edge signals, whereas the infrared image includes not only edge signals but also some noise. Employing simple maximum or average strategies would both result in fused outcomes containing noise. When the noise value exceeds the valuable signal in the other input image, the maximum strategy would be entirely submerged by noise, and the average strategy would diminish the value of the valuable edge information. In contrast, our proposed method retains edge information from the input images, even in the presence of noise interference. Our proposed edge-consistent fusion module consists of three main parts: bilateral transpose correlation, smoothing, and activity-based fusion. It is important to note that both bilateral transpose correlation and smoothing are designed to obtain the activity map, rather than serving as the images for subsequent fusion. The edge-consistent fusion module preserves the integrity of edges and textures while avoiding artifacts in the fusion result.
Firstly, the schematic diagram of the transpose correlation is shown in
Figure 4. The larger absolute values in the detail layers correspond to sharper brightness changes and thus to the salient features in the image, such as edges, lines, and region boundaries. It can be observed that the overlapping regions during the kernel movement add up. When the pixels represented by red and blue belong to the edge, the accumulation will further enhance the edge. When both of these pixels do not belong to the edge, the accumulated value is relatively smaller. The size of the output image for transposed correlation can be calculated using Equation (
1).
where
O represents the size of the output image,
I represents the input size,
k represents the kernel size, and
s represents the stride size. In this study, we simply choose
s as 1 and
k as 3.
The design of the kernel is crucial for the performance of the transpose correlation. Simple fixed kernels, while computationally efficient, often yield limited results. Another commonly used Gaussian filtering kernel considers only spatial distances, neglecting sharp changes in edges, leading to blurred edges. Bilateral filtering is a successful method, but its purpose is to filter noise rather than being specifically designed for image fusion. Therefore, we propose a novel non-iterative kernel construction method suitable for transpose correlation. This method considers both value distances and spatial distances, enabling the adaptive enhancement of edges and textures.
To obtain a bilateral kernel that can enhance both edges and textures, the value distance matrix is defined as in Equation (
2). Since the pixel values of the edges themselves are larger or smaller, their corresponding pixel values should be less influenced by the surrounding non-edge pixels. Therefore, choosing the cube of the value distance is necessary to reduce the weight of the surrounding pixels. And if the surrounding pixels all belong to edges, they will mutually enhance each other’s effects:
where
represents the value distance matrix,
denotes the current window’s center x-coordinate and
denotes the current window’s center y-coordinate. The parameter
represents the standard deviation of the spatial distance.
To consider the pixel value differences in the image, the value distance matrix is defined as given in Equation (
3). As the variations in the edge regions are significant, it is essential to emphasize the extent of the pixel value changes, apart from considering the spatial distances. Taking the grayscale image as an example, the pixel value changes often exceed 100, which may lead to very small values in the numerator. To avoid an unnatural reduction in pixel values, normalization is performed. Inspired by softmax, we employ an exponential function for the normalization, which makes the value distance matrix more sensitive to pixel value changes:
where
represents the value distance matrix,
denotes the pixel value corresponding to the current kernel center position,
represents the value distance standard deviation, and
Q represents the window corresponding to the current kernel.
With the spatial distance matrix and value distance matrix of the same size now defined, the bilateral kernel can be formulated as shown in Equation (
4). This kernel takes into account both the differences in the pixel values between different pixels and the influence of the spatial distances. The bilateral kernel is utilized in the bilateral transpose correlation:
where the
K is the bilateral kernel.
After the bilateral transpose correlation processing, the output is larger than the size of the source image, so it needs to be restored to the same size as the source image. The enhanced edges through the bilateral transpose correlation exhibit more pronounced distinctions between the infrared and visible images. However, the accentuation of the edges can potentially introduce artifacts in the vicinity of the edges. So, smoothing is introduced to ensure smoother transitions between edges and surrounding pixels, preserving edges while achieving a more natural fusion outcome. This step accomplishes the consistency verification of edges and their neighboring regions. The values in the smoothing kernel are set to 1 in this paper. The smoothing is defined as Equation (
5):
where
M represents the result of the correlation,
I denotes the input image,
K refers to the correlation kernel, and
k represents the size of the correlation kernel.
After processing the infrared and visible detail layers described above, we obtain corresponding images with enhanced edges and textures. In
Figure 5, to visually demonstrate the effects of bilateral transpose correlation and smoothing, we provide an example using the one-dimensional signal from
Figure 3. From
Figure 5, it is evident that the bilateral transpose correlation amplifies the differences in the edges of the image. However, a little noise in the infrared image is still greater than the edges in the visible image. Then, smoothing raises all edges above the noise values, ensuring that the edges in the fusion result are continuous and intact.
To obtain the activity map, we define higher pixel values in the grayscale image as having higher activity and retain the positions with higher activity. The following formula is designed to generate the activity map:
where
is the activity map.
and
represent the enhanced visible detail layers and enhanced infrared detail layers, respectively.
The fused detail layer can be obtained by the following equation:
Figure 6 presents the visualized results of the aforementioned process. The NSST decomposition yields multiple detail layers, each corresponding to different scales and orientations of details. We illustrate the effectiveness of our proposed approach by selecting one of these detail layers. As observed in the figure, the source visible and infrared detail layers encompass distinct information. Following the application of the bilateral kernel, edge and texture information receive enhancement. The coherence in the chosen pixels, evident in the activity map, underscores the advantage of regionally consistent fusion, ensuring the continuity and integrity of edges and textures. The ultimate fused result adeptly not only retains the principal features of both infrared and visible images but also captures intricate and comprehensive details.
Note that edge-consistency fusion is a generic multi-modal image fusion method, and we only take the IVIF task as an example to explain its operation.
3.3. Correlation-Driven Base Layers Fusion
Both the infrared and visible images are captured from the same scene, thus containing statistically similar information such as background and objects. However, due to their different modalities, they also possess independent information, such as the rich textures in the visible images and the thermal radiation information in the infrared images. Therefore, our objective is to promote the extraction of modality-specific features and modality-shared features by constraining the correlation between different modality images.
The correlation-driven network contains three main modules, i.e., a transformer-based encoder for feature extraction, a fusion layer to fuse visible and infrared features, and a decoder for generating fusion images.
3.3.1. Encoder
The residual architecture design of the encoder draws inspiration from ResNet [
25], enabling the global encoder to extract shared cross-modal features, while also ensuring that the encoder captures modality-specific features for each modality.
First, we define some symbols for clarity in the formulation. The input paired infrared and visible images are denoted as and . The global feature encoder and transformer block are represented by and , respectively.
Global encoder. The global encoder aims to extract global features
from visible and infrared images
, i.e.,
The Restormer block can extract global features by applying self-attention across the feature dimension. So, we employ it to extract cross-modality global features without increasing too much computation. The detailed architecture of the Restormer block can be referred in
Appendix A or the original paper [
26].
Transformer block. The Transformer block receives the output of the residual structure network, retaining both the shared features across modalities and the distinct characteristics within different modalities. Additionally, the Transformer block employs the self-attention mechanism, enhancing the network’s ability to focus on the most relevant features for effective fusion:
where
is the encoded feature of
V. And the infrared feature can be obtained similarly. Because the balance of performance and computational efficiency is important, the LT block [
27] is chosen as the basic unit of the transformer block. The LT block shrinks the embedding to reduce the number of parameters while preserving the same performance.
3.3.2. Feature Fusion Layer
The features of visible and infrared images are combined using an element-wise average strategy, which is then used as the input to the fusion layer. Considering that the inductive bias for feature fusion should be similar to feature extraction in the encoder, we also employ LT blocks for the fusion layer:
where
represents the fusion layer. ⊕ is the element-wise addition.
3.3.3. Decoder
The decoder
reconstructs the features into the fused base layer:
Since the inputs here involving cross-modality features, we keep the decoder structure consistent with the design of the global encoder.
3.3.4. Loss Function
There is no ground truth for the IVIF task. We introduce the intensity loss , gradient loss , and correlation loss to constrain the visual quality of the fusion results.
The full objective function of the progressive fusion network is a weighted combination of the intensity loss, gradient loss and correlation loss, which is expressed as follows:
where
,
and
are the tuning parameters.
The intensity loss quantifies the disparity between the fused images and the more salient regions within the infrared and visible images. Hence, we formulate the intensity loss as follows:
where
H and
W are the height and width of the input image, respectively.
stands for the
-norm.
denotes the element-wise maximum selection.
Moreover, we expect the fused image to maintain the optimal intensity distribution and preserve abundant texture details. The optimal texture of the fused image can be expressed as a maximum aggregate of the infrared and visible image textures. Therefore, a texture loss is introduced to force the fused image to contain more texture information, which is defined as follows:
where ∇ indicates the Sobel gradient operator.
The above losses aim to ensure that the fusion results closely resemble the source images. However, they do not explicitly utilize the prior knowledge that the two modalities correspond to the same scene. Given that both infrared and visible images capture the same scene, it is evident that the background and common large-scale features are correlated. To address this, we introduce a correlation loss, ensuring that the global encoder learns related information while also easing the subsequent modules’ task of extracting modality-specific features:
where
is the correlation coefficient operator, and
here is set to 2 to ensure that this term always remains positive.
In conclusion, our correlation-driven fusion network effectively preserves salient regions and details from the source images while actively focusing on the correlation between the two modalities. Therefore, this network utilizes the shared and distinct information from both modalities, leading to meaningful and efficient fusion results.
After the fusion of detail and base layers, the inverse NSST is employed to reconstruct the spatial final fused image.