In this section, we comprehensively describe the framework for super-resolution-based semantically aware infrared and visible image fusion, including the problem formulation of the framework, content loss and semantic loss. At the same time, we delve into the structure of the multi-branch hybrid attention module in super-resolution networks and the architecture of the fusion network based on the comprehensive information extraction module.
3.1. Problem Formulation
In source images with poor resolution, the shapes and contours of categories such as bicycles, pedestrians, and guardrails appear blurred or distorted, significantly diminishing the visual quality and realism of the fused images. To address this issue, we have designed a super-resolution network to enhance the quality and detail of the source images.
Given a pair of aligned visible images
and infrared images
, the quality of the fused images
generated in the fusion network often depends on the customized loss function. To achieve high-quality fusion, we use content loss and semantic loss to jointly constrain the optimization of the fusion network. The overall framework of the super-resolution-based semantic-aware infrared and visible image fusion algorithm is shown in
Figure 1.
First, we design a multi-branch hybrid attention module (MHAM) in a super-resolution network to efficiently handle the modal differences between visible and infrared images. We apply MHAM to extract the rich fine-grained detail features of infrared and visible images in channel and space, which can be represented as follows:
where
F denotes the features of a visible image or an infrared image,
⨂ denotes multiplication by bit.
denotes the channel attention feature map and
denotes the spatial attention feature map. MHAM sequentially infers a 1D channel attention map
and a 2D spatial attention map
. Equation (1) represents the optimization of the input feature
by multiplexed channel attention module to obtain the new feature
. Equation (2) represents the optimization of the input
by multiplexed spatial attention module to obtain the final fine-grained detailed feature
. The channel attention module focuses the correlation between different channels in the feature map, which can be expressed as follows:
where
denotes the sigmoid function, and
and
denote the average pooling and maximum pooling operations performed on the input feature map F, respectively, and processed by the multi-layer perceptual machine (MLP) to obtain the attentional weights for each channel. Then the MLP output features are elementally summed before sigmoid activation to obtain the final single-branch channel attention feature
. And then, the two-way channel attention features are multiplied with the input features for adaptive feature optimization to obtain the optimized features
. Compared to the single-branch channel attention module, the advantage of the dual-branch channel attention module is that it can identify and focus on important channel features more effectively, thus significantly improving the representation of the feature map.
The spatial attention module focuses on the correlation between different spatial locations in the feature map, which can be expressed as follows:
where
(∙) denotes a convolution operation with a filter size of 7 × 7. The two-way spatial attention feature is multi-plied with the input feature
to obtain the final fine-grained detail feature
. In this way, the network is able to focus on the detailed features at different spatial locations in the image, thus improving the quality of the image.
In addition, in order to fully extract the complementary information of infrared and visible images, the STDC module is designed in the fusion network. More specifically, a form of window-based multi-head self-attention (W-MSA) interacting with depth-wise convolution (Dwconv) in both directions is designed inside the Swin Transformer in order to realize the full extraction of global and local detail information, while also efficiently processing high-resolution images. The process of extracting complementary information by the STDC module is represented as follows:
where
denotes the features of infrared and visible images, and LN denotes LayerNorm layer which is used to normalize the input features. W-MSA is used to capture the information of the local region. Dwconv is used to extract the features efficiently. Equation (6) indicates that after normalization of
, the feature transformation is performed by MLP and added to the input
to obtain the updated feature
. SW-MSA denotes shift-window based multi-head self-attention for capturing a wider range of features. Pairing W-MSA and SW-MSA together can effectively save computation when processing high-resolution images [
47].
denotes the multi-modal information output from the STDC module. MIX denotes the feature mixing function that realizes the bidirectional interactions between the W-MSA block and the Dwconv block. The MIX function firstly projects the input features onto the parallel branches through the normalization layer. Then it mixes the complementary features of the source image following the steps shown in
Figure 2 and
Figure 3.
In W-MSA, the image is partitioned into multiple small windows and the self-attention computation is performed independently within each window. Since the size of each window is much smaller than the entire image, this greatly reduces the number of elements involved in each computation, and thus the overall computation is significantly reduced. Combining W-MSA and Dwconv allows for adequate and efficient extraction of local features from the source image. In SW-MSA, the window is misaligned horizontally or vertically (e.g., shifted half a window’s distance to the right and downward) so that the window coverage area overlaps with a portion of the neighboring window, thus realizing cross-window information. Ultimately, this design not only improves processing efficiency, but also enhances the model’s ability to capture global and local information.
Meanwhile, considering the requirements of high-level vision tasks on fused images, semantic loss is employed to guide the fusion network to retain the semantic information in the source image to a greater extent. More specifically, the segmentation network
is added to segment the fusion result
. The gap between the segmentation result
and the semantic label
is denoted as the semantic loss
, where H and W are the height and width of the source image, respectively, and
C denotes the number of object categories in the source image. The semantic segmentation process can be represented as follows:
The size of the semantic loss
can reflect the richness of the semantic information contained in the fused image, which can be expressed as follows:
where
denotes the error function.
3.3. Super-Resolution Network Framework
In order to solve the problem of existing datasets with more images with blurred picture quality, we introduce a super-resolution network to enhance the quality and details of the source images, which are then fed into the fusion network for effective fusion. Meanwhile, considering the differences of multi-modal images, a multi-branch hybrid attention module (MHAM) is designed in the super-resolution network to enhance the network’s ability to characterize multi-modal features. The MHAM-based super-resolution network for infrared and visible images is shown in
Figure 4.
As shown in
Figure 4, the super-resolution network consists of two parts: feature extraction and image reconstruction. The feature extraction part includes a convolutional layer with a kernel size of 3 × 3, a Leaky ReLU activation layer, and a depth residual block. Among them, the convolutional layer and Leaky ReLU activation function are used to extract shallow features of the source image. By using multiple deep residual blocks, it aims to extract finer-grained features from the shallow features while ensuring the stability of the whole network.
The image reconstruction part consists of a convolutional layer with a kernel size of 3 × 3, a Leaky ReLU activation layer, an upsampling layer, and a sigmoid activation layer. After the feature map output from the residual module passes through the convolutional layer and the Leaky ReLU activation layer, it is enlarged by two times the size in the upsampling layer, and then passes through the sigmoid activation layer to obtain the final high-quality and high-resolution image required.
The inclusion of a batch normalization layer in the residual module is to address the problem of information loss. In particular, the Multi-branch Hybrid Attention Module (MHAM) is designed to realize the network’s effective processing of multi-modal images. The design of the MHAM module is shown in
Figure 5. The MHAM module draws on the idea of CBAM [
61] (Convolutional Block Attention Module) by adding a one-way channel attention module and a one-way spatial Attention Module, and such a design aims to obtain richer fine-grained features, thus increasing the network’s ability to understand the variability of multi-modal features.
As can be observed from
Figure 5, the shallow features of the source image are multiplied bitwise with the output features passing through the two channel attention modules in the multiplier to obtain the channel attention features; then the channel attention features are multiplied bitwise with the output features passing through the two spatial attention modules in the multiplier to finally obtain the rich fine-grained features. Meanwhile, the internal detailed constructions of the channel attention module (CAM) and spatial attention module (SAM) are shown in
Figure 6 and
Figure 7, respectively. Specifically, the shallow features
F of the source image are fed into the shared multi-layer perceptron (MLP) for processing after average pooling and maximum pooling operations to obtain the attention weights of each channel. Then the features output from the MLP are subjected to elemental summation followed by sigmoid activation to obtain the channel attention feature map
. Next, the channel attention feature map is input to the spatial attention module, which firstly, also undergoes the average pooling and maximum pooling operations, and then undergoes the convolution operation with a kernel size of 7 × 7, and finally, the sigmoid activation function to obtain the final spatial attention feature map
.
3.4. Fusion Network Framework
In order to comprehensively acquire the fine-grained complementary information of the source image, as well as to efficiently process high-resolution images, we propose an STDC-based fusion network for infrared and visible images, as shown in
Figure 8. The fusion network consists of a feature extractor and an image reconstructor, where the feature extractor contains a patch segmentation layer, a linear embedding layer, a comprehensive information extraction module, and a patch merging layer.
As shown in
Figure 8, the linear embedding layer is used to do a linear transformation of the channel data of each pixel, which specifies that each 4 × 4 neighboring pixel is a Patch, and the source image of
size will obtain a feature map of
size after passing through the patch partition layer and the linear embedding layer. Observation shows that except for Stage 1 in which the patch partition layer and linear embedding layer are passed first, the remaining three Stages are first downsampled through the patch merging layer. Among them, the height and width of the feature map will be halved, and the depth will be doubled after each downsampling through the patch merging layer. The designed STDC module can not only accelerate the training speed of the network, but also fully obtain the complementary information in the infrared and visible images. The specific design of the STDC module is shown in
Figure 2. The STDC module consists of the Swin Transformer block and the Dwconv block (depth-wise convolution). Among them, the Swin Transformer block includes a normalization layer, a multi-layer perceptual machine layer (MLP), a W-MSA block (window-based multi-head self-attention), and a SW-MSA block (shifted window-based multi-head self-attention).
After downsampling by the patch merging layer, the feature map is divided into multiple disjoint regions (window), and then each window is processed by using W-MSA and SW-MSA in pairs, which also effectively reduces the amount of computation in the face of high-resolution images. A detailed description of W-MSA and SW-MSA can be found in the original paper [
47]. And the introduction of the Dwconv block in the Swin Transformer block aims to flexibly utilize the respective advantages of Swin Transformer and depth-wise convolution in order to achieve the effective extraction of global and local information of the source image. Meanwhile, the W-MSA block and the Dwconv block are designed to interact with each other in a bidirectional form, and the detailed design is shown in
Figure 3.
In
Figure 3, the Dwconv block and the W-MSA block interact bi-directionally through channel/spatial interaction, the local information extracted by the Dwconv block is input to the multiplier for feature fusion and enhancement with the help of channel interaction with the feature map, and the output is sent to the W-MSA block for windowing, and the global features output from the W-MSA block are input to the multiplier for feature fusion and enhancement with the help of spatial interaction with the local features. The global features from the W-MSA block are then fed into the multiplier with the help of spatial interaction to fuse and enhance the local features. This design aims to fully utilize the interactions between the various modules in order to achieve adequate extraction of fine-grained complementary information from the source image, so that the fusion network can generate fused images with superior visual perception.