MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion

Feng, Yuting; Jin, Xin; Jiang, Qian; Wang, Quanli; Liu, Lin; Yao, Shaowen

doi:10.3390/rs14236118

Open AccessArticle

MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion

by

Yuting Feng

^1,2,

Xin Jin

^1,2

,

Qian Jiang

^1,2,*,

Quanli Wang

^1,2,

Lin Liu

³ and

Shaowen Yao

^1,2

¹

Engineering Research Center of Cyberspace, Yunnan University, Kunming 650000, China

²

School of Software, Yunnan University, Kunming 650000, China

³

School of Information, Yunnan Normal University, Kunming 650000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(23), 6118; https://doi.org/10.3390/rs14236118

Submission received: 21 October 2022 / Revised: 27 November 2022 / Accepted: 28 November 2022 / Published: 2 December 2022

(This article belongs to the Special Issue Pansharpening and Beyond in the Deep Learning Era)

Download

Browse Figures

Versions Notes

Abstract

:

The fusion of a high-spatial-resolution panchromatic (PAN) image and a corresponding low-resolution multispectral (MS) image can yield a high-resolution multispectral (HRMS) image, which is also known as pansharpening. Most previous methods based on convolutional neural networks (CNNs) have achieved remarkable results. However, information of different scales has not been fully mined and utilized, and still produces spectral and spatial distortion. In this work, we propose a multilevel parallel feature injection network that contains three scale levels and two parallel branches. In the feature extraction branch, a multi-scale perception dynamic convolution dense block is proposed to adaptively extract the spatial and spectral information. Then, the sufficient multilevel features are injected into the image reconstruction branch, and an attention fusion module based on the spectral dimension is designed in order to fuse shallow contextual features and deep semantic features. In the image reconstruction branch, cascaded transformer blocks are employed to capture the similarities among the spectral bands of the MS image. Extensive experiments are conducted on the QuickBird and WorldView-3 datasets to demonstrate that MPFINet achieves significant improvement over several state-of-the-art methods on both spatial and spectral quality assessments.

Keywords:

pansharpening; image fusion; remote sensing; deep learning; self-attention mechanism

Graphical Abstract

1. Introduction

Remote sensing images record the electromagnetic wave of various objects on the surface of the earth, where the remote sensing information contained does not only indicate the distribution of ground objects, but also record the changes in the interaction between them. With the rapid development of satellite remote sensing technology, the application of remote sensing has presented a diverse trend [1,2,3], which can be widely applied to disaster warning, land cover classification, geological observations, and military reconnaissance. However, PAN and MS images of the same scene can only be captured separately due to the limitations of sensors mounted on satellites. PAN images with a single band have rich spatial information at the cost of limited spectral information, which is helpful in differentiating the details of ground objects. In contrast, MS images characterized by multiple spectral bands are complementary with PAN images in terms of spectral information and spatial information. To meet the demands of subsequent applications, the research on pansharpening has attracted extensive attention from the scientific community [4,5].

Various pansharpening techniques have been proposed and developed over the past decades, which can roughly be divided into four major classes: component substitution (CS), multi-resolution analysis (MRA), variational optimization-based (VO), and deep learning (DL) methods.

CS methods convert the MS image into a specific domain to separate the spatial and spectral information. After histogram matching to obtain the same mean and variance, the spatial components of the MS image are replaced with the matched PAN band under certain substitution principles. Finally, inverse transformation is used to project the substituted image to the original domain. Classic examples of CS methods are intensity-hue-saturation (IHS) [6], principal component analysis (PCA) [7], and the adaptive Gram–Schmidt (GSA) method [8]. Although CS methods achieve high spatial fidelity and produce promising results on computational burden, the fused images often produce serious spectral distortion because of the incompatible spectral range of MS and PAN images [9].

Numerous MRA methods such as the smoothing filter-based intensity modulation (SFIM) [10], Laplacian pyramid (LP) [11], and the generalized LP based on Gaussian filters matching the modulation transfer function (MTF_GLP) [12] have been proposed to effectively preserve spectral information. The main idea is to inject the details from the PAN image into the MS image. More specifically, spatial information is extracted from the PAN image through multi-resolution decomposition and is then injected into the corresponding scale of the MS image. Compared with CS methods, MRA methods provide higher spectral fidelity because the information in the MS image is completely preserved. However, MRA methods are quite sensitive to registration errors [13]. Moreover, the fused images may lead to ringing artifacts, fuzzy deficiencies, and aliasing effects due to the filtering operation and excessive detail injection.

The VO methods design the model with adaptive optimization algorithms in order to establish the mathematical relational mapping among the PAN image, the MS image, and the reference image. Then, the fused high-resolution image is predicted by the given model. Typical examples in this category include the coupled non-negative matrix factorization (CNMF) [14], sparse representation-based methods [15,16], variational models [17], and Bayesian methods [18]. By and large, they can achieve tremendous improvements in performance and effectively solve the problem of fixed parameters of conventional methods leading to insufficient feature extraction. Nevertheless, enormous hyper-parameters can also result in complicated calculations.

Recently, with the help of excellent learning and nonlinear mapping capabilities, deep learning models have achieved outstanding performance in various image processing domains [19,20,21]. Motivated by great success in the field of single image super-resolution (SISR), a lot of advanced networks have emerged for pansharpening. A three-layer pansharpening network (PNN) was proposed in [22], which had a structure similar to the super-resolution convolutional neural network (SRCNN) in [23]. It was one of the early works on convolutional neural networks for pansharpening, whose results were far better than those of conventional methods. To address the problem of inadequate data in supervised training, the modality of target-adaptive fine-tuning was adopted in PNN (APNN) [24] to overcome the possible gap between the processes of training and testing. Wei et al. [25] imitated the well-known deep network (VDSR) adopted in SISR [26] and proposed a deep residual network (DRPNN) with a residual module to improve the fusion accuracy. Moreover, Cai [27] designed a progressive pansharpening neural network (SRPPNN) based on the strategy of SISR. Yang [28] proposed a deep network (PANNET) that learned details in the high-pass domain to further improve the spatial resolution of the fused images. In addition, a lot of pansharpening methods based on unsupervised training have emerged. The perceptual pansharpening method (PercepPan) [29] shows the superiority of the training paradigm in which supervised pre-training is performed first, followed by unsupervised fine-tuning. Ciotola [30] proposed a pansharpening method (Z-PNN) pre-trained on original data at full resolution that can effectively reduce loss of information.

Due to the difference in scales between PAN images and MS images, many studies have explored pansharpening methods based on multi-scale models. Zhang [31] used a bidirectional pyramid network to separately process MS and PAN images in two branches. In [32], cascaded multi-scale generative adversarial networks (ZeRGAN) were employed for pansharpening without using a large number of datasets for training. Motivated by the MRA scheme, a novel triple-double network could progressively inject the spatial details into an MS image [33]. Wang [34] proposed a dual-scale regression-based MRA method (MTF_GLP_HPM_DS), which combined the coarse-scale and fine-scale information of different weight parameters so that the model would have good adaptability to different scenarios.

We can reasonably infer that whether the feature can be fully utilized is closely relevant to network structure. Although the methods mentioned above have exhibited excellent performance, there is still a lot of room for improvement. Source images with different characteristics may not be well-adapted to the direct injection of spatial details, which might cause them to suffer from detail loss and spectral distortion. For the multi-scale fusion method, the process of information fusion at each scale has considerable importance. Moreover, the spectral information in each band of the MS image has common characteristics that are usually neglected. In this work, we propose a promising pansharpening method with multiple levels for feature extraction and image reconstruction. In these two branches, complex modules are designed to separately accommodate different tasks. Moreover, the complementary information of each scale is adequately integrated by the self-attention mechanism based on channel. The major contributions of the proposed method can be summarized as follows:

A multilevel parallel feature injection network (MPFINet) is devised to concurrently learn spectral information and spatial details. It can leverage hierarchical features from PAN and MS images to balance spatial enhancement and spectral preservation.
In the feature extraction branch, a multi-scale dynamic convolutional dense block (MDCDB) is proposed to effectively extract four-stream features of different levels and take advantage of cross-scale information.
To reuse and supplement the feature information, cascade transformer blocks based on the channel self-attention mechanism (CSTB) are established to adaptively learn cross-channel dependencies and long-range details of PAN and MS images.

2. Background and Related Works

2.1. Attention Mechanism

The attention mechanism usually provides more faithful guidance for the model to focus on important areas. Depending on the domain of interest, the attention mechanism can be divided into three main categories: spatial domain, channel domain, and hybrid domain. Channel attention can model the channel-wise information in a computationally efficient manner. Squeeze-and-excitation (SE) attention [35] is a widely deployed channel attention mechanism that adaptively recalibrates channel features through squeeze, excitation, and scale. The spatial transformer network (STN) [36] is a classic attention mechanism of the spatial domain, which adopts the appropriate spatial transformation to locate the target. The dynamic capacity network [37] achieves low computational cost and high accuracy through two sub-networks. One is a coarse model that globally locates the region of interest in the image, and the other is a fine model that refines the corresponding region. As an attention mechanism of the hybrid domain, the convolutional block attention module (CBAM) [38] investigates both spatial and channel-wise attention based on an efficient architecture. Furthermore, the multi-context attentive block (MCAB) proposed in [39] focuses on contextual features, which consist of a cutting-splicing block (CSB) and two-stage attention.

However, the positional information is usually ignored because the methods mentioned above merely transform the input into a one-dimensional feature through a pooling operation. Nevertheless, it is crucial to accurately locate and identify the regions of interest. The coordinate attention (CA) was designed to capture direction-aware and position-sensitive information by embedding coordinate information into the feature tensor [40]. The structure of the CA is shown in Figure 1. The coordinate-aware feature maps are first derived from the global average pooling in the vertical and horizontal directions. Specific-direction pooling can capture more precise channel relationships and focus on large regions. Then, a pair of feature maps embedded with specific positional information is processed by concatenation and convolution transformations. Finally, two components separated from the synthetic feature along the spatial dimensions are encoded as two attention masks. As a plug-and-play module, the CA attention is embedded into the multi-scale dynamic convolutional dense block as one of the streams due to its lower computational burden.

Compared to the three types of attention mechanisms above, the self-attention mechanism [41] is better at capturing the internal relevance of features with less reliance on external information. Vision Transformer (ViT) has become the milestone of computer vision [42]. Based on the paired triples (Key, Query, Value), it provides an effective modeling method to capture global context information. Non-local Neural Networks [43] inherited the modeling method of triples, which improved the performance of object detection and instance segmentation. They both illustrate the importance of the self-attention mechanism in modeling context information.

2.2. Dynamic Convolution

Attention mechanisms are not only embedded in modules to assign attention to different features, but they can also be applied to convolution kernels. Most approaches can improve performance by increasing model depth, but this way also incurs computational overhead. To improve the capability of the model without increasing the computation too much, the dynamic convolutional neural networks were proposed [44]. In these networks, convolution kernels aggregated in a nonlinear way have more expressive power. Moreover, this way does not affect the depth nor the width of the network. Firstly, the input features downsample by average pool to compress the global spatial information and generate the parameters of convolution that vary with the input. After two fully connected layers and the use of softmax, k attention weights are generated. The parameters of k convolution kernels are linearly summed to obtain the aggregated convolution kernel, which can perform an adaptive convolution of the input features.

3. The Proposed Network

In this section, we summarize the proposed multilevel parallel network and elaborate on the three key components in Section 3.1, Section 3.2, Section 3.3 and Section 3.4. In the last subsection, the loss function in the training model is elaborated. Let

P \in R^{H \times W}

denote a high-resolution (HR) PAN image with the size of

H \times W

, and

\tilde{P} \in R^{H \times W \times B}

is yielded by replicating the

P

to

B

bands in the spectral dimension. Let

LRMS \in R^{(H / r) \times (W / r) \times B}

be a low-resolution (LR) MS image, where r is the spatial resolution ratio between

P

and

LRMS

.

↑_{i}

denotes the upsampling operation, and

↓_{i}

denotes the downsampling operation with a scale of i. For example,

MS ↑_{4}

denotes the upsampled version of the

LRMS

image with a scale of 4. Furthermore, let

HRMS \in R^{H \times W \times B}

represent the desired pansharpening result and

GT \in R^{H \times W \times B}

denote a reference MS image.

3.1. Overall Network Framework

The overall flowchart of MPFINet is graphically illustrated in Figure 2, which includes three main parts: the feature extraction branch, the feature fusion stage, and the image reconstruction branch. To obtain the complete structure information, the feature extraction branch consists of several MDCDBs that are beneficial for the preservation of the intensity and contrast of the input image. Downsample operations are applied to reduce the dimension of the features, except for the third level. Accordingly, the features of each level are fed into a dual-attention fusion module for fusion. In the image reconstruction branch, the spectral component of the input MS image is preserved as much as possible before upsampling. Based on the scheme of detail injection, spatial and spectral mapping is directly complemented at each level so as to make the network easier to recover fine images. Eventually, a convolution layer is employed to map the number of channels to

B

.

3.2. Feature Extraction Branch

The proposed network takes

I_{0}

as the input and obtains shallow features through a convolution layer, which can be expressed using the following formula:

F_{i n} = {Conv}_{3 \times 3} (I_{0})

(1)

where

{Conv}_{3 \times 3} (\cdot)

represents a 3 × 3 convolution layer. Since the upsampled MS image is low-resolution, it can be regarded as low-pass spatial content of the PAN image.

I_{0}

containing both the details of the PAN image and a little spectral information is obtained by subtracting

MS ↑_{4}

from

\tilde{P}

. In the following three levels, the features are extracted from the rough resolution to the fine resolution. The third level starts with the downsampling operation of pixel-unshuffling [45] to reduce the size of the feature map and multiply the number of filters by two, followed by

N_{1}

consecutive MDCDBs. To alleviate the problem of a vanishing gradient and accelerate the efficiency of information propagation in the network, the residual connection retains the downsampled feature and transmits it to the subsequent operation along with the output of the current level. The structure of the first and second levels is consistent with that of the third one, except that there is no downsampling operation in the first level. The output of the sub-branch

F_{l}

is defined as follows:

F_{l} = \{\begin{matrix} {MDCDB}_{l}^{i} (Down (F_{l + 1})) + Down (F_{l + 1}), l = 2, 3 \\ {MDCDB}_{l}^{i} (F_{i n}) + F_{i n}, l = 1 \end{matrix}

(2)

where

{MDCDB}_{l}^{i} (\cdot)

denotes the ith MDCDB for extracting the features of input images at the lth level, and

Down

denotes the downsampling operation. Each level has

N_{1}

modules, which means that i has a maximum of

N_{1}

. Three feature tensors with the size of

(W / 4) \times (H / 4) \times 4 C

,

(W / 2) \times (H / 2) \times 2 C

and

W \times H \times C

are later fed during the feature fusion stage.

Remote sensing images contain abundant information on ground objects. Therefore, static convolution with fixed parameters may not satisfy the requirements of complex image characteristics. Inspired by residual dense blocks (RDB) [46] applied to SISR, MDCDB equipped with a dynamic convolution and attention mechanism is devised to adaptively extract dense and complex hierarchical features. As illustrated in Figure 3, two submodules of the same structure are stacked to constitute the MDCDB that strengthens the representational power of the model. Each submodule includes four streams, and residual dense connection is employed to strengthen feature propagation. The first three streams are multi-scale dynamic convolutional layers, and the last one is an attention layer for feature aggregation. Convolution kernels with the small size of 1 × 1 and 3 × 3 are set because of computation complexity. Except for the first stream, all the other streams use a common static convolution to reduce the channel dimension and simplify the model. Both static convolution and dynamic convolution are followed by the ReLU activation function. Batch Normalization (BN) [47] is widely adopted before the nonlinearity, and yet it may destroy the contrast of the image. By normalizing spectral information, BN will affect the quality of the image to some extent [48]. As a result, removing BN in MDCDB is an appropriate option. In addition to the three scales, coordinate attention (CA) is also embedded to extract features along two spatial directions. In this way, CA can capture dependency relationships in one direction and accurately retain location information in the other direction. Finally, a convolutional layer is adopted to concatenate features that take full advantage of all preceding layers.

3.3. Feature Fusion Stage

The main task of the feature fusion stage is to fuse the complementary information of the features. More specifically, the synthetic features from the feature extraction branch are unidirectionally injected into the feature of the MS image level by level. Since the ground objects in remote sensing images usually cluster in large numbers over the same area, the underlying similarity will be embodied in the spectrum and the spatial structure. To capture the cross-modality long-range dependency, Luo [49] derived spatially related attention between low-frequency textures and high-frequency noises based on the self-attention mechanism. Inspired by Luo’s work, a dual-attention fusion module (DAFM) is devised for spectral band correlation learning, which can model the similarity between the channels and recover the spectral information well. Moreover, considering that the result is subsequently transferred to a single branch, two streams are concatenated to facilitate the feature learning and then fed into the image reconstruction branch.

The process of fusion in the third level is illustrated in detail as an example in Figure 4. The module takes two features as input: one is the shallow feature

F_{1}

derived from the feature extraction branch, the other is the deep feature

F S_{3}

derived from the image reconstruction branch. Firstly,

F_{1}

is flattened to query

Q_{F} \in R^{C \times H W}

and value

V_{F} \in R_{}^{H W \times C}

. Accordingly,

Q_{F}

is used to encode the attention coefficient, and

V_{F}

embodies the information of the specific branch. Analogously,

F S_{3}

is factored into

Q_{FS}^{} \in R_{}^{H W \times C}

and

V_{FS}^{} \in R_{}^{H W \times C}

. The multi-head similarity matrix

M \in R_{}^{C \times C}

can be generated by multiplying two query matrices and by normalizing with the use of softmax. A specific row in

M

can represent the inherent correlation between the current channel and the other channels. Transposing the matrix can get the attention matrix

M^{'}

with the same size as

M

. Both embody complementary information and transmit important information across the two branches. Next, the refined features can be obtained using the following formulas:

F {D^{'}}_{3} = (M \otimes V_{FS}^{}) + F_{1}

(3)

FS {D^{'}}_{3} = (M^{'} \otimes V_{F}) + {FS}_{3}^{}

(4)

where ⊗ denotes matrix multiplication, and subscript 3 stands for the third level.

3.4. Image Reconstruction Branch

The shallow feature generated from

LRMS

is fed into the trunk branch to reconstruct the pansharpening results. The image reconstruction branch is also composed of three levels that correspond to the other branch. Each level starts with several CSTBs, followed by a channel-reducing gate, and finally, an upsampling operation. Deep features containing abstract semantic information and shallow features containing texture information complement each other to reconstruct the image. The structure of the third level is similar to the first two, except that there is no final upsampling layer. The process can be expressed as follows:

F S_{l} = \{\begin{matrix} Conv (LRMS), l = 1 \\ Up (CRG ({CSTB}_{1}^{u} (F D_{1})) + F_{3} + F S_{1}), l = 2 \\ Up (CRG ({CSTB}_{2}^{u} (F D_{2})) + F_{1} + F S_{2}), l = 3 \end{matrix}

(5)

where

Up

represents the upsampling operation of pixel-unshuffle, and

CRG

represents the channel-reducing operations.

{CSTB}_{l}^{u} (\cdot)

denotes the uth CSTB used to reconstruct the images at the lth level, and the maximum value of u is

N_{2}

. After three levels of spatial enhancement and spectral recovery, a convolution layer is employed to reduce the channels of

F_{o u t}

to

B

, which generates the desired HRMS.

Because transformer blocks have been successfully applied in large-scale image reconstruction tasks [50], we designed a CSTB module by applying the self-attention mechanism across the feature dimension instead of the spatial dimension. Figure 5 illustrates the self-attention mechanism in CSTB that emphasizes contextual information and aggregates cross-channel pixel-wise features. Three basic elements, consisting of Query (Q), Key (K), and Value (V), are the linear projection of the original feature vector. Self-attention is calculated by the dot product of Q and the transposed K, which can model the similarity between pixels. Then, the attention mechanism is migrated to V, which retains the original feature, in order to extract information. The common self-attention mechanism directly uses

V

for calculation. However, it may pay attention to some less informative spatial regions with low spectral representation. Thus, we modulate

V

before multiplying with the attention map in order to obtain the masked

V

, which can direct the model to focus on regions with high fidelity. After the multi-head attention mechanism, the normal feed-forward network consisting of two convolution layers is applied to each pixel location. The first layer expands the dimension by 4, and the second layer restores it to the original input dimension. GELU [51] is activated between the two linear transformation layers. Overall, CSTB can model the information of global relationships among pixels, which is helpful in reconstructing the structure and spectral information.

Since the feature channels are doubled in DAFM, a channel-reducing gate is used to halve the feature channels. A common way is achieved by reducing the number of filters in the convolution. However, this introduces extra parameters. To address this problem, a simpler operation is adopted [52]. It just divides the channel dimension into two components and multiplies these two parts for the output. Compared to the complicated convolution, the channel-reducing gate can also reduce the dimension and introduce nonlinearity. In addition, the features from the feature extraction branch and the image reconstruction branch are directly injected by the residual connection to strengthen feature reuse.

3.5. Loss Function

To depict the difference between the

HRMS

and

GT

images, mean absolute error (MAE) is adopted to optimize the proposed method. The loss function can be expressed as follows:

L (θ) = \frac{1}{N} \sum_{i}^{N} {∥M_{θ}^{} ({MS}_{}^{(i)}; {\tilde{P}}_{}^{(i)} {- MS ↑}_{4}^{(i)}) - {GT}_{}^{(i)}∥}_{1}^{}

(6)

where

N

denotes the number of training samples,

θ

represents parameters of the model, and

{∥\cdot∥}_{1}^{}

is the

ℓ_{1}

norm. Although the L2 loss function amplifies the image differences that can better optimize the model, it sacrifices high-frequency information, resulting in an excessively smooth texture [53]. As a consequence, L1 loss is used as the loss function to promote model convergence.

4. Experimental Results and Analysis

4.1. Datasets and Experimental Setup

In this section, the effectiveness of the proposed network architecture is evaluated via images from two datasets, which were captured separately by the QuickBird (QB) and WorldView-3 (WV3) sensors.

QuickBird dataset. In the QB case, the spatial resolutions of the MS bands are 2.4 m and 0.6 m for the PAN channel. The radiometric resolution of the QB dataset is 11 bits. We sorted out 496 pairs of samples, which are divided into 451/23/23 for the training dataset, the validation dataset, and the testing dataset, respectively. The resolutions of the MS images and the PAN images are 64 × 64 pixels and 256 × 256 pixels, both in the reduced-resolution and full-resolution data.

WorldView-3 dataset. The WV3 dataset was obtained from an open pansharpening dataset called “PanCollection” [54]. It contains datasets from four sensors, which have been processed into reduced resolution and full resolution in different formats (h5 files and mat files) for fair training and testing. The WV3 sensor has a spectral resolution of 1.2 m and a spatial resolution of 0.3 m. There are 9714 patch pairs for training, 1080 patch pairs for validation, and 20 patch pairs for testing. In the simulated experiments, the LRMS patch contained 64 × 64 pixels in each band, and the dimension of the corresponding PAN patch is 256 × 256. For the full-resolution datasets, the sizes of the LRMS and PAN images are 128 × 128 and 512 × 512 pixels. The datasets are available at https://github.com/liangjiandeng/PanCollection (accessed on 27 October 2022).

Following Wald’s protocol [55], the source images are degraded to acquire LRMS images, which are obtained by downsampling the original MS and PAN images using a Gaussian filter with a spatial resolution ratio of 4. Thus, the original MS images are defined as reference images. In addition,

MS ↑_{4}

is upsampled via bicubic interpolation. The

LRMS

,

MS ↑_{4}

, and

GT

images in all the datasets have three bands, which means

B

is set to 3.

The proposed network is implemented based on Pytorch 1.10.0 equipped with a GPU NVIDIA GeForce GTX 3090. The same hyperparameters are used for the two datasets. The model is trained for 900 epochs with a batch size of 8. The AdamW optimizer with a learning rate of 0.001 is employed to minimize the objective function in (6). The learning rate descent factor is set to 0.1 every 300 epochs. We set the network configuration as follows: the amount of MDCDB and CSTB is

N_{1} = N_{2} = 3

, and the number of channels is

C = 16

.

4.2. Compared Methods and Evaluation Metric

Several representative methods selected from the CS, MRA, VO, and DL categories are employed for comparison. The four conventional pansharpening methods include Brovey [56], MTF-GLP with high-pass modulation (MTF_GLP_HPM) [57], SFIM [10], and CNMF [14]. Meanwhile, six state-of-the-art DL methods are retrained, including PNN [22], PANNET [28], ZeRGAN [32], PanFormer [58], SRPPNN [27], and MUCNN [59]. The default parameters of these works are adopted.

The performance of all the methods is evaluated in quantitative metrics and qualitative comparisons. For reduced-resolution assessment, six metrics are adopted in the experiments, which include the peak signal-to-noise ratio (PSNR) [60], the structural similarity index measurement (SSIM) [61], the correlation coefficient (CC) [62], the universal image quality index (UIQI) [63], the spectral angle mapper (SAM) [64], and the relative dimensionless global error in synthesis (ERGAS) [65]. With respect to full-resolution metrics, the spatial distortion index

D_{ρ}

[66], the spectral distortion index

R - Q 2^{n}

[66], and the quality with no reference index (QNR) [67] are commonly employed to evaluate performance.

4.3. Experimental Results on the QB Dataset

The quantitative results of all the methods in reduced resolution are listed in Table 1. It is clear that the proposed MPFINet outperforms the comparative methods in the reduced-resolution metrics. Overall, the performance of the conventional methods is inferior to most DL methods, especially Brovey and SFIM, which gives evidence for the later visual assessment.

Figure 6 presents the results of the QB dataset, which further corroborate objective indexes. Varying degrees of spectral distortion are introduced by Brovey, MTF_GLP_HPM, PNN, and ZeRGAN. As revealed by the roof color, these methods turn it into light blue or gray instead of dark blue. The fused images provided by PANNET, SRPPNN, and MUCNN suffer from blurred details on the edge of the roof. In the enlarged green boxes, the spectral quality of ZeRGAN is unsatisfactory, especially the faint white car. To further analyze the performance of each method, the mean absolute error (MAE) residues are calculated to present the difference between the synthesized image and the corresponding ground truth. As shown in Figure 7, blue means small differences, and a light color indicates large reconstruction errors. The smallest differences in the enlarged green and red boxes of Figure 7k indicate that our fused result is closer to the ground truth.

The quantitative results at full resolution are reported in Table 2. The proposed method benefits from multi-scale feature injection, which reduces spectral distortion to some extent. Although the no-reference indexes are not the best for the proposed method, it is still competitive in terms of visual quality. Figure 8 shows the results of the reduced-resolution experiment on the QB dataset. The abnormal color of the farmlands can be noticed in the results of the Brovey, SFIM, CNMF, and ZeRGAN methods. The MTF_GLP_HPM method produces obvious artifacts in most regions. The MPFINet presents superiority in detail preservation, which is clearly shown in the mountain road from the enlarged view. The CNMF method provides a high spatial resolution and preserves the texture of the mountain top in the enlarged box, which further demonstrates the performance of the CNMF method on the

D_{ρ}

index at full resolution. For a qualitative evaluation, Figure 8k achieves relatively higher spectral fidelity among all the pansharpened images. However, the QNR index does not invariably reflect the quality of the fused image. For instance, although the MUCNN method fails to reconstruct the complete mountain road in Figure 8h, it still achieves the best value on the no-reference QNR index.

4.4. Experimental Results on the WV3 Dataset

To further verify the effectiveness of the proposed method, we provide the results of the quantitative and qualitative evaluations performed on the WV3 dataset. The evaluation results of the WV3 dataset in reduced-resolution are presented in Table 3. It can be observed that MPFINet obtains the best values in reduced-resolution metrics. Compared with SRPPNN, it is evident that MPFINet is higher by 0.3dB and lower by 0.1 in PSNR and ERGAS, indicating better image quality and higher spectral fidelity.

An example of an experiment conducted on the WV3 dataset is shown in Figure 9. The disadvantage of the MTF_GLP_HPM method is obvious as it presents a serious loss of spectral accuracy. Most conventional methods exhibit oversmoothing. Moreover, the images of PNN, MUCNN, and ZeRGAN are blurry to some extent due to insufficient spatial detail injection. For the detailed zoom part in the red box, the spectral information of the white roof is faithfully preserved in most of the comparative methods. The residual images between the pansharpened results and the reference images for the WV3 dataset are shown in Figure 10. Most errors occur in the high-frequency details such as on the edges of buildings. Since the discriminator only identifies the spatial information of PAN images, the ZeRGAN method fails to preserve the spectral content well. Overall, the result of the proposed MPFINet shows minor differences in the enlarged green box.

As can be seen in Table 4, the performance of most DL methods is inferior to conventional methods in

D_{ρ}

, while the proposed method is still competitive with the DL methods. Although the

D_{ρ}

and

R - Q 2^{n}

of our method are slightly worse than those of the other methods, QNR ranks first, which means that MPFINet achieves an equilibrium in the preservation of spectral and spatial information for the full-resolution measures. Figure 11 presents the fused results of the WV3 dataset at full resolution. It can be observed in Figure 11a–c that the fused results of Brovey, MTF_GLP_HPM, and SFIM suffer from slight spatial distortions, as the buildings are blurry in the aspect of visual inspection. Although the SFIM method achieves the best performance in

R - Q 2^{n}

, poor spectral fidelity is presented in the amplified areas. At the same time, the result from SRPPNN has distorted spatial and spectral properties, with the presence of weird pixels around the white car. The ZeRGAN method can preserve spectral features in the intermediate MS images, which is a benefit of the mode of unsupervised training. It is obvious that our proposed MPFINet outperforms other comparison methods both in the spatial and spectral domains. In total, the experiments conducted on the QB and WV3 datasets demonstrate that MPFINet surpasses all the comparative methods in reduced resolution and is beneficial in balancing spectral preservation and spatial enhancement at full resolution.

4.5. Ablation Study

4.5.1. Ablation Study on the Input of the Feature Extraction Branch

To extract the detail and edge information from the source images, the common strategy is to train the model in the high-pass domain [23,68,69], which breaks the inconsistency in the MS images and the PAN images. However, spectral information is often ignored in the condition of high-pass images. To avoid losing information and injecting abundant information into the image reconstruction branch, experiments with different inputs from the feature extraction branch are conducted. Table 5 reports the comparison between a high-pass PAN image (abbreviated as “HP-PAN”) and

I_{0}

. When

I_{0}

is taken as the input, the proposed MPFINet achieves the best performance. On the one hand, the indexes of the spectral information are significantly boosted since the input intrinsically introduces the spectral information. For example, the method is lower by 0.1 in ERGAS with the input of

I_{0}

, indicating lower spectral distortion and more similarity to the ground truth. On the other hand, the spatial information of the pansharpened images is enhanced to a certain extent. The results give evidence to the fact that the spatial information of the PAN images is not adequately utilized while training in the high-pass domain. Consequently, details of the difference between the MS and PAN images are verified to have better performance than the high-pass PAN image.

4.5.2. Ablation Study on MPFINet with Different Structures and Loss Function Settings

Different loss functions are performed to constrain the generated

HRMS

of the QB dataset. The widely used L1 loss is more tolerant of abnormal values than L2 loss, while L2 loss is more stable overall as the data changes. Consequently, we conducted experiments with L1 loss and L2 loss under the default settings. By comparing the results presented in rows 1 and 2 of Table 6, the advantage of L1 loss is obvious in all objective indexes. It is indicated that L1 loss can effectively retain structure information since it is insensitive to outliers. Thus, the loss shown in (6) has been set as the default loss function in this work.

To verify the effectiveness of the proposed MPFINet, different variants are considered. The network architecture is divided into three types, i.e., the feature extraction branch with three outputs (referred to as “Three levels”), the feature extraction branch with the remaining two outputs except for

F_{1}

(referred to as “Double levels”), and the feature extraction branch with only the

F_{3}

output (referred to as “Single levels”). Table 6 clearly shows MPFINet performs better with the assistance of a multilevel structure. Under the default parameter settings, DAFM is replaced by concatenation to verify the effect of the fusion module. It can be seen that the fusion strategy is suitable for this specific structure. In addition, although the index SSIM slightly decreases, all the other indexes reveal an improvement to some extent with the masked V, such that the validity of the masked V cannot be denied.

4.5.3. Ablation Study on the Structure of MDCDB

MDCDB integrating dynamic convolution and attention mechanisms plays an important role in each level of the feature extraction branch. First, the effect of CA is analyzed. Both CBAM and SE attention are classical attention mechanisms that are used to substitute the CA block in our proposed module. Meanwhile, the squeeze operation with a ratio of 4 is applied in all the three attention mechanisms. In comparing the results presented in rows 1, 5, and 6, the CA shows the best results, followed by SE and CBAM. Thus, the CA block is the appropriate choice for embedding into MDCDB.

In addition, static convolution in the place of dynamic convolution (referred to as “w/o DyConv”) is used to prove the effect of dynamic convolution. The results shown in rows 1 and 2 of Table 7 indicate that the static convolution with fixed parameters results in a minor degradation of the model performance as compared with the dynamic convolution. As mentioned in section III-B, BN is one of the causes of smooth edges. A comparison of the results in rows 1 and 3 can indeed confirm this conclusion. Moreover, two identical submodules are connected in series by a dense connection in MDCDB. We degrade MDCDB into one residual submodule with four streams (referred to as “one submodule”) to verify the performance. It is indicated that two submodules do benefit the pansharpening results. Hence, the effectiveness of each component in MDCDB is verified.

4.5.4. Ablation Study on the Number of CSTBs

Intuitively, more CSTBs embedded in each level can improve the ability of feature extraction. Hence, variant models with 2, 3, and 4 CSTBs were trained on the QB dataset. The results of the quantitative assessment are shown in Figure 12. The vertical and horizontal axes represent the index values and the number of CSTBs. It can be seen that a significant improvement is achieved while increasing the number from 2 to 3. However, as the number of CSTBs continues to increase from 3 to 4, the disadvantage of superfluous modules can be observed evidently. The reason is that the spatial and spectral components of

I_{0}

can be extracted well and refined in detail by three CSTBs. Continuously increasing the number of blocks may lead to over-fitting and insufficient training. Furthermore, Table 8 lists the network parameters and training time of these models with different numbers of CSTBs under the same epochs. A larger number of modules may require more training time. Therefore, we set three CSTBs in each level of image reconstruction branch for a trade-off between computational burden and fusion performance.

4.6. Visualization of Feature Maps

To completely analyze the progress of fusion and reconstruction, the heatmaps of the feature are shown in Figure 13. Each column

# n

represents the index of the

n

th channel. As shown in Figure 2, the feature maps placed in the first and second rows are selected from

F_{1}

of the feature extraction branch and

F S_{3}

of image the reconstruction branch, which are both the inputs of DAFM. The last row shows the outputs of the third level

F_{out}

before the convolution yields the final fused result. More specifically,

# 6

focuses on the open space next to roads and buildings. The details learned above are collected through feature injection. In the second column, it can observed that the edge and texture information of the roads is well-enhanced. According to

# 13

, the buildings in the lower right and upper left corners are well-identified. Hence, we can draw the conclusion that the features of the two branches are complementary, and that feature maps become abundant after feature injection and feature reconstruction.

4.7. Computation Complexity

In this section, the experiments on test time and network parameters are conducted on the images with a spatial size of 256 × 256. As can be seen in Table 9, it is obvious that all the other methods have tolerable test times, except for the ZeRGAN method. The reason is that the ZeRGAN method does not require training in advance. The PAN images and LRMS images of the testing set are fused by optimizing multi-scale generators and discriminators, such that ZeRGAN consumes much more time than the others. VO-based methods perform complex calculations and require more time for the test, which is confirmed by the result of CNMF. Compared with most methods, the PNN method has the minimal parameters due to its simple structure. Although the introduction of transformer block results in the increase of network parameters, it is acceptable to a certain degree, and the test time of MPFINet is still competitive.

5. Conclusions

In this article, we proposed a multilevel parallel network named MPFINet. This method consists of two branches and an intermediate fusion stage, namely the feature extraction branch, the image reconstruction branch, and the feature fusion stage. In particular, MDCDBs integrated the dynamic convolution and the CA attention mechanism was used to extract the multi-scale abundant features from the difference of PAN and the upsampled MS images. For spectral fidelity and spatial resolution enhancement, CSTBs based on the channel self-attention mechanism were embedded into the image reconstruction branch level by level. Furthermore, the fusion module and the detail injection were employed to enrich the complementary feature and suppress redundant information. The proposed MPFINet was compared with four conventional methods and six DL classical methods on the QB and WV3 datasets. The impressive results in visual appearance and objective indexes demonstrated the superiority of the proposed method.

Following Wald’s protocol, the original MS images were used as reference images for supervised learning. The proposed method had significant advantages in the reduced resolution, but its performance was not obvious at full resolution. Although down-scaled images can achieve impressive performance, they lack real-world application and have difficulty in generalizing to full-resolution images. In the future, we will utilize an unsupervised approach to learn from full-resolution images. By optimizing the design of the network architecture, the gap between unsupervised and supervised learning is narrowed as much as possible.

The code is available at https://github.com/Feng-yuting/MPFINet.

Author Contributions

Conceptualization, Y.F.; methodology, Y.F.; software, Y.F.; validation, Y.F.; formal analysis, Y.F., X.J. and Q.J.; investigation, X.J., Q.J. and Q.W.; data curation, Y.F. and X.J.; writing—original draft, Y.F.; writing—review and editing, X.J., Q.J., Q.W., L.L. and S.Y.; visualization, Q.W., L.L. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Natural Science Foundation of China (Grant Nos. 62101481, 62002313, 62261060, 62166047, 61862067), Basic Research Project of Yunnan Province (Grant Nos. 202201AU070033, 202201AT070112, 202001BB050076), Major Scientific and Technological Project of Yunnan Province (Grant No. 202202AD080002), Key Laboratory in Software Engineering of Yunnan Province (Grant No.2020SE408), and Research and Application of Object detection based on Artificial Intelligence.

Data Availability Statement

The data provided in this study can be made available upon request from the corresponding author. The data have not been made public because they are still being used for further research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, H.; Song, H.; Huang, J.; Zhong, H.; Zhan, R.; Teng, X.; Qiu, Z.; He, M.; Cao, J. Flood Detection in Dual-Polarization SAR Images Based on Multi-Scale Deeplab Model. Remote Sens. 2022, 14, 5181. [Google Scholar] [CrossRef]
Grządziel, A. Application of Remote Sensing Techniques to Identification of Underwater Airplane Wreck in Shallow Water Environment: Case Study of the Baltic Sea, Poland. Remote Sens. 2022, 14, 5195. [Google Scholar] [CrossRef]
Jiang, J.; Xing, Y.; Wei, W.; Yan, E.; Xiang, J.; Mo, D. DSNUNet: An Improved Forest Change Detection Network by Combining Sentinel-1 and Sentinel-2 Images. Remote Sens. 2022, 14, 5046. [Google Scholar] [CrossRef]
Yilmaz, C.S.; Yilmaz, V.; Gungor, O. A theoretical and practical survey of image fusion methods for multispectral pansharpening. Inf. Fusion 2022, 79, 1–43. [Google Scholar] [CrossRef]
Dadrass Javan, F.; Samadzadegan, F.; Mehravar, S.; Toosi, A.; Khatami, R.; Stein, A. A review of image fusion techniques for pan-sharpening of high-resolution satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 171, 101–117. [Google Scholar] [CrossRef]
Carper, W.; Lillesand, T.; Kiefer, R. The use of intensity-hue-saturation transformations for merging SPOT panchromatic and multispectral image data. Photogramm. Eng. Remote Sens. 1990, 56, 459–467. [Google Scholar]
Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat thematic mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens. 1989, 55, 339–348. [Google Scholar]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS+Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Tu, T.; Su, S.; Shyu, H.; Huang, P.S. A new look at IHS-like image fusion methods. Inf. Fusion 2004, 2, 177–186. [Google Scholar] [CrossRef]
Jianguo, L. Smoothing filter-based intensity modulation: a spectral preserve image fusion technique for improving spatial details. Int. J. Remote Sens. 2000, 21, 3461–3472. [Google Scholar]
Burt, P.; Adelson, E. The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 1983, 31, 532–540. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
Yokoya, N.; Yairi, T.; Iwasaki, A. Coupled nonnegative matrix factorization unmixing for hyperspectral and multispectral data fusion. IEEE Trans. Geosci. Remote Sens. 2012, 50, 528–537. [Google Scholar] [CrossRef]
Baronti, S.; Aiazzi, B.; Selva, M.; Garzelli, A.; Alparone, L. A Theoretical Analysis of the Effects of Aliasing and Misregistration on Pansharpened Imagery. IEEE J. Sel. Top. Signal Process. 2011, 5, 446–453. [Google Scholar] [CrossRef]
Fei, R.; Zhang, J.; Liu, J.; Fang Du, P.C.; Hu, J. Convolutional sparse representation of injected details for pansharpening. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1595–1599. [Google Scholar] [CrossRef]
Fei, R.; Zhang, J.; Liu, J.; Du, F.; Hu, J.; Chang, P.; Zhou, C.; Sun, K. Weighted manifold regularized sparse representation of featured injected details for pansharpening. Int. J. Remote Sens. 2021, 42, 4199–4223. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O. A new pansharpening algorithm based on total variation. IEEE Geosci. Remote Sens. Lett. 2014, 11, 318–322. [Google Scholar] [CrossRef]
Zhang, Y.; Duijster, A.; Scheunders, P. A Bayesian restoration approach for hyperspectral images. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3453–3462. [Google Scholar] [CrossRef]
Wang, P.; Bayram, B.; Sertel, E. A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Sci. Rev. 2022, 232, 104110. [Google Scholar] [CrossRef]
Xiang, H.; Zou, Q.; Nawaz, M.A.; Huang, X.; Zhang, F.; Yu, H. Deep learning for image inpainting: A survey. Pattern Recognit. 2022, 134, 109046. [Google Scholar] [CrossRef]
Khan, S.; Khan, A. FFireNet: Deep Learning Based Forest Fire Classification and Detection in Smart Cities. Symmetry 2022, 14, 2155. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by Convolutional Neural Networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef] [Green Version]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Scarpa, G.; Vitale, S.; Cozzolino, D. Target-Adaptive CNN-Based Pansharpening. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5443–5457. [Google Scholar] [CrossRef] [Green Version]
Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the Accuracy of Multispectral Image Pansharpening by Learning a Deep Residual Network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [Google Scholar] [CrossRef] [Green Version]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef] [Green Version]
Cai, J.; Huang, B. Super-Resolution-Guided Progressive Pansharpening Based on a Deep Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5206–5220. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A Deep Network Architecture for Pan-Sharpening. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1753–1761. [Google Scholar] [CrossRef]
Zhou, C.; Zhang, J.; Liu, J.; Zhang, C.; Fei, R.; Xu, S. PercepPan: Towards Unsupervised Pan-Sharpening Based on Perceptual Loss. Remote Sens. 2020, 12, 2318. [Google Scholar] [CrossRef]
Ciotola, M.; Vitale, S.; Mazza, A.; Poggi, G.; Scarpa, G. Pansharpening by Convolutional Neural Networks in the Full Resolution Framework. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, C.; Sun, M.; Ou, Y. Pan-Sharpening Using an Efficient Bidirectional Pyramid Network. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5549–5563. [Google Scholar] [CrossRef]
Diao, W.; Zhang, F.; Sun, J.; Xing, Y.; Zhang, K.; Bruzzone, L. ZeRGAN: Zero-Reference GAN for Fusion of Multispectral and Panchromatic Images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.J.; Deng, L.J.; Huang, T.Z.; Chanussot, J.; Vivone, G. A Triple-Double Convolutional Neural Network for Panchromatic Sharpening. IEEE Trans. Neural Networks Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Yao, H.; Li, C.; Zhang, G.; Leung, H. Multiresolution Analysis Based on Dual-Scale Regression for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Max, J.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
Almahairi, A.; Ballas, N.; Cooijmans, T.; Zheng, Y.; Larochelle, H.; Courville, A. Dynamic Capacity Networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML’16), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 2549–2558. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, J.; Long, C.; Wang, Y.; Piao, H.; Mei, H.; Yang, X.; Yin, B. A Two-Stage Attentive Network for Single Image Super-Resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1020–1033. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 7794–7803. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic Convolution: Attention Over Convolution Kernels. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11027–11036. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef] [Green Version]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2480–2495. [Google Scholar] [CrossRef] [Green Version]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, ICML’15, Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar]
Fu, X.; Wang, W.; Huang, Y.; Ding, X.; Paisley, J. Deep Multiscale Detail Networks for Multiband Spectral Image Sharpening. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 2090–2104. [Google Scholar] [CrossRef] [PubMed]
Yuchen, L.; Yong, Z.; Junchi, Y.; Wei, L. Generalizing Face Forgery Detection with High-frequency Features. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 16312–16321. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). IEEE Trans. Neural Netw. Learn. Syst. 2016, 32, 2090–2104. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient Transformer for High-Resolution Image Restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–23 June 2022; pp. 5728–5739. [Google Scholar]
Luo, Y.; Zhang, Y.; Yan, J.; Liu, W. NAFNet Simple Baselines for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–23 June 2022; pp. 1239–1248. [Google Scholar]
Sajjadi, M.S.M.; Schölkopf, B.; Hirsch, M. EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4501–4510. [Google Scholar] [CrossRef]
Deng, L.j.; Vivone, G.; Paoletti, M.E.; Scarpa, G.; He, J.; Zhang, Y.; Chanussot, J.; Plaza, A. Machine Learning in Pansharpening: A benchmark, from shallow to deep networks. IEEE Geosci. Remote Sens. Mag. 2022, 10, 279–315. [Google Scholar] [CrossRef]
Lucien, W.; Thierry, R.; Marc, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting image. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Chang, C.I. An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis. IEEE Trans. Inf. Theory 2000, 46, 1927–1932. [Google Scholar] [CrossRef] [Green Version]
Vivone, G.; Restaino, R.; Dalla Mura, M.; Licciardi, G.; Chanussot, J. Contrast and Error-Based Fusion Schemes for Multispectral Image Pansharpening. IEEE Geosci. Remote Sens. Lett. 2014, 11, 930–934. [Google Scholar] [CrossRef] [Green Version]
Zhou, H.; Liu, Q.; Wang, Y. PanFormer: A Transformer Based Model for Pan-Sharpening. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar] [CrossRef]
Wang, Y.; Deng, L.J.; Zhang, T.J.; Wu, X. SSconv: Explicit Spectral-to-Spatial Convolution for Pansharpening. In Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, Virtual Event China, 20–24 October 2021; pp. 4472–4480. [Google Scholar] [CrossRef]
Nezhad, Z.H.; Karami, A.; Heylen, R.; Scheunders, P. Fusion of Hyperspectral and Multispectral Images Using Spectral Unmixing and Sparse Coding. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2377–2389. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Palsson, F.; Sveinsson, J.R.; Benediktsson, J.A.; Aanaes, H. Classification of Pansharpened Urban Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 281–297. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Alparone, L.; Wald, L.; Chanussot, J.; Thomas, C.; Gamba, P.; Bruce, L.M. Comparison of Pansharpening Algorithms: Outcome of the 2006 GRS-S Data-Fusion Contest. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3012–3021. [Google Scholar] [CrossRef]
Wald, L. Quality of high-resolution synthesised images: Is there a simple criterion. In Proceedings of the 3rd Conference Fusion Earth Data, Sophia Antipolis, France, 26–28 January 2000; pp. 99–103. [Google Scholar]
Scarpa, G.; Ciotola, M. Full-Resolution Quality Assessment for Pansharpening. Remote Sens. 2022, 14, 1808. [Google Scholar] [CrossRef]
Vivone, G.; Alparone, L.; Chanussot, J.; Dalla Mura, M.; Garzelli, A.; Licciardi, G.A.; Restaino, R.; Wald, L. A Critical Comparison Among Pansharpening Algorithms. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2565–2586. [Google Scholar] [CrossRef]
Xiao, S.S.; Jin, C.; Zhang, T.J.; Ran, R.; Deng, L.J. Progressive Band-Separated Convolutional Neural Network for Multispectral Pansharpening. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4464–4467. [Google Scholar] [CrossRef]
Li, W.; Liang, X.; Dong, M. MDECNN: A Multiscale Perception Dense Encoding Convolutional Neural Network for Multispectral Pan-Sharpening. Remote Sens. 2021, 13, 535. [Google Scholar] [CrossRef]

Figure 1. Structure of the CA.

Figure 2. Network architecture of MPFINet. The gray arrows point to the direction of increasing levels.

Figure 3. Structure of MDCDB.

Figure 4. The third level of the fusion process in DAFM.

Figure 5. The self-attention mechanism in CSTB.

Figure 6. Qualitative comparison of MPFINet with ten counterparts on a sample from the QB dataset. (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) PAN. (m) GT.

Figure 7. The residual images between the pansharpened results and reference images in Figure 6. (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) GT.

Figure 8. Qualitative comparison of MPFINet with ten counterparts on a typical satellite image pair from the QB dataset at full resolution. The enlarged view of the corresponding region is shown in the green box. (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) PAN.

Figure 9. Qualitative comparison of MPFINet with ten counterparts on a sample from the WV3 dataset. (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) PAN. (m) GT.

Figure 10. The residual images between the pansharpened results and reference images in Figure 9. (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) GT.

Figure 11. Qualitative comparison of MPFINet with ten counterparts on a typical satellite image pair from the WV3 dataset at full resolution. The enlarged view of the corresponding region is shown in the red box (a) Brovey. (b) MTF_GLP_HPM. (c) SFIM. (d) CNMF. (e) PNN. (f) PANNET. (g) SRPPNN. (h) MUCNN. (i) ZeRGAN. (j) PanFormer. (k) MPFINet (ours). (l) PAN.

Figure 12. Average quantitative results for number of CSTBs on the QB datase. (a) PSNR. (b) SSIM. (c) CC. (d) SAM. (e) ERGAS. (f) UIQI.

Figure 13. Visualization of feature maps.

Table 1. Quantitative results of compared methods on the QB dataset in reduced resolution. The best values are shown in bold, and those in second place are underlined.

Method	PSNR	SSIM	CC	SAM	ERGAS	UIQI
Brovey	27.3228	0.8034	0.7992	0.0360	5.1949	0.7417
MTF_GLP_HPM	25.1442	0.7797	0.8751	0.0395	6.0307	0.7645
SFIM	26.4810	0.7313	0.8490	0.0493	5.1755	0.7280
CNMF	27.7634	0.7680	0.8365	0.0481	4.7586	0.7640
PNN	30.2131	0.8220	0.9305	0.0478	3.4744	0.8553
PANNET	33.0751	0.8776	0.9638	0.0406	2.4413	0.9038
SRPPNN	34.4130	0.9009	0.9717	0.0404	2.1060	0.9209
MUCNN	34.8714	0.9097	0.9745	0.0383	2.0003	0.9264
ZeRGAN	24.5345	0.7124	0.7645	0.0923	6.7646	0.6578
PanFormer	35.8243	0.9207	0.9781	0.0362	1.8160	0.9340
Proposed	36.5424	0.9291	0.9812	0.0357	1.6624	0.9405
Ideal value	$+ \infty$	1	1	0	0	1

Table 2. Quantitative results of compared methods on the QB dataset in full resolution. The best values are shown in bold, and those in second place are underlined.

Method	$D_{ρ}$	$R - Q 2^{n}$	QNR
Brovey	0.0205	0.7812	0.7960
MTF_GLP_HPM	0.0761	0.9003	0.8484
SFIM	0.1034	0.9327	0.8628
CNMF	0.0600	0.8353	0.8434
PNN	0.3337	0.9848	0.8962
PANNET	0.2723	0.9859	0.8874
SRPPNN	0.3166	0.9866	0.8965
MUCNN	0.3058	0.9861	0.8989
ZeRGAN	0.1590	0.7342	0.6881
PanFormer	0.8367	0.1328	0.8918
Proposed	0.3344	0.9862	0.8911
Ideal value	0	1	1

Table 3. Quantitative results of compared methods on the WV3 dataset in reduced resolution. The best values are shown in bold, and those in second place are underlined.

Method	PSNR	SSIM	CC	SAM	ERGAS	UIQI
Brovey	29.9676	0.8792	0.8851	0.0459	5.4111	0.7286
MTF_GLP_HPM	27.6733	0.8801	0.9029	0.0488	7.0395	0.7593
SFIM	32.0575	0.8943	0.9304	0.0405	4.3210	0.8046
CNMF	31.5958	0.9013	0.9140	0.0466	4.6167	0.7440
PNN	36.3034	0.9578	0.9688	0.0363	2.6760	0.8887
PANNET	35.8773	0.9537	0.9676	0.0365	2.7940	0.8728
SRPPNN	38.2118	0.9704	0.9798	0.0315	2.1081	0.9061
MUCNN	37.9780	0.9700	0.9792	0.0317	2.1768	0.9063
ZeRGAN	28.6615	0.8328	0.8604	0.1063	6.6978	0.6901
PanFormer	38.0079	0.9697	0.9792	0.0321	2.1484	0.8986
Proposed	38.5129	0.9739	0.9823	0.0298	2.0355	0.9132
Ideal value	$+ \infty$	1	1	0	0	1

Table 4. Quantitative results of compared methods on the WV3 dataset at full resolution. The best values are shown in bold, and those in second place are underlined.

Method	$D_{ρ}$	$R - Q 2^{n}$	QNR
Brovey	0.0795	0.8370	0.8254
MTF_GLP_HPM	0.1241	0.8455	0.8441
SFIM	0.1330	0.9072	0.8812
CNMF	0.1153	0.8396	0.8040
PNN	0.2938	0.9024	0.8588
PANNET	0.2675	0.9019	0.8978
SRPPNN	0.3466	0.9009	0.8643
MUCNN	0.3359	0.9005	0.8801
ZeRGAN	0.2494	0.7832	0.7237
PanFormer	0.3834	0.9044	0.8882
Proposed	0.3355	0.9051	0.8991
Ideal value	0	1	1

Table 5. Ablation study of the input for the feature extraction branch on the QB dataset. The best performance is shown in bold.

Input	PSNR	SSIM	CC	SAM	ERGAS	UIQI
HP-PAN	35.9926	0.9244	0.9798	0.0360	1.7496	0.9380
$I_{0}$	36.5424	0.9291	0.9812	0.0357	1.6624	0.9405

Table 6. Ablation study of the proposed method with different structures on the QB dataset. The best performance is shown in bold.

L1	L2	Single Level	Double Levels	Three Levels	w/o DAFM	w/o Masked V	PSNR	SSIM	CC	SAM	ERGAS	UIQI
✓				✓			36.5424	0.9291	0.9813	0.0357	1.6624	0.9405
	✓			✓			35.8549	0.9229	0.9792	0.0364	1.7757	0.9359
✓				✓		✓	36.4230	0.9294	0.9811	0.0358	1.6791	0.9403
✓			✓				36.1944	0.9266	0.9802	0.0360	1.7188	0.9388
✓		✓					35.5495	0.9216	0.9776	0.0364	1.8557	0.9353
✓				✓	✓		36.2956	0.9277	0.9804	0.0359	1.7080	0.9395

Table 7. Ablation study of the proposed MDCDB with different structures on the QB dataset. The best performance is shown in bold.

CA	CBAM	SE Block	BN	w/o DyConv	One Submodule	PSNR	SSIM	CC	SAM	ERGAS	UIQI
✓						36.5424	0.9291	0.9813	0.0357	1.6624	0.9405
✓				✓		36.3036	0.9274	0.9804	0.0358	1.7103	0.9394
✓			✓			36.3949	0.9285	0.9809	0.0359	1.6860	0.9399
✓					✓	36.1250	0.9246	0.9794	0.0361	1.7724	0.9378
	✓					35.9982	0.9249	0.9792	0.0360	1.7694	0.9373
		✓				36.4488	0.9290	0.9811	0.0358	1.6777	0.9407

Table 8. Comparison of network parameters and training times under different number of CSBTs.

Number of CSTBs	2	3	4
Parameters	2.76 M	3.60 M	4.45 M
Training time (h)	4.7	5.75	6.8

Table 9. Parameters and average test time of different methods.

Method	Test Time (s)	Number of Parameters
Brovey	0.130	-
MTF_GLP_HPM	0.913	-
SFIM	0.087	-
CNMF	6	-
PNN	0.304	0.08 M
PANNET	0.348	0.15 M
SRPPNN	0.130	0.38 M
MUCNN	0.130	1.2 M
ZeRGAN	3409.018	0.9 M
PanFormer	0.216	1.5 M
MPFINet	0.087	3.6 M

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Jin, X.; Jiang, Q.; Wang, Q.; Liu, L.; Yao, S. MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion. Remote Sens. 2022, 14, 6118. https://doi.org/10.3390/rs14236118

AMA Style

Feng Y, Jin X, Jiang Q, Wang Q, Liu L, Yao S. MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion. Remote Sensing. 2022; 14(23):6118. https://doi.org/10.3390/rs14236118

Chicago/Turabian Style

Feng, Yuting, Xin Jin, Qian Jiang, Quanli Wang, Lin Liu, and Shaowen Yao. 2022. "MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion" Remote Sensing 14, no. 23: 6118. https://doi.org/10.3390/rs14236118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MPFINet: A Multilevel Parallel Feature Injection Network for Panchromatic and Multispectral Image Fusion

Abstract

1. Introduction

2. Background and Related Works

2.1. Attention Mechanism

2.2. Dynamic Convolution

3. The Proposed Network

3.1. Overall Network Framework

3.2. Feature Extraction Branch

3.3. Feature Fusion Stage

3.4. Image Reconstruction Branch

3.5. Loss Function

4. Experimental Results and Analysis

4.1. Datasets and Experimental Setup

4.2. Compared Methods and Evaluation Metric

4.3. Experimental Results on the QB Dataset

4.4. Experimental Results on the WV3 Dataset

4.5. Ablation Study

4.5.1. Ablation Study on the Input of the Feature Extraction Branch

4.5.2. Ablation Study on MPFINet with Different Structures and Loss Function Settings

4.5.3. Ablation Study on the Structure of MDCDB

4.5.4. Ablation Study on the Number of CSTBs

4.6. Visualization of Feature Maps

4.7. Computation Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI