A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening

Li, Weisheng; Hu, Yijian; Peng, Yidong; He, Maolin

doi:10.3390/rs15194816

Open AccessArticle

A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening

College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4816; https://doi.org/10.3390/rs15194816

Submission received: 30 August 2023 / Revised: 29 September 2023 / Accepted: 2 October 2023 / Published: 3 October 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Pansharpening is a technique used in remote sensing to combine high-resolution panchromatic (PAN) images with lower resolution multispectral (MS) images to generate high-resolution multispectral images while preserving spectral characteristics. Recently, convolutional neural networks (CNNs) have been the mainstream in pansharpening by extracting the deep features of PAN and MS images and fusing these abstract features to reconstruct high-resolution details. However, they are limited by the short-range contextual dependencies of convolution operations. Although transformer models can alleviate this problem, they still suffer from weak capability in reconstructing high-resolution detailed information from global representations. To this end, a novel Swin-transformer-based pansharpening model named SwinPAN is proposed. Specifically, a detail reconstruction network (DRNet) is developed in an image difference and residual learning framework to reconstruct the high-resolution detailed information from the original images. DRNet is developed based on the Swin Transformer with a dynamic high-pass preservation module with adaptive convolution kernels. The experimental results on three remote sensing datasets with different sensors demonstrate that the proposed approach performs better than state-of-the-art networks through qualitative and quantitative analysis. Specifically, the generated pansharpening results contain finer spatial details and richer spectral information than other methods.

Keywords:

pansharpening; remote sensing; dynamic convolution kernel; deep learning; transformer

1. Introduction

Given the inherent physical limitations of remote sensing satellite sensors, achieving imagery with both high spatial and spectral resolutions poses a significant challenge. Pansharpening has emerged as a pivotal preprocessing technique within numerous remote sensing applications. Its primary objective is to fuse multispectral (MS) imagery with panchromatic (PAN) data, ultimately yielding an MS image mirroring the spatial resolution of the PAN data. Numerous pansharpening methods have been developed in recent years, which can be roughly divided into two categories, i.e., traditional and deep learning-based.

Traditional methods mainly contain component substitution (CS), multiresolution analysis (MRA) and variational optimization (VO) techniques. In essence, the core concept behind CS methods revolves around projecting the low-resolution MS image into a compatible domain. This projection facilitates the replacement of the spatial information component within the low-resolution MS image with corresponding spatial details extracted from the PAN image. This process is undertaken while endeavouring to preserve the original spectral information to the greatest extent possible. Prominent techniques encompass intensity–tone–saturation, principal component analysis [1,2,3], Gram–Schmidt [4], adaptive component replacement method [5], etc. Moreover, the fundamental principle underlying the MRA method involves the extraction of spatial components from low-resolution MS images through a process of multiresolution decomposition. Subsequently, these spatial components are substituted with panchromatic images containing intricate high-frequency details. Representative methods include the Laplacian pyramid decomposition [6], wavelet transform [7,8] and contour wave [9] methods. In contrast, the VO-based approach capitalizes on existing prior information to formulate regular terms that effectively constrain the model. This strategy leads to the derivation of the ultimate panchromatic sharpening outcome through a streamlined solution algorithm. Noteworthy methodologies within this category encompass the construction of sparse priors for regular terms [10], the utilization of image-based nonlocal similarity [11] and the integration of fragment smoothness [12] into the regularization model. These approaches distinctly enhance the accuracy of both spectral and spatial dimensions within the model.

With the development of artificial intelligence, machine leaning methods [13,14,15] and deep learning methods have been dominant in remote sensing processing and interpretation [16,17,18,19]. Notably, convolutional neural networks (CNNs) have exhibited promising performance [20,21,22,23] owing to their adeptness in uncovering abstract features within remote sensing imagery. Their robust prowess in image reconstruction further contributes to their efficacy in this context. CNN-based pansharpening can be conceptualized as an image fusion and reconstruction network, trained in an end-to-end manner. The existing approaches can be categorized into five distinct frameworks, as illustrated in Figure 1.

The CNNs essentially attempt to extract deep representations of PAN and MS images and reconstruct high-resolution images while preserving spectral information. Recently, residual learning has been a popular choice in pansharpening networks due to its effectiveness in supplementing spectral information, as shown in Figure 1b–d. However, various detail injection paradigms exist because they adopt different convolutional networks with different detail injection and spectral preservation approaches. As shown in Figure 1, the interactions between the PAN and MS encompass concatenation [24], subtraction [25] and addition [26]. Residual blocks with shortcuts or skip connections have been utilized to ensure the input information can be sufficiently propagated through all layers. They can also effectively relieve the phenomenon of gradient disappearance and explosion. On this basis, numerous efforts have been made to improve the representation capability of CNNs, such as dual-branch CNN [27], local-context adaptive kernels [24], subpixel convolutional layers [27] and multiscale convolutions [26].

More recently, generative adversarial networks (GANs) have been introduced into pansharpening [28,29,30,31,32], using a discriminator to distinguish the generated high-resolution images from the ground truth. Although adversarial training can improve the discriminative capability of the pansharpening network, some fake textures could inevitably be produced.

In addition, some algorithm-unrolling-based networks have been proposed [33,34,35]. Such methods assume the prior distribution between the HR target image and the HR guidance image, and they use one or two algorithms to solve the loss function or optimization function. Eventually, the algorithm is unrolled into a deep learning network, which has the benefit of making the model more interpretable.

While CNN-based approaches have demonstrated remarkable performance compared to conventional methods, they still grapple with the inherent limitations of CNN networks, including their relatively short-range interactions and restricted receptive fields. This can lead to challenges in effectively preserving spatial details, ultimately resulting in notable spatial detail loss within the high-resolution MS images produced by these CNN networks.

Recently, self-attention-based transformer models have been used to model the long-distance dependencies of different pixels within feature maps [36]. They are conducive to capturing vital contextual information from remote sensing images for recovering high-resolution MS images [37,38,39,40]. The MS and PAN images are stacked directly or fed into a two-branch network to encode MS and PAN images and learn their interactions for the reconstruction of high-resolution fused images. However, they focus on global representations but still suffer from a weak capability to recover high-resolution detailed information.

In this paper, a framework named SwinPAN with a detail reconstruction network (DRNet) is proposed to reconstruct high-resolution spatial and spectral details from input images. The process of obtaining the fused image revolves around incorporating the high spatial structures from a high-resolution PAN image into a resampled MS image. These high spatial structures are typically derived from the difference between the HR PAN and low-resolution (LR) components. A more in-depth exploration is conducted by directly feeding the network with details extracted through the differentiation of the individual PAN image with each MS band.

In summary, the main contributions of our work are as follows.

The detail injection mechanism is further investigated in pansharpening networks. A dynamic high-pass preservation module is developed to enhance the high frequencies present in input shallow features. This module achieves its objective by adaptively acquiring the expertise to generate convolution kernels. Furthermore, it strategically employs distinct kernels for each spatial location, facilitating the effective amplification of high frequencies.
A subtraction framework with details directly extracted by differentiating the single PAN image with each MS band is proposed. This solution allows us to avoid compromising the spatial information with a preprocessing step using detailed extraction techniques proposed in classical pansharpening approaches, letting the framework spectrally adjust the extracted details through the estimation of the nonlinear and local injection model.
A full transformer network named SwinPAN is developed for pansharpening based on the Swin Transformer. The proposed network introduces content-based interactions between image content and attention weights, resembling spatially varying convolutions. This is achieved through a shifted-window mechanism, which enables effective long-range dependency modelling. Notably, the Swin Transformer boasts improved performance while utilizing fewer parameters in comparison to the Vision Transformer (ViT).
Experimental results on three remote sensing datasets, including QuickBird, GaoFen2 and WorldView3, demonstrate that the proposed method achieves superior performance competitiveness compared with other state-of-the-art CNN-based methods.

2. Related Works

2.1. CNN-Based Methods

PNN [41], as the first CNN-based pansharpening model, took a three-layer- convolutional network adapted from SRCNN [42] as its backbone. More specifically, the PNN upsamples a low-resolution MS image to the size of the PAN image. Then, the upsampled image is concatenated with the PAN map along the channel dimension to form the input of the network. Although PNN has fewer network parameters, it converges relatively slowly due to the simple network structure. The DiCNN [43] method first concatenates the PAN map and the upsampled low-resolution MS map. The convolutional layer is used to learn the residual details of the image, and, finally, the output of the network is directly added to the upsampled low-resolution MS image to obtain the final fused image output. Skip connections were adopted to alleviate gradient explosion and speed up the convergence of the network. Furthermore, PanNet [44] attempts to split the panchromatic sharpening task into two objectives: structural and spectral preservation. For structural preservation, detailed contents obtained via the high-pass filter are fed into the CNN. For spectral preservation, PanNet directly adds the upsampled MS map to the output of the spatial detail learning network, which can effectively propagate the spectral information directly to the output image. For MSDCNN [26], the authors designed convolution kernels of different sizes to extract features with different scales and receptive fields, thereby enhancing the network’s representation ability. For BDPN [26], the authors designed a pyramid-based bidirectional network architecture to process low-resolution MS maps and high-resolution PA maps, respectively. Through this network, the multiscale details of the PAN map can be effectively extracted and injected into the MS map to obtain high-resolution output. Specifically, the entire network structure is converted between different scales. The network for extracting details uses several classic residual network blocks and the network for reconstructing images uses subpixel convolutional layers to upsample MS images. FusionNet [25] directly uses the difference between the original upsampled MS image and the PAN image to extract image details. This method can effectively maintain the spatial information and potential spectral information of the image. The extracted details are input to several residual network blocks for feature extraction and detail learning, and the final output is added to the upsampled MS map to obtain a fused image. A convolution module named LAGConv [24] that can adapt to local content is proposed, which mainly includes local adaptive convolution kernel generation and a global bias mechanism. The local adaptive convolution kernel is generated by multiplying the traditional convolution kernel with a learnable adaptive weight matrix. The global bias mechanism mainly supplements the global information loss problem caused by the previous local adaptive convolution, which is also implemented through two fully connected networks.

Although these networks have improved image reconstruction quality, it is challenging to obtain position-independent global information, and they could not make full use of intrinsically similar textual information in images.

2.2. Transformer-Based Methods

Self-attention-based transformer models have been a popular choice in pansharpening. For example, Meng et al. [37] utilized a Vision Transformer [45] for pansharpening, in which MS and PAN are stacked together and then cropped into patches to reduce the sequence length. Therefore, the pansharpening tasks can be regarded as the image fusion and reconstruction process. However, the Vision Transformer is weak in extracting multiscale contextual information. In [38,39,46], CNN and transformer encoders are conducted directly on the stacked MS and PAN images to exploit local and nonlocal features, respectively. Some works exploit two-branch structures to encode MS and PAN images separately, in which convolution and transformer modules are combined to explore spatial–spectral features [47,48]. The generated features are then fused for the reconstruction of the pansharpening image [40]. Recently, Swin Transformer [49] has drawn attention to image restoration. It uses a hierarchical structure to extract information at different scales and develops a window-based self-attention mechanism to reduce computational costs. It focuses on content-based interactions between image content and attention weights, which can be interpreted as a spatially varying convolution within a local window. Therefore, the Swin Transformer has linear computation complexity to the image size.

Although these transformer-based models have improved their performance in fetching global representations, they are limited in recovering high-resolution detailed information from abstract representations. In these models, pansharpening is regarded as a feature fusion task of MS and PAN images. The high-frequency information in the PAN image cannot be well preserved in the representation learning and image reconstruction process.

3. Methodology

3.1. Framework

As shown in Figure 2, the proposed approach adopts a subtraction framework with a DRNet to reconstruct the high-resolution spatial information and a residual module to preserve the spectral information.

Pre-processing. The MS and PAN images captured by the satellite are denoted as

I_{M S} \in R^{h \times w \times C_{i n}}

and

I_{P a n} \in R^{H \times W \times 1}

, respectively. The low-resolution MS image is upsampled to the same size as the PAN image. For the PAN image, it is duplicated in the channel dimension so that it has the same channels as the MS image. The images after pre-processing are denoted as

I_{L M S} \in R^{H \times W \times C_{i n}}

and

I_{P A N} \in R^{H \times W \times C_{i n}}

, respectively. After that, high-frequency spatial structures are obtained from the difference between the PAN and MS components as

F_{S U B} = H_{S U B} (I_{P A N}, I_{L M S})

(1)

where

F_{S U B} \in R^{H \times W \times C_{i n}}

represents the generated image with spatial structures and

H_{S U B} (\cdot)

denotes matrix subtraction for two inputs.

Shallow feature extraction with high-pass preservation (SFE-HP). Rather than directly using

F_{S U B}

as the input of the network, a dynamic high-pass preservation module is developed to retain the image’s spatial structure in the shallow features. Generally, the same weight is used in convolution operation, which is called weight sharing. However, when a low-pass filtering convolution kernel is used to check image edges, it will make the edges flatter so that it will lose spatial information. In the high-pass module, different convolution kernels are generated for each pixel, as shown in Figure 3.

As different channels will contribute different spatial information, the feature map is parted into several groups to better obtain the spatial information. This operation is also used in channel attention. The feature map is not divided into one channel for each group. Instead, the feature map is divided into several channels for each group in order to accelerate the training speed. Firstly, feature maps are divided into g groups in the channel dimension, and each group can be presented as

X_{i}

, where

i = 1, 2, \dots, g

. For each group, one convolution kernel map is generated as

w_{i} = G (X_{i}) i = 1, 2, \dots, g

(2)

where

G (\cdot)

is a function to generate different convolution kernels.

G (\cdot)

contains four sub-modules: a convolution module, batch normalization modules, a softmax module and an invert module. The structure of the

G (\cdot)

module is shown in Figure 3. The invert module is added to obtain the high-pass information. Otherwise,

G (\cdot)

will obtain the low-pass information.

w_{i}

presents the convolution kernel map for each group. Usually, one feature map corresponds to one convolution kernel, which is called weight sharing. In the high-pass module, different convolution kernels with dynamic variable parameters are generated for each pixel. Each kernel has

K^{2}

parameters. These parameters are all can be trained.

w_{i}

is convolved with

X_{i} \in R^{H \times W \times \frac{C_{i n}}{g}}

to obtain

F_{H P_{i}} \in R^{H \times W \times C_{i n}}

as

F_{H P_{i}} = H_{H P} (X_{i}, w_{i}) i = 1, 2, \dots, g

(3)

where

H_{h p} (\cdot)

performs the convolution operation. Lastly, g groups of

F_{H P_{i}}

are added into

F_{H P}

in the channel dimension.

The shallow feature extraction is essentially a convolution layer. It is normal to add a convolution layer before the transformer model. It not only leads to stable optimization and better results but also provides a simple way to map the input space to a higher dimensional feature space. A

3 \times 3

convolution layer

H_{S F} (\cdot)

is used to realize shallow feature extraction

F_{0} \in R^{H \times W \times C}

from

F_{H P}

as

F_{0} = H_{S F} (F_{H P})

(4)

where C is the feature channel number.

Deep feature extraction. A more complex network is used to extract spatial information. As shown in Figure 2, the input of DRNet is low-resolution high-frequency (LR-HF) features. The image is blurry at this point because it just subtracts from PAN and upsamples MS, without spatial feature extraction. After the DRNet extracts the spatial features of LR-HF features, the high-resolution high-frequency (HR-HF) features will be output. The image resolution is higher at this point.

Deep feature

F_{D F} \in R^{H \times W \times C}

is extracted from

F_{0}

as

F_{D F} = H_{D F} (F_{0})

(5)

where

H_{D F} (\cdot)

is the deep feature extraction module. It contains K DRNet blocks (DRB) and adds a

3 \times 3

convolutional layer at the end of the module. Intermediate features

F_{1}, F_{2}, F_{3} \dots

and the deep feature extraction module’s output

F_{D F}

are extracted by the block as

F_{i} = H_{D R B_{i}} (F_{i - 1}), i = 1, 2, \dots, K

(6)

F_{D F} = H_{C O N V} (F_{K})

(7)

where

H_{D R B_{i}} (\cdot)

and

H_{C O N V}

donote the i-th DRB and the last convolution layer, respectively. Using a convolution layer at the end of the module can bring the inductive bias of the convolution operation into the transformer-based network, and lay a better foundation for the later image reconstruction of the shallow and deep features.

Image reconstruction module. The

F_{E N D} \in R^{H \times W \times C}

image is constructed by aggregating shallow and deep features as

F_{E N D} = H_{C O N V} (F_{S U B} + F_{D F})

(8)

where

H_{C O N V}

is essentially a

3 \times 3

convolution layer. It can reconstruct the image from abstract features

F_{S F}

and

F_{D F}

, which are the output of shallow and deep feature extraction modules, respectively.

Spectral information preservation The high-pass modules, shallow feature extraction module and deep feature extraction module are aimed at extracting the spatial information, which contains both marginal and detailed information. As for the spectral information, a residual connection is used:

I_{H R M S} = H_{A D D} (I_{L M S}, F_{E N D})

(9)

where

H_{A D D} (\cdot)

performs matrix addition.

3.2. Detail Reconstruction Block

As shown in Figure 1, the residual Detail Reconstruction Block (DRB) consists of several Detail Reconstruction Layers (DRLs) with a convolution layer. Given the features

F_{i}

output by the i-th DRB, the intermediate features

F_{i, 1}, F_{i, 2}, \dots, F_{i, L}

are extracted by L DRLs as

F_{i, j} = H_{D R L_{i, j}} (F_{i, j - 1}) j = 1, 2, \dots, L

(10)

where

H_{D R L_{i, j}} (\cdot)

is the j-th DRL in the i-th DRB. A convolution layer is added before the residual connection, and the output of i-th DRB is formulated as

F_{i, o u t p u t} = H_{C O N V_{i}} (F_{i, L}) + F_{i, 0}

(11)

where

H_{C O N V_{i}} (\cdot)

is the last convolution layer in the i-th DRB.

3.3. Detail Reconstruction Layer

Self-attention in nonoverlapped windows. The Detail Reconstruction Layer (DRL) is based on multihead self-attention. For a given input

F_{i, j} \in R^{W \times H \times C}

, the DRL first reshapes the input to

\frac{H \times W}{M^{2}} \times M^{2} \times C

tensor by performing window partition, which aims to part the input into nonoverlapping

M \times M

local windows, and

\frac{H \times W}{M^{2}}

is the number of windows. The original multihead self-attention is computed in each window. For each feature

X \in R^{M^{2} \times C}

in each window, the

q u e r y (Q), k e y (K)

and

v a l u e (V)

are obtained as

Q = X \cdot P_{Q}, K = X \cdot P_{K}, V = X \cdot P_{V},

(12)

where

P_{Q}, P_{K}, P_{V}

are projection matrices that are shared among each window. Now, the muti-head self attention is computed as

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d}} + B) V

(13)

where B is a learnable relative position bias. This process is shown in Figure 4. Following this, this function is repeated h times, and the result after h times is the output of W-MSA. As for the MLPs (multilayer perceptrons), a network with two fully connected layers is contributed. Each fully connected layer will follow a GELU for further feature transformations.

Shifted-window partitioning in successive layers. Patch merging between each DRB is not used like in most other pansharpening methods to preserve spatial information as much as possible. However, this leads to a lack of communication between different windows. So, the shifted-window partition method is used, which alternates between two partitioning configurations in consecutive STLs, as shown in Figure 4.

As shown in Figure 4, the upper module uses the original window partition method (W-MSA), and the lower module uses the shift window partition method (SW-MSA). The original window partition method parts the

8 \times 8

feature map into four windows whose size is

2 \times 2

. For the shifted-window partition method, each window is moved to the lower right corner by

[\frac{M}{2}, \frac{M}{2}]

pixels from the original window, the empty pixel in the lower right corner is filled in with the upper left corner as shown in Figure 2 and masked (S)W-MSA is also used when computing self-attention.

With the alternating operation of the original window partition method and shifted-window partition method, the two successive DRNet layers are computed as

{\hat{x}}^{l} = W M S A (L N (x^{l - 1})) + x^{l - 1}, x^{l} = M L P (L N ({\hat{x}}^{l})) + {\hat{x}}^{l}

(14)

{\hat{x}}^{l + 1} = S W M S A (L N (x^{l})) + x^{l}, x^{l + 1} = M L P (L N ({\hat{x}}^{l + 1})) + {\hat{x}}^{l + 1}

(15)

where

x^{i}

and

{\hat{x}}^{i}

denote the output feature output by the (S)W-MSA module and the MLP module in the j-th DRL, respectively, and MSA

(\cdot)

and SW-MSA

(\cdot)

denote the original window partition method and shifted-window partition method, respectively.

4. Experiments

4.1. Datasets

Three large-scale remote sensing datasets are employed for the evaluation of the proposed approach on pansharpening.

The WorldView-3 dataset predominantly consists of panchromatic and multispectral images comprising eight bands. These images were captured by WorldView-3 satellites within the visible and near-infrared spectral ranges. Notably, the sampling intervals for the panchromatic and multispectral images are

0.3

m and

1.2

m, respectively. This results in a spatial resolution ratio of 4 between them.

The GaoFen2 dataset comprises panchromatic images and multispectral images encompassing four bands. These images were acquired by the Gaofen2 satellite in the visible and near-infrared spectral ranges. The spatial sampling intervals for panchromatic and multispectral images stand at 1 and 4 m, respectively. This spatial difference results in a spatial resolution scale of 4.

As for the QuickBird dataset, it contains panchromatic images and multispectral images encompassing four bands. These images were captured by the QuickBird satellite within the visible and near-infrared spectral ranges. The spatial sampling intervals for the panchromatic and multispectral images are

0.61

m and

2.44

m, respectively. This results in a spatial resolution scale of 4.

4.2. Experimental Settings

4.2.1. Data Preparation

It should be noted that real-world, ideal high-resolution multispectral (MS) images are not commonly available. Hence, training samples are often acquired following Wald’s protocol. The process begins by filtering the image block using the modulation transfer function (MTF) tailored to the specific satellite. Subsequently, the nearest interpolation method is employed to downsample the image block by a specified resolution factor. This results in the creation of both panchromatic and multispectral image blocks on a lower resolution scale. Further, a 23-tap polynomial interpolation is implemented to achieve the upsampling of the multispectral image blocks. The original MS image blocks, prior to downsampling, are utilized as the reference ground truths in this context.

All PAN, MS and ground truths were cropped into patches with the size

64 \times 64

,

16 \times 16

and

64 \times 64

, respectively. Regarding the QuickBird dataset, it comprises a total of 17,139 pairs of MS and PAN images for training purposes. Additionally, 20 image pairs are allocated for testing. Similarly, for the GaoFen2 dataset, there are 19,809 pairs of MS and PAN images available for training, along with 20 images set aside for testing and validation. Furthermore, the WorldView-3 dataset encompasses 9714 pairs of MS and PAN images designated for training. Alongside this, there are 20 images designated for both testing and validation purposes.

4.2.2. Implementation Details

For the proposed model, the stochastic gradient descent optimizer Adam was chosen for network training. The learning rate and the mini-batch size were set to

3 \times 10^{- 4}

and 32, respectively. The numbers of DRBs, DRLs, attention heads and feature dimensions in each DRL are displayed in Table 1. Furthermore, the parameters of the compared methods were set following the corresponding original articles. All experiments were implemented on the PyTorch platform under the Ubuntu operating system with NVIDIA GeForce RTX 3090 graphics cards.

4.2.3. Metrics

For reduced-resolution tests, four commonly used indices were employed, including spectra mapper angle (SAM) [50], Erreur Relative Globale Adimensionnelle de Synthèse (ERGAS), spatial correlation coefficient (SCC) [51] and the multiband extension of the Universal Image Quality Index, denoted by

Q 2^{n}

[52], where Q4 represents four-band images and Q8 represents eight-band images. Specifically, SAM evaluates spectral distortions in resulting images compared with the ground truth. ERGAS represents the relative global dimensional synthesis error, while SCC evaluates the similarity of the spatial details between the results and the ground truth. Furthermore,

Q 2^{n}

is a composite measure comprising various factors to encompass correlations, the mean of each spectral band, intraband local variance and the spectral angle. Consequently, it incorporates both intraband and interband distortions within a unified index.

In the full-resolution evaluation, due to the lack of reference images, three popular nonreference indices, including

D_{λ}

,

D_{s}

and the quality with no reference (QNR) [53], were used. Specifically,

D_{λ}

is a spectral metric for the spectral distortion, while

D_{s}

is used to evaluate the spatial distortion. Furthermore, QNR is a combination of

D_{s}

and

D_{λ}

to measure global quality without ground truth.

4.3. Comparison Analysis

The proposed approach is compared with ten state-of-the-art pansharpening methods, including GSA [54], MTF-GLP-HPM [55], SFIM [56], DiCNN, MSDCNN, DRPNN, FusionNet, PNN, PanNet and HyperTransformer. The evaluation of performance is conducted at both reduced and full resolutions.

4.3.1. Reduced-Resolution Experiment

Quantitative analysis. Table 2, Table 3 and Table 4 provide a quantitative assessment of the compared methods alongside our proposed approach across the three reduced-resolution datasets. Notably, our proposed method achieves top-tier performance across all metrics and datasets. Particularly in the case of the WorldView3 dataset, our approach demonstrates an impressive enhancement of around

25 %

in the measure. Excluding our method, MSDCNN and HyperTransformer emerge as noteworthy contenders. They yield promising results, with MSDCNN clinching the best performance in metrics such as

S A M

,

Q 4

and

S C C

for the QuickBird dataset, and HyperTransformer clinching the best performance in

R E G A S

, and its performance on other metrics is also close to MSDCNN for QuickBird. Additionally, their efficacies are supported by their dual-stream residual information enhancement structure and transformer-based structure. However, MSDCNN’s performance does not generalize effectively to the WorldView3 dataset. Among the compared methods, DRPNN showcases better performance in terms of

R E G A S

on the GaoFen2 datasets, as well as superior

S C C

on the GaoFen2 dataset. This performance is attributed to its utilization of deep residual models as the underlying architecture. With the inherent limitations of neural network representation capacity, PNN and PanNet demonstrate suboptimal performance across all three datasets. This underscores the critical role of neural networks’ expressive power in achieving favourable outcomes. Traditional methods such as SFIM, MTF-GLP-HPM and GSA achieve poor performances. Upon reviewing Table 2, Table 3 and Table 4, it is evident that our proposed approach significantly outperforms CNN-based models, reiterating its superiority.

Visual comparisons. Figure 5, Figure 6 and Figure 7 provide visual insights through the comparison of sample results extracted from the test datasets. The traditional methods generally have color distortion and edge blurring among all the datasets. In the context of the QuickBird dataset, DiCNN, DRPNN and FusionNet exhibit slightly blurred edges in their outcomes. Moreover, MSDCNN, PNN, HyperTransformer and PanNet demonstrate spectral distortions. Our proposed approach, on the other hand, adeptly retains spectrum information and spatial structures. It yields outcomes featuring clean and high-frequency details that closely resemble the PAN images. Figure 6 reveals the prevalence of spectral distortions across all compared methods on the GaoFen2 dataset. Our proposed method stands out for effectively preserving spectrum information. Likewise, Figure 7 highlights spectral distortions among all the compared methods on the WorldView3 dataset. Our proposed approach maintains the bulk of the spectrum, achieving results akin to the ground-truth image. Collectively, the visual comparisons reinforce the notion that the proposed approach outperforms other state-of-the-art models.

4.3.2. Full-Resolution Experiment

Quantitative Analysis. A comprehensive evaluation is conducted of all methods using the full-resolution dataset. It is important to note that the full-resolution dataset lacks ground-truth images for testing purposes. Nevertheless, the results clearly demonstrate the superiority of our proposed method across all metrics and datasets, showcasing a substantial improvement, as shown in Table 5, Table 6 and Table 7. This underscores the robustness and strong generalization of our approach.

Among the compared methods, the traditional methods perform poorly in reconstructing high-resolution images and PNN exhibits acceptable performance on the GaoFen2 and WorldView3 datasets. However, its performance suffers on the QuickBird dataset, indicating a lack of broad applicability. HyperTransformer is also unsatisfactory in terms of model generalization; it performs well on the QuickBird and WorldView3 datasets, but suffers on the GaoFeng2 datasets. Certain methods excel in specific metrics while lagging behind in others. For instance, DiCNN performs reasonably well in terms of

D_{λ}

but falls significantly short in both

D_{s}

and QNR, particularly on the QuickBird and GaoFen2 datasets. This suggests that while DiCNN’s detail injection might effectively preserve spectral information, its capability to extract spatial details is constrained by the limited feature representation capacity of CNNs. MSDCNN demonstrates solid performance in the context of reduced-resolution datasets. However, this achievement does not carry over effectively to the full-resolution datasets. In stark contrast, our proposed methods consistently outperform all other approaches across all datasets, reaffirming their resilience and generalization capability.

Visual comparisons. Figure 8, Figure 9 and Figure 10 visually illustrate the comparative analysis of full-resolution experimental outcomes. An important observation to note is that distinguishing visual differences among the CNN methods is not straightforward. This challenge arises from the constraints of representing the images in 8-bit RGB format, which falls short of capturing the nuances in the original 11-bit MS data. Firstly, the images generated by the traditional method have problems such as color distortion, and the blurring images have lower spectral consistency with the MS. Specifically, DiCNN, PNN, PanNet and HyperTransformer exhibit notable color distortion. Despite MSDCNN, DRPNN and FusionNet managing to retain a relatively higher amount of spectral information, they still exhibit a loss of intricate textures. Furthermore, DiCNN, MSDCNN, DRPNN and PNN result in regions of smoothing and blurring. While FusionNet, HyperTransformer and PanNet present outcomes with enhanced, detailed information and texture, they compromise spatial accuracy. In contrast, the proposed approach outperforms others by producing more precise and coherent outlines of ground objects. This is indicative of better texture information extraction and reconstruction in our method.

5. Discussion

5.1. Ablation Study

5.1.1. Albation Study of Dynamic High-Pass Preservation Module

Both reduced-resolution and full-resolution metrics are evaluated on the QuickBird dataset with and without a high-pass module. As Figure 11 shows, the model with high-pass modules obtains better results in all seven metrics. It once again confirms that the high-pass module proposed by us can effectively extract spatial details, thereby improving the quality of the generated images.

5.1.2. Albation Study of Static High-Pass Preservation Module

Compared to the dynamic high-pass preservation module, the static high-pass preservation module is just a simple three-layer CNN network. Both reduced-resolution and full-resolution metrics are evaluated on the QuickBird dataset with the static high-pass preservation module and the dynamic high-pass preservation module. As Figure 12 shows, the model with the dynamic high-pass module obtains better results in all seven metrics than with the static high-pass preservation module, the static high-pass preservation module even obtains worse results than those models which do not have a high-pass preservation module. This confirms that static convolution will lose some edge spatial information to some extent.

5.2. Parameter Analysis

5.2.1. Feature Dimension

The parameter of feature dimension refers to the channel dimension of the feature map after being processed by the SFE-HP module. Experiments of different dimensions on the QuickBird dataset are conducted. Firstly, the values of metrics are not sensitive to the dimension. When the dimension changes, the values’ results do not fluctuate greatly. For the full-resolution dataset, with increasing dimension, the results become slightly better, as shown in Figure 13. For the reduced-resolution tests, as the dimension increases, the results of the experiments slightly fluctuate, and the change in feature dimension has no apparent correlation with the results of reduced-resolution metrics.

5.2.2. Number of DRLs

The number of DRLs in the DRB module affects the feature representation and the model complexity. As shown in Figure 14, a larger number of DRLs can enhance the representation capability of the model. As a consequence, this tends to yield larger values of metrics like Q4 and SCC on reduced-resolution tests. However, this improvement might come at the cost of reduced accuracies in full-resolution evaluation. Therefore, the acceptable number of DRLs should be carefully investigated to balance the performance of the proposed approach in the reduced- and full-resolution experiments.

The experimental results on three remote sensing datasets from different sensors have corroborated that the proposed approach is superior to other methods in the reduced- and full-resolution experiments through qualitative and quantitative analysis. The generated pansharpening results effectively reconstruct high-resolution details from PAN images while preserving the essential spectral information in the MS images. In addition, parameter analysis experiments and ablation studies were conducted. The ablation experiment particularly highlighted the efficacy of the high-pass module. The two sets of parameter analysis focused on investigating the dimension of SFE-HP and the structure of DRBs. The experimental results show that the model has strong robustness to the model’s hyperparameters. However, it is essential to acknowledge that including multihead attention mechanisms might contribute to increased model complexity, which could be considered a primary limitation of our proposed approach.

6. Conclusions

In this paper, a novel network called SwinPAN designed to address pansharpening tasks is proposed. In SwinPAN, a comprehensive transformer-based network named DRNet is proposed, building upon the Swin Transformer architecture. A dynamic high-pass preservation module is developed to enhance the high-frequency content present in shallow input features. Furthermore, it leverages content-based interactions between image content and attention weights, using a shifted-window mechanism to model long-range dependencies effectively. Comparative experiments are conducted on the QuickBird, GaoFeng2 and WorldView3 datasets. The results of these experiments demonstrate that DRNet excels in generating images enriched with fine spatial details and robust spectral information. Additionally, it surpasses previously proposed models in terms of both rendered images and evaluation metrics. In future work, semi-supervised or unsupervised learning methods for pansharpening will be investigated to reduce the labeling efforts of training samples for deep models.

Author Contributions

Data curation, W.L.; formal analysis, W.L.; methodology, W.L. and Y.H.; validation, Y.H.; visualization, Y.H. and Y.P.; writing—original draft, Y.H.; writing—review and editing, Y.H., M.H. and Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (nos. 62331008, 61972060 and 62027827), the National Key Research and Development Program of China (no. 2019YFE0110800) and Special Funding for Postdoctoral Research Projects of Chongqing (no. 2022CQBSHTB3103).

Data Availability Statement

Data sharing is not applicable to this article.

Acknowledgments

The authors would like to thank all the reviewers for their valuable contributions to our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chavez, P.S., Jr.; Kwarteng, A.Y. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. In Proceedings of the 6th Thematic Conference on Remote Sensing for Exploration Geology, Houston, TX, USA, 16–19 May 1988. [Google Scholar]
Shah, V.P.; Younan, N.H.; King, R.L. An Efficient Pan-Sharpening Method via a Combined Adaptive PCA Approach and Contourlets. IEEE Trans. Geosci. Remote Sens. 2008, 46, 1323–1335. [Google Scholar] [CrossRef]
Shettigara, V.K. A generalized component substitution technique for spatial enhancement of multispectral images using a higher resolution data set. Photogram. Eng. Remote Sens. 1992, 58, 561–567. [Google Scholar]
Tu, T.M.; Su, S.C.; Shyu, H.C.; Huang, P.S. A new look at IHS-like image fusion methods. Inf. Fusion 2001, 2, 177–186. [Google Scholar] [CrossRef]
Choi, J.; Yu, K.; Kim, Y. A New Adaptive Component-Substitution-Based Satellite Image Fusion by Using Partial Replacement. IEEE Trans. Geosci. Remote Sens. 2010, 49, 295–309. [Google Scholar] [CrossRef]
Burt, P.J. The Laplacian Pyramid as a Compact Image Code. In Readings in Computer Vision; Morgan Kaufmann: Burlington, MA, USA, 1987; pp. 671–679. [Google Scholar]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
Antoniadis, A.; Oppenheim, G. The Stationary Wavelet Transform and some Statistical Applications. In Wavelets and Statistics; [Lecture Notes in Statistics]; Springer Science & Business Media: Berlin, Germany, 1995; Volume 103, pp. 281–299. [Google Scholar] [CrossRef]
Do, M.N.; Vetterli, M. The contourlet transform: An efficient directional multiresolution image representation. IEEE Trans. Image Process. 2005, 14, 2091–2106. [Google Scholar] [CrossRef]
Fang, F.; Li, F.; Shen, C.; Zhang, G. A Variational Approach for Pan-Sharpening. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2013, 22, 2822–2834. [Google Scholar] [CrossRef]
Buades, A.; Coll, B.; Duran, J.; Sbert, C. Implementation of Nonlocal Pansharpening Image Fusion. Image Process. Line 2014, 4, 1–15. [Google Scholar] [CrossRef]
Deng, L.J.; Feng, M.; Tai, X.C. The fusion of panchromatic and multispectral remote sensing images via tensor-based sparse modeling and hyper-Laplacian prior. Inf. Fusion 2018, 52, 76–89. [Google Scholar] [CrossRef]
Devi, Y.A.S. Ranking based classification in hyperspectral images. J. Eng. Appl. Sci. 2018, 13, 1606–1612. [Google Scholar]
Nayak, S.C.; Sanjeev Kumar Dash, C.; Behera, A.K.; Dehuri, S. An elitist artificial-electric-field-algorithm-based artificial neural network for financial time series forecasting. In Biologically Inspired Techniques in Many Criteria Decision Making: Proceedings of BITMDM 2021; Springer: Singapore, 2022; pp. 29–38. [Google Scholar]
Merugu, S.; Tiwari, A.; Sharma, S.K. Spatial–spectral image classification with edge preserving method. J. Indian Soc. Remote Sens. 2021, 49, 703–711. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.O.; Shi, W. Cross-domain landslide mapping from large-scale remote sensing images using prototype-guided domain-aware progressive representation learning. ISPRS J. Photogramm. Remote Sens. 2023, 197, 1–17. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, B.; Yu, W.; Kang, X. Federated Deep Learning with Prototype Matching for Object Extraction From Very-High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Dabbu, M.; Karuppusamy, L.; Pulugu, D.; Vootla, S.R.; Reddyvari, V.R. Water atom search algorithm-based deep recurrent neural network for the big data classification based on spark architecture. Int. J. Mach. Learn. Cybern. 2022, 13, 2297–2312. [Google Scholar] [CrossRef]
Balamurugan, D.; Aravinth, S.; Reddy, P.C.S.; Rupani, A.; Manikandan, A. Multiview objects recognition using deep learning-based wrap-CNN with voting scheme. Neural Process. Lett. 2022, 54, 1495–1521. [Google Scholar] [CrossRef]
Vitale, S. A cnn-based pansharpening method with perceptual loss. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3105–3108. [Google Scholar]
Vitale, S.; Scarpa, G. A detail-preserving cross-scale learning strategy for CNN-based pansharpening. Remote Sens. 2020, 12, 348. [Google Scholar] [CrossRef]
He, L.; Zhu, J.; Li, J.; Plaza, A.; Chanussot, J.; Li, B. HyperPNN: Hyperspectral pansharpening via spectrally predictive convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3092–3100. [Google Scholar] [CrossRef]
Scarpa, G.; Vitale, S.; Cozzolino, D. Target-adaptive CNN-based pansharpening. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5443–5457. [Google Scholar] [CrossRef]
Jin, Z.R.; Zhang, T.J.; Jiang, T.X.; Vivone, G.; Deng, L.J. LAGConv: Local-context adaptive convolution kernels with global harmonic bias for pansharpening. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 1113–1121. [Google Scholar]
Deng, L.J.; Vivone, G.; Jin, C.; Chanussot, J. Detail Injection-Based Deep Convolutional Neural Networks for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6995–7010. [Google Scholar] [CrossRef]
Yuan, Q.; Wei, Y.; Meng, X.; Shen, H.; Zhang, L. A Multiscale and Multidepth Convolutional Neural Network for Remote Sensing Imagery Pan-Sharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 978–989. [Google Scholar] [CrossRef]
Yang, Y.; Tu, W.; Huang, S.; Lu, H.; Wan, W.; Gan, L. Dual-stream convolutional neural network with residual information enhancement for pansharpening. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 13 December 2014; pp. 2672–2680. [Google Scholar]
Liu, Q.; Zhou, H.; Xu, Q.; Liu, X.; Wang, Y. PSGAN: A generative adversarial network for remote sensing image pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10227–10242. [Google Scholar] [CrossRef]
Xu, Q.; Li, Y.; Nie, J.; Liu, Q.; Guo, M. UPanGAN: Unsupervised pansharpening based on the spectral and spatial loss constrained generative adversarial network. Inf. Fusion 2023, 91, 31–46. [Google Scholar] [CrossRef]
Zhao, Z.; Zhan, J.; Xu, S.; Sun, K.; Huang, L.; Liu, J.; Zhang, C. FGF-GAN: A lightweight generative adversarial network for pansharpening via fast guided filter. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Li, W.; Zhu, M.; Li, C.; Fu, H. PAN-GAN: A Generative Adversarial Network for Pansharpening. Remote Sens. 2020, 12, 1836. [Google Scholar]
Xu, S.; Zhang, J.; Zhao, Z.; Sun, K.; Liu, J.; Zhang, C. Deep gradient projection networks for pan-sharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1366–1375. [Google Scholar]
Mifdal, J.; Tomás-Cruz, M.; Sebastianelli, A.; Coll, B.; Duran, J. Deep unfolding for hyper sharpening using a high-frequency injection module. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2106–2115. [Google Scholar]
Zhou, M.; Yan, K.; Pan, J.; Ren, W.; Xie, Q.; Cao, X. Memory-augmented deep unfolding network for guided image super-resolution. Int. J. Comput. Vis. 2023, 131, 215–242. [Google Scholar] [CrossRef]
Zhang, X.; Yu, W.; Pun, M.O. Multilevel deformable attention-aggregated networks for change detection in bitemporal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Meng, X.; Wang, N.; Shao, F.; Li, S. Vision Transformer for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Yin, J.; Qu, J.; Sun, L.; Huang, W.; Chen, Q. A Local and Nonlocal Feature Interaction Network for Pansharpening. Remote Sens. 2022, 14, 3743. [Google Scholar] [CrossRef]
Li, S.; Guo, Q.; Li, A. Pan-Sharpening Based on CNN+ Pyramid Transformer by Using No-Reference Loss. Remote Sens. 2022, 14, 624. [Google Scholar] [CrossRef]
Zhang, K.; Li, Z.; Zhang, F.; Wan, W.; Sun, J. Pan-Sharpening Based on Transformer with Redundancy Reduction. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
He, L.; Rao, Y.; Li, J.; Plaza, A.; Zhu, J. Pansharpening via Detail Injection Based Convolutional Neural Networks. arXiv 2018, arXiv:1806.08898. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Paisley, J. PanNet: A Deep Network Architecture for Pan-Sharpening. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, N.; Meng, X.; Meng, X.; Shao, F. Convolution-Embedded Vision Transformer with Elastic Positional Encoding for Pansharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–9. [Google Scholar] [CrossRef]
Zhang, F.; Zhang, K.; Sun, J. Multiscale Spatial-Spectral Interaction Transformer for Pan-Sharpening. Remote Sens. 2022, 14, 1736. [Google Scholar] [CrossRef]
Zhu, W.; Li, J.; An, Z.; Hua, Z. Mutiscale Hybrid Attention Transformer for Remote Sensing Image Pansharpening. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the JPL, Summaries of the Third Annual JPL Airborne Geoscience Workshop. Volume 1: AVIRIS Workshop, Pasadena, CA, USA, 1–5 June 1992. [Google Scholar]
Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef]
Alparone, L.; Baronti, S.; Garzelli, A.; Nencini, F. A global quality measurement of pan-sharpened multispectral imagery. IEEE Geosci. Remote Sens. Lett. 2004, 1, 313–317. [Google Scholar] [CrossRef]
Arienzo, A.; Vivone, G.; Garzelli, A.; Alparone, L.; Chanussot, J. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches. IEEE Geosci. Remote Sens. Mag. 2022, 10, 168–201. [Google Scholar] [CrossRef]
Aiazzi, B.; Baronti, S.; Selva, M. Improving Component Substitution Pansharpening Through Multivariate Regression of MS + Pan Data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. An MTF-based spectral distortion minimizing model for pan-sharpening of very high resolution multispectral images of urban areas. In Proceedings of the 2003 2nd GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, Berlin, Germany, 22–23 May 2003; pp. 90–94. [Google Scholar]
Liu, J. Smoothing filter-based intensity modulation: A spectral preserve image fusion technique for improving spatial details. Int. J. Remote Sens. 2000, 21, 3461–3472. [Google Scholar] [CrossRef]

Figure 1. (a–e) illustrate different frameworks of pansharpening.

Figure 2. Proposed SwinPAN framework for pansharpening.

Figure 3. Schematic illustration of the high-pass preservation module.

Figure 4. Illustration of the self-attention module in Swin Transformer.

Figure 5. Reduced-resolution results on the QuickBird imagery: (a) MS image. (b) PAN image. (c) Ground truth. (d–n) Pansharpening results obtained using (d) GSA, (e) MTF-GLP-HPM, (f) SFIM, (g) DiCNN, (h) MSDCNN, (i) DRPNN, (j) FusionNet, (k) PNN, (l) PanNet, (m) HyperTransformer and (n) the proposed approach.

Figure 6. Reduced-resolution results on the GaoFen2 imagery: (a) MS image. (b) PAN image. (c) Ground truth. (d–n) Pansharpening results obtained using (d) GSA, (e) MTF-GLP-HPM, (f) SFIM, (g) DiCNN, (h) MSDCNN, (i) DRPNN, (j) FusionNet, (k) PNN, (l) PanNet, (m) HyperTransformer and (n) the proposed approach.

Figure 7. Reduced-resolution results on the WorldView3 imagery: (a) MS image. (b) PAN image. (c) Ground truth. (d–n) Pansharpening results obtained using (d) GSA, (e) MTF-GLP-HPM, (f) SFIM, (g) DiCNN, (h) MSDCNN, (i) DRPNN, (j) FusionNet, (k) PNN, (l) PanNet, (m) HyperTransformer and (n) the proposed approach.

Figure 8. Full-resolution results on the QuickBird imagery: (a) MS image. (b) PAN image. (c–m) Pansharpening results obtained using (c) GSA, (d) MTF-GLP-HPM, (e) SFIM, (f) DiCNN, (g) MSDCNN, (h) DRPNN, (i) FusionNet, (j) PNN, (k) PanNet, (l) HyperTransformer and (m) the proposed approach.

Figure 9. Full-resolution results on the GaoFen2 imagery: (a) MS image. (b) PAN image. (c–m) Pansharpening results obtained using (c) GSA, (d) MTF-GLP-HPM, (e) SFIM, (f) DiCNN, (g) MSDCNN, (h) DRPNN, (i) FusionNet, (j) PNN, (k) PanNet, (l) HyperTransformer and (m) the proposed approach.

Figure 10. Full-resolution results on the WorldView3 imagery: (a) MS image. (b) PAN image. (c–m) Pansharpening results obtained using (c) GSA, (d) MTF-GLP-HPM, (e) SFIM, (f) DiCNN, (g) MSDCNN, (h) DRPNN, (i) FusionNet, (j) PNN, (k) PanNet, (l) HyperTransformer and (m) the proposed approach.

Figure 11. (a–g) show the results of ablation study of high-pass preservation modules.

Figure 12. (a–g) show the results of ablation study of static high-pass preservation modules and dynamic high-pass preservation modules.

Figure 13. (a–g) show the results of quantitative results with different values of feature dimension.

Figure 14. (a–g) show the results of quantitative results with different numbers of DRBs.

Table 1. Parameter settings for the proposed approach on the experimental datasets.

Paratmeters	DRB	DRL	Head	Dimension	Learning Rate	Batch Size
QuickBird	6	[2, 2, 2, 2, 2, 2]	2	96	$3 \times 10^{- 4}$	32
GaoFen2	6	[2, 2, 2, 2, 2, 2]	6	60	$3 \times 10^{- 4}$	32
WorldView3	6	[2, 2, 2, 2, 2, 2]	2	60	$3 \times 10^{- 4}$	32

Table 2. Quantative results at reduced resolution on the QuickBird dataset.

Methods	SAM↓	REGAS↓	Q4↑	SCC↑
SFIM	8.1925 ± 1.7282	8.8807 ± 2.1295	0.8495 ± 0.0788	0.9315 ± 0.0150
MTF-GLP-HPM	8.3063 ± 1.5742	10.4731 ± 0.9394	0.8411 ± 0.0138	0.8796 ± 0.0194
GSA	8.3497 ± 1.6728	9.3289 ± 2.7366	0.8289 ± 0.1119	0.9284 ± 0.0140
DiCNN	5.6262 ± 0.9368	5.4730 ± 0.3720	0.9488 ± 0.0103	0.9712 ± 0.0054
MSDCNN	4.9896 ± 0.8182	4.1383 ± 0.2411	0.9720 ± 0.0057	0.9814 ± 0.0038
DRPNN	5.0111 ± 0.8288	4.1363 ± 0.2487	0.9719 ± 0.0057	0.9794 ± 0.0040
FusionNet	5.1158 ± 0.8432	4.3962± 0.2662	0.9678 ± 0.0071	0.9797 ± 0.0039
PNN	5.4115 ± 0.8705	4.7185 ± 0.3218	0.9630 ± 0.0087	0.9763 ± 0.0044
PanNet	5.5462 ± 1.0085	5.4995 ± 0.8098	0.9487 ± 0.0186	0.9687 ± 0.0082
HyperTransformer	4.9931 ± 0.7630	4.1189 ± 0.3429	0.9715 ± 0.0078	0.9813 ± 0.0075
Ours	4.8653 ± 0.7909	4.0546 ± 0.2531	0.9726 ± 0.0063	0.9826 ± 0.0035

Table 3. Quantative results at reduced resolution on the GaoFen2 dataset.

Methods	SAM ↓	REGAS ↓	Q4 ↑	SCC ↑
SFIM	6.2068 ± 1.1050	12.4050 ± 2.1028	0.5631 ± 0.1187	0.9691 ± 0.0120
MTF-GLP-HPM	5.1642 ± 1.0338	10.5863 ± 3.2607	0.6186 ± 0.1613	0.9416 ±0.0142
GSA	6.4668 ± 1.0011	12.8536 ±2.1755	0.5391 ± 0.1253	0.9659 ± 0.0127
DICNN	1.1003 ± 0.2064	1.1222 ± 0.2184	0.9840 ± 0.0081	0.9862 ± 0.0059
MSDCNN	0.9889 ± 0.1839	0.9679 ± 0.1777	0.9886 ± 0.0063	0.9901 ± 0.0043
DRPNN	0.9118 ± 0.1634	0.8185 ± 0.1377	0.9916 ± 0.0045	0.9918 ± 0.0035
FusionNet	1.0143 ± 0.1959	1.0551 ± 0.2079	0.9860 ± 0.0071	0.9889 ± 0.0048
PNN	1.0907 ± 0.2105	1.1116 ± 0.2259	0.9842 ± 0.0085	0.9871 ± 0.0058
PanNet	1.0248 ± 0.1724	0.9214 ± 0.1561	0.9893 ± 0.0056	0.9898 ± 0.0043
HyperTransformer	0.9538 ± 0.1648	0.8506 ± 0.1283	0.9908 ± 0.0048	0.9905 ± 0.0038
Ours	0.7996 ± 0.1441	0.7790 ± 0.1271	0.9922 ± 0.0041	0.9936 ± 0.0028

Table 4. Quantative results at reduced resolution on the WorldView3 dataset.

Methods	SAM ↓	REGAS ↓	Q8 ↑	SCC ↑
SFIM	5.5385 ± 1.4737	5.7839 ± 1.7049	0.8704 ± 0.4548	0.9531 ± 0.0142
MTF-GLP-HPM	5.7246 ± 1.5042	6.5285 ± 1.3622	0.8716 ± 0.3886	0.9237 ± 0.0211
GSA	5.6828 ± 1.5025	6.6567 ± 1.8083	0.8732 ± 0.3973	0.9496 ± 0.0137
DICNN	4.4534 ± 0.8643	3.2739 ± 0.8627	0.9228 ± 0.5535	0.9765 ± 0.0125
MSDCNN	3.7875 ± 0.6942	2.7558 ± 0.6105	0.9373 ± 0.3181	0.9748 ± 0.0114
DRPNN	3.5703 ± 0.6365	2.5916 ± 0.5484	0.8479 ± 0.4550	0.9778 ± 0.0119
FusionNet	3.4672 ± 0.6286	2.5718 ± 0.5937	0.8516 ± 0.4376	0.9825 ± 0.0077
PNN	3.9548 ± 0.7266	2.8655 ± 0.6668	0.8968 ± 0.4820	0.9750 ± 0.0109
PanNet	3.7322 ± 0.6609	2.7974 ± 0.6270	0.8760 ± 0.4258	0.9729 ± 0.0128
HyperTransformer	3.1275 ± 0.5250	2.6405 ± 0.5122	0.9210 ± 0.3970	0.9843 ± 0.1240
Ours	2.9542 ± 0.5253	2.1558 ± 0.4439	0.9539 ± 0.4722	0.9890 ± 0.0046

Table 5. Quantative results at full resolution on the QuickBird dataset.

Methods	$D_{λ} ↓$	$D_{s} ↓$	QNR↑
SFIM	0.0512 ± 0.0113	0.1296 ± 0.0996	0.8243 ± 0.0974
MTF-GLP-HPM	0.0506 ± 0.0234	0.1341 ± 0.1146	0.8217 ± 0.1095
GSA	0.0465 ± 0.0207	0.2007 ± 0.1098	0.7614 ± 0.1033
DICNN	0.0416 ± 0.0300	0.0910 ± 0.0514	0.8723 ± 0.0711
MSDCNN	0.0604 ± 0.0390	0.0524 ± 0.0137	0.8903 ± 0.0391
DRPNN	0.0394 ± 0.0327	0.0409 ± 0.0241	0.9219 ± 0.0513
FusionNet	0.0402 ± 0.0341	0.0543 ± 0.0410	0.9088 ± 0.0676
PNN	0.0399 ± 0.0342	0.0500 ± 0.0393	0.9133 ± 0.0665
PanNet	0.0409 ± 0.0347	0.0418 ± 0.0334	0.9200 ± 0.0618
HyperTransformer	0.0424 ± 0.0376	0.0412 ± 0.0195	0.9210 ± 0.0499
Ours	0.0370 ± 0.0333	0.0398 ± 0.0257	0.9253 ± 0.0536

Table 6. Quantative results at full resolution on the GaoFen2 dataset.

Methods	$D_{λ} ↓$	$D_{s} ↓$	QNR↑
SFIM	0.0371 ± 0.0160	0.0647 ± 0.0460	0.9010 ± 0.0518
MTF-GLP-HPM	0.0925 ± 0.0386	0.0805 ± 0.0531	0.8351 ± 0.0676
GSA	0.0596 ± 0.0227	0.1027 ± 0.0542	0.8445 ± 0.0620
DICNN	0.0179 ± 0.0145	0.0590 ± 0.0262	0.9244 ± 0.0371
MSDCNN	0.0121 ± 0.0144	0.0387 ± 0.0198	0.9499 ± 0.0317
DRPNN	0.0158 ± 0.0152	0.0319 ± 0.0168	0.9530 ± 0.0298
FusionNet	0.0215 ± 0.0191	0.0546 ± 0.0262	0.9255 ± 0.0419
PNN	0.0113 ± 0.0130	0.0333 ± 0.0175	0.9560 ± 0.0285
PanNet	0.0115 ± 0.0118	0.0412 ± 0.0191	0.9486 ± 0.0285
HyperTransformer	0.0174 ± 0.0170	0.0414 ± 0.0218	0.9422 ± 0.0355
Ours	0.0110 ± 0.0099	0.0309 ± 0.0135	0.9585 ± 0.0210

Table 7. Quantative results at full resolution on the WorldView3 dataset.

Methods	$D_{λ} ↓$	$D_{s} ↓$	QNR↑
SFIM	0.0353 ± 0.0106	0.0565 ± 0.0288	0.9075 ± 0.0351
MTF-GLP-HPM	0.0389 ± 0.0229	0.0523 ± 0.0332	0.9113 ± 0.0482
GSA	0.0325 ± 0.0131	0.0603 ± 0.0293	0.9062 ± 0.0381
DICNN	0.0239 ± 0.0174	0.0575 ± 0.0339	0.9202 ± 0.0410
MSDCNN	0.0267 ± 0.0146	0.0473 ± 0.0254	0.9275 ± 0.0351
DRPNN	0.0266 ± 0.0168	0.0476 ± 0.0234	0.9274 ± 0.0369
FusionNet	0.0320 ± 0.0253	0.0490 ± 0.0213	0.9207 ± 0.0354
PNN	0.0249 ± 0.0139	0.0451 ± 0.0220	0.9313 ± 0.0312
PanNet	0.0277 ± 0.0142	0.0603 ± 0.0237	0.9140 ± 0.0342
HyperTransformer	0.0276 ± 0.0137	0.0487 ± 0.0212	0.9347 ± 0.0312
Ours	0.0216 ± 0.0105	0.0326 ± 0.0194	0.9467 ± 0.0275

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Hu, Y.; Peng, Y.; He, M. A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening. Remote Sens. 2023, 15, 4816. https://doi.org/10.3390/rs15194816

AMA Style

Li W, Hu Y, Peng Y, He M. A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening. Remote Sensing. 2023; 15(19):4816. https://doi.org/10.3390/rs15194816

Chicago/Turabian Style

Li, Weisheng, Yijian Hu, Yidong Peng, and Maolin He. 2023. "A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening" Remote Sensing 15, no. 19: 4816. https://doi.org/10.3390/rs15194816

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Swin Transformer with Dynamic High-Pass Preservation for Remote Sensing Image Pansharpening

Abstract

1. Introduction

2. Related Works

2.1. CNN-Based Methods

2.2. Transformer-Based Methods

3. Methodology

3.1. Framework

3.2. Detail Reconstruction Block

3.3. Detail Reconstruction Layer

4. Experiments

4.1. Datasets

4.2. Experimental Settings

4.2.1. Data Preparation

4.2.2. Implementation Details

4.2.3. Metrics

4.3. Comparison Analysis

4.3.1. Reduced-Resolution Experiment

4.3.2. Full-Resolution Experiment

5. Discussion

5.1. Ablation Study

5.1.1. Albation Study of Dynamic High-Pass Preservation Module

5.1.2. Albation Study of Static High-Pass Preservation Module

5.2. Parameter Analysis

5.2.1. Feature Dimension

5.2.2. Number of DRLs

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI