Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening

Li, Qingping; Yang, Xiaomin; Li, Bingru; Wang, Jin

doi:10.3390/s25082560

Open AccessArticle

Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening

College of Electronic Information, Sichuan University, Chengdu 610017, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(8), 2560; https://doi.org/10.3390/s25082560

Submission received: 28 February 2025 / Revised: 9 April 2025 / Accepted: 14 April 2025 / Published: 18 April 2025

(This article belongs to the Section Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Pansharpening techniques are crucial in remote sensing image processing, with deep learning emerging as the mainstream solution. In this paper, the pansharpening problem is formulated as two optimization subproblems with a solution proposed based on multiscale contrastive learning combined with attention-guided gradient projection networks. First, an efficient and generalized Spectral–Spatial Universal Module (SSUM) is designed and applied to spectral and spatial enhancement modules (SpeEB and SpaEB). Then, the multiscale high-frequency features of PAN and MS images are extracted using discrete wavelet transform (DWT). These features are combined with contrastive learning and residual connection to progressively balance spectral and spatial information. Finally, high-resolution multispectral images are generated through multiple iterations. Experimental results verify that the proposed method outperforms existing approaches in both visual quality and quantitative evaluation metrics.

Keywords:

pansharpening; multiscale contrastive; attention-guided gradient projection network

1. Introduction

Advances in remote sensing satellite technology have made Earth surface observation possible [1]. However, limited by sensor performance, satellites are unable to capture images that contain both rich spectral and spatial information simultaneously. Instead, they can only acquire low-resolution multispectral (MS) images and corresponding high-resolution panchromatic (PAN) images. The importance of high-resolution multispectral images in fields such as change detection [2], classification [3], and target identification [4] has led to the emergence of the pansharpening technique.

Traditional pansharpening methods are mainly divided into strategies based on component substitution (CS) and multiresolution analysis (MRA). CS methods, such as Brovey [5], principal component analysis (PCA) [6], IHS [7], and GSA [8], project the spectral information of the MS image into a new domain by replacing some or all of the spatial information with data from the PAN image, followed by back-projection Although histogram matching is performed before replacement to reduce spectral distortion, it is still difficult to completely avoid spectral aberrations. MRA methods, such as ATWT [9,10], SFIM based on smoothing filters [11], and the MTF-Matched Filtering-Based Generalized Laplace Pyramid (MTF-GLP) [12], extract spatial information from the PAN image through multiscale decomposition, subsequently injecting this information into the upsampled MS image. However, aliasing effects may cause spatial distortion.

Deep learning methods have become mainstream tools due to their powerful feature extraction capabilities and nonlinear mapping performance. Inspired by the Super-Resolution (SR) technique, Masi et al. [13] treated the pansharpening task as a super-resolution problem, using convolutional neural networks (CNNs) to address it. Subsequently, residual networks (RNs) [14,15], generative adversarial networks (GANs) [16,17,18,19,20] and MSDCNN [21], a multiscale deep convolutional network, were proposed.

Variational optimization methods, which lie between traditional CS/MRA methods and deep learning, consider generalized pansharpening as an optimization problem. The P+XS method [22] achieves pansharpening by extracting spatial information from a panchromatic (PAN) image, which is then injected into a multispectral (MS) image. Wu et al. [23] combined variational optimization with deep CNNs to enhance the model’s generalization ability, subsequently proposing a pansharpening framework based on low-rank tensor complementation [24]. In addition, meta-heuristic algorithms [25,26] are also widely used in generalized pansharpening tasks due to their superior performance in large-scale search spaces.

Under the variational optimization framework, it is assumed that a multispectral (MS) image is a reduced-quality version of a high-resolution multispectral (HRMS) image, while a panchromatic (PAN) image is a linear combination of the bands of the HRMS image. Based on this assumption, this paper proposes two optimization problems for HRMS image reconstruction, which constrain the generation of HRMS images using the information from both MS and PAN images, respectively.

Although existing variational optimization methods have shown significant effects on pansharpening, there are still several issues that urgently need to be addressed:

(1): Modal differences between spatial and spectral information lead to inconsistencies in information representation and extraction, resulting in poor fusion performance.
(2): During the optimization process of HRMS images, the high-frequency noise in MS images is not considered in spectral optimization, leading to an increase in artifacts in the reconstructed image.
(3): Balancing spectral and spatial information: Overemphasizing one aspect may lead to a decrease in the overall quality of the final reconstructed image.

To address these challenges, this paper applies contrastive learning to the pansharpening task by introducing an innovative method that combines self-supervised multiscale contrastive learning with attention-guided deep gradient projection (MCAGP).

The method first designs a Spectral–Spatial Universal Module (SSUM) for depth gradient projection networks, combining the depth prior to design spectral enhancement blocks (SpeEBs) and spatial enhancement blocks (SpaEBs). These blocks are applied serially and stacked alternately in the depth gradient projection network to solve the two optimization problems step by step.

Additionally, a multiscale contrastive learning strategy is applied to optimize the spatial information of PAN images. In this strategy, the high-frequency components of PAN images are considered positive samples, while those of MS images are treated as negative samples. This method strengthens the SpaEB’s focus on the spatial features of PAN images while also enhancing the SpeEB’s ability to preserve the spectral properties of MS images.

Finally, a contrastive loss function based on contrastive learning is applied to effectively balance spatial and spectral features by maximizing the similarity between positive samples while minimizing the similarity between negative samples, with model performance further enhanced by incorporating L1 loss.

The experimental results demonstrate that the MCAGP method surpasses both traditional and contemporary advanced methods in terms of visual quality and performance metrics, offering a novel approach to the pansharpening field.

The contributions of this paper are summarized as follows:

(1): Combining contrastive learning with deep gradient projection within a variational optimization framework: this method reduces modal differences by contrasting high-frequency features, strengthens the task focus of the spectral and spatial enhancement blocks, improves feature consistency and reconstruction quality, and overcomes conflicts between modalities through independent optimization strategies.
(2): Introducing a Spectral–Spatial Universal Module (SSUM) combined with depth priors: This module is extended to spectral and spatial enhancement blocks, effectively solving the dual optimization problem. Through channel-space attention guidance and multilevel residual connections, it balances spatial and spectral features.
(3): Designing a multiscale contrastive learning strategy: this strategy introduces contrast loss to filter out noise in MS images, allowing the model to perform well in both full-resolution and reduced-resolution tasks.

The structure of the paper is as follows: Section 2 provides a review of related work; Section 3 describes the MCAGP method in detail; Section 4 presents the experimental results; and Section 5 presents the conclusions.

2. Related Work

2.1. Self-Supervised Learning

Self-supervised learning (SSL) generates labels using the data themselves to train the model without manual labeling, thus providing a significant advantage in areas where labeling is costly. Its successful applications in natural language processing (NLP) and computer vision (CV), such as image colorization [27,28] and super-resolution [29], demonstrate that SSL is able to efficiently extract structural, contextual, and semantic features from data. In the field of pansharpening, SSL shows great potential. Xing et al. [30] proposed a cross-predictive diffusion model (CrossDiff) to explore self-supervised representations in panchromatic sharpening; Ruben et al. [31] designed a self-supervised double-U network (W-NetPan); and He et al. [32] developed a self-supervised pansharpening method based on spectral super-resolution (sSRPNet). These studies show that SSL offers innovative ideas for panchromatic sharpening tasks, significantly enhancing performance.

2.2. Contrastive Learning

Contrastive learning has garnered significant attention, with its core idea being to enhance the mutual information of learned representations by reducing the distance between anchor and positive samples in latent space while pushing negative samples away [33,34,35,36,37,38]. The construction of positive and negative samples is key to contrastive learning. In the field of image super-resolution, positive samples are typically real images, and negative samples are degraded or other images [39,40,41]. SimCLR [34] utilizes data augmentation (e.g., cropping, flipping, and color dithering) to generate pairs of positive samples and learns their similarity through contrast loss. MoCo [35] introduces a momentum encoder and dynamic queue, effectively addressing the problem of balancing positive and negative samples and making representation learning more robust. In the pansharpening field, Zhou et al. [42] enforce contrastive learning to constrain the distance between the restored features and the ground truth, performing distillation to promote the learning of consistent features.

3. Proposed Method

This subsection describes in detail the proposed pansharpening method MCAGP, whose overall framework is illustrated in Figure 1 and Algorithm 1. In this figure, ms denotes the low-resolution multispectral image, PAN denotes the high-resolution panchromatic image, and HRMS refers to the final high-resolution multispectral image.

Algorithm 1: MCAGP Forward Pass

The framework of MCAGP consists of three key components: a spectral enhancement block (SpeEB), a spatial enhancement block (SpaEB), and a Multiscale Contrastive Learning module (MCL), which are closely coupled through iterative residual learning.

Specifically, the process begins with interpolating the low-resolution MS image to the PAN resolution, obtaining the initial

{H R M S}^{0}

. Both the interpolated MS image and the original MS image are then input into the SpeEB, which is designed to enhance the spectral information by learning and compensating the spectral difference between the upsampled image and the original MS image. The output of SpeEB, denoted as

{H R M S}^{l - 1}

, is subsequently passed through the MCL module, where the multiscale contrastive loss is calculated by extracting high-frequency details and constructing positive and negative samples based on data augmentation and noise injection, effectively guiding the network to focus on fine-grained spatial–spectral consistency.

Afterwards, the contrastive-enhanced

{H R M S}^{l - 1}

and the PAN image are jointly fed into the SpaEB, which injects spatial details from the PAN image while preserving spectral consistency. A residual block is embedded after SpaEB to further refine the fused result and compensate for residual errors.

This procedure is repeated over L iterations, with residual connections linking the outputs at each stage to progressively refine the reconstructed HRMS. Through the interaction of spectral enhancement, spatial enhancement, and contrastive learning, the network gradually improves the fidelity of the pansharpened image. The detailed workflow is summarized in the pseudo-code provided, and the interconnection between modules is visually illustrated in Figure 1.

3.1. Attention-Guided Gradient Projection

Problem description: Suppose the LR image is a degraded version of the HR image, while the PAN image is a linear combination of the bands in the HR image. Therefore, the following formula can be obtained:

y_{lr} = D K x_{hr}

(1)

y_{pan} = F x_{hr}

(2)

where

D \in R^{m n \times M N}

denotes the downsampling matrix,

K

is the low-pass circular convolution matrix,

F \in S^{B \times b}

is the spectral response function, and

x_{hr}

represents the target high-resolution multispectral image. Since the process of reconstructing the HR image is a typical pathological inverse problem, the direct solution often faces instability. Therefore, in order to constrain the reasonableness of the solution, the following optimization problem with a regularization term is proposed:

min_{x_{hr}} L_{d a t a} (x_{hr}) + γ R (x_{hr})

(3)

where

R (x_{hr})

is the prior term, which is used to control the smoothness or structure of the

x_{hr}

image; traditional optimization is typically hand-tailored, while in deep learning, it is represented as an implicit prior.

L_{d a t a} (x_{hr}) = ∥ y_{lr} - D x_{hr} ∥_{F}^{2} + {∥ y_{pan} - F x_{hr} ∥}_{F}^{2}

is the data fidelity term, which is used to constrain the consistency between the

x_{hr}

image, the

y_{l r}

image, and the

y_{p a n}

image;

γ

is the trade-off parameter, which is used to regulate the relative importance between the regularization term and the data fidelity term.

In order to better utilize the deep learning framework, the generalized pansharpening problem is decomposed into two complementary subproblems: spectral optimization and spatial optimization. This decomposition allows for the independent optimization of spectral and spatial information, with the final goal of reconstructing the HR image formulated as follows:

min_{x_{hr}} f (y_{lr}, x_{hr}) + γ R_{l} (x_{hr})

(4)

min_{x_{hr}} f (y_{pan}, x_{hr}) + γ R_{P} (x_{hr})

(5)

Inspired by generative GAN algorithms, two generative modules were designed: the spectral enhancement block (SpeEB) and the spatial enhancement block (SpaEB). These two modules implicitly model regularization terms through deep learning in order to optimize both the spectral features and spatial details.

Spectral enhancement block (SpeEB): The focus of the spectral enhancement module is to optimize spectra by reconstructing spectral distributions that are consistent with low-resolution (LR) images.The optimization process of SpeEB consists of the following four steps:

\hat{{y_{lr}}^{(m)}} = D K {x_{hr}}^{(m - 1)}

(6)

{R_{l}}^{(m)} = y_{lr} - \hat{{y_{lr}}^{(m)}}

(7)

{R_{h}}^{(m)} = ρ {(D K)}^{T} {R_{l}}^{(m)}

(8)

{x_{hr}}^{(m)} {prox}_{h_{l}} = ({x_{hr}}^{(m - 1)} + {R_{h}}^{(m)})

(9)

where

ρ

is the step size, and

{prox}_{h_{l}}

is the proximal operator corresponding to the penalty term

h_{l} (\cdot)

.

Spatial enhancement block (SpaEB): the spatial enhancement block focuses on spatial optimization, which optimizes the details of spatial information by comparing the linear combination of an HR image and a PAN image. Its optimization steps are as follows:

\hat{{y_{pan}}^{(m)}} = F {x_{hr}}^{(m - 1)}

(10)

{R_{p}}^{(m)} = y_{pan} - \hat{{y_{pan}}^{(m)}}

(11)

{R_{h}}^{(m)} = ρ {R_{p}}^{(m)} F^{T}

(12)

{x_{hr}}^{(m)} {prox}_{h_{p}} = ({x_{hr}}^{(m - 1)} + {R_{h}}^{(m)})

(13)

where

ρ

is the step size, and

{prox}_{h_{p}}

is the proximal operator corresponding to the penalty term

h_{p} (\cdot)

.

Spectral–Spatial Universal Module (SSUM): The detailed structure of the Spectral–Spatial Universal Module (SSUM) is illustrated in Figure 2. To further enhance the fusion efficiency of spectral and spatial information, this paper introduces the SSUM module between the spectral enhancement block (SpeEB) and the spatial enhancement block (SpaEB), aiming to achieve the unified extraction and enhancement of spectral and spatial features. Specifically, SSUM incorporates both channel attention and spatial attention mechanisms, which effectively guide the network to selectively focus on spectral attributes and spatial details, thereby improving the feature representation capability. In the overall framework, SpeEB mainly leverages the residual information between the low-resolution multispectral (MS) image and the interpolated high-resolution MS image to compensate for the spectral distortion caused by upsampling. Conversely, SpaEB focuses on utilizing the spatial structural details contained in the PAN image and compensates for the spatial resolution loss via a residual back-projection strategy. Although both SpeEB and SpaEB share the same SSUM structure as the basic unit for feature mapping and residual feedback, they achieve functional decoupling and complementarity in terms of input design and residual information utilization. This ensures a well-balanced optimization between spectral fidelity and spatial detail enhancement. Furthermore, the structural versatility and efficiency of SSUM enable feature sharing and collaborative optimization between SpeEB and SpaEB, significantly improving the overall quality of feature representation and computational efficiency.

3.2. Multiscale Contrastive Learning

In the reconstruction of remote sensing images, MS images have poor spatial quality with significant high-frequency noise (Figure 3a). In contrast, PAN images have clear high-frequency spatial details (Figure 3b). Thus, MS images mainly contribute spectral information, while PAN images provide high-quality spatial details. This division prevents artifacts caused by mixing MS image noise with PAN image details.

To this end, discrete wavelet transform (DWT) is introduced in this paper to extract the multiscale high-frequency features of PAN and MS images. DWT is able to capture the spatial details in multiscale and multidirectional forms by decomposing the images into low and high-frequency subbands. The multiscale contrastive learning framework is shown in Figure 4. Specifically, the following applies.

Anchor sample: The reconstructed image generated via the SpeEB is used to extract its multiscale high-frequency features through DWT, with low-dimensional embedded features generated by global pooling and linear projection.

Z_{a n c h o r} = P (G (DWT (x_{hr}^{SpeEB})))

(14)

where

G (\cdot)

denotes global pooling,

G (\cdot)

denotes linear projection for mapping high-dimensional features to low-dimensional potential space, and

x_{hr}^{SpeEB}

denotes the HR image generated via the spectral enhancement module.

Positive sample: The PAN images are acquired through spatial matching, with multiscale high-frequency features extracted after data enhancement (e.g., random flip and rotation) and the embedded features generated using the same process as for the anchor samples.

Z_{p o s} = P (G (DWT (Augment (y_{pan}))))

(15)

A u g m e n t (\cdot)

represents data enhancement operations such as random flipping and rotation.

Negative sample: Extracted from the upsampled MS image, diverse negative samples are generated by adding Gaussian noise, extracting their high-frequency features and mapping them to the low-dimensional space. Through multiple negative samples, the distance between the anchor point and negative samples is enlarged to improve the discriminative ability.

Z_{n e g}^{i} = P (G (DWT (AddNoise (y_{lr}^{↑}))))

(16)

where

y_{lr}^{↑}

denotes the LR image after upsampling through the interpolation operation,

A d d N o i s e (\cdot)

denotes the addition of random Gaussian noise to the high-frequency portion of the MS image, and i denotes different instances of negative samples.

Multiscale contrastive learning (MCL): In the proposed MCAGP framework, a multiscale contrastive learning (MCL) module is introduced. As illustrated in Figure 4, the complete process of positive and negative sample construction, high-frequency feature extraction, and contrastive loss computation is clearly presented, providing readers with a detailed understanding of the implementation and functionality of this module.

The core idea of the MCL module is to guide the network to focus more on the consistency of spatial–spectral details during training by constructing positive and negative sample pairs. Specifically, multiscale high-frequency features are first extracted from the output

H R M S^{l - 1}

of the SpeEB module, which serves as the anchor samples. Subsequently, a data augmentation strategy—including rotation, flipping, color jittering, and other transformations—is applied to the PAN image to generate positive samples. Their multiscale high-frequency features are also extracted. In order to provide effective contrastive information, multiple negative samples are further generated by injecting Gaussian noise into the multispectral image MS, followed by high-frequency feature extraction.

In the feature space, the similarity between the anchor features and the positive features is maximized (i.e., bringing them closer), while the similarity between the anchor features and the negative features is minimized (i.e., pushing them apart). This forms the positive–negative contrastive training objective, where the similarity measurement is implemented using the InfoNCE loss function.

It is noteworthy that the high-frequency feature extraction in the MCL module not only focuses on single-scale texture information but also leverages multiscale spatial details obtained via discrete wavelet transform (DWT). This ensures the effectiveness of contrastive loss across different scales. Additionally, the generation process of positive and negative samples incorporates diverse data augmentation and noise injection strategies, effectively enhancing the model’s discriminative ability and robustness.

Residual connection and information balance: To avoid the loss of spectral information due to the over-reliance of the model on the spatial features of the PAN image and to improve the fusion efficiency of spectral and spatial features, this paper introduces a multi-stage residual connection mechanism between the SpeEB, the SpaEB, and the subsequent residual blocks, which progressively accrues the features of each stage and realizes the dynamic balance between the spectral and spatial information.

3.3. Loss Functions

L1 loss:

L_{L 1} = ∥x_{hr}^{pred} - x_{hr}^{gt}∥

(17)

Contrast loss: the InfoNCE loss [33,34,35] is used.

D = \sum_{i = 1}^{K} exp (\frac{sim (Z_{anchor}, Z_{neg}^{i})}{τ})

(18)

L_{InfoNCE} = - log (\frac{exp (\frac{sim (Z_{anchor}, Z_{pos})}{τ})}{exp (\frac{sim (Z_{anchor}, Z_{pos})}{τ}) + D})

(19)

where

Z_{a n c h o r}

is the feature representation of the anchor sample,

Z_{p o s}

is the feature representation of the positive sample,

Z_{n e g}^{i}

is the feature representation of the i negative sample, with a total of K negative samples,

τ

is a given temperature parameter, which is used to regulate the scaling range of similarity, and

s i m (a, b)

is the similarity function, which is commonly used to measure the similarity between feature vectors by dot product or cosine similarity.

In the implementation, the dot products of positive and negative samples are batch-processed and spliced by columns to form a logits matrix, where the first position is a positive sample and the rest are negative samples. Cross-entropy loss is a reliable and efficient loss function that is widely utilized in deep networks [43,44,45], and the final contrast loss is calculated by cross-entropy loss.

Thus, the total loss function of the model is as follows:

L = L_{L 1} + λ L_{I n f o N C E}

(20)

where

λ

for the weight hyperparameters, which are used to balance the proportion of the contribution of the

L_{L 1}

loss and the InfoNCE loss.

4. Experiments

4.1. Datasets and Metrics

To verify the superiority of the proposed method, we conduct experiments on the Rio dataset (source: WV3), Guangzhou dataset (source: GF2), and Indianapolis dataset (source: QB), which all have a scale factor of 4, and the test sets contain the reduced-resolution test set and the full-resolution test set, respectively. As shown in Table 1, The data can be found at GitHub-liangjiandeng/PanCollection.

For the reduced-resolution experiments, we used four commonly used metrics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [46], spectral angle mapping (SAM) [47], and relative unquantized global synthesis error (ERGAS) [48]. For the full-resolution experiments, we use the spectral distortion index (D

λ

), the spatial distortion index (Ds), and the quality of the reference-free mixing (QNR) to assess the quality of the results.

Our MCAGP is implemented in the PyTorch-3.8 framework with Adam optimizer, a learning rate of

5 \times 10^{- 5}

, L2 regularization, a weight decay factor of

1 \times 10^{- 4}

, a batch size of 4, and a network depth and width of 64 and 8, respectively. The experiments were performed on MATLAB 2019b and NVIDIA RTX 4050 GPU computers (NVIDIA, Santa Clara, CA, USA). For other deep learning pansharpening methods, we trained the network using the default settings from the relevant papers or code repositories, using the same equipment and PyTorch environment.

4.2. Comparison with SOTA Methods

In this section, we compare what we propose in this paper with several state-of-the-art methods, including three traditional methods, i.e., EXP [49], C-GSA [50], and BDSD-PC [51], TV [52], PWMPF [53], and nine deep learning-based methods, i.e., DaViT [54] and its variants, such as paDaViT and rDaViT, LeWin [55], MSDCNN [21], PanFormer [56], SSIN [57], PANNET [58], and PNN [13],. We conducted reduced-resolution and full-resolution experiments on three datasets, with the reduced-resolution following the Wald protocol.

Results on WV3 dataset: Table 2 shows the results of quantitative experiments on the WV3 dataset, while Figure 5 provides a visualization of the fused images. Overall, the deep learning-based approach shows significant advantages over the traditional approach. In the resolution reduction experiments, our method is 1.331 dB ahead of the suboptimal method in the PSNR metric, with 0.013 higher in the SSIM metric, indicating a significant improvement in image restoration quality. The restored images are clearer and more natural, with better preservation of details and structures. Our method reduces the spatial angle metric (SAM) by 0.01 and the ERGAS value by 0.458 compared to the suboptimal method, which further indicates that our method achieves a superior balance between preserving spatial details and spectral accuracy. These metrics show that our method effectively recovers the high-frequency details of the image during image reconstruction while reducing the recovery error and enhancing the realism of the image. In the full-resolution experiments, although our D

λ

(spectral distortion) is slightly higher than that of other methods, indicating a slight trade-off in spectral recovery, we succeeded in minimizing spatial distortion by optimizing Ds (spatial distortion). This optimization allowed us to achieve optimal performance in the recovery of spatial details, ensuring high-resolution image recovery. In terms of the final QNR (quality-to-noise ratio) value, our method achieves the best performance, indicating that we have achieved an ideal balance between image quality and noise control and thus ensured the detail and visual quality of the image. In terms of visual effect, our method significantly improves the clarity and detail performance of the image, especially in the detailed presentation of buildings and vegetation, with a sharper restoration effect.

Results on the QB dataset: Table 3 lists the quantitative results on the QB dataset, while Figure 6 demonstrates the corresponding visual effects. Overall, the deep learning-based method outperforms the traditional approach. In the resolution reduction experiments, our method surpasses the suboptimal method by 0.1635 dB in PSNR and 0.022 in SSIM, indicating superior performance in image noise suppression, detail retention, and structure restoration. SAM and ERGAS values are lower than the suboptimal method by 0.012 and 1.095, respectively, suggesting that our method maximizes spectral restoration, preserving the spectral features of the original image and effectively reducing reconstruction errors. In the full-resolution experiments, our method slightly sacrifices spectral distortion (D

λ

), but this does not affect overall performance. Our spatial distortion (Ds) is the lowest among all methods, demonstrating that we minimize spatial distortion during image restoration, ensuring accurate recovery of spatial structure and details. Notably, in the comprehensive QNR (quality-to-noise ratio) metric, our method achieves the best performance, indicating an ideal balance between image quality and noise control.

Results on the GF2 dataset: Table 4 summarizes the experimental results on the GF2 dataset, while Figure 7 presents a visual representation of the fused images. In the down-resolution experiments, our method outperforms the next best method by 0.075 dB in PSNR and 0.019 in SSIM, demonstrating its superiority in image restoration quality, particularly in detail and contrast preservation. SAM is the lowest among all methods, indicating better spatial restoration performance, and ERGAS is 1.238, which is 0.022 higher than the optimal paDaViT method but still shows better results. In the full-resolution experiments, our method continues to significantly outperform traditional methods, although it performs slightly lower than individual deep learning methods in some metrics, especially in spectral recovery. Overall, our method achieves a balance between spectral and spatial details in image restoration, with superior overall performance. Visually, the fused images exhibit lower noise, fewer artifacts, and sharper details with better contrast.

The performance on the GF2 dataset is not as good as that on the QB and WV3 datasets, mainly due to the noise level in the data, scene complexity, and the stringent demands of the unsupervised full-resolution evaluation protocol on the model’s generalization ability. The GF2 dataset contains more fragmented structures, a mix of vegetation and urban textures, and more pronounced edge aliasing effects, which increase the difficulty of image restoration. Additionally, the performance on the GF2 dataset in the full-resolution experiments is not as good as that on other datasets, partly because the Wald protocol we used has limited applicability to the GF2 dataset. While the Wald protocol works effectively for high-quality commercial sensors such as QB and WV3, it may not hold for GF2, as significant details and noise patterns are lost during the downsampling process, and the generated pseudo-GT exhibits substantial statistical deviation from the true full-resolution images in both spectral and texture domains. Although our method outperforms others in down-resolution experiments, the performance on the GF2 dataset in full-resolution evaluation is slightly worse than on other datasets due to these factors.

4.3. Ablation Experiments

To evaluate the contribution of each module in the proposed method, we conducted ablation experiments on the QB dataset by replacing or removing different modules, comparing the experimental results with the final model (Ours) and analyzing the impact of each module on the model performance. The experimental results are shown in Table 5 and analyzed in detail below:

(1): Replacing the SSUM module with regular convolution while removing the contrastive learning part.

In the experimental setup (1), the SSUM module is replaced with regular convolution, with the contrastive learning part removed. Compared with our final model (our approach), PSNR decreased by 8.54%, SSIM decreased by 3.99%, SAM increased by 22.09%, ERGAS increased by 39.50%, and QNR decreased by 0.66%. The results show that regular convolution cannot replace the efficient SSUM module, with the removal of contrastive learning significantly reducing the model’s performance in both down-resolution and full-resolution experiments.

(2): Replacing the SSUM module with regular convolution while retaining only the contrastive learning component.

In experimental setup (2), contrastive learning and its loss function are retained, but the SSUM module is replaced with ordinary convolution. Compared with our approach, PSNR decreased by 8.97%, SSIM decreased by 3.57%, SAM increased by 19.77%, ERGAS increased by 41.66%, and QNR decreased by 0.66%. The results demonstrate the key role of the SSUM module in the model, which can significantly improve the reconstruction quality of image details and effectively reduce errors.

(3): Retaining the SSUM module while deleting the contrastive learning part.

In experimental setup (3), only the SSUM module is used, and the contrastive learning part is removed. Compared with our approach, PSNR decreased by 3.78%, SSIM decreased by 1.05%, SAM increased by 8.14%, and ERGAS increased by 13.95%. Although the SSUM module improves the reconstruction quality, the removal of contrastive learning degrades the model’s performance in the high-resolution reconstruction task; the spectral and spatial properties especially cannot be fully optimized, further validating the importance of contrastive learning.

4.4. Discussion of the Loss Function Parameter, $λ$

To address the different optimization objectives of the two loss functions, we investigated the impact of introducing contrast loss at different stages on model performance, proposing a new strategy that adds contrast loss at a later stage to fine-tune the already established model. In our experiments, we compared two training strategies: one introduced the contrast loss in the whole process (i.e., the method in this paper, with

λ

= 1); the other trained the model using the L1 reconstruction loss initially to establish the basic image reconstruction capability, followed by gradually increasing the weight of contrast loss until it matched the L1 loss. The training results are shown in Table 6, Table 7 and Table 8.

On the WV3 dataset, our method prioritizes spectral retention, reflected by a lower SAM and D

λ

, but with a slight sacrifice in spatial consistency (indicated by the increase in Ds). In contrast, the two-stage training strategy balances spectral and spatial properties better, though at the cost of a slight reduction in PSNR. For tasks requiring high spectral fidelity, such as surface classification and hyperspectral analysis, our method is more suitable. For higher overall performance, the two-stage strategy can be considered. On the QB dataset, our method offers a better balance between spectral and spatial performance, achieving a superior overall performance index. On the GF2 dataset, the two-stage method strikes a better balance between spatial details and spectral consistency, effectively reducing global error (ERGAS); in full-resolution tests, our method shows better spatial detail recovery and noise suppression.

5. Conclusions

In this paper, we have proposed a generative network for depth gradient projection based on self-supervised multiscale contrastive learning and attention guidance that improves the balance of spectral and spatial information. We first proposed an efficient SSUM module based on channel and spatial attention, which was combined with a depth prior and generalized to the depth gradient projection network to form a spectral enhancement block and a spatial enhancement block, which is the basis of our network. Secondly, based on the two proposed optimization problems, we used contrastive learning by using the multiscale high-frequency component of PAN as the positive sample and the upsampled multiscale high-frequency information of MS as the negative samples. This enables the spectral enhancement block and the spatial enhancement block to focus more on their respective optimization tasks; in the end, contrastive loss was applied throughout the process to refine the model, leading to improved reconstruction quality. The experiments demonstrate the superiority of the proposed method in this paper. In the future, research on contrastive learning loss will continue to be strengthened, and it is believed that contrastive learning will have more space for development in the field of pansharpening.

Author Contributions

Conceptualization, Q.L., B.L. and J.W.; Methodology, Q.L.; Software, Q.L.; Investigation, Q.L. and B.L.; Resources, Q.L.; Data curation, Q.L., B.L. and J.W.; Writing—original draft, Q.L.; Writing—review & editing, X.Y.; Visualization, Q.L.; Supervision, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Sichuan University, grant number: 24NSFSC2159.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yilmaz, C.S.; Yilmaz, V.; Gungor, O. A theoretical and practical survey of image fusion methods for multispectral pansharpening. Inf. Fusion 2022, 79, 1–43. [Google Scholar] [CrossRef]
Bovolo, F.; Bruzzone, L.; Capobianco, L.; Garzelli, A.; Marchesi, S.; Nencini, F. Analysis of the effects of pansharpening in change detection on VHR images. IEEE Geosci. Remote Sens. Lett. 2009, 7, 53–57. [Google Scholar] [CrossRef]
Zhong, P.; Wang, R. Learning conditional random fields for classification of hyperspectral images. IEEE Trans. Image Process. 2010, 19, 1890–1907. [Google Scholar] [CrossRef] [PubMed]
Yu, X.; Hoff, L.E.; Reed, I.S.; Chen, A.M.; Stotts, L.B. Automatic target detection and recognition in multiband imagery: A unified ML detection and estimation approach. IEEE Trans. Image Process. 1997, 6, 143–156. [Google Scholar] [PubMed]
Hallada, W.A.; Cox, S. Image sharpening for mixed spatial and spectral resolution satellite systems. In Proceedings of the 1983 International Symposium on Remote Sensing of Environment, Ann Arbor, MI, USA, 9–13 May 1983. [Google Scholar]
Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens. 1989, 55, 339–348. [Google Scholar]
Haydn, R. Application of the IHS color transform to the processing of multisensor data and image enhancement. In Proceedings of the International Symposium on Remote Sensing of Arid and Semi-Arid Lands, Cairo, Egypt, 19–25 January 1982. [Google Scholar]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS + Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Nunez, J.; Otazu, X.; Fors, O.; Prades, A.; Pala, V.; Arbiol, R. Multiresolution-based image fusion with additive wavelet decomposition. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1204–1211. [Google Scholar] [CrossRef]
Garzelli, A.; Nencini, F. PAN-sharpening of very high resolution multispectral images using genetic algorithms. Int. J. Remote Sens. 2006, 27, 3273–3292. [Google Scholar] [CrossRef]
Liu, J.; Basaeed, E. Smoothing Filter-based Intensity Modulation: A spectral preserve image fusion technique for improving spatial details. Int. J. Remote Sens. 2000, 21, 3461–3472. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored Multiscale Fusion of High-resolution MS and Pan Imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Wei, Y.; Yuan, Q.; Shen, H.; Zhang, L. Boosting the accuracy of multispectral image pansharpening by learning a deep residual network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1795–1799. [Google Scholar] [CrossRef]
Benzenati, T.; Kallel, A.; Kessentini, Y. Two stages pan-sharpening details injection approach based on very deep residual networks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4984–4992. [Google Scholar] [CrossRef]
Liu, Q.; Zhou, H.; Xu, Q.; Liu, X.; Wang, Y. PSGAN: A generative adversarial network for remote sensing image pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10227–10242. [Google Scholar] [CrossRef]
Ozcelik, F.; Alganci, U.; Sertel, E.; Unal, G. Rethinking CNN-based pansharpening: Guided colorization of panchromatic images via GANs. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3486–3501. [Google Scholar] [CrossRef]
Zhou, H.; Hou, J.; Zhang, Y.; Ma, J.; Ling, H. Unified gradient-and intensity-discriminator generative adversarial network for image fusion. Inf. Fusion 2022, 88, 184–201. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Chen, C.; Liang, P.; Guo, X.; Jiang, J. Pan-GAN: An unsupervised pan-sharpening method for remote sensing image fusion. Inf. Fusion 2020, 62, 110–120. [Google Scholar] [CrossRef]
Dong, W.; Hou, S.; Xiao, S.; Qu, J.; Du, Q.; Li, Y. Generative dual-adversarial network with spectral fidelity and spatial enhancement for hyperspectral pansharpening. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7303–7317. [Google Scholar] [CrossRef]
Yuan, Q.; Wei, Y.; Meng, X.; Shen, H.; Zhang, L. A multiscale and multidepth convolutional neural network for remote sensing imagery pan-sharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 978–989. [Google Scholar] [CrossRef]
Ballester, C.; Caselles, V.; Igual, L.; Verdera, J.; Rougé, B. A variational model for P+ XS image fusion. Int. J. Comput. Vis. 2006, 69, 43–58. [Google Scholar] [CrossRef]
Wu, Z.C.; Huang, T.Z.; Deng, L.J.; Vivone, G.; Miao, J.Q.; Hu, J.F.; Zhao, X.L. A new variational approach based on proximal deep injection and gradient intensity similarity for spatio-spectral image fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 6277–6290. [Google Scholar] [CrossRef]
Wu, Z.C.; Huang, T.Z.; Deng, L.J.; Huang, J.; Chanussot, J.; Vivone, G. LRTCFPan: Low-rank tensor completion based framework for pansharpening. IEEE Trans. Image Process. 2023, 32, 1640–1655. [Google Scholar] [CrossRef]
Saeedi, J.; Faez, K. A new pan-sharpening method using multiobjective particle swarm optimization and the shiftable contourlet transform. ISPRS J. Photogramm. Remote Sens. 2011, 66, 365–381. [Google Scholar] [CrossRef]
Yilmaz, V. A Non-Dominated Sorting Genetic Algorithm-II-based approach to optimize the spectral and spatial quality of component substitution-based pansharpened images. Concurr. Comput. Pract. Exp. 2021, 33, e6030. [Google Scholar] [CrossRef]
Larsson, G.; Maire, M.; Shakhnarovich, G. Colorization as a Proxy Task for Visual Understanding. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 840–849. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 June 2017; pp. 645–654. [Google Scholar]
Wang, Z.; Wang, J.; Liu, Z.; Qiu, Q. Energy-Inspired Self-Supervised Pretraining for Vision Models. arXiv 2023, arXiv:2302.01384. [Google Scholar]
Xing, Y.; Qu, L.; Zhang, S.; Zhang, K.; Zhang, Y.; Bruzzone, L. CrossDiff: Exploring Self-Supervised Representation of Pansharpening via Cross-Predictive Diffusion Model. IEEE Trans. Image Process. 2024, 33, 5496–5509. [Google Scholar] [CrossRef]
Fernandez-Beltran, R.; Fernandez, R.; Kang, J.; Pla, F. W-NetPan: Double-U network for inter-sensor self-supervised pan-sharpening. Neurocomputing 2023, 530, 125–138. [Google Scholar] [CrossRef]
He, J.; Yuan, Q.; Li, J.; Xiao, Y.; Zhang, L. A self-supervised remote sensing image fusion framework with dual-stage self-learning and spectral super-resolution injection. ISPRS J. Photogramm. Remote Sens. 2023, 204, 131–144. [Google Scholar] [CrossRef]
Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S.; Brain, G. Time-contrastive networks: Self-supervised learning from video. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1134–1141. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; Part XI. Springer International Publishing: Cham, Switzerland, 2020; pp. 776–794. [Google Scholar]
Cai, Q.; Wang, Y.; Pan, Y.; Yao, T.; Mei, T. Joint contrastive learning with infinite possibilities. Adv. Neural Inf. Process. Syst. 2020, 33, 12638–12648. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Wang, Y.; Lin, S.; Qu, Y.; Wu, H.; Zhang, Z.; Xie, Y.; Yao, A. Towards compact single image super-resolution via contrastive self-distillation. arXiv 2021, arXiv:2105.11683. [Google Scholar]
Han, J.; Shoeiby, M.; Malthus, T.; Botha, E.; Anstee, J.; Anwar, S.; Wei, R.; Petersson, L.; Armin, M.A. Single underwater image restoration by contrastive learning. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2385–2388. [Google Scholar]
Zhou, M.; Huang, J.; Yan, K.; Yang, G.; Liu, A.; Li, C.; Zhao, F. Normalization-based feature selection and restitution for pan-sharpening. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Baum, E.; Wilczek, F. Supervised learning of probability distributions by neural networks. In Proceedings of the Neural Information Processing Systems Conference, Denver, CO, USA, 8–12 November 1987. [Google Scholar]
Levin, E.; Fleisher, M. Accelerated learning in layered neural networks. Complex Syst. 1988, 2, 3. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.H.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the JPL, Summaries of the Third Annual JPL Airborne Geoscience Workshop, Volume 1: AVIRIS Workshop, Pasadena, CA, USA, 1–5 June 1992. [Google Scholar]
Alparone, L.; Wald, L.; Chanussot, J.; Thomas, C.; Gamba, P.; Bruce, L.M. Comparison of pansharpening algorithms: Outcome of the 2006 GRS-S data-fusion contest. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3012–3021. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
Restaino, R.; Mura, M.D.; Vivone, G.; Chanussot, J. Context-adaptive pansharpening based on image segmentation. IEEE Trans. Geosci. Remote Sens. 2017, 55, 753–766. [Google Scholar] [CrossRef]
Vivone, G. Robust band-dependent spatial-detail approaches for panchromatic sharpening. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6421–6433. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O.; Benediktsson, J.A. Model-based fusion of multi- and hyperspectral images using PCA and wavelets. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2652–2663. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O. A new pansharpening algorithm based on total variation. IEEE Geosci. Remote Sens. Lett. 2014, 11, 318–322. [Google Scholar] [CrossRef]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. Davit: Dual attention vision transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general U-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhou, H.; Liu, Q.; Wang, Y. PanFormer: A Transformer-Based Model for Pan-Sharpening. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Nie, Z.; Chen, L.; Jeon, S.; Yang, X. Spectral–spatial interaction network for multispectral image and panchromatic image fusion. Remote Sens. 2022, 14, 4100. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1753–1761. [Google Scholar]

Figure 1. Overall framework diagram of MCAGP.

Figure 2. Spectral–Spatial Universal Module (SSUM) framework diagram.

Figure 3. Comparison of the high-frequency portion of an MS image and a PAN image: (a) shows the obvious noise in the high-frequency portion of the MS image; (b) demonstrates the rich and clear spatial details in the high-frequency portion of the PAN image. (a) High-frequency portion of a MS image (from left to right, they are as follows: MS-Original, MS-LH (Vertical High), MS-HL (Horizontal High), and MS-HH (Diagonal High)). (b) High-frequency portion of a PAN image (from left to right, they are as follows: PAN-Original, PAN-LH (Vertical High), PAN-HL (Horizontal High), and PAN-HH (Diagonal High)).

Figure 4. Multiscale contrastive learning framework. Positive samples are generated with multiscale high-frequency components from PAN, anchor samples are pairs of scale high-frequency components after reconstruction with spectral enhancement blocks, and negative samples are multiscale high-frequency components after LRMS interpolation to generate multiple negative samples.

Figure 5. Visualization on the WV3 dataset.

Figure 6. Visualization of the QB dataset.

Figure 7. Visualization of the GF2 dataset.

Table 1. Dataset Information. B is the number of channels in the multispectral image.

Data	B	MS-Resolution	PAN-Resolution
WV3	8	64	256
QB	4	64	256
GF2	4	64	256

Table 2. Test results for the WV3 dataset at reduced and full resolution. (Bold: best; underline: second best).

Method	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
EXP	27.409	0.678	0.135	8.441	0.056	0.156	0.796
C-GSA	31.245	0.853	0.138	5.567	0.102	0.075	0.831
BDSD-PC	31.521	0.873	0.130	5.313	0.063	0.073	0.870
TV	32.381	0.905	0.094	4.855	0.078	0.102	0.829
PWMBF	32.130	0.919	0.097	3.932	0.028	0.078	0.894
DaViT	30.950	0.892	0.097	4.465	0.031	0.085	0.887
LeWin	30.591	0.882	0.096	4.694	0.029	0.082	0.892
MSDCNN	30.441	0.884	0.098	4.758	0.035	0.089	0.880
paDaViT	32.392	0.924	0.087	3.761	0.030	0.072	0.901
PanFormer	32.182	0.924	0.085	3.880	0.036	0.083	0.889
SSIN	34.252	0.955	0.070	3.036	0.029	0.075	0.898
PANNET	31.257	0.899	0.091	4.413	0.028	0.077	0.898
PNN	29.412	0.857	0.105	5.288	0.035	0.093	0.876
rDaViT	31.213	0.900	0.092	4.337	0.031	0.085	0.888
Ours	35.583	0.968	0.060	2.578	0.033	0.062	0.907

Table 3. Test results for the QB dataset at reduced and full resolution. (Bold: best; underline: second best).

Method	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
EXP	28.038	0.682	0.145	11.927	0.079	0.186	0.750
C-GSA	32.057	0.861	0.125	7.530	0.080	0.137	0.794
BDSD-PC	31.920	0.855	0.136	7.648	0.029	0.069	0.904
TV	32.174	0.865	0.142	7.690	0.046	0.074	0.884
PWMBF	34.223	0.904	0.118	5.504	0.058	0.068	0.878
DaViT	32.880	0.909	0.114	6.313	0.038	0.115	0.852
LeWin	32.722	0.901	0.119	6.427	0.047	0.102	0.857
MSDCNN	33.060	0.917	0.113	6.183	0.039	0.110	0.855
paDaViT	33.404	0.921	0.108	5.951	0.045	0.098	0.862
PanFormer	33.552	0.914	0.102	5.857	0.048	0.114	0.843
SSIN	33.861	0.930	0.098	5.667	0.044	0.092	0.869
PANNET	33.496	0.920	0.108	5.888	0.053	0.073	0.878
PNN	32.506	0.899	0.121	6.601	0.035	0.106	0.864
rDaViT	33.125	0.912	0.111	6.145	0.039	0.110	0.856
Ours	35.858	0.952	0.086	4.572	0.052	0.045	0.906

Table 4. Test results for the GF2 dataset at reduced and full resolution. (Bold: best; underline: second best).

Method	PSNR ↑	SSIM ↑	SAM ↓	ERGAS↓	D $λ ↓$	Ds ↓	QNR ↑
EXP	31.094	0.794	0.035	2.645	0.019	0.167	0.816
C-GSA	33.944	0.895	0.033	1.924	0.053	0.134	0.820
BDSD-PC	33.882	0.894	0.032	1.911	0.049	0.139	0.819
TV	33.900	0.904	0.030	1.598	0.067	0.074	0.865
PWMBF	34.510	0.896	0.031	1.673	0.024	0.076	0.874
DaViT	36.897	0.933	0.022	1.269	0.036	0.054	0.913
LeWin	36.327	0.920	0.024	1.357	0.035	0.049	0.918
MSDCNN	36.166	0.923	0.024	1.368	0.033	0.045	0.923
paDaViT	37.232	0.934	0.021	1.216	0.038	0.056	0.908
PanFormer	36.483	0.929	0.024	1.315	0.034	0.049	0.919
SSIN	36.411	0.929	0.023	1.330	0.037	0.041	0.924
PANNET	36.478	0.926	0.022	1.318	0.038	0.043	0.921
PNN	35.616	0.914	0.028	1.461	0.034	0.049	0.917
rDaViT	37.023	0.935	0.021	1.247	0.034	0.056	0.912
Ours	37.307	0.954	0.021	1.238	0.046	0.064	0.892

Table 5. Results of ablation experiments on the QB dataset. (Bold: best; underline: second best).

	SSUM	CL	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
$(1)$	×	×	32.795	0.914	0.105	6.380	0.043	0.060	0.900
$(2)$	×	✓	32.641	0.918	0.103	6.477	0.039	0.064	0.900
$(3)$	✓	×	34.504	0.942	0.093	5.210	0.045	0.080	0.879
Ours	✓	✓	35.858	0.952	0.086	4.572	0.052	0.045	0.906

Table 6. Test results of the WV3 dataset introducing contrast loss at different time periods. (Bold: best).

	SSUM	CL	Two-Stage	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
	✓	✓	✓	34.952	0.966	0.062	2.752	0.036	0.052	0.914
Ours	✓	✓	×	35.583	0.968	0.060	2.578	0.033	0.062	0.907

Table 7. Test results of the QB dataset introducing contrast loss at different time periods. (Bold: best).

	SSUM	CL	Two-Stage	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
	✓	✓	✓	35.661	0.952	0.088	4.650	0.059	0.079	0.867
Ours	✓	✓	×	35.858	0.952	0.086	4.571	0.052	0.045	0.906

Table 8. Test results of the GF2 dataset introducing contrast loss at different time periods. (Bold: best).

	SSUM	CL	Two-Stage	PSNR ↑	SSIM ↑	SAM ↓	ERGAS ↓	D $λ ↓$	Ds ↓	QNR ↑
	✓	✓	✓	37.394	0.954	0.020	1.203	0.045	0.071	0.887
Ours	✓	✓	×	37.307	0.954	0.021	1.238	0.046	0.064	0.892

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Yang, X.; Li, B.; Wang, J. Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening. Sensors 2025, 25, 2560. https://doi.org/10.3390/s25082560

AMA Style

Li Q, Yang X, Li B, Wang J. Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening. Sensors. 2025; 25(8):2560. https://doi.org/10.3390/s25082560

Chicago/Turabian Style

Li, Qingping, Xiaomin Yang, Bingru Li, and Jin Wang. 2025. "Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening" Sensors 25, no. 8: 2560. https://doi.org/10.3390/s25082560

APA Style

Li, Q., Yang, X., Li, B., & Wang, J. (2025). Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening. Sensors, 25(8), 2560. https://doi.org/10.3390/s25082560

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Learning

2.2. Contrastive Learning

3. Proposed Method

3.1. Attention-Guided Gradient Projection

3.2. Multiscale Contrastive Learning

3.3. Loss Functions

4. Experiments

4.1. Datasets and Metrics

4.2. Comparison with SOTA Methods

4.3. Ablation Experiments

4.4. Discussion of the Loss Function Parameter, $λ$

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Self-Supervised Multiscale Contrastive and Attention-Guided Gradient Projection Network for Pansharpening

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Learning

2.2. Contrastive Learning

3. Proposed Method

3.1. Attention-Guided Gradient Projection

3.2. Multiscale Contrastive Learning

3.3. Loss Functions

4. Experiments

4.1. Datasets and Metrics

4.2. Comparison with SOTA Methods

4.3. Ablation Experiments

4.4. Discussion of the Loss Function Parameter, λ

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4. Discussion of the Loss Function Parameter, $λ$