1. Introduction
Advances in remote sensing satellite technology have made Earth surface observation possible [
1]. However, limited by sensor performance, satellites are unable to capture images that contain both rich spectral and spatial information simultaneously. Instead, they can only acquire low-resolution multispectral (MS) images and corresponding high-resolution panchromatic (PAN) images. The importance of high-resolution multispectral images in fields such as change detection [
2], classification [
3], and target identification [
4] has led to the emergence of the pansharpening technique.
Traditional pansharpening methods are mainly divided into strategies based on component substitution (CS) and multiresolution analysis (MRA). CS methods, such as Brovey [
5], principal component analysis (PCA) [
6], IHS [
7], and GSA [
8], project the spectral information of the MS image into a new domain by replacing some or all of the spatial information with data from the PAN image, followed by back-projection Although histogram matching is performed before replacement to reduce spectral distortion, it is still difficult to completely avoid spectral aberrations. MRA methods, such as ATWT [
9,
10], SFIM based on smoothing filters [
11], and the MTF-Matched Filtering-Based Generalized Laplace Pyramid (MTF-GLP) [
12], extract spatial information from the PAN image through multiscale decomposition, subsequently injecting this information into the upsampled MS image. However, aliasing effects may cause spatial distortion.
Deep learning methods have become mainstream tools due to their powerful feature extraction capabilities and nonlinear mapping performance. Inspired by the Super-Resolution (SR) technique, Masi et al. [
13] treated the pansharpening task as a super-resolution problem, using convolutional neural networks (CNNs) to address it. Subsequently, residual networks (RNs) [
14,
15], generative adversarial networks (GANs) [
16,
17,
18,
19,
20] and MSDCNN [
21], a multiscale deep convolutional network, were proposed.
Variational optimization methods, which lie between traditional CS/MRA methods and deep learning, consider generalized pansharpening as an optimization problem. The P+XS method [
22] achieves pansharpening by extracting spatial information from a panchromatic (PAN) image, which is then injected into a multispectral (MS) image. Wu et al. [
23] combined variational optimization with deep CNNs to enhance the model’s generalization ability, subsequently proposing a pansharpening framework based on low-rank tensor complementation [
24]. In addition, meta-heuristic algorithms [
25,
26] are also widely used in generalized pansharpening tasks due to their superior performance in large-scale search spaces.
Under the variational optimization framework, it is assumed that a multispectral (MS) image is a reduced-quality version of a high-resolution multispectral (HRMS) image, while a panchromatic (PAN) image is a linear combination of the bands of the HRMS image. Based on this assumption, this paper proposes two optimization problems for HRMS image reconstruction, which constrain the generation of HRMS images using the information from both MS and PAN images, respectively.
Although existing variational optimization methods have shown significant effects on pansharpening, there are still several issues that urgently need to be addressed:
- (1)
Modal differences between spatial and spectral information lead to inconsistencies in information representation and extraction, resulting in poor fusion performance.
- (2)
During the optimization process of HRMS images, the high-frequency noise in MS images is not considered in spectral optimization, leading to an increase in artifacts in the reconstructed image.
- (3)
Balancing spectral and spatial information: Overemphasizing one aspect may lead to a decrease in the overall quality of the final reconstructed image.
To address these challenges, this paper applies contrastive learning to the pansharpening task by introducing an innovative method that combines self-supervised multiscale contrastive learning with attention-guided deep gradient projection (MCAGP).
The method first designs a Spectral–Spatial Universal Module (SSUM) for depth gradient projection networks, combining the depth prior to design spectral enhancement blocks (SpeEBs) and spatial enhancement blocks (SpaEBs). These blocks are applied serially and stacked alternately in the depth gradient projection network to solve the two optimization problems step by step.
Additionally, a multiscale contrastive learning strategy is applied to optimize the spatial information of PAN images. In this strategy, the high-frequency components of PAN images are considered positive samples, while those of MS images are treated as negative samples. This method strengthens the SpaEB’s focus on the spatial features of PAN images while also enhancing the SpeEB’s ability to preserve the spectral properties of MS images.
Finally, a contrastive loss function based on contrastive learning is applied to effectively balance spatial and spectral features by maximizing the similarity between positive samples while minimizing the similarity between negative samples, with model performance further enhanced by incorporating L1 loss.
The experimental results demonstrate that the MCAGP method surpasses both traditional and contemporary advanced methods in terms of visual quality and performance metrics, offering a novel approach to the pansharpening field.
The contributions of this paper are summarized as follows:
- (1)
Combining contrastive learning with deep gradient projection within a variational optimization framework: this method reduces modal differences by contrasting high-frequency features, strengthens the task focus of the spectral and spatial enhancement blocks, improves feature consistency and reconstruction quality, and overcomes conflicts between modalities through independent optimization strategies.
- (2)
Introducing a Spectral–Spatial Universal Module (SSUM) combined with depth priors: This module is extended to spectral and spatial enhancement blocks, effectively solving the dual optimization problem. Through channel-space attention guidance and multilevel residual connections, it balances spatial and spectral features.
- (3)
Designing a multiscale contrastive learning strategy: this strategy introduces contrast loss to filter out noise in MS images, allowing the model to perform well in both full-resolution and reduced-resolution tasks.
The structure of the paper is as follows:
Section 2 provides a review of related work;
Section 3 describes the MCAGP method in detail;
Section 4 presents the experimental results; and
Section 5 presents the conclusions.
3. Proposed Method
This subsection describes in detail the proposed pansharpening method MCAGP, whose overall framework is illustrated in
Figure 1 and Algorithm 1. In this figure, ms denotes the low-resolution multispectral image, PAN denotes the high-resolution panchromatic image, and HRMS refers to the final high-resolution multispectral image.
Algorithm 1: MCAGP Forward Pass |
![Sensors 25 02560 i001]() |
The framework of MCAGP consists of three key components: a spectral enhancement block (SpeEB), a spatial enhancement block (SpaEB), and a Multiscale Contrastive Learning module (MCL), which are closely coupled through iterative residual learning.
Specifically, the process begins with interpolating the low-resolution MS image to the PAN resolution, obtaining the initial . Both the interpolated MS image and the original MS image are then input into the SpeEB, which is designed to enhance the spectral information by learning and compensating the spectral difference between the upsampled image and the original MS image. The output of SpeEB, denoted as , is subsequently passed through the MCL module, where the multiscale contrastive loss is calculated by extracting high-frequency details and constructing positive and negative samples based on data augmentation and noise injection, effectively guiding the network to focus on fine-grained spatial–spectral consistency.
Afterwards, the contrastive-enhanced and the PAN image are jointly fed into the SpaEB, which injects spatial details from the PAN image while preserving spectral consistency. A residual block is embedded after SpaEB to further refine the fused result and compensate for residual errors.
This procedure is repeated over L iterations, with residual connections linking the outputs at each stage to progressively refine the reconstructed HRMS. Through the interaction of spectral enhancement, spatial enhancement, and contrastive learning, the network gradually improves the fidelity of the pansharpened image. The detailed workflow is summarized in the pseudo-code provided, and the interconnection between modules is visually illustrated in
Figure 1.
3.1. Attention-Guided Gradient Projection
Problem description: Suppose the LR image is a degraded version of the HR image, while the PAN image is a linear combination of the bands in the HR image. Therefore, the following formula can be obtained:
where
denotes the downsampling matrix,
is the low-pass circular convolution matrix,
is the spectral response function, and
represents the target high-resolution multispectral image. Since the process of reconstructing the HR image is a typical pathological inverse problem, the direct solution often faces instability. Therefore, in order to constrain the reasonableness of the solution, the following optimization problem with a regularization term is proposed:
where
is the prior term, which is used to control the smoothness or structure of the
image; traditional optimization is typically hand-tailored, while in deep learning, it is represented as an implicit prior.
is the data fidelity term, which is used to constrain the consistency between the
image, the
image, and the
image;
is the trade-off parameter, which is used to regulate the relative importance between the regularization term and the data fidelity term.
In order to better utilize the deep learning framework, the generalized pansharpening problem is decomposed into two complementary subproblems: spectral optimization and spatial optimization. This decomposition allows for the independent optimization of spectral and spatial information, with the final goal of reconstructing the HR image formulated as follows:
Inspired by generative GAN algorithms, two generative modules were designed: the spectral enhancement block (SpeEB) and the spatial enhancement block (SpaEB). These two modules implicitly model regularization terms through deep learning in order to optimize both the spectral features and spatial details.
Spectral enhancement block (SpeEB): The focus of the spectral enhancement module is to optimize spectra by reconstructing spectral distributions that are consistent with low-resolution (LR) images.The optimization process of SpeEB consists of the following four steps:
where
is the step size, and
is the proximal operator corresponding to the penalty term
.
Spatial enhancement block (SpaEB): the spatial enhancement block focuses on spatial optimization, which optimizes the details of spatial information by comparing the linear combination of an HR image and a PAN image. Its optimization steps are as follows:
where
is the step size, and
is the proximal operator corresponding to the penalty term
.
Spectral–Spatial Universal Module (SSUM): The detailed structure of the Spectral–Spatial Universal Module (SSUM) is illustrated in
Figure 2. To further enhance the fusion efficiency of spectral and spatial information, this paper introduces the SSUM module between the spectral enhancement block (SpeEB) and the spatial enhancement block (SpaEB), aiming to achieve the unified extraction and enhancement of spectral and spatial features. Specifically, SSUM incorporates both channel attention and spatial attention mechanisms, which effectively guide the network to selectively focus on spectral attributes and spatial details, thereby improving the feature representation capability. In the overall framework, SpeEB mainly leverages the residual information between the low-resolution multispectral (MS) image and the interpolated high-resolution MS image to compensate for the spectral distortion caused by upsampling. Conversely, SpaEB focuses on utilizing the spatial structural details contained in the PAN image and compensates for the spatial resolution loss via a residual back-projection strategy. Although both SpeEB and SpaEB share the same SSUM structure as the basic unit for feature mapping and residual feedback, they achieve functional decoupling and complementarity in terms of input design and residual information utilization. This ensures a well-balanced optimization between spectral fidelity and spatial detail enhancement. Furthermore, the structural versatility and efficiency of SSUM enable feature sharing and collaborative optimization between SpeEB and SpaEB, significantly improving the overall quality of feature representation and computational efficiency.
3.2. Multiscale Contrastive Learning
In the reconstruction of remote sensing images, MS images have poor spatial quality with significant high-frequency noise (
Figure 3a). In contrast, PAN images have clear high-frequency spatial details (
Figure 3b). Thus, MS images mainly contribute spectral information, while PAN images provide high-quality spatial details. This division prevents artifacts caused by mixing MS image noise with PAN image details.
To this end, discrete wavelet transform (DWT) is introduced in this paper to extract the multiscale high-frequency features of PAN and MS images. DWT is able to capture the spatial details in multiscale and multidirectional forms by decomposing the images into low and high-frequency subbands. The multiscale contrastive learning framework is shown in
Figure 4. Specifically, the following applies.
Anchor sample: The reconstructed image generated via the SpeEB is used to extract its multiscale high-frequency features through DWT, with low-dimensional embedded features generated by global pooling and linear projection.
where
denotes global pooling,
denotes linear projection for mapping high-dimensional features to low-dimensional potential space, and
denotes the HR image generated via the spectral enhancement module.
Positive sample: The PAN images are acquired through spatial matching, with multiscale high-frequency features extracted after data enhancement (e.g., random flip and rotation) and the embedded features generated using the same process as for the anchor samples.
represents data enhancement operations such as random flipping and rotation.
Negative sample: Extracted from the upsampled MS image, diverse negative samples are generated by adding Gaussian noise, extracting their high-frequency features and mapping them to the low-dimensional space. Through multiple negative samples, the distance between the anchor point and negative samples is enlarged to improve the discriminative ability.
where
denotes the LR image after upsampling through the interpolation operation,
denotes the addition of random Gaussian noise to the high-frequency portion of the MS image, and
i denotes different instances of negative samples.
Multiscale contrastive learning (MCL): In the proposed MCAGP framework, a multiscale contrastive learning (MCL) module is introduced. As illustrated in
Figure 4, the complete process of positive and negative sample construction, high-frequency feature extraction, and contrastive loss computation is clearly presented, providing readers with a detailed understanding of the implementation and functionality of this module.
The core idea of the MCL module is to guide the network to focus more on the consistency of spatial–spectral details during training by constructing positive and negative sample pairs. Specifically, multiscale high-frequency features are first extracted from the output of the SpeEB module, which serves as the anchor samples. Subsequently, a data augmentation strategy—including rotation, flipping, color jittering, and other transformations—is applied to the PAN image to generate positive samples. Their multiscale high-frequency features are also extracted. In order to provide effective contrastive information, multiple negative samples are further generated by injecting Gaussian noise into the multispectral image MS, followed by high-frequency feature extraction.
In the feature space, the similarity between the anchor features and the positive features is maximized (i.e., bringing them closer), while the similarity between the anchor features and the negative features is minimized (i.e., pushing them apart). This forms the positive–negative contrastive training objective, where the similarity measurement is implemented using the InfoNCE loss function.
It is noteworthy that the high-frequency feature extraction in the MCL module not only focuses on single-scale texture information but also leverages multiscale spatial details obtained via discrete wavelet transform (DWT). This ensures the effectiveness of contrastive loss across different scales. Additionally, the generation process of positive and negative samples incorporates diverse data augmentation and noise injection strategies, effectively enhancing the model’s discriminative ability and robustness.
Residual connection and information balance: To avoid the loss of spectral information due to the over-reliance of the model on the spatial features of the PAN image and to improve the fusion efficiency of spectral and spatial features, this paper introduces a multi-stage residual connection mechanism between the SpeEB, the SpaEB, and the subsequent residual blocks, which progressively accrues the features of each stage and realizes the dynamic balance between the spectral and spatial information.
3.3. Loss Functions
Contrast loss: the InfoNCE loss [
33,
34,
35] is used.
where
is the feature representation of the anchor sample,
is the feature representation of the positive sample,
is the feature representation of the
i negative sample, with a total of
K negative samples,
is a given temperature parameter, which is used to regulate the scaling range of similarity, and
is the similarity function, which is commonly used to measure the similarity between feature vectors by dot product or cosine similarity.
In the implementation, the dot products of positive and negative samples are batch-processed and spliced by columns to form a logits matrix, where the first position is a positive sample and the rest are negative samples. Cross-entropy loss is a reliable and efficient loss function that is widely utilized in deep networks [
43,
44,
45], and the final contrast loss is calculated by cross-entropy loss.
Thus, the total loss function of the model is as follows:
where
for the weight hyperparameters, which are used to balance the proportion of the contribution of the
loss and the InfoNCE loss.
4. Experiments
4.1. Datasets and Metrics
To verify the superiority of the proposed method, we conduct experiments on the Rio dataset (source: WV3), Guangzhou dataset (source: GF2), and Indianapolis dataset (source: QB), which all have a scale factor of 4, and the test sets contain the reduced-resolution test set and the full-resolution test set, respectively. As shown in
Table 1, The data can be found at GitHub-liangjiandeng/PanCollection.
For the reduced-resolution experiments, we used four commonly used metrics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [
46], spectral angle mapping (SAM) [
47], and relative unquantized global synthesis error (ERGAS) [
48]. For the full-resolution experiments, we use the spectral distortion index (D
), the spatial distortion index (D
s), and the quality of the reference-free mixing (QNR) to assess the quality of the results.
Our MCAGP is implemented in the PyTorch-3.8 framework with Adam optimizer, a learning rate of , L2 regularization, a weight decay factor of , a batch size of 4, and a network depth and width of 64 and 8, respectively. The experiments were performed on MATLAB 2019b and NVIDIA RTX 4050 GPU computers (NVIDIA, Santa Clara, CA, USA). For other deep learning pansharpening methods, we trained the network using the default settings from the relevant papers or code repositories, using the same equipment and PyTorch environment.
4.2. Comparison with SOTA Methods
In this section, we compare what we propose in this paper with several state-of-the-art methods, including three traditional methods, i.e., EXP [
49], C-GSA [
50], and BDSD-PC [
51], TV [
52], PWMPF [
53], and nine deep learning-based methods, i.e., DaViT [
54] and its variants, such as paDaViT and rDaViT, LeWin [
55], MSDCNN [
21], PanFormer [
56], SSIN [
57], PANNET [
58], and PNN [
13],. We conducted reduced-resolution and full-resolution experiments on three datasets, with the reduced-resolution following the Wald protocol.
Results on WV3 dataset:
Table 2 shows the results of quantitative experiments on the WV3 dataset, while
Figure 5 provides a visualization of the fused images. Overall, the deep learning-based approach shows significant advantages over the traditional approach. In the resolution reduction experiments, our method is 1.331 dB ahead of the suboptimal method in the PSNR metric, with 0.013 higher in the SSIM metric, indicating a significant improvement in image restoration quality. The restored images are clearer and more natural, with better preservation of details and structures. Our method reduces the spatial angle metric (SAM) by 0.01 and the ERGAS value by 0.458 compared to the suboptimal method, which further indicates that our method achieves a superior balance between preserving spatial details and spectral accuracy. These metrics show that our method effectively recovers the high-frequency details of the image during image reconstruction while reducing the recovery error and enhancing the realism of the image. In the full-resolution experiments, although our D
(spectral distortion) is slightly higher than that of other methods, indicating a slight trade-off in spectral recovery, we succeeded in minimizing spatial distortion by optimizing D
s (spatial distortion). This optimization allowed us to achieve optimal performance in the recovery of spatial details, ensuring high-resolution image recovery. In terms of the final QNR (quality-to-noise ratio) value, our method achieves the best performance, indicating that we have achieved an ideal balance between image quality and noise control and thus ensured the detail and visual quality of the image. In terms of visual effect, our method significantly improves the clarity and detail performance of the image, especially in the detailed presentation of buildings and vegetation, with a sharper restoration effect.
Results on the QB dataset:
Table 3 lists the quantitative results on the QB dataset, while
Figure 6 demonstrates the corresponding visual effects. Overall, the deep learning-based method outperforms the traditional approach. In the resolution reduction experiments, our method surpasses the suboptimal method by 0.1635 dB in PSNR and 0.022 in SSIM, indicating superior performance in image noise suppression, detail retention, and structure restoration. SAM and ERGAS values are lower than the suboptimal method by 0.012 and 1.095, respectively, suggesting that our method maximizes spectral restoration, preserving the spectral features of the original image and effectively reducing reconstruction errors. In the full-resolution experiments, our method slightly sacrifices spectral distortion (D
), but this does not affect overall performance. Our spatial distortion (D
s) is the lowest among all methods, demonstrating that we minimize spatial distortion during image restoration, ensuring accurate recovery of spatial structure and details. Notably, in the comprehensive QNR (quality-to-noise ratio) metric, our method achieves the best performance, indicating an ideal balance between image quality and noise control.
Results on the GF2 dataset:
Table 4 summarizes the experimental results on the GF2 dataset, while
Figure 7 presents a visual representation of the fused images. In the down-resolution experiments, our method outperforms the next best method by 0.075 dB in PSNR and 0.019 in SSIM, demonstrating its superiority in image restoration quality, particularly in detail and contrast preservation. SAM is the lowest among all methods, indicating better spatial restoration performance, and ERGAS is 1.238, which is 0.022 higher than the optimal paDaViT method but still shows better results. In the full-resolution experiments, our method continues to significantly outperform traditional methods, although it performs slightly lower than individual deep learning methods in some metrics, especially in spectral recovery. Overall, our method achieves a balance between spectral and spatial details in image restoration, with superior overall performance. Visually, the fused images exhibit lower noise, fewer artifacts, and sharper details with better contrast.
The performance on the GF2 dataset is not as good as that on the QB and WV3 datasets, mainly due to the noise level in the data, scene complexity, and the stringent demands of the unsupervised full-resolution evaluation protocol on the model’s generalization ability. The GF2 dataset contains more fragmented structures, a mix of vegetation and urban textures, and more pronounced edge aliasing effects, which increase the difficulty of image restoration. Additionally, the performance on the GF2 dataset in the full-resolution experiments is not as good as that on other datasets, partly because the Wald protocol we used has limited applicability to the GF2 dataset. While the Wald protocol works effectively for high-quality commercial sensors such as QB and WV3, it may not hold for GF2, as significant details and noise patterns are lost during the downsampling process, and the generated pseudo-GT exhibits substantial statistical deviation from the true full-resolution images in both spectral and texture domains. Although our method outperforms others in down-resolution experiments, the performance on the GF2 dataset in full-resolution evaluation is slightly worse than on other datasets due to these factors.
4.3. Ablation Experiments
To evaluate the contribution of each module in the proposed method, we conducted ablation experiments on the QB dataset by replacing or removing different modules, comparing the experimental results with the final model (Ours) and analyzing the impact of each module on the model performance. The experimental results are shown in
Table 5 and analyzed in detail below:
- (1)
Replacing the SSUM module with regular convolution while removing the contrastive learning part.
In the experimental setup (1), the SSUM module is replaced with regular convolution, with the contrastive learning part removed. Compared with our final model (our approach), PSNR decreased by 8.54%, SSIM decreased by 3.99%, SAM increased by 22.09%, ERGAS increased by 39.50%, and QNR decreased by 0.66%. The results show that regular convolution cannot replace the efficient SSUM module, with the removal of contrastive learning significantly reducing the model’s performance in both down-resolution and full-resolution experiments.
- (2)
Replacing the SSUM module with regular convolution while retaining only the contrastive learning component.
In experimental setup (2), contrastive learning and its loss function are retained, but the SSUM module is replaced with ordinary convolution. Compared with our approach, PSNR decreased by 8.97%, SSIM decreased by 3.57%, SAM increased by 19.77%, ERGAS increased by 41.66%, and QNR decreased by 0.66%. The results demonstrate the key role of the SSUM module in the model, which can significantly improve the reconstruction quality of image details and effectively reduce errors.
- (3)
Retaining the SSUM module while deleting the contrastive learning part.
In experimental setup (3), only the SSUM module is used, and the contrastive learning part is removed. Compared with our approach, PSNR decreased by 3.78%, SSIM decreased by 1.05%, SAM increased by 8.14%, and ERGAS increased by 13.95%. Although the SSUM module improves the reconstruction quality, the removal of contrastive learning degrades the model’s performance in the high-resolution reconstruction task; the spectral and spatial properties especially cannot be fully optimized, further validating the importance of contrastive learning.
4.4. Discussion of the Loss Function Parameter,
To address the different optimization objectives of the two loss functions, we investigated the impact of introducing contrast loss at different stages on model performance, proposing a new strategy that adds contrast loss at a later stage to fine-tune the already established model. In our experiments, we compared two training strategies: one introduced the contrast loss in the whole process (i.e., the method in this paper, with
= 1); the other trained the model using the L1 reconstruction loss initially to establish the basic image reconstruction capability, followed by gradually increasing the weight of contrast loss until it matched the L1 loss. The training results are shown in
Table 6,
Table 7 and
Table 8.
On the WV3 dataset, our method prioritizes spectral retention, reflected by a lower SAM and D, but with a slight sacrifice in spatial consistency (indicated by the increase in Ds). In contrast, the two-stage training strategy balances spectral and spatial properties better, though at the cost of a slight reduction in PSNR. For tasks requiring high spectral fidelity, such as surface classification and hyperspectral analysis, our method is more suitable. For higher overall performance, the two-stage strategy can be considered. On the QB dataset, our method offers a better balance between spectral and spatial performance, achieving a superior overall performance index. On the GF2 dataset, the two-stage method strikes a better balance between spatial details and spectral consistency, effectively reducing global error (ERGAS); in full-resolution tests, our method shows better spatial detail recovery and noise suppression.