1. Introduction
The production of high-resolution synthetic images is a major focus in remote sensing image processing. It serves as a critical preprocessing step for the subsequent extraction of information, whether through human interpretation or numerical algorithms. These techniques generate detailed images by leveraging image redundancies, as in super-resolution [
1], or by combining complementary data from multiple images of the same area, as in data fusion [
2,
3]. This enables the creation of images with higher detail than what state-of-the-art sensors can directly capture.
One of the most widely studied data fusion techniques is pansharpening. Historically, it aims to merge a low-resolution (LR) multispectral (MS) image with a high-resolution (HR) panchromatic (PAN) image to create an HR version of the MS image [
4,
5,
6]. The good results achieved in pansharpening can be attributed to the fact that many satellites carry both types of sensors, allowing near-simultaneous, co-located acquisition of surface images. This enables applications ranging from enhanced visualization in virtual globe software, like Google Earth and Bing Maps, to improved scene classification [
7] and change detection [
8].
Numerous pansharpening solutions have been proposed, with the most prominent methodologies evolving rapidly alongside advances in signal processing [
5,
9]. The objective is effectively described by Wald’s protocol, which provides a guiding framework for developing pansharpening solutions [
10]. Given a multispectral sensor, the goal is to generate an image that mimics what the same sensor would capture at a higher resolution. To achieve this, a panchromatic image of the same area, acquired at the target resolution, is utilized to extract the missing high-resolution details.
Early developments were marked by the competition between spatial domain methods and spectral domain techniques [
11]. Spatial domain methods, often called Multiresolution Analysis (MRA) techniques, rely on the multiscale decomposition of images to be fused [
12]. These can be achieved through linear systems like Gaussian filters [
13], wavelets [
14,
15], curvelets [
16], and contourlets [
17], or nonlinear schemes [
18]. Spectral domain methods, typically denoted as Component Substitution (CS) techniques, on the other hand, perform image fusion in a transformed domain, where the spatial component is enhanced using the PAN image. Examples include Principal Component Analysis (PCA) [
19], Gram–Schmidt orthogonalization [
20], Brovey transform (BT) [
21], and the band-dependent spatial-detail-based (BDSD) approach [
22,
23].
Classical methods remain relevant due to their simplicity and adaptability to various image types. Their continued use is further supported by the high performance achieved through ongoing refinements. In particular, detail injection techniques in both CS and MRA methods have seen significant improvements. For example, the optimization of high-pass modulation (HPM) [
24] has resulted in state-of-the-art performance [
25,
26], while regression-based projective methods, like the Gram–Schmidt adaptive (GSA) algorithm [
27] and the MTF-based generalized Laplacian pyramid with context-based decision (MTF-GLP-CBD) [
28], have advanced the field. Physical considerations have also contributed, such as the introduction of haze correction in the Brovey transform (BT-H) and MTF-GLP with HPM injection model (MTF-GLP-HPM-H) [
29], as well as the BDSD with physical constraints (BDSD-PC) [
30]. Contextual adaptive techniques, often indicated by the prefix “C”, such as C-BDSD [
23], C-GSA, and C-MTF-GLP-CBD [
31], also compete with the most innovative pansharpening methods, including variational optimization (VO) and machine learning (ML) approaches [
9].
Before ML techniques gained popularity, VO methods attracted significant attention. They delivered high performance but required dataset-specific parameter tuning and complex hyperparameter estimation [
9]. Despite these obstacles, some VO approaches, such as semiblind deconvolution-based filter estimation [
32], total variation [
33], and low-rank representation [
34,
35], have achieved success. Sparse representation methods have achieved noteworthy results, beginning with seminal works where this representation was applied to images [
36,
37]. These methods have evolved to include techniques focused on details (SR-D) [
38,
39] and approaches utilizing dictionaries optimized for the spatial and spectral information of the images being combined [
40].
Machine learning techniques, the main focus of this study, have revolutionized pansharpening. Two pioneering methods are the autoencoder-based approach [
41] and the pansharpening neural network (PNN) based on a convolutional neural network (CNN) [
42]. These approaches remain the foundation of many current implementations. A key innovation in recent years has been the introduction of residual learning, which accelerates and stabilizes the parameter learning process [
43,
44]. Deeper networks have been proposed to further enhance performance [
43], though an increasing number of parameters can complicate the learning process [
45]. Attention mechanisms [
46], later integrated into more sophisticated transformer-based networks [
47], offer a promising approach to improving performance without excessive parameter growth by focusing on the most relevant regions in the image. Generative methods have also evolved in parallel. Pansharpening works well with generative techniques. It was first implemented using generative adversarial networks (GANs). In this approach, a generator creates the fused image, and a discriminator evaluates its quality [
48,
49,
50]. Recently, diffusion models have emerged as an alternative, improving stability in the training process, which is often a challenge with GANs [
51].
Current research emphasizes methods that rely solely on available data, eliminating the need for reference images. These techniques, often categorized as nonsupervised, can be further divided into various subgroups [
52]. While adaptable networks for different image types show promise [
53], the most effective techniques today are either fully unsupervised (USL) [
54,
55] or involve fine-tuning models through transfer learning (TL) [
45]. The fields of weakly supervised learning (WSL) and semi-supervised learning (Semi-SL) are still underexplored but have begun to see practical applications. Semi-supervised learning (Semi-SL) often starts with supervised training at reduced resolution. This is followed by unsupervised fine-tuning [
56,
57]. Self-supervised learning (SSL), on the other hand, begins with a pretext task. It then fine-tunes the model using pansharpening-specific data [
58,
59].
The primary challenge in pansharpening remains the difficulty of developing ready-to-use solutions for unseen datasets. Although current research is actively seeking ways to enhance the generalizability of pansharpening methods across diverse images, the performance of ML-based algorithms, including shallow networks, often falls short in this regard [
44,
45]. The most effective strategy continues to be a brief, dataset-specific training phase, where pre-trained networks are fine-tuned using the target images through an effectively implementable technique. This paper contributes to the ongoing research on unsupervised methods by focusing on the optimization of network parameters using reference-free cost functions. It examines key cost functions derived from well-established distortion metrics that rely solely on the images to be fused. Most of them had never been used for this purpose before. Special emphasis is placed on the simplicity of implementation, pursued through appropriate approximations. This aims to ensure that the optimization process remains efficient, both in terms of speed and solution stability.
The article is organized as follows.
Section 2 outlines the motivations behind this study and highlights its contributions to addressing gaps in the existing literature.
Section 3 formalizes the pansharpening problem, describing the architectures employed in this work and the cost functions investigated.
Section 4 details the experiments conducted, including descriptions of the datasets, quality metrics, algorithms used, and the configuration of the networks being compared. The experimental results are presented in
Section 5 and discussed in
Section 6. Finally,
Section 7 summarizes the key findings and conclusions drawn from this study.
4. Experimental Tests
Few papers have assessed the generalization capability of ML-based pansharpening algorithms, revealing that it tends to be limited [
44,
45], even for shallow networks. Consequently, this study focuses on fine-tuning pre-trained networks using the same images to be fused, specifically employing a reference-free cost function. This approach removes the need for a downsampling step, making it suitable for smaller images. It is especially effective for shallow networks with fewer parameters, allowing for the acquisition of statistically meaningful results.
In this section, the experimental phase is described below, detailing the dataset used, the image quality indexes employed for the assessment, the main operating conditions of the tests performed, and finally the performances obtained by the algorithms.
4.1. Testbed
In this experimental study, the whole PairMax dataset, provided by Maxar Technologies for benchmarking purposes [
97], was utilized (
Figure 2).
The dataset consists of nine scenes acquired from four different satellites: Geoeye-1 (GE-1), Worldview-4 (WV-4), Worldview-2 (WV-2), and Worldview-3 (WV-3). The MS sensors mounted on board the first two satellites (GE-1 and WV-4) capture images with four channels, while WV-2 and WV-3 collect MS images with eight channels. This distinction is critical, as the number of bands has a significant impact on performance, particularly regarding spectral coherence. In the case of eight-band sensors, the overlap between the Relative Spectral Response (RSR) of the MS and PAN sensors is typically reduced, influencing the analysis results.
All sensors provide very high spatial resolution, as shown in
Table 1, with the Ground Sample Distance (GSD) of the PAN images being less than 0.5 m, and the GSD of the MS images around 2 m. However, these sensors belong to two different generations: the earlier generation mounted on GE-1 and WV-2 satellites, and the more recent generation on WV-4 and WV-3. Despite WV-4 no longer being operational, its data were included due to the superior sensor technology it housed compared to GE-1’s four-channel sensor. All sensors provide images with an 11-bit radiometric resolution, which is preserved in the dataset. The original MS and PAN images have size of
and
, corresponding to a quite common resize factor
.
In this study, the RR protocol was used to assess algorithm performance. The RR protocol, detailed in [
62], retains the original MS image as the Ground Truth (GT) while the algorithms are applied to the degraded versions of the original MS and PAN images [
9]. In line with the Wald protocol [
10], the MS image is degraded using a filter whose frequency response matches the sensor’s MTF, and the PAN image is processed using an ideal filter [
13]. Both images are then downsampled by a factor equal to the resize factor
employed in this study.
4.2. Image Quality Indexes
To assess the quality of the pansharpened images, three widely used reference-based indices were employed: the Spectral Angle Mapper [
98], the Erreur Relative Globale Adimensionnelle de Synthèse [
72], and the Q2
n index [
93,
94].
The Spectral Angle Mapper (SAM) [
98] evaluates the spectral coherence between two images. It measures the angle between two spectral vectors,
and
, corresponding to the pixel values across the spectral bands. The SAM between two vectors is defined as:
where
represents the scalar product of the vectors. The overall SAM for two images
and
is the average of the SAM values computed for each pixel.
The ERGAS (Erreur Relative Globale Adimensionnelle de Synthèse) [
72] is a radiometric index that normalizes the Mean Square Errors (MSEs) of each spectral channel relative to the average value of the channels. It is calculated as:
where
is the MSE between the corresponding bands
and
of
and
, and
is the mean of
.
Lastly, we employed the Q2
n index, defined in
Section 3.2.2, as a global measure that evaluates both the spatial and spectral quality of the images. This index is particularly useful for assessing the overall performance of pansharpening methods.
4.3. Compared Algorithms
The algorithms selected for this evaluation include those under investigation, namely A-PNN, PanNet, and FusNet, each tested with various cost functions. Additionally, algorithms from key classes previously analyzed in earlier reviews [
9,
62] were incorporated for comparison. In particular, results for the same algorithms presented in the paper that introduced the PairMax dataset [
62], have been reported here. It is also straightforward to obtain additional comparative values using the toolbox provided in [
99], which accompanies the paper [
9].
For CS approaches, the evaluation includes methods such as BDSD with physical constraints (BDSD-PC) [
30], the Gram–Schmidt (GS) [
20], and the Gram–Schmidt Adaptive (GSA) [
27]. To represent the MRA class, three algorithms using the Generalized Laplacian Pyramid (GLP) with MTF-matched filters [
13] were selected. These include the full-scale coefficient computation method (MTF-GLP-FS) [
26], the HPM injection method with haze correction (MTF-GLP-HPM-H) [
29], and a clustering-based implementation with projective injection (MTF-GLP-CBD) [
31]. Additionally, the comparison considered a VO algorithm based on the sparse representation of details (SR-D) [
38]. Three machine-learning-based algorithms were also evaluated. The A-PNN algorithm, fine-tuned for 50 epochs, represents the version available in the MATLAB toolbox (see Reference [
99]) and is denoted as A-PNN-FT [
44]. The bidirectional pansharpening network (BDPN) algorithm [
100], one of the deep-learning-based methods included in the Python toolbox for recent deep learning algorithms [
101], was fine-tuned at reduced resolution to prevent the use of images at the same resolution as those in the test set. The third method is the Z-PNN algorithm [
68], which shares a methodological similarity with the approach explored in this study, as it employs a full-scale cost function without reference data. For completeness, values were calculated for MS image interpolation using a 23-tap polynomial kernel filter (EXP), providing a key baseline for comparison. The values associated with the reference method represent the ideal outcomes for the respective indices. These ideal values are achieved by using the reference image as the fused image in the case of reference-based indices. However, for indices without reference, the ideal values are rarely obtained.
The notation related to the CNN algorithms analyzed is intuitive and is composed first of the name of the network and then of the fine-tuning technique used. For example, FusNet-HQNR indicates the FusNet network whose weights are optimized with the HQNR cost function. The performances related to the absence of fine-tuning (noFT) and to the use of the reference image during the training phase (GT) were also evaluated. The first technique corresponds to the value obtainable with the networks obtained by Transfer Learning, while the second technique represents a crucial value that allows evaluating the maximum performances obtainable with the method in question. These were obtained using 20,000 fine-tuning epochs, a value chosen to ensure an almost steady state value of the indices. In this case, it never happens that the performance decreases as the iterations increase since there is no risk of excessive specialization.
The other loss functions are obtained using the nonreference quality metrics described in
Section 3.2.2 and are referred to by the same name in the acronyms. However, it is important to distinguish between metrics for evaluating image quality and those for optimizing network weights. The effectiveness of a network cost function is influenced by several factors that go beyond the accuracy of the quality metric. For example, the complexity of a metric affects both the calculation time—which can be a limiting factor, especially when fine-tuning is required for each image pair—and the difficulty of finding an optimal solution in parameter space.
Considering these factors, we opted to replace the
index [
93,
94], which is used in the FQNR, HQNR, and RQNR quality metrics, with the average Q-index, defined in (
8). The two indices show a very similar trend, as demonstrated by the scatterplot in
Figure 3, which compares the Q and Q2
n values calculated from the images produced by the classical algorithms used in this study, alongside the regression line between the two indices. Preliminary experiments confirmed that this substitution leads to comparable performance during network training, while also providing substantial computational savings. This modification is crucial for practical implementation and enhances the robustness of parameter estimation. The improved regularity of the error surface also enables more stable solutions, enhancing the network’s ability to handle complex illuminated scenes.
Regarding the QNR metric, the approach proposed by [
64], which uses the combination rule
, was evaluated in preliminary tests. However, it yielded worse results compared to the multiplicative rule described in (
11).
All learning algorithms were fine-tuned for 2000 epochs, starting from nets trained from scratch on completely different datasets (those used in [
9]).
The combination coefficients
and
resulted in overall better performance, and thus, they were adopted for all algorithms. The overall pansharpening process is outlined in Algorithm 1.
Algorithm 1: Fine-tuning using reference-free metrics. |
![Remotesensing 17 00016 i003]() |
5. Experimental Results
The quality indices of the fused images obtained using the data collected by the various satellites are reported in
Table 2,
Table 3,
Table 4 and
Table 5.
The first two tables refer to satellites acquiring four-channel multispectral imagery. Among traditional methods, BDSD-PC and MTF-GLP-FS consistently show strong performance across both datasets, providing a strong balance between spectral and spatial quality, as indicated by , SAM, and ERGAS metrics. GSA and C-MTF-GLP-CBD also perform well, closely trailing the two leaders in terms of overall quality. Very low ERGAS values, indicating better radiometric accuracy, are reported for the MTF-GLP-HPM-H among the classical methods, and particularly for the Z-PNN among the ML-based methods.
More specific observations related to the problem examined in this paper highlight the need to fine-tune the networks. Only the A-PNN without tuning has decent performance, indicating a greater generalization capability. The results are reasonably competitive but clearly fall behind the other tuning methods, such as HQNR and RQNR. The PanNet and the FusNet show significant degradation in all metrics across datasets without fine-tuning. SAM and ERGAS values are considerably higher, demonstrating that they struggle to maintain spatial and spectral fidelity. Across A-PNN, PanNet, and FusNet, RQNR consistently provides the best results in terms of , SAM, and ERGAS across multiple datasets. It shows the best balance between spatial and spectral quality and can outperform all other methods in the Ge_Tren_Urb dataset.
The datasets with eight-channel MS images give very similar results, with some small differences. As for the classical methods, which obtain better results than CNNs, the highest performance is also obtained in this case by BDSD-PC, whereas for MRA techniques, the highest ranking is obtained by MTF-GLP-HPM-H. Once again, the Z-PNN method achieves the best performance among ML-based algorithms; however, the results indicate that the algorithm is more inclined to optimize radiometric distortion. For both WV-2 and WV-3 datasets, the performance of the pansharpening algorithms shows a clear trend where the application of fine-tuning enhances the quality of fused images. The improvement is seen across all metrics, with significant gains in spatial and spectral quality for fine-tuned versions of the models. This indicates that tuning to the specific characteristics of each dataset is critical for improving pansharpening performance. A-PNN shows consistent improvements with tuning, especially in spectral accuracy, making it competitive across both WV2 and WV3 datasets. PanNet performs well, especially in urban and mixed regions, with fine-tuning leading to substantial gains in both SAM and ERGAS. FusNet, while benefiting from fine-tuning, remains less effective compared to A-PNN and PanNet, especially in the natural region, where its SAM and ERGAS values indicate higher errors.
The figures enable the visual inspection of products generated by various algorithms in two scenarios: an urban scene with eight-channel MS data (
W3_Muni_Mix), and a natural scene with four-channel MS data (
W4_Mexi_Nat). For the urban scene, close-ups of the final product, detailed views, and MS maps are presented in
Figure 4,
Figure 5 and
Figure 6. For the natural scene, close-ups of the final product and detailed views are shown in
Figure 7 and
Figure 8. Results from all combinations of the analyzed networks and tested cost functions are reported, while only representative examples are provided for classical reference algorithms, particularly those significant for the quality of the obtained results. Specifically, two examples from the CS methods—GSA and BDSD-PC—and two from the MRA methods—MTF-GLP-FS and MTF-GLP-HPM-H—are presented alongside the SR-D technique, which is notably effective among the VO approaches.
It is evident that CS techniques often yield spatially appealing results, although they can sometimes lead to overinjection of details. MRA techniques, on the other hand, tend to provide more spectrally consistent outcomes.
In both scenes, the differences between the various cost functions are evident in the merged products displayed in
Figure 4 and
Figure 7. Additionally, detailed comparisons in
Figure 5 and
Figure 8 also show the details, i.e., the difference between the fused image and the original image upsampled to the scale of the fused image.
First, it is important to note the high quality of the merged images produced by the networks trained with the GT, indicating their ability to generate results close to the ideal. In contrast, the images produced by the networks before the tuning phase are of significantly lower quality.
The cost function based on QNR does not improve results in the given case, and in fact, tends to introduce artifacts, as shown in in
Figure 7 and produces spectrally wrong images. This behavior occurred consistently across the tests, as reflected in the results shown in
Table 2,
Table 3,
Table 4 and
Table 5. In contrast, the other algorithms achieved significantly better outcomes, particularly after the tuning phase, enhancing performance compared to the unrefined cases. A more detailed examination reveals that the RQNR cost function makes a notable contribution in reproducing vegetation on the right side of
Figure 4. This improvement is quantitatively confirmed by the reduction in MSE error in that area, as shown in
Figure 6. Furthermore,
Figure 6 demonstrates the improvement that RQNR achieves on building contours, which was previously only visible in the A-PNN-FQNR results in
Figure 4, such as the edges of the central building.
Although the fused images in
Figure 4 and
Figure 7 show several visible differences, the visual analysis is further aided by the images in
Figure 5 and
Figure 8. These figures highlight the net contribution of pansharpening to the final product, particularly in terms of the quantity of injected details and their spectral properties. The RQNR cost function tends to promote higher levels of detail injection (which may not always be beneficial). Moreover, optimal details are observed to be spectrally accurate, with the PanNet network yielding the highest quality results, particularly in the
W3_Muni_Mix scene.
Figure 9 provides an example of pansharpening algorithm application to images at the original scale. Numerical results are not presented here because selecting a specific cost function for full-scale evaluation would inherently favor one of the tested functions, introducing bias. Therefore, the assessment relies on visual analysis, which remains valuable for evaluating the robustness of the techniques across scales. However, relying on visual analysis of full-scale images, similar to [
68], presents challenges due to difficulties in analyzing images with a high number of channels and with radiometric resolution exceeding 8 bits.
A comparison between the images in
Figure 9 and those in
Figure 4 (depicting the same area) highlights the appearance of much smaller objects in the full-scale images. The improvements achieved by fine-tuning the networks over 1000 full-scale epochs are evident in the enhanced image quality, both spatially and spectrally, as observed through visual inspection. Among the tested cost functions, the one derived from the RQNR index consistently delivers the best performance across all three networks. In particular, the images produced by the PanNet network exhibit outstanding quality in both spatial and spectral dimensions.
The comparison between
Figure 4 and
Figure 9 is particularly compelling, as the ranking of the tested algorithms remains virtually unchanged. This consistency supports the validity of the numerical results obtained at a reduced scale, even when the algorithms are applied at the actual scale.
An important evaluation metric in this study is the computational burden associated with different cost functions.
Table 6 presents the training times for a single epoch, as measured using an Nvidia Titan XP GPU. The
and
cost functions, which are used for training with ground truth GT data, involve the least computational overhead. In contrast, reference-free cost functions require additional calculations, leading to increased computation times. Among these, the QNR function is the most computationally demanding, especially in the eight-channel case, due to the high complexity of calculating spectral distortion, which involves approximately
comparisons between Q-indexes. The FQNR and HQNR cost functions exhibit similar computation times, while the RQNR index offers a small reduction in computational burden. This efficiency gain is primarily due to the simpler evaluation of spatial distortion, with the savings becoming more pronounced as the number of
B bands increases.
6. Discussion
The most significant evidence that emerged from this work is the great potential of ML-based algorithms, though effectively exploiting them remains challenging. The results obtained using GT far surpass those of traditional algorithms, highlighting the impressive ability of neural networks to address such problems and the clear advantage of nonlinear methods. However, parameter tuning to achieve these high performance levels is impossible, due to the lack of reference images.
Most contributions in the literature report results comparable to those obtained using the GT-based cost function in this study. This similarity arises because, although the training and test images are not identical, subimages from the same dataset are often used for training. As a result, the issue of different resolutions in the reference images is left unresolved. Another persistent challenge, which also remains unaddressed in this work, is the absence of a validation set that mirrors the critical aspects of the training set. This absence makes it difficult to select optimal network hyperparameters and assess potential overfitting, a particularly crucial issue when generalizing to different scales. A more specific analysis is needed to explore suitable tools for mitigating this problem. Traditional techniques such as cross-validation and regularization must be adapted to account for varying scales of observation.
This paper contributes to the evaluation of neural-network-based pansharpening algorithms by assessing their practical performance when using no-reference quality measures as cost functions. The best results were obtained using the RQNR cost function. This function combines a well-established spectral distortion index to assess the consistency between the fused image and the original MS image, along with a spatial distortion index based on a linear regression model between the fused image channels and the PAN image. However, this choice does not resolve the debate over how best to evaluate spatial information quality, as the regression-based approach implements a different rationale from the Wald protocol [
60]. Moreover, the regression model could potentially be improved by using a nonlinear approach, possibly through the implementation of neural networks [
102,
103].
Using cost functions based on high-performing no-reference quality indices improves neural network performance. However, this study shows that the initial performance of these networks, without specific fine-tuning, is often poor, even for models with few parameters. This makes the fine-tuning process more challenging. Approaches focused on generalization, such as the general image fusion framework [
53] or foundation models [
104], may offer a better starting point.
7. Conclusions
In this study, we analyzed the performances achievable with various simple CNN architectures proposed for pansharpening, focusing on the impact of different cost functions derived from key no-reference image quality indices.
Using a reduced resolution assessment protocol, we highlighted the gap between the theoretically excellent performance of CNN architectures and the more limited results achievable with cost functions that can be practically implemented. Despite the potential of these architectures, this discrepancy underscores the challenges associated with selecting effective cost functions that align with real-world constraints.
Our findings underscore the superiority of a reference-free cost function that incorporates a highly accredited spectral quality index, combined with an innovative spatial quality index. This combination proved effective in linking the channels of the fused image to the available panchromatic image through a simple yet powerful relationship. This approach not only enhances the fusion process but also offers a promising pathway for achieving better pansharpening results without relying on ground-truth references.
This work provides a starting point for evaluating the performance of ML-based pansharpening techniques while highlighting key questions for future research. The most critical challenges involve optimizing the trade-off between algorithm generalization and image quality for specific datasets. On the one hand, this requires designing architectures that deliver robust performance across a broad range of problems and, in the context of pansharpening, ensure sufficient scale invariance. On the other hand, developing reliable techniques to monitor product quality during fine-tuning is essential. This task can be supported by using simplified cost functions, like those introduced in this study, which produce smoother error surfaces and facilitate optimization.