4.1. Dataset Introduction and Generation
We chose to test the model on the datasets constructed from the Ikonos and GeoEye-1 satellite sensors. GeoEye-1 operates in a sun-synchronous orbit with an orbital height of 681 km, an inclination of 98 degrees, and an orbital period of 98 min, which captures a 0.41 m-resolution PAN image and a 1.64 m-resolution LRMS image. Ikonos is a commercial satellite capable of capturing a 1 m-resolution PAN image and a 4 m-resolution LRMS image. The characteristics of the two remote sensing satellites are shown in
Table 1.
In GeoEye-1, the source PAN image had a size of 13,532 × 31,624, and the source LRMS image had the size of 3383 × 7906. For Ikonos, the size of the source PAN image was also 13,532 × 31,624, while the size of the source LRMS image was 3383 × 7906. For each type of remote sensing imagery, we first removed the black borders at the edges, and then we divided them into non-overlapping 200 × 200 LRMS image patches and 800 × 800 PAN image patches. Following this, we categorized these image patches into training samples and testing samples. When constructing the training set, we adhered to the Wald protocol to generate simulated datasets. Specifically, we applied modulation transfer function (MTF) filtering to each pair of original LRMS and PAN images to perform spatial degradation. The degraded data were then used as the input images for the network. In this manner, the original LRMS images served as reference images, enabling the supervision of the network model during training and the subsequent calculation of loss values. The dataset division is shown in
Table 2. We cropped the input and reference images into small 32 × 32 image patches with a specified stride, with the aim of obtaining a training set that contains a larger volume of data.
4.3. Comparison Algorithms and Evaluation Metrics
To validate the outstanding effectiveness of our proposed strategy, this study rigorously followed the Wald test standard and meticulously selected six key evaluation methods. These metrics encompass Spectral Angle Mapper (SAM) [
35], Relative Dimensionless Global Error in Synthesis (ERGAS) [
36], Relative Average Spectral Error (RASE) [
37], Spatial Correlation Coefficient (SCC) [
38], the Quality (Q) index [
39], and the Structural Similarity (SSIM) index [
40]. For the evaluation of the performance at full-resolution scale, SAM, the Quality with No Reference (QNR) [
41] index, the spectral distortion index
, and the spatial distortion index
were used to ensure comprehensive and precise evaluation.
SAM measures the spectral similarity between the pansharpened image and the corresponding reference image. The smaller the value of SAM, the higher the spectral similarity between two images. It is defined by the following equation:
ERGAS can measure the overall spatial and spectral quality of fused images, which is defined as
where
represents the mean square error, and
and
are the spatial resolutions of the PAN and LRMS images, respectively. The ideal value of ERGAS is 0.
RASE reflects the global spectral quality of the fused image, which is defined as
where
M represents the average brightness of the
N spectral bands of the
R image. The ideal value of RASE is 0.
SCC measures the spatial detail similarity between the fused image and the reference image. Before calculating SCC, it is necessary to obtain high-frequency information from two images through high-pass filtering, and then the correlation coefficient between the high-frequency components of the two images is calculated. The closer the SCC value is to 1, the higher the spatial quality of the fused image.
Q can estimate the global spatial quality of the fused image. It measures the correlation, average brightness similarity, and contrast similarity between the fused image and the reference image. The closer the value of Q is to 1, the higher the quality of the fused image. The Q index is defined as
where
represents the covariance between two images, and
and
are the variance and mean of the images, respectively.
The SSIM metric can globally evaluate the similarity between the two images from three directions: brightness, contrast, and structure. The expression is as follows:
is used to determine the spectral similarity between the fused image and the observed LRMS image. Its value is a positive number closer to 0, indicating that the spectral information of the fused image is more similar to that of the original LRMS image. The
index is defined as
where
calculates the Q value between two images, and the parameter
p is a positive exponent that emphasizes the degree of spectral distortion, which is usually taken as
p = 1.
is used to measure the spatial similarity between the fused image and the original PAN image. Its value is a positive number closer to 0, indicating that the spatial distortion of the fused image is smaller.
is defined as
where
p represents a reduced resolution PAN image with the same size as the original LRMS image, and
q is a positive exponent that emphasizes the degree of spatial distortion in the fused image, which is usually taken as
q = 1.
QNR can jointly measure the degree of the spectral distortion and spatial distortion of fused images using two indexes:
and
. The QNR is defined as
where
and
are positive exponents that control the relative correlation between the spectral distortion and spatial distortion, which is usually taken as
=
= 1. The ideal value of QNR is 1.
This study selected nine algorithms for experimentation on the GeoEye-1 and Ikonos datasets, including four traditional methods and five deep learning methods. The traditional comparison algorithms are the generalized Laplacian pyramid with an MTF-matched filter (MTF-GLP) [
42], the Gram–Schmidt mode 2 algorithm with a generalized Laplacian pyramid (GS2-GLP) [
42], the robust band dependent spatial detail (BDSD-PC) [
43], and the additive wavelet luminance proportional with haze correction (AWLP-H) [
44]. The CNN-based methods included TFnet [
26], PanNet [
25], DiCNN1 [
27], S2DBPN [
28], and CIKANet [
29].
4.4. Performance Evaluation on Remote Sensing Images at a Reduced Resolution
In this section, we will compare the performance of the different pansharpening methods on two simulated datasets.
Figure 4 provides a detailed comparison of the different image fusion algorithms on the simulated GeoEye-1 dataset.
Figure 5 shows the corresponding residual images. Through observation, we clearly noticed that the GS2-GLP algorithm exhibited significant spatial distortion in the fusion results, especially in the large grass area. Meanwhile, although the BDSD-PC, MTF-GLP, and AWLP-H fusion methods achieved certain effects, they also exhibited varying degrees of spatial distortion and spectral distortion when processing the grass area. Among them, the BDSD-PC algorithm showed significant differences in color and texture compared to the original image, making it particularly noticeable.
Compared to the traditional algorithms, the fused images based on deep learning methods exhibited higher similarity with the reference image in terms of spectral fidelity and spatial details. These deep learning methods have learned the ability to extract useful information from the LRMS images and the PAN images by training on a large amount of sample data, thus achieving higher-quality image fusion. However, while these methods can obtain good fusion results in most cases, they may still suffer from issues of spectral and spatial distortion. PanNet, TFnet, and S2DBPN showed spectral distortion and a lack of fineness in the fusion of the middle green lawn area, while the DiCNN1 and CIKANet methods exhibited relatively more spectral and spatial information, resulting in better fusion performance. In addition to the lawn area, the fusion effect of our method in the street and village areas was also closer to that of the actual scenes. Although the subtle differences in the fused images could not be clearly observed by the naked eye, it could be seen from the residual images in
Figure 5 that our fused image reflected less residual information and was closer to the reference image.
In order to further quantify and compare the performance of the different fusion methods,
Table 4 lists the average evaluation metrics for the 25 test images that were subjected different fusion methods. We have highlighted the best indicator values in bold and highlighted the second best values with an underline. As can be seen from
Table 4, the evaluation metrics of the AWLP-H algorithm achieved the best results compared to the traditional algorithms. Compared to the traditional algorithms, the deep learning methods produced a better fusion performance. Specifically, TFnet, PanNet, and S2DBPN performed slightly worse, while DiCNN1 and CIKANet were highly competitive. Our method achieved the best results. Except for a slightly poorer performance in the Q value, our method excelled in all other metrics. Through experimental indicators, it can be seen that our method significantly maintains the high spatial resolution of the image and preserved the high-quality spectral information. This characteristic significantly improved the image fusion performance under complex backgrounds.
Figure 6 shows the image fusion results obtained by applying different algorithms to the simulated Ikonos dataset.
Figure 7 shows the corresponding residual images. Firstly, both GS2-GLP and MTF-GLP exhibited spectral distortion on the lawn in the upper right corner. This distortion significantly reduced the clarity and recognition rate of the image. This was primarily due to the lack of capability in handling details, edges, and other information when processing high-resolution images. Although the BDSD-PC and AWLP-H algorithms had good spatial resolution, there was a certain degree of spectral distortion. The six CNN-based methods, including the proposed algorithm, were significantly better than the traditional methods in preserving spatial details. The distortion exhibited in the fused images was markedly reduced in comparison to the traditional methods. However, the fused images of the TFnet, PanNet, and S2DBPN algorithms exhibited color tone deviation at the intersection of villages and rivers. There was color distortion between the fusion results and the reference image. There were two main reasons for this: first, the ability of the model to process specific scene colors during the training phase was weak; second, it could not adapt to complex environments with varying lighting and reflection. By analyzing the residual images, it could be found that the proposed method in this study generated residual images that had less information, indicating that it maintained a higher fidelity to the spatial details and spectral features of the original image during the fusion process.
In order to quantify the performance of each algorithm on the Ikonos dataset, we also listed the average evaluation metrics of the 24 sets of simulation experimental results, as shown in
Table 5. Similarly, we highlighted the best results in bold and highlighted the second best results with an underline. Among the CNN-based algorithms, the proposed algorithm achieved the most optimal results in terms of SAM, SCC, and SSIM, but it performed suboptimally in both the ERGAS and RASE metrics. This indicates that the method proposed in this paper is more suitable for GeoEye-1 datasets. But, overall, its performance on the simulated datasets surpassed the traditional methods and the numerous deep learning algorithms.
4.5. Performance Evaluation on Remote Sensing Images at Full Resolution
In practical applications, full-resolution remote sensing images play a central role in the accuracy and reliability of experimental results. Therefore, we directly performed image fusion operations on the real test datasets and comprehensively assessed the performance of various fusion algorithms through visual inspection and analysis and quantitative indicators.
Figure 8 shows the real experimental results of our algorithm and other comparative algorithms on the Ikonos dataset. The small image in the green box was enlarged and placed in the lower right corner for clearer observation of its spatial and spectral information. As can be seen from
Figure 8, the traditional algorithms exhibited significant deficiencies in detail processing. Specifically, the MTF-GLP algorithm handled boundaries particularly unclearly. However, the distortion issues of the AWLP-H algorithm were the most severe, primarily manifesting in color shifts and distortions in the image. In contrast, the six deep learning algorithms demonstrated a visually similar performance, which achieved significant improvements in detail and better fusion results. We were able to observe the roof of the red house and found that our proposed method had a more delicate fusion effect. It retained excellent spatial details while effectively restoring the original spectrum of the LRMS.
In order to quantify the performance of various algorithms on the Ikonos dataset,
Table 6 lists the average evaluation metrics of the 24 sets of real experimental results. TFnet and CIKANet both demonstrated good performance, with TFnet ranking second in SAM metrics and CIKANet ranking second in the QNR and
metrics. However, our proposed algorithm achieved the best performance on all evaluation metrics. This result further validated the advantages of deep learning in the field of remote sensing image fusion and demonstrates the potential of our algorithm in practical applications.
Figure 9 shows the real experimental results of all the algorithms when applied to the GeoEye-1 dataset. Through observation, it could be found that the color distortion of GS2-GLP and MTF-GLP was particularly evident, which affected the overall quality of the fused image. From the enlarged area, it could be observed that the result of the GS2-GLP algorithm appeared relatively blurry, while the fused image of the MTF-GLP algorithm showed the most blurry effect. In contrast, the fusion image generated by the AWLP-H algorithm produced more saturated colors. The visual effects of the six deep learning methods, including our algorithm, usually perform better than the traditional methods by showing more refined and comprehensive fusion of spatial and spectral information.
We also conducted statistical analysis on the average metrics of the 25 sets of real image fusion results on the GeoEye-1 dataset, as shown in
Table 7. Overall, the traditional methods performed poorly in terms of metrics, with only BDSD-PC performing slightly better from the traditional methods. For the deep learning fusion algorithms, TFnet and CIKANet both demonstrated good performance, with TFnet ranking second in the
and QNR metrics, and CIKANet ranking second in the SAM and
metrics. However, our proposed algorithm achieved the best performance on all of the evaluation metrics.