1. Introduction
Remote sensing image fusion refers to the complementary operation and processing of multisource remote sensing image data in space, time and the spectrum according to certain rules and algorithms. This obtains more accurate and richer information than single image data, and generates synthetic image data with new spatial, spectral and temporal characteristics [
1,
2]. In the remote sensing community, hyperspectral and panchromatic images are two important image types. Hyperspectral remote sensing images usually have high spectral resolution and can provide rich spectral information. However, due to the limited energy acquired by the remote sensing image sensor, the spatial resolution is too low to maintain images with high spectral resolutions. This means that the spatial details of ground objects cannot be reflected in hyperspectral remote sensing images [
3]. Panchromatic remote sensing images usually have high spatial resolution and can provide many spatial details of the ground objects. However, the spectral resolution of panchromatic images is usually low. Thus, panchromatic images cannot provide enough spectral information [
4]. As a result, fusing hyperspectral and panchromatic images can obtain images with both high spectral and high spatial resolution. Therefore, this kind of image fusion can complement the deficiency of pre-fused images, and the fused images can be used in a variety of applications such as ground object classification [
5], spectral decomposition [
6], and urban target detection [
7], among others.
Remote sensing image fusion can be at the pixel-level, feature-level or decision-level [
8]. In pixel-level fusion, each pixel of all the image data is directly fused through various algebraic operations. Then it extracts the feature information of ground objects after processing and analysis. Pixel-level fusion requires multiple sensors to be placed on the same platform to achieve both accurate spatial registration of the sensors and strict correspondence between the pixels. Pixel-level image fusion methods are generally based on the space domain and the transform domain. Feature-level fusion consists of image spatial registration, feature extraction, feature fusion and description of the attributes according to the fusion results. When multiple image sensors report similar features at the same location, the likelihood of actual feature occurrence can be increased, and the accuracy of measured features can be improved. Feature-level image fusion is very important for target recognition and identity authentication. Decision-level fusion is the highest level of fusion. First, it carries out spatial registration of the image, and feature extraction and description of the image information’s attributes is carried out by using a large-scale database and an expert decision system to simulate the process of human analysis, reasoning, recognition and decision-making. Finally, it fuses the feature information and attributes. This method mainly aims to fuse multisource information and have strong fault tolerance. The difference between decision-level and feature-level fusion lies is that the goal of feature-level fusion is to extract features from remote sensing images and directly fuse them into new features through various algorithms, while decision-level fusion aims to extract features and recognize new ground objects and then combine the ground object information into new ground objects.
Remote sensing image fusion methods can be divided into the following categories: multiresolution analysis (MRA)-based image fusion methods; component substitution (CS)-based fusion methods; matrix decomposition-based methods; Bayesian-based image fusion methods, and remote sensing image fusion methods based on deep learning (DL). MRA-based fusion methods first obtain information on spatial details by multiscale decomposition of panchromatic images, then this is fused into multispectral or hyperspectral images. MRA fusion methods mainly include undecimated wavelet transform (UWT) [
9], decimated wavelet transform (DWT) [
10], the indistinguishable transform method based on a curve wave [
11], and the Laplacian pyramid method [
12], the indistinguishable transform method based on a contour wave [
13]. These extract spatial details from panchromatic images through a spatial filter, then insert the extracted spatial details into hyperspectral images.
CS-based fusion methods replace components in the multispectral or hyperspectral images with panchromatic images. CS fusion methods include the intensity–hue–saturation (IHS) method [
14,
15,
16], principal component analysis (PCA) [
17,
18,
19], and the Gran–Schmidt (GS) method, among others. They also rely on projection of the hyperspectral images into another spectral space to separate the spatial and spectral information, so that the transformed hyperspectral image data can be fused by replacing the spatial components of the panchromatic images. The stronger the correlation between the panchromatic images and the replaced components, the less the spectral loss in the fused images. Therefore, before replacing panchromatic images, histogram matching is usually carried out. Then the fused images are obtained through inverse spectral transformation.
Matrix decomposition-based image fusion methods assume that hyperspectral images can be decomposed into the product of the spectral primitives and the correlation coefficient matrix. The spectral primitives refer to abstract representations of the spectral information, including sparse representation [
20,
21,
22] and low-rank representation [
23,
24]. The spectral primitive form of sparse expression has a complete dictionary and assumes that each spectrum is a linear combination of several dictionary items. The items are usually based on a complete dictionary with a low spatial resolution of the hyperspectral remote sensing images by sparse dictionary learning methods, such as K-SVD [
25], online dictionaries [
26], and non-negative dictionary learning. Next, sparse priors are used to regularize the coefficients; usually sparse coding algorithms to estimate the coefficients. The low-rank expression holds that spectral features can be represented by low-dimensional subspaces, and the matrix composed of spectral primitives is a low-rank matrix. The low-rank spectral elements are usually composed of vertex component analysis (VCA), simplex identification via split augmented Lagrangian (SISAL), principal component analysis, and truncated singular value decomposition (TSVD). Both sparse representation and low-rank representation methods aim to model the similarity and redundancy between spectral bands. Thus, both can maintain the spectral characteristics well. However, the low-rank representation method can greatly reduce the dimensions of the spectral pattern and has less computational complexity than the sparse representation method.
The Bayesian method relies on the posterior distribution of hyperspectral and panchromatic images. This posterior distribution is obtained by Bayesian reasoning, and posterior distribution contains two factors: (a) a likelihood function, which is the probability density of the multispectral or hyperspectral images and panchromatic images obtained after a given target image, and (b) the prior probability density of the target image, in that the characteristics of the target image can be improved by the desired characteristics, such as segmentation smoothing. Selection of the appropriate prior information can solve the inverse ill-condition problem in the process of fusion [
27]. Hyperspectral images and images with a high spatial resolution can be described in the framework of Bayesian inference. This method can intuitively explain the fusion process through the posterior distribution of the Bayesian fusion model. Since fusion problems are often ill-conditioned, Bayesian methods provide a convenient way to regularize the problem by defining an appropriate prior distribution for the scenarios of interest. According to this strategy, many scholars have designed different Bayesian estimation methods for fusing images with high spatial resolution and hyperspectral images [
28,
29].
DL has made achievements in many fields, such as natural language processing [
30], computer vision [
31,
32], speech recognition [
33], search engines [
34] and so on. The DL fusion method is considered to be a new trend, which trains a network model to describe the mapping among the hyperspectral images, panchromatic images and target fusion images [
35]. Current DL fusion methods include the Deep Residual Pansharpening Neural Network (DRPNN) [
36], the Pansharpening Neural Network (PNN) [
37], the Multiscale and the Multidepth Convolutional Neural Network (MSDCNN) [
38]. The DIP-HyperKite method proposed in [
39] defines the spatial-domain constraint as the
L1 distance between the predicted PAN image and the actual PAN image, proposes a learnable spectral response function (SRF), and also proposes a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers. The RCNN method proposed in [
40] utilize the network to map the differential information between the high spatial resolution panchromatic (HR-PAN) image and the low spatial resolution multispectral (LR-MS) image. Moreover, RCNN makes full use of the LR-MS image and utilizes the gradient information of the up-sampled LR-MS image (Up-LR-MS) as auxiliary data to assist the network. Furthermore, an attention module and residual blocks are incorporated in the proposed network structure. The MARB-Net proposed in [
41] assigns multiple weights to each feature using multiple attention mechanism models. Then, the MARB-Net deeply mines and integrates the features using the residual network. Finally, MARB-Net performs contextual semantic integration on the deep fusion features using the Bi-LSTM network. These methods can usually obtain better spectral fidelity, but spatial enhancement is inadequate in the image fusion results. The current DL-based image fusion methods usually lack richly formalized and diversified deep features. The current DL-based image fusion methods usually lack deep feature enhancement. Meanwhile, these methods usually regard spatial and spectral features as individual units.
In the late 1980s, the invention of the back propagation algorithm for artificial neural networks brought hope to machine learning, setting off a boom in machine learning based on statistical models. In 2006, Geoffrey Hinton and Ruslan Salakhutdinov published an article in Science, a leading academic journal, that started a wave of deep learning in academia and industry. Since 2006, deep learning has been gaining momentum in academia. Today, Google, Microsoft, Baidu, and other well-known high-tech companies with large data are scrambling to invest their resources into occupation of deep learning technology.
In this study, we proposed a deep learning model of an encoder–decoder with a residual network (EDRN) for fusing hyperspectral and panchromatic images (
Supplementary Materials). The advantage of the end-to-end neural network is that the model can be changed from the original input to the final output as much as possible by reducing manual pretreatment and follow-up processing. This gives the model more space for automatic adjustment according to the data and increases the overall fit of the model. The proposed method can be divided into three parts: (1) hyperspectral and panchromatic image combination; (2) establishment of the encoder–decoder network, and (3) residual enhancement of the encoded and decoded deep features. To overcome the independence of the hyperspectral and panchromatic image features adopted in the majority of fusion methods, we first combined hyperspectral and panchromatic images by a particular means. The image features then produced interactive effects. The integration mode in our manuscript leads to a more concise combination mode for hyperspectral and panchromatic images to interact spectral-spatial information. Second, we established an encoder–decoder network for extracting the representative encoded and decoded deep features. Our model extracts richly formalized encoded and decoded deep features with different feature sizes for image fusion. Our model extracts more diversified deep features, which allows image fusion with more effective and hierarchically variable feature levels for image fusion. Finally, to solve the lack of deep feature enhancement in the current fusion methods, we established residual networks between the encoder network and the decoder network to enhance the extracted encoded and decoded deep features. Our model achieves residual enhanced encoded and decoded deep features to attain enhanced image fusion result. The establishment of the proposed method is complete. The proposed DL-based image fusion method is able to enhance the features of extracted deep features. At the same time, the proposed method is able to combine hyperspectral and panchromatic images.
In this paper, we propose a novel encoder-decoder with residual network fusion model. The main contributions and novelties of the proposed method are as follow:
- (1)
Spatial-spectral information interaction. We first up-sampled the hyperspectral image as the size of panchromatic image, then contacted the panchromatic and up-sampled hyperspectral images for information interaction.
- (2)
One-to-one encoder and decoder construction. For the process of construction, the previous encoded layers were corresponded with the latter decoded layers. Then, we constructed a one-to-one encoder-decoder network to extract encoded and decoded deep features.
- (3)
Encoded and decoded deep feature enhancement. We utilized the convolutional residual network from the encoded layers to corresponding decoded layers to enhance the encoded and decoded deep features. We computed the encoded deep features with a convolutional implementation and then added them to the corresponding decoded deep features to enhance the entire deep features.
The rest of this article is organized as follows:
Section 2 provides a detailed description of the proposed method.
Section 3 presents the experimental results.
Section 4 represents a discussion.
Section 5 summarizes the conclusions.
2. Proposed Method
In this article, we propose an end-to-end DL model for fusing hyperspectral and panchromatic images. “End-to-end” means that the input data of the model are the original raw data, and the output data are the result. Classical machine learning uses the raw original data as features in preprocessing, then utilizes the features in a specific application. The results of the specific application depend on the quality of the image features to a certain degree; earlier machine learning methods spent most of their time on feature design. Machine learning at that time was more appropriately named feature engineering. Later, people found that it would be better to use neural networks and let the network learn how to obtain features by itself. This led to a rise in representation learning, and this method is more flexible for data fitting. With the further deepening of the network, the multilayer concept of representation learning brought the accuracy of the algorithm to a new height and led to multilevel feature extraction as well as recognizer unified training and prediction networks. An end-to-end neural network excels in reducing manual pretreatment and follow-up processing and can be changed from the original input to the final output as much as possible. It gives the model more space for automatic adjustment according to the data and increasing the overall fit of the model. Features can be learned by themselves, so we integrate feature extraction into the classification algorithm without human intervention. For the end-to-end DL model proposed in this research, the inputs are hyperspectral and panchromatic images, while, the output is the fusion result.
Current DL-based remote sensing image fusion methods usually regard spatial and spectral features as individual units. On the input end of the proposed end-to-end deep learning model, we first regard the hyperspectral and panchromatic images as an entirety. This operation is carried out for the sake of combining the spectral information in the hyperspectral images and the spatial information in the panchromatic images as an entirety. In addition, this spectral–spatial combination entity allows feature interaction between the spectral and spatial information in the later DL model. In this operation, the hyperspectral image is up-sampled as the spatial size of the panchromatic image. To overcome the problem that current DL-based image fusion methods usually regard spatial and spectral features as individual units, the integration of up-sampled hyperspectral and panchromatic images can provide them with spatial-spectral information interaction for image fusion. The integration mode in our manuscript leads to a more concise combination mode for hyperspectral and panchromatic images to interact spectral-spatial information. After up-sampling the hyperspectral image, we contacted the panchromatic image with the up-sampled hyperspectral image according to Equation (1):
where
is the panchromatic image and
represents the combination of the up-sampled hyperspectral and panchromatic images.
Next, we took as the input data for the deep learning model. After matching the panchromatic and up-sampled hyperspectral images, we used an encoder–decoder network is for elementary deep feature representation of the spectral–spatial interactive input data. An encoder–decoder is an artificial neural network used in supervised learning. The encoder is the first half of an encoder–decoder, and its function is to turn the input data into the middle layer to produce a hidden representation. This part transforms the deep representative features of input data to a hidden representation. These deep representative features are also called encoded deep features. The decoder is the back half of an encoder–decoder and its function is to refactor the hidden representation from the middle layer to the output data. This part extracts the deep representative features from the hidden representation to the output data. These deep representative features are also called decoded deep features. There are several categories of encoder–decoder: (1) ordinary encoder–decoders, which are neural networks with three layers (i.e., a neural network with a hidden layer); (2) multilayer encoder–decoders, which are extended from ordinary encoder–decoders to a encoder–decoder with multiple hidden layers, and (3) a convolutional encoder–decoder, which is extended from a fully connected network to a convolution network. We utilized the convolutional encoder–decoder in our proposed DL model. A convolutional neural network (CNN), which is a type of deep learning method, was adopted to establish a deep network fusion model for the intelligent fusion of hyperspectral and panchromatic images in this study.
CNN is one of the representative deep learning algorithms and is a feedforward neural network containing convolution computation [
42]. In the convolution layer of CNN, we only connected a neuron with the neurons of the adjacent layers, usually containing multiple feature planes. Neurons in the same feature plane share the same weights, i.e., convolution kernels. Subsampling, also known as pooling, usually takes two forms: average subsampling and maximum subsampling. Subsampling can be regarded as a special convolution process. The output result of convolution forms a layer of neurons through the activation function to form a layer characteristic map.
Figure 1 illustrates a schematic diagram of the encoder–decoder network. The encoder–decoder network adopted in our proposed DL model comprised three operations: convolution, pooling and up-sampling. In
Figure 1,
HSI refers to the hyperspectral image,
PAN represents the panchromatic image,
conv represents the convolution operation,
pool represents the max-pooling operation and
up represents up-sampling.
The current DL-based image fusion methods usually lack richly formalized and diversified deep features, so we constructed an encoder-decoder network to extract richly formalized and diversified deep features. In
Figure 1, there are two parts in the encoder–decoder network. The left part is the encoder network, while, the right part is the decoder network. The encoder network includes a series of convolution layers and pooling layers. The feature sizes of the convolution layers diminish gradually because of the pooling operation. The decoder network constitutes a series of up-sampling and convolution layers. The feature sizes of the convolution layers increase gradually because of up-sampling. In the encoder network, we convoluted and pooled the combined hyperspectral and panchromatic image data layer by layer to form the encoded deep features. Next, we extracted the encoded deep features from the combined hyperspectral and panchromatic image data and completed the establishment for the encoder network. Equation (2) shows the convolution operation of the encoder network:
where
and
is the convolution operation,
is the deviation value and
and
represent the input and output of the
level convolution in the encoder network, respectively. Equation (3) shows the pooling operation in the encoder network:
where
pool(*) represents the pooling function,
and
refer to the input and output of the
level’s convolution in the encoder network, the step length is
and pixel
has the same meaning as the convolution layer. When
,
pooling takes the mean value in the pooled area, i.e., average pooling; when
,
pooling takes the maximum value in the region, i.e., max-pooling. At the same time, in the decoder network, we subjected the middle layer between the encoder and decoder networks to the up-sampling and convolution operation layer by layer to form the decoded deep features. The decoded deep features are then extracted from the middle layer between the encoder and decoder networks to complete the establishment for the decoder network. Equation (4) shows the convolution operation of the decoder network:
where
and
represent the input and output of the
layer’s convolution in the decoder network.
There are the same number of levels in the encoder and decoder networks. Each layer from the encoder network and each layer in the decoder network correspond one to one. For the encoder network, we used the pooling operation between layers to obtain encoded convolution feature blocks with different sizes and dimensions for each layer. For the decoder network, we obtained the decoded convolution feature block with the same size as the corresponding encoded convolution feature block by using up-sampling between layers. In this way, we established the encoder network and the decoder network with the corresponding feature blocks of the same size and dimension. To overcome the problem that current DL-based image fusion methods usually lack of richly formalized and diversified deep features, our model extracts richly formalized encoded and decoded deep features with different feature size for image fusion. Our model extracts more diversified deep features that makes image fusion with more effective and hierarchically variable feature levels for image fusion.
The current DL-based image fusion methods usually lack deep feature enhancement, so we utilized residual network to enhance the extracted encoded and decoded deep features.
Figure 2 is a schematic diagram of residual enhancement for the encoder network and the decoder network. In
Figure 2, the plus sign represents the residual block. This study used a residual network structure to adjust and enhance the encoded and decoded deep features. For the residual network, denoted as the desired underlying mapping function
, we let the stacked nonlinear layers fit other mapping, as shown in Equation (5):
where
is the residual part,
is the mapping part,
is a residual variable and
is a mapping variable. The original mapping was recast into Equation (6):
The formulation of
can be realized by feedforward neural networks with shortcut connections. The shortcut connections simply perform identity mapping. We also added their outputs to the outputs of the stacked layers. Identity shortcut connections add neither extra parameters nor computational complexity [
43]. For the convolutional residual network with Equation (6), there is a convolutional operation in the residual part
and an identity mapping with the mapping part
. We adopted a convolutional residual network in the proposed EDRN method. The residuals of the network structure were established on the decoder network at each level of convolution to join the corresponding encoder’s convolution. When establishing the encoder and decoder networks, we formulated the corresponding one-to-one encoded and decoded convolutions. For establishing the residual enhancement structure between the encoder and decoder networks, we added each encoded convolution layer to the corresponding decoded convolution layer with the same convolution feature size shown in
Figure 2. Before the addition operation, there was a convolution operation to enhance the residual network.
The residual network constitutes a series of residual blocks that constitute mapping and residual parts. Equation (7) shows the operation of the residual enhancement network structure:
where
is the result of the convolution at the
th level in the decoder network, which is the mapping part for identity mapping in the residual block;
is the residual part in the residual block according to Equation (6);
is the result of the enhanced residuals of the convolution of the
layer in the decoder network, and
is the total number of convolution layers in the decoder network. For convolution of the residual network, Equation (8) represents
as a convolution operation:
where
is the convolution weight of the
th residual part and
is the biases of the
th residual part. By substituting Equation (8) into Equation (7), Equation (9) obtains
:
To overcome the problem that the current DL-based image fusion methods usually lack deep feature enhancement, our model achieves residual enhanced encoded and decoded deep features to attain enhanced image fusion result.
For the encoder and decoder network, we set each layer of the encoded and decoded layers with specific spatial size and number of layer channels. We constructed the final data cube with the final layer of the decoder network. For the final layer of the decoder network, the spatial size was the same with the spatial size of the panchromatic image, and we set the number of final layer channels as the band number of the hyperspectral image. Then, the spatial and spectral resolution of the final layer of the decoder network was the same as the ground truth image. We constructed the final data cube with the above implementation so that it could be used in loss function computed with the ground truth image.
4. Discussion
We utilized a series of image fusion performance indices to precisely verify the spectral and spatial performance of the fusion results in our experiments. We adopted eight performance indices in this study, namely the structural similarity index (SSIM), the peak-signal to noise ratio (PSNR), spectral curve comparison, spatial correlation coefficient (SCC), the spectral angle mapper (SAM), the root mean squared error (RMSE), the relative dimensionless global error in synthesis (ERGAS) and the Q metric [
48,
49]. We computed the SSIM index as Equation (10):
where,
is the fused image,
is the ground truth,
with
the band number of hyperspectral image,
is the
th band of the fused image,
is the
th band of the ground truth,
is the average value of
,
is the average value of
,
is the covariance value between
and
,
is the variance of
,
is the variance of
. The SSIM index is a band-based performance index. We computed the PSNR index as in Equation (11):
where,
is the pixel number of fused image,
,
is the
th value in
,
is the
th value in
. The PSNR index is also a band-based performance index. We computed the SCC index as in Equation (12):
It is clear that the SCC index is also a band-based performance index. We computed the SAM index as in Equation (13):
where,
is the
th pixel in fused image,
is the
th pixel in ground truth, and
is the
norm. That the SAM index is a pixel-based performance index. We computed the RMSE index as in Equation (14):
The RMSE index is a pixel-based performance index. We computed the ERGAS index as in Equation (15):
where,
is the spatial downsampling factor. The ERGAS index is a band-based performance index. We computed the Q metric index as in Equation (16):
The Q metric index is a pixel-based performance index.
Among the performance indices, the SCC and SSIM are spatial quality metrics, and SAM and the spectral curve are spectral quality metrics. The RMSE, PSNR, ERGAS and the Q metric are comprehensive spatial–spectral quality metrics. We utilized the SAM, spectral curve comparison, RMSE, PSNR, ERGAS and Q metric to test the spectral information enhancement in high-resolution images. We also utilized the SCC, SSIM, RMSE, PSNR, ERGAS and Q metric to test the spatial information enhancement in hyperspectral images. The spectral curve comparison compares the spectral curves of a pixel in the results of a fusion method with the corresponding spectral curve of the pixel in the original hyperspectral image. In our experiments, we compared the spectral curve of the (360, 360) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30, 30) pixel in the corresponding hyperspectral image for Baiyangdian, Chaohu and Dianchi datasets. For Pavia Center dataset, we compared the spectral curve of the (120, 120) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30, 30) pixel in the corresponding hyperspectral image. For the Botswana dataset, we compared the spectral curve of the (90, 90) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30, 30) pixel in the corresponding hyperspectral image. For the Chikusei dataset, we compared the spectral curve of the (120, 120) pixel in the fusion results of all the compared and proposed methods with the spectral curve of the (30, 30) pixel in the corresponding hyperspectral image. It is important to emphasize that PSNR, SSIM and the Q metric are better when they are larger, and RMSE, SAM and ERGAS are better when they are smaller. For SCC, the performance is better when most of the SCC values of the bands are bigger than those of the other fusion methods. For spectral curve comparison, the performance is better when the spectral curve is near to the original spectral curve in the hyperspectral images.
Figure 16 shows the quality of the compared and proposed fusion methods for the Baiyangdian dataset in terms of SCC and spectral curve comparison, while
Table 1 illustrates the quality of the compared and proposed fusion methods for the Baiyangdian dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric. In
Table 1, the best performance for each of all the indices is shown in bold font. The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance, with a RMSE lower than 100. The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index. The GIHS and the proposed EDRN methods achieved better performance than the other methods, while the proposed EDRN method achieved the best SCC performance in most of the spectral bands. The proposed EDRN method also achieved the best performance for the spectral curve comparison. AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, while the proposed EDRN method achieved the best performance for the PSNR index. GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, while the proposed EDRN method achieved the best performance for the SSIM index. AWLP, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF, GIHS and SFIM methods, while the proposed EDRN method achieved the best performance for the ERGAS index. MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP, CNMF, GIHS, HPF and SFIM methods, while the proposed EDRN method had the best performance for the Q metric index. Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.
Figure 17 shows the quality of the compared and proposed methods for the Chaohu dataset in terms of SCC and spectral curve comparison, while
Table 2 illustrates the quality of the compared and proposed fusion methods for the Chaohu dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric. In
Table 2, the best performance of each of the indices is shown in bold font. The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance. The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index. The proposed EDRN method achieved better performance than all the compared methods in most of the spectral bands for the SCC index. The proposed EDRN method also achieved the best performance for the spectral curve comparison. AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index. GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index. AWLP, CNMF, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS and SFIM methods, while the proposed EDRN method had the best performance for the ERGAS index. GIHS, MTF_GLP, HPF, SFIM, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index. Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.
Figure 18 shows the quality of the compared and proposed methods for the Dianchi dataset in terms of the SCC and spectral curve comparison.
Table 3 illustrates the quality of the compared and proposed fusion methods for the Dianchi dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric. In
Table 3, the best performance for each of all the indices is shown in bold font. The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, and the proposed EDRN method achieved the best RMSE performance. The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, and the EDRN method achieved the best performance for the SAM index. The proposed EDRN method achieved better performance than all the other methods in most of the spectral bands for the SCC index. AWLP and the proposed EDRN methods achieved better performance than the CNMF, GIHS and MTF_GLP methods for the spectral curve comparison. The proposed EDRN method achieved the best performance for the spectral curve comparison. AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index. GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index. MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP, CNMF, GIHS, HPF and SFIM methods, while the proposed EDRN method achieved the best performance for the ERGAS index. GIHS, MTF_GLP, HPF, SFIM, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index.
Figure 19,
Figure 20 and
Figure 21 show the quality of the compared and proposed methods for the Pavia Center, Botswana and Chikusei dataset in terms of SCC and spectral curve comparison, while
Table 4,
Table 5 and
Table 6 illustrate the quality of the compared and proposed fusion methods for the Pavia Center, Botswana and Chikusei dataset in terms of RMSE, SAM PSNR, SSIM, ERGAS and the Q metric. In
Table 4,
Table 5 and
Table 6, the best performance corresponded by each index is shown in bold font. The AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the CNMF and GIHS methods to a great degree, while the proposed EDRN method achieved the best RMSE performance. The AWLP, CNMF, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS method to a great degree, while the EDRN method achieved the best performance for the SAM index. The proposed EDRN method achieved better performance than all the compared methods in most of the spectral bands for the SCC index. The proposed EDRN method also achieved the best performance for the spectral curve comparison. AWLP, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than CNMF and GIHS, and the proposed EDRN method achieved the best performance for the PSNR index. GIHS, MTF_GLP, DIP-HyperKite and the proposed EDRN methods achieved better performance than AWLP and CNMF, and the proposed EDRN method achieved the best performance for the SSIM index. AWLP, CNMF, MTF_GLP, HPF, DIP-HyperKite and the proposed EDRN methods achieved better performance than the GIHS and SFIM methods, while the proposed EDRN method had the best performance for the ERGAS index. GIHS, MTF_GLP, HPF, SFIM, DIP-HyperKite and the proposed EDRN methods achieved better performance than the AWLP and CNMF methods, while the proposed EDRN method achieved the best performance for the Q metric index. Hence, the proposed EDRN method achieved the best performance for all eight evaluation indices.
In terms of the quality evaluation for the compared and proposed EDRN methods on the six datasets, the proposed EDRN method achieved better evaluation performance than all the other fusion methods.
We counted the time cost for taking entire hyperspectral image as up-sampling input and taking hyperspectral image patch as up-sampling input. The time costs are shown in
Table 4 as follow.
We set the size of image patches for the hyperspectral image as 20 × 20, and the size of hyperspectral image as 300 × 300. Then, we segmented the hyperspectral image with a total of 255 image patches. As shown in
Table 7, taking the entire hyperspectral image as up-sampling input cost a little more time, at roughly 37 s. Then, we computed the time cost for taking one image patch as up-sampling input, which cost roughly 0.11 s. We also computed the total time cost for taking all image patches as up-sampling input which is that time cost for one image patch multiplied by 255. Taking image patches as up-sampling input saves more time than taking the entire hyperspectral image as up-sampling input.
We set a series of experiments for comparing up-sampling with nearest, bilinear and bicubic modes that affect the downstream task. The fused results of up-sampling with nearest, bilinear and bicubic modes on Baiyangdian, Chaohu and Dianchi dataset are shown in
Figure 22,
Figure 23 and
Figure 24.
As shown in
Figure 22,
Figure 23 and
Figure 24, fusion results with up-sampling of the nearest mode have a very fuzzy visual effect, the results with up-sampling of bilinear mode have a little clearer visual effect than that with up-sampling of nearest mode, while the results with up-sampling of bicubic mode have the best visual effect. As shown in
Figure 22a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of croplands, roads, buildings and shadows in the Baiyangdian region dataset. As shown in
Figure 22b, the fusion result with up-sampling of bilinear mode achieved fuzzy fusion performance on the land covers of cropland, buildings and shadows but a little clearer fusion performance on the land cover of road on the Baiyangdian region dataset. As shown in
Figure 22c, the fusion result with up-sampling of bicubic achieved clear fusion performance on all of the land covers in the Baiyangdian region dataset. As shown in
Figure 23a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of croplands, mountains, roads and water in the Chaohu region dataset. As shown in
Figure 23b, the fusion result with up-sampling of bilinear mode achieved a fuzzy fusion performance on the land covers of croplands, mountains and roads but a little clearer fusion performance on the land cover of water in the Chaohu region dataset. As shown in
Figure 23c, the fusion result with up-sampling of bicubic mode achieved clear fusion performance on all land covers in the Chaohu region dataset. As shown in
Figure 24a, the fusion result with up-sampling of nearest mode achieved fuzzy fusion performance on the land covers of rivers, water and mountains in the Dianchi region dataset. As shown in
Figure 24b, the fusion result with up-sampling of bilinear mode achieved fuzzy fusion performance on the land covers of mountains but a little clearer fusion performance on the land covers of jungles, rivers and water in the Dianchi region dataset. As shown in
Figure 24c, the fusion result with up-sampling of bicubic mode achieved clear fusion performance for all of the land covers in the Dianchi region dataset. It can be concluded that up-sampling with nearest and bilinear modes affected the downstream task with a comparatively fuzzy up-sampling result. Up-sampling with the bicubic mode reached the sweet spot for the up-sampling step. If the image was up-sampled beyond this point, the performance of the downstream task began to diminish.