3.1. Motivation
The goal of pansharpening is to combine the abundant spectral information in MS image with the rich spatial information in PAN image to obtain the HRMS image. As mentioned above, the fused images obtained by CS-based methods and MRA-based methods suffer from spectral distortions and spatial distortions, respectively. However, the CNN-based methods can effectively improve the spectral fidelity and spatial fidelity of the fused image. In terms of the powerful performance of CNN in image feature extraction and image reconstruction from the extracted features, CNN is still used to perform the pansharpening work.
An original image contains abundant detail information. If the image is downsampled by a certain scale, much detail information in the image is lost, but the downsampled image can coarsely reflect spatial structure of the original image. Therefore, utilizing the coarse-level structure of an image to reconstruct the fine-level structure is considered. In some literatures [
35,
36,
37], this coarse-to-fine idea has been applied, and successful practices have been achieved. For example, in the image deblurring network proposed by Nah et al. [
35], the authors divided the entire network into three levels. From top level to bottom level, the input of each level of network is the original image that needs to be deblurred, the medium-level blurred image that is downsampled once from the original image, and the coarse-level blurred image that is downsampled twice, respectively. Each level of the network uses the same structure, and they output the deblurred image under the corresponding level respectively. To make full use of the information in different levels of the blurred image, the restored image in the coarser level is concatenated with the next relatively finer blurred input image as the input of the finer-level network. Finally, a fine-level deblurred image with better performance is obtained. Inspired by such successful cases, the coarse-to-fine idea is introduced into the field of pansharpening in this paper, so that the detail information in different levels (coarse-level, fine-level, etc.) of the source images can be fully exploited to reconstruct high-resolution MS image that retain more details. Besides, in order to give full play to the advantages of deep learning, residual learning is introduced and the depth of the network is deepened, so that the mapping relationship between the input images and the pansharpened image can be simulated more accurately, and the learning process is also simplified.
3.2. The Architecture of Proposed Network
The architecture of the proposed MSDRN is shown in
Figure 1. For convenience of modeling, the original MS and PAN images are represented as
and
with size of
and
, respectively.
,
, and
respectively denote the height, width, and channels of MS image, and
denotes the ratio of spatial resolution between MS image and PAN image. The upsampled MS image by ratio
is denoted as
; the images that are downsampled once and twice for
are respectively represented as
and
; the images that are downsampled once and twice for
are respectively represented as
,
, and the scale of each downsampling is 2 × 2. From top to bottom, three different levels of high-resolution pansharpened images to be predicted are represented as
,
and
, respectively. The concatenation of images along the band direction is denoted as “
”. For example, the concatenation of upsampled MS image and PAN image can be represented as
, where
indicates the concatenated data.
The network flowchart is for 4-band MS image, and the size ratio of PAN image to MS image is 4. In
Figure 1, “↑” means upsampling operation. Moreover, it can be seen from
Figure 1 that the entire network is divided into three levels, i.e., fine-level network, medium-level network and coarse-level network, from top to bottom. The networks in all levels have a similar structure, which consists of a sub-network (“Net” in the Figure) and a subsequent convolutional layer. The original data of the pansharpening problem is composed of MS image and PAN image, while that of the image deblurring network [
35] is only one RGB image. To this end, the original MS image is first interpolated to the same size as the PAN image, and it is then stacked with the PAN image to form a 5-band data, which can be expressed as
. Subsequently, the whole data is downsampled twice in succession, and the results are used as the initial inputs of the medium-level network and the coarse-level network, respectively. Each level of network outputs high-resolution MS image at the corresponding level.
Each level of the network is added a “skip connection”, which points from the input of the sub-network at each level to the output of the sub-network. Apart from stacking more layers, skip connections can also transmit the network’s input to the output intactly. In each level of the network, the network inputs including MS and PAN images of corresponding scale are transmitted unchanged to the output of the corresponding sub-network (Net), so that the loss of spectral information and spatial information in the fused image can be reduced. Since the element-wise addition operation requires the objects to be added to have the same dimensions, a convolutional layer needs to be added behind the skip connection to reduce the number of bands of the output data. Besides the initial inputs, the fused image from the next relatively coarser level is taken as part of the inputs by the medium- and fine-level networks. Specifically, the medium-level network uses the coarse-level fused image as part of inputs, and the fine-level network uses the fused image of the medium-level network as part of inputs. In this way, different levels of detail information from the source images can be exploited. In [
26], the authors indicate that some class-specific radiometric indices having a high correlation to some feature maps from different layers can guide the learning process of the network. Hence, some well-known indices for 4-band MS, i.e., normalized difference water index (NDWI) and normalized difference vegetation index (NDVI) [
26,
38], are used in our proposed network. Their definitions can be expressed as
where
represents corresponding band of MS. The concatenation of these extended inputs is expressed as NDXI, i.e.,
. In order to reduce the complexity of the network, they are only considered as part of the inputs of the fine-level network, because the fusion target of pansharpening is only generated by the fine-level network. Like the original MS image, NDXI is first interpolated to the same size as the PAN image, and these extended inputs of interpolating
times are denoted as
. Then, it is concatenated with the initial inputs (upsampled MS image and PAN image) of the fine-level network and fed into the fine-level network. The network fusion process is divided into three stages, starting with coarse-level fusion, transitioning to medium-level fusion, and ending at the final fine-level fusion. Here, the number of layers in each level of the network is set to
l. The detailed descriptions of fusion are as follows.
Stage 1: The five-band data
obtained by performing 4 × 4 downsampling is input to the coarse-level network. The output after the skip connection should also be five-band data, and it can be expressed as
where
represents the mapping of “Net3” sub-network input to output;
and
represent the weights and biases of the sub-network, respectively;
represents the
th layer, and subscript “3” represents the coarse-level network. To obtain four-band coarse-level HRMS image, it is necessary to add another convolutional layer to reduce the spectral dimension, thus the resulting coarse-level HRMS image can be expressed as
where
and
, respectively, represent the weights and biases in the
th layer of the coarse-level network, and
represents the convolution operation.
Stage 2: To match the input image size of the medium-level network, the coarse-level fused image needs to be upsampled. The commonly used upsampling methods include the linear interpolation, bicubic interpolation, etc., but the up-convolution method is used in this paper. On the one hand, the up-convolution method has been claimed by some literatures [
31,
39] to have better performance. On the other hand, it can be used as a part of the entire network to make all layers of the entire network learn together without other interventions. The image after the coarse-level fused image is upsampled by a scale of 2 with the up-convolution method, which is denoted as
. The coarse-level fused image after up-convolution is concatenated with the initial inputs of the medium-level network to form inputs with nine bands, which can be expressed as
. Then, it is feed into the medium-level network, the output after the skip connection can be expressed as
where
represents the mapping of “Net2” sub-network input to output;
and
, respectively, represent the weights and biases of the sub-network, and subscript “2” represents the medium-level network. Similarly, a convolutional layer needs to be added behind the skip connection, so that a medium-level HRMS image can be obtained by the following formula
where
and
respectively represent the weights and biases in the
th layer of the medium-level network.
Stage 3: The medium-level HRMS image is upsampled by a scale of 2 with up-convolution manner, and the obtained image is expressed as
. Concatenating the obtained image with the initial inputs of the fine-level network, the data obtained is used as the inputs of the fine-level network, and this whole data can be expressed as
Similar to the above two stages, the output after the skip connection of fine-level network and the output after adding the convolutional layer can be respectively expressed as
where
is the final pansharpened image;
,
and
represent the mapping of “Net1” sub-network input to output, the weights and biases of the sub-network;
and
, respectively, represent the weights and biases in the
th layer of the fine-level network, and subscript “1” represents the fine-level network.
Except the up-convolution layers, each level of the network has the similar structure but different parameters, and number of layers is set to 11. The detailed structural parameters of the network are listed in
Table 1.
3.3. Training of Network
The goal of pansharpening is to obtain a HRMS image with the same spatial resolution as the PAN image, so it is desired that the spatial resolution of the fused image be as close as possible to that of the PAN image. However, the ideal image does not exist, which hampers the training of the network and the quality evaluation of the fused image. These problems can be solved by the Wald protocol [
40]. The Wald protocol is to first downsample the original MS and PAN images at the same time based on the ratio of the spatial resolution of the MS and PAN images. The downsampled PAN image has the same spatial resolution as the original MS image. In this case, the original MS image can be used as a reference, and the downsampled MS and PAN images can be used as the inputs of the network. After the network training is completed, the original MS and PAN images are used as the inputs of the network, and the optimized model parameters are used to predict the pansharpened image.
The proposed network has three different levels of input and output. To make the network fully trained, a reference image is set at each level of the network. According to the Wald protocol, the reference images from the fine-level network to the coarse-level network respectively are the original MS image, the MS image after downsampling at the scale of 2, and the MS image after downsampling at the scale of 4. For the loss function, the mean square error (MSE) is chosen. In the case of reduced resolution, if the inputs of the fine-level to coarse-level network are simplified as
,
and
, and the corresponding reference images are
,
and
, respectively, then the loss function of the
th level network can be expressed as
where
1, 2, 3, and
i represents the number of the sample in a batch size;
and
represent the weights and biases of the
th level network, respectively;
denotes the mapping of
th level network, and
is the training batch size. During training, the loss functions of three different levels of network are averaged, i.e., the total loss function is
The optimal values of the parameters (weights and biases) are obtained by minimizing
. The Adam [
41] method is used to update the network parameters. If all the parameters in the network are expressed as
, the update formula can be expressed as
where
represents gradients at timestep
;
and
represent biased first moment estimate and biased second raw moment estimate, respectively;
and
represent bias-corrected first moment estimate and bias-corrected second raw moment estimate, respectively;
represents the learning rate;
,
,
are usually taken as
, 0.99, 0.999, respectively.