**2. Related Work**

Notations. We denote the *low resolution multispectral* (LRMS) image by *MS* ∈ <sup>R</sup>*H*×*W*×*N*, where *H*, *W*, and *N* are the width, the height, and the number of spectral bands of the LRMS image, respectively. We denote the high resolution PAN image by *P* ∈ <sup>R</sup>*rH*×*rW*, where *r* is the spatial resolution ratio between MS and PAN, denote by *MS* ∈ R*rH*×*rW*×*<sup>N</sup>* the reconstructed HRMS image. We let *MSk* to represent the *k*th band of the LRMS image, where *k* = 1, . . . , *N*, and let *MS k* ∈ R*rH*×*rW* to represent the upsampled version of *MSk* by ratio *r*. For notational simplicity, we also denote by *P* the histogram matched PAN image. Based on these symbols, we next briefly introduce the main idea of the CS-based, MRA-based and learning-based methods.

#### *2.1. The CS-Based Methods*

The CS-based methods are based on the assumption that the spatial and spectral information of LRMS image can be separated by a projection or transformation of the original LRMS image [3,37]. The CS class usually has four steps: (1) upsample the LRMS image to the size of the PAN image; (2) use a linear transformation to project the upsampled LRMS image into another space; (3) replace the component containing the spatial information with the PAN image; (4) perform an inverse transformation to bring the transformed MS data back to their original space and then ge<sup>t</sup> the pansharpened MS image (i.e., the estimated HRMS). Due to the changes in low spatial frequencies of the MS image, the substitution procedure usually suffers from spectral distortion. Thus, spectral matching procedure (i.e., histogram matching) is often applied before the substitution.

Mathematically, above fusion process can be simplified without the calculation of the forward and backward transformation as shown in Figure 1, which leads the CS class to have the following equivalent form as

$$
\overline{\hat{M}\hat{\mathbf{S}}\_{k}} = \overline{\hat{M}\hat{\mathbf{S}}\_{k}} + \underline{\mathcal{g}}\_{k}(P - I\_{L}), \tag{1}
$$

$$
k = 1, \dots, N,
$$

where *g*1, ..., *gN* are the injection gains, and *IL* is a linear combination of the upsampled LRMS image bands and often called *intensity component*, defined as

$$I\_L = \sum\_{k=1}^{N} w\_k \overline{M} \overline{S}\_{k^\*} \tag{2}$$

where *w*1, ..., *wN* usually correspond to the first row of the forward transformation matrix, which is used to measure the degrees of spectral overlap between the MS and PAN channels.

**Figure 1.** Flowchart of the CS-based methods for pansharpening.

Numerous CS-based methods have been proposed to sharpen the LRMS images according to Equation (1) and flowchart in Figure 1. The CS class includes IHS [9] which exploits the transformation into the IHS color space and its generalized version GIHS [10], PCA [11] based on the statistical irrelevance of each principal component, Brovey [12] based on a multiplicative injection scheme, *Gram-Schmidt* (GS) [13] which conducts the Gram-Schmidt orthogonalization procedure and by a weighted average of the MS bands minimizing the *mean square error* (MSE) with respect to a low-pass filtered version PAN image in the *adaptive* GS (GSA) [15], *band-dependent spatial detail* (BDSD) [14] and its enhanced version (i.e., *BDSD with physical constraints*: BDSD-PC) [16], *partial replacement adaptive component substitute* (PRACS) [17] based on the concept of *partial replacement* of the intensity component and so on. Each method differs from the others by the different projections of the MS images used in the process and by the different designs of injection gains. Although they show extreme performances in improving the spatial qualities of LRMS images, they usually suffer from heavily spectral distortions in some scenarios due to local dissimilarity or the not well-separated spatial structure with the spectral information. Refer to [3] for more detailed discussions about this.

#### *2.2. The MRA-Based Methods*

Unlike the CS-based methods, the MRA class is based on the operator of multi-scale decomposition or low-pass filter (equal to a single scale of decomposition) over the PAN image [3,37]. They first extract the spatial details over a wide range of scales from the high resolution PAN image or from the difference between the PAN image and its low-pass filtered version *PL*, and then inject the extracted spatial details into each band of upsampled LRMS image. Figure 2 shows the general flowchart of the MRA-based methods.

Generally, for each band *k* = 1, 2, ··· , *N*, the MRA-based methods can be formulated as

$$
\bar{\hat{M}\!\!\!\!\!S\_k} = \bar{\hat{M}\!\!\!S\_k} + \mathcal{g}\_k (P - P\_L). \tag{3}
$$

As we can see from above Equation (3), different MRA-based methods can be distinguished by the way of obtaining *PL* and by the design of injection gains *g*1, *g*2, ··· , *gN*. Several methods belonging to this class have been proposed, such as HPF [18] using the box mask and additive injection, the SFIM [19], *decimated Wavelet transform using additive injection model* (Indusion) [23], the AWLP [24], GLP with *modulation transfer function* (MTF)-matched filter (denoted by MTF-GLP) [21], its HPM injection version (MTF-GLP-HPM) [22] and context-based decision version (MTF-GLP-CBD) [38], *a trous wavelet transform using the model 3* (ATWT-TM3) [39], and so on.

The MRA-based methods highlight the extraction of multi-scale and local details from the PAN image, well in reducing the spectral distortion but compromising the spatial enhancement. To make up this problem, many approches have been proposed by the utilization of different decomposition schemes (e.g., *morphological filters* [40]) and the optimization of the injection gains.

**Figure 2.** Flowchart of the MRA-based methods for pansharpening.

#### *2.3. The Learning-Based Method*

Apart from the traditional CS-based and MRA-based methods, the learning-based methods have been proposed or applied to the pansharpening, among which the CNN-based methods are the most popular [41]. The CNN-based methods are very flexible, and one can design a CNN with different architectures. Due to the end-to-end and data-driven properties, they achieve the state-of-the-art performances in some studies [32–35]. After a network architecture design, training image pairs with *low resolution* MS (LRMS) images as network input and *high resolution* MS (HRMS) images as network output, are needed to learn the network parameters *θ*. The learning procedure is based on the choice of loss function and optimization method, and the effect of learning is different from each other according to these choices of loss function and optimization strategy. However, these ideal image pairs are unavailable, and usually simulated based on scale invariant assumption by properly downsampling both PAN and the original MS images to a reduced resolution. Then, the resolution reduced MS images and the original MS can be used as an input-output pair.

Given the input-output MS pairs of *MS i* with low resolution and *Yi* with high resolution, and aided by the auxiliary PAN image *<sup>P</sup>i*, *i* = 1, 2, . . . , *n*, the CNN-based methods optimize the parameter by minimizing the following cost function

$$\mathcal{L}(\theta) = \sum\_{i=1}^{n} ||f(P^i, \overline{M}\overline{S}^i; \theta) - \mathcal{Y}^i||\_F^2 \tag{4}$$

where *f*(*Pi*, *MS i*; *θ*) denotes a neural network which takes *θ* as parameters, and || · ||*F* is the Frobenius norm, which is defined as the square root of the sum of the absolute squares of the elements.

To further improve the performances of CNN-based methods, recent work mainly resorts to the deep residual architecture [42] or to increase the depth of the model to extract multi-level abstract features [43]. However, these will require large number of network parameters and burden computation [36]. Unlike the above CNN-based methods that aim at generating the HRMS images or their residual images, we here to reduce the number of parameters and reduce the requirements on the computation capacity of the computer by learning weight maps for the CS-based and the MRA-based methods. Refer to the following section for more detailed discussions about this.

#### **3. The Proposed PWNet Method**

#### *3.1. Motivation and Main Idea*

According to the above analysis, the CS-based and MRA-based methods are simple and usually have complementary performances, i.e., the CS-based methods are good at spatial rendering but sometimes suffer from severe spectral distortions, while the MRA-based methods performance well in keeping the spectral information of the MS images but may have limited spatial enhancements. And the performances of the CS-based and MRA-based methods show data uncertainty, i.e., they have different fusion performances on different scenarios. The learning-based methods, especially the CNN-based methods, perform well in reducing spatial and spectral distortions due to their powerful feature extraction capabilities and data-driven training scheme. However, they usually need an extremely large data set to train the model parameters and are difficult to be interpretable.

Is there a way to make full use of the complementary performances of the CS-based and MRA-based methods at the same time reducing their data uncertainty? A straightforward idea is to firstly generate multiple fusion results by multiple methods (i.e., the CS-based and MRA-based methods), and then automatically combine them with weights based on performances within different scenarios to boost the fusion result. This may be realized by using a trainable CNN since it is data-driven and has strong abilities in the field of image processing.

Motivated by the above, we propose a novel model average method, referred to as *pansharpening weight netowrk* (PWNet), for the pansharpening. Specifically, rather than generating only one estimated HRMS image at a time, we use multiple inference modules to generate distinct estimated HRMS images at a time. Each inference module produces a distinct estimate of HRMS with bias, and multiple estimates have the positive and negative deviations. And then the biases can be complemented by averaging the multiple results, thus leading to the distortions of average are smaller than that of a single estimate. In order to make use of the simplicity and complementary characteristics of the CS-based and MRA-based methods, we choose them as inference modules, i.e., use each CS-based or MRA-based method as an inference module, and then design an end-to-end trainable network with the original MS and PAN images as input to simultaneously obtain weight maps for all fusion results obtained by the CS and MRA inference modules. Based on the powerful capability and data-driven training scheme of nerual network, the output weight maps are context and method dependent. Finally, we ge<sup>t</sup> an estimated HRMS image through adaptively averaging all the fusion results obtained by the CS-based and MRA-based methods. Figure 3 depicts the main procedures of the proposed PWNet for pansharpening.
