**2. Related Work**

Over the past decades, many pansharpening methods have been put forward. These methods fall into four categories [23]: component substitution (CS)-based methods, multiresolution analysis (MRA)-based methods, variational optimization (VO)-based methods, and deep learning (DL)-based methods. In CS-based methods, the LR MS image is decomposed into spectral and spatial components, then the decomposed spatial components of the LR MS image are replaced by the histogram-matched PAN image. Finally, the HR MS image is obtained by inverse transformation. Several widely known CS-based methods include intensity-hue-saturation (IHS) [24,25], principal component analysis (PCA) [26], the Gram–Schmidt (GS) conversion [27], the adaptive GS (GSA) [28], the partial replacement adaptive CS (PRACS) [29], and the band-dependent spatial details (BDSD) [30]. These CS-based methods are simple and efficient and can directly extract spatial information from PAN images. However, spectral distortion is prone to occurs during pansharpening when the correlation between the PAN and MS images is low [31]. As for MRA-based methods, MRA-based methods typically consist of three steps: (1) the upsampled MS image and PAN image are decomposed into multiple scales, (2) fusion at every scale, (3) utilizes an inverse transform to ge<sup>t</sup> the reconstructed image. MRA-based methods use multi-scale decomposition to decompose the source image into multiple scales, for instance, discrete

wavelet transform (DWT) [32], Laplacian pyramid (LP) [33], generalized Laplacian pyramid (GLP) [34], and contourlet transformation [35]. These methods are generally superior to CS-based methods in spectral fidelity. However, multiscale transformation brings a large amount of computation and may lead to spatial distortion [36].

In order to balance spectral and spatial distortions, some methods based on variational optimization are proposed.

VO-based pansharpening methods transformed the pansharpening process into an optimization problem. The key to solve the problem is the establishment of energy function and the selection of optimization algorithm [37].

This category was developed in the 1990s [38]. Since Ballester et al. [39] proposed the pioneer variational method for pansharpening, VO-based pansharpening methods attracted more and more attention and developed rapidly. Model-based methods [40,41] and sparsebased methods [42,43] are two representative VO-based methods. Even though VO-based methods can produce high-quality fusion results, the optimization is time-consuming [44].

In recent years, with the rapid development of DL-based image super-resolution, many DL-based pansharpening methods have been put forward which greatly improve the performance and efficiency of pansharpening. Masi et al. [6] first introduced a simple three-layer CNN into pansharpening. Inspired by VDSR [20], Wei et al. [8] proposed a deep residual neural network with 11 convolutional layers for pansharpening. Yang et al. [7] proposed a deep network architecture for the pansharpening called PanNet, which trained the network in the high-pass domain to preserve the spatial structure. To capture multiscale detailed information, Yuan et al. [12] introduce multiscale feature extraction and residual learning into CNN for pansharpening. These pansharpening methods process at the pixel level. Different from the previous method, Liu et al. [13] presented a two-stream fusion network (TFNet) to fuse PAN and MS images in the feature level and reconstructed the pansharpened image from the fused features. To take full advantage of gradient characteristics in pansharpening. Lai et al. [45] utilized gradient information to guide the pansharpening process. The above method simply concatenated the upsampled MS image and PAN image, input it into the network, and then learned the mapping relationship between the input and HRMS image directly, which resulted in spatial distortion. To solve this problem. Wang et al. [15] integrated a multiscale U-shaped CNN into pansharpening for make full use of multispectral information. To explore the intra-image characteristics and the inter-image correlation concurrently. Guan et al. [46] proposed a three-stream structure network to fully extract the valuable information that encoded in the HR PAN images and LR hyperspectral images.

As mentioned above, although most of the current DL-based methods have significantly improved the fusion performance, they do not explore the advantage information of MS and PAN images. Unlike the above methods, we consider the interaction and impact of spectral-spatial information on the basis of the dual-branch structure. We use both spatial attention and spectral attention mechanisms in the SSA module to take full advantage of the advantageous information of the MS and PAN images.
