*3.1. Network Architecture*

Figure 1 shows the network architecture of SSIN, which consists of three main parts: spectral-spatial information extraction, spectral-spatial information interaction, and spatialspectral information fusion. Specifically, we use a single convolutional layer to extract information and then stack the information interaction group (IIG) to achieve the information interaction. Finally, we use the IF module to fuse spectral and spatial information. The

input LRMS image and PAN image are denoted as *IMS* ∈ R*h*×*w*×*c* and *IPAN* ∈ R*H*×*W*×1. The spectral and spatial information is first extracted by a single convolution layer, respectively.

$$I\_{\rm spc}^{0} = f\_{\rm spc}(f\_{\rm up}(I\_{\rm MS}))\_{\prime} I\_{\rm spa}^{0} = f\_{\rm spa}(I\_{\rm PAN})\_{\prime} \tag{1}$$

where *<sup>I</sup>*0*spe* ∈ R*H*×*W*×*<sup>B</sup>* and *<sup>I</sup>*0*spa* ∈ R*H*×*W*×*<sup>B</sup>* represent the extracted spectral and spatial information, respectively. *fspe*(·) and *fspa*(·) are 3 × 3 convolutional layers used to extract the spectral-spatial information from the input image*IMS* and *IPAN*. *fup*(·) represents the bicubic interpolation.

**Figure 1.** The architecture of the proposed SSIN.

Then, the extracted spectral and spatial information are fed into a series of IIG to realize spectral-spatial information interaction, which can be formulated as follows:

$$\left(I\_{\rm spc}^{n}, I\_{\rm spa}^{n}\right) = f\_{\rm IIG}^{n}\left(I\_{\rm spc}^{n-1}, I\_{\rm spa}^{n-1}\right), n = (1, 2, \ldots, N) \tag{2}$$

where *f nIIG*(·) stands for the the *nth* IIG, and *Inspe*, *Inspa* represent the output spectral and spatial information of the *nth* IIG, respectively. *N* denotes the total numbers of IIG.

Inspired by [47], we cascade all these IIGs to fully use the information interacted at different stages. Then, the spectral and spatial information extracted by each IIG are concatenated and fed into an information fusion module to fuse the spectral-spatial information. Simultaneously, we feed the spectral information *<sup>I</sup>*0*spe* and spatial information *<sup>I</sup>*0*spa* into the IF module to maintain the original information concentration. Finally, to maintain the integrity of the spectral information, we add the upsampled LRMS image to the fused information forming global residual learning. This process can be expressed as:

$$I\_{f\text{use}} = f\_{\text{IF}}\left( \left[ I\_{\text{spec}}^1 \cdot \cdot \cdot \cdot \, \, \, \, I\_{\text{spec}}^N \right] \, \left[ I\_{\text{spar}}^1 \cdot \cdot \cdot \, \, \, \, I\_{\text{spa}}^N \right] \, \, I\_{\text{spec}}^0 \, \, I\_{\text{spa}}^0 \right) \tag{3}$$

$$I\_{SRMS} = \ \ I\_{f\text{us}\mathfrak{e}} + f\_{\text{up}}(I\_{MS})\tag{4}$$

where *fIF*(·) and [·] denote the IF module and concatenation operation. *If use* represents the fused information by the IF module. *ISRMS* indicates the pansharpened MS image.
