*3.5. Information Fusion Module*

The main purpose of the information fusion (IF) module is to fuse the interacted spectral-spatial information to reconstruct an HR-MS image *ISRMS*. The schematic of IF is shown in Figure 5.

**Figure 5.** Schematic diagram of the information fusion module, "⊕" denotes elementwise addition, and "⊗" denotes matrix multiplication.

To fully use the information interacted at different stages, we take as input the concatenation of the spectral and spatial information extracted by each IIG in each branch. First, the concatenated inputs *<sup>I</sup>*1*spe*, ··· , *INspe* ∈ R*H*×*W*×*NB* and *<sup>I</sup>*1*spa*, ··· , *INspa* ∈ R*H*×*W*×*NB* are fed to a 1 × 1 convolution to squeeze the number of channels, and then the squeezed information is added with its initial extracted spectral *<sup>I</sup>*0*spe* and spatial *<sup>I</sup>*0*spa* information to generate feature maps *Fspe* ∈ *RH*×*W*×*<sup>B</sup>* and *Fspa* ∈ *<sup>R</sup>H*×*W*×*B*, respectively. Next, we concatenate the feature maps in each branch to generate *Ff use* ∈ R*H*×*W*×*<sup>B</sup>* and the 1 × 1 convolution is used for squeeze channels again. This process can be described as:

$$F\_{\rm spe} = H\_{1 \times 1} \left( \left[ I\_{\rm spec}^1, \dots, I\_{\rm spec}^N \right] \right) + I\_{\rm spec}^0 \tag{14}$$

$$F\_{\rm spa} = H\_{1 \times 1} \left( \left[ I\_{\rm spa}^1, \cdot, \cdot, I\_{\rm spa}^N \right] \right) + I\_{\rm spa}^0 \tag{15}$$

$$F\_{fusc} = H\_{1 \times 1} \left( \left[ F\_{spc\prime}, F\_{spa} \right] \right) \tag{16}$$

Inspired by [51], we adopt the pixel attention (PA) block at the end of the IF to incorporate the spectral and spatial information, which consists of two convolution layers and a PA layer between them. PA obtains the attention maps, and only goes through a 1 × 1 convolution and a Sigmoid function, which will then be used to weight the input features. It could effectively improve the final performance at lower parameter cost [51], which is validated by the ablation study in Section 4.4. We denote the proposed PA as *fPA*(·), the *Ff use* is further fed into the PA block:

$$I\_{fus\varepsilon}^{'} = f\_{PA} \left( F\_{fus\varepsilon} \right) \tag{17}$$

Finally, to match the channel number of the input MS image, a convolutional layer is used to generate the final fusion result:

$$I\_{fus\varepsilon} = H\_{\mathbb{3}\times\mathbb{3}}\left(I\_{fus\varepsilon}^{'}\right). \tag{18}$$
