Multiscale Spatial–Spectral Dense Residual Attention Fusion Network for Spectral Reconstruction from Multispectral Images

Liu, Moqi; Zhang, Wenjuan; Pan, Haizhu

doi:10.3390/rs17030456

Open AccessArticle

Multiscale Spatial–Spectral Dense Residual Attention Fusion Network for Spectral Reconstruction from Multispectral Images

by

Moqi Liu

^1,2

,

Wenjuan Zhang

^1,*

and

Haizhu Pan

^3,4

¹

Aerospace Information Research Institute (AIR), Chinese Academy of Sciences (CAS), Beijing 100094, China

²

School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

College of Computer and Control Engineering, Qiqihar University, Qiqihar 161000, China

⁴

Heilongjiang Key Laboratory of Big Data Network Security Detection and Analysis, Qiqihar University, Qiqihar 161000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 456; https://doi.org/10.3390/rs17030456

Submission received: 16 October 2024 / Revised: 25 January 2025 / Accepted: 26 January 2025 / Published: 29 January 2025

(This article belongs to the Special Issue Intelligent Remote Sensing: AI-Powered Techniques for Enhanced Data Analysis and Interpretation)

Download

Browse Figures

Versions Notes

Abstract

:

Spectral reconstruction (SR) from multispectral images (MSIs) is a crucial task in remote sensing image processing, aiming to enhance the spectral resolution of MSIs to produce hyperspectral images (HSIs). However, most existing deep learning-based SR methods primarily focus on deeper network architectures, often overlooking the importance of extracting multiscale spatial and spectral features in the MSIs. To bridge this gap, this paper proposes a multiscale spatial–spectral dense residual attention fusion network (MS2Net) for SR. Specifically, considering the multiscale nature of the land-cover types in the MSIs, a three-dimensional multiscale hierarchical residual module is designed and embedded in the head of the proposed MS2Net to extract spatial and spectral multiscale features. Subsequently, we employ a two-pathway architecture to extract deep spatial and spectral features. Both pathways are constructed with a single-shot dense residual module for efficient feature learning and a residual composite soft attention module to enhance salient spatial and spectral features. Finally, the spatial and spectral features extracted from the different pathways are integrated using an adaptive weighted feature fusion module to reconstruct HSIs. Extensive experiments on both simulated and real-world datasets demonstrate that the proposed MS2Net achieves superior performance compared to state-of-the-art SR methods. Moreover, classification experiments on the reconstructed HSIs show that the proposed MS2Net-reconstructed HSIs achieve classification accuracy that is comparable to that of real HSIs.

Keywords:

spectral reconstruction; convolutional neural network; spatial and spectral feature extraction; attention mechanisms; adaptive feature fusion

1. Introduction

Hyperspectral remote sensing has attracted much attention due to its ability to capture continuous narrow-band spectral signatures, generally from 400 to 2500 nm, covering the visible to the short-wave infrared [1,2]. The captured hyperspectral image (HSI) products, which contain rich spectral information, play a crucial role in a variety of applications, such as agricultural analysis [3], environmental monitoring [4], and mineral exploration [5]. Despite their immense potential, acquiring HSIs remains costly and technically challenging due to hardware constraints, which limits their widespread application. A practical alternative is to use multispectral images (MSIs), which are more accessible and easier to obtain. However, MSIs suffer from lower spectral resolution, which limits their ability to capture fine spectral details and to distinguish subtle variations in land-cover characteristics effectively. To address this limitation, this study focuses on enhancing the spectral resolution of MSIs to generate HSIs using the spectral reconstruction (SR) technique.

Generally, SR is an ill-posed transition problem since the reconstructed HSIs are not unique [6]. Numerous SR methods have been proposed over the past decade to mitigate this issue. In earlier studies, researchers preferred to apply the sparse representation-based method to reconstruct HSIs from MSIs. These methods first extract the HSIs dictionary and then calculate the sparse coefficients of the MSIs. Subsequently, the learned HSI dictionary is combined with the sparse coefficients to reconstruct the HSIs [7,8,9,10,11,12,13]. However, the HSI dictionary does not accurately represent the nonlinear spectral signatures of complex objects. Additionally, these sparse representation-based methods are sensitive to parameter settings and lack adaptive parameter selection capability, affecting their generalization ability and robustness. Recently, with the significant growth of machine learning, learning-based SR methods have emerged, aiming to build mappings from MSIs to the corresponding HSIs using learned models. Among these models, deep learning (DL)-based SR methods are highly preferred by researchers. Since the HSCNN that was proposed in 2017 [14], a growing number of convolutional neural network (CNN)-based SR methods have been proposed [15,16,17,18,19,20,21,22,23,24,25] due to their robust nonlinear fitting and learning capabilities. Compared with the sparse representation-based schemes, these CNN-based methods significantly improve the reconstruction performance through deeper or wider architectures. Nevertheless, as the network depth or width increases, a large number of redundant features tend to be extracted, which not only hinders the learning process, but also reduces the efficiency of exploiting the global modeling capabilities, thus limiting the reconstruction quality.

Fortunately, attention mechanisms, which can guide the network to adaptively focus on key features and enhance global dependency modeling capabilities, have been widely studied and adopted in SR tasks [26]. Most attention-based SR methods primarily leverage channel attention, spatial attention, or self-attention mechanisms to design effective attention modules. Specifically, channel attention focuses on adaptive re-weighting different feature maps to highlight the most informative ones, and thus improves the model capacity to prioritize key features in the SR process [27,28,29,30,31]. The spatial attention could make the model focus on specific spatial regions, and thus enhance the spatial details of the reconstructed HSIs [32]. To capture both channel-wise dependencies and spatial features simultaneously, some SR schemes based on dual attention have been proposed [33,34,35,36,37]. Moreover, leveraging on the advantages of Vision Transformer (ViT) in capturing long-range dependencies and global context, self-attention has recently been widely adopted in SR tasks, which significantly improves its detail recovery and global perception [38,39,40,41].

Although the aforementioned studies have achieved promising reconstruction performance, most of them struggle to effectively capture the multiscale spatial and spectral features inherent in MSIs [42,43], primarily due to the significant variations in object sizes and the divergence of spectral signatures across different geographical regions, as illustrated in Figure 1. Fortunately, increasing the model depth or incorporating attention mechanisms have been widely used methods to enhance feature extraction. However, these approaches inevitably increase computational complexity and may cause gradient vanishing or exploding, thus limiting their practical applicability. Consequently, it is challenging to effectively capture the rich multiscale spatial and spectral features of MSIs while simultaneously maintaining a balance between computational efficiency and model stability.

To solve the above challenges, this paper proposes a multiscale spatial–spectral dense residual attention fusion network (MS2Net) for SR. Technically, we first apply a three-dimensional (3-D) multiscale hierarchical residual (3-D MHR) module to extract the multiscale spatial and spectral features from the MSIs. Then, the extracted multiscale features are passed through two branches to further extract deep spatial and spectral features. In particular, we design the single-shot dense residual (SSDR) module by replacing dense connections with single-shot connections in each branch to reduce the computational cost and maintain the gradient flow. Furthermore, the residual composite soft attention (RCSA) module is designed to dynamically guide the model to focus on valuable features and empowers the model with global modeling capabilities. Finally, these high-level spectral and spatial features, extracted from the two branches, are fused by the adaptive weighted feature fusion (AWFF) module to reconstruct the HSIs. The main contributions and findings of this paper can be summarized as follows:

First, the 3-D MHR module is proposed to extract multiscale spatial and spectral features from MSIs. Specifically, the proposed 3-D MHR module not only expands the receptive field through the hierarchical residual structure but also enables fine-grained multiscale feature extraction by dividing the input feature into subsets, where each subset is processed sequentially to capture specific spatial and spectral details. In this way, the feature diversity and representation are enhanced.
Second, a novel two-branch network architecture is designed, comprising a spectral and a spatial feature extraction branch, each equipped with an SSDR and an RCSA module. The SSDR module in the spectral branch employs a 1 × 1 × 3 convolution to capture spectral dependencies, whereas, in the spatial branch, it uses a 3 × 3 × 1 convolution to extract local spatial patterns. Moreover, instead of using dense connections, the SSDR module uses single-shot connections to facilitate feature reuse, which reduces complexity while ensuring effective feature propagation. The RCSA module is connected behind the SSDR module, which can adaptively recalibrate feature responses by emphasizing important spectral and spatial features while enhancing the global representation capability.
Third, the AWFF module is introduced to integrate spectral and spatial features extracted from different branches to improve the reconstruction quality. Extensive experiments on four MSI datasets, including two simulated and two real-world datasets, using eight state-of-the-art methods, consistently demonstrates that the proposed MS2Net achieves superior reconstruction performance.

The remainder of this paper is organized as follows. Section 2 reviews several SR methods, such as sparse representation-based methods, CNN-based methods, and attention-based methods. Section 3 presents the details of the proposed method. After that, the experiment results and discussion are provided in Section 4 and Section 5, respectively. At last, Section 6 presents the conclusions and future research directions.

2. Related Works

2.1. Sparse Representation-Based Methods for SR

In the early stages of this research, sparse representation-based methods dominated SR tasks. Specifically, Yi et al. [8] combined the sparse representation and unmixing models to reconstruct HSIs. Gao et al. [9] proposed a simple but effective method, named J-SLoL, to improve MSIs spectral resolution using the extracted low-rank HSI dictionary and sparse coefficients. Similarly, Fotiadou et al. [10] designed a post-acquisition method based on sparse representations and dictionary learning to improve the spectral resolution of MSIs. Han et al. [11] proposed a spectral library-based dictionary learning method to recover missing spectral signatures from MSIs. Furthermore, some sparse representation methods are dedicated to reconstructing HSIs from visible-light images (e.g., RGB images), which can be regarded as a specific case of MSI SR. For instance, Arad and BenShahar [12] proposed a low-cost and fast approach to recover HSIs directly from RGB images. Their method first learns a sparse dictionary from the HSIs prior via K-means decomposition and projects it to the corresponding RGB images. Then, the orthogonal match pursuit algorithm is used to recover HSIs. Moreover, Aeschbacher et al. [13] used shallow learning models to compute a sparse dictionary from a specific spectral prior to recovering lost spectral resolution, which further improved the reconstruction performance of [12].

2.2. CNN-Based Methods for SR

In the past few years, CNN-based SR methods have received much attention due to their ability to learn deep feature representations automatically from MSIs. Specifically, inspired by residual networks [16] and dense networks [17], Shi et al. [15] proposed HSCNN-R and HSCNN-D networks for SR, respectively. Similarly, Can and Timofte [18] proposed a nine-layer residual network for SR, in which the adaptively parametric rectifier linear unit is used to increase the nonlinearity of the features. Furthermore, learning from the semantic segmentation task, Stiebel et al. [19] proposed SR networks based on U-net, in which the pooling and batch normalization operations are abandoned. To make the most of the contextual features, Zhao et al. [20] employed the pixel unshuffle operation to generate four-level different resolution inputs and thus construct a hierarchical regression network for SR. In addition, Koundinya et al. [21] proposed two-dimensional (2-D) and three-dimensional (3-D) CNNs for SR, in which the former network yielded low computational complexity while the latter network achieved the best reconstruction results. Fu et al. [22] presented an HSI reconstruction network based on spectral nonlinear mapping units and spatial similarity units and applied camera spectral response selection to improve the network efficiency. Wu et al. [23] introduced a lightweight SR method in which a polymorphic residual context restructuring module is used instead of the traditional residual module to encode spatial and spectral context features. In addition, based on the intrinsic characteristics of HSIs, such as spectral correlation and projection properties, Hang et al. [24] proposed a three-stage SR model. In the first stage, the band correlation coefficients are calculated to obtain the correlation matrix, and the subsequent two stages use CNN-based residual blocks to extract deep features. Similarly, Dian et al. [25] proposed an HSI prior learning module based on CNNs and used it to construct two sub-networks for extracting spatial and spectral prior features from HSIs.

2.3. Attention-Based Methods for SR

Inspired by human visual processing, the attention mechanism is introduced to help networks focus on important features while ignoring irrelevant ones; it is widely used in SR tasks. For example, Li et al. [27] proposed an adaptive weighted attention network for SR, in which the squeeze-and-excitation attention module [28] is introduced to recalibrate each channel contribution of feature maps. Similarly, in [29], a hybrid 2-D and 3-D depth residual attention network is proposed, in which the 2-D and 3-D attention mechanisms can adaptively recalibrate the channels and bands of the feature maps, respectively. Inspired by [30], Zheng et al. [31] proposed a spatial–spectral residual attention network for SR from MSI to maintain the association between adjacent bands. In addition, many studies based on dual attention mechanisms have been presented for SR which enable the network to simultaneously enhance and model global features in both channel and spatial dimensions. Specifically, Wang et al. [33] proposed a dense residual network with dual attention. The channel attention module in their network follows the design of [28]. For the spatial attention module, a downsampling strategy is first employed to enhance spatial information, followed by a non-local operation [32] to compute spatial attention weights. Li et al. [34] constructed a double-branch hybrid attention network for SR in which the dual attention network (DANet) [35] is inserted separately into two branches. Sun et al. [36] proposed a hybrid spectral and texture attention pyramid network, in which convolutional block attention modules [37] are used to form spatial and spectral constraint units. Recently, the self-attention module in ViT, with its powerful global modeling capability, has been widely used to SR tasks [38]. For instance, Cai et al. [39] proposed a lightweight multi-stage spectral-wise transformer network (MST++), where the self-attention module computes the relationship between different channels. Building upon MST++, Wu et al. [40] proposed a multistage spatial–spectral fusion network (MSFN) for SR from MSIs. This network leverages spatial correlation and spectral self-similarity, incorporating two fundamental building blocks: a spectral module that employs spectral-wise multi-head self-attention, as introduced in [39], and a spatial module based on window-based multi-head self-attention [41].

3. Proposed Method

3.1. Overall Pipeline

The pipeline of the proposed MS2Net is displayed in Figure 2. Generally, the proposed pipeline is divided into three parts: data preprocessing, spectral–spatial feature extraction, and feature fusion and reconstruction. Specifically, to make the training samples contain both spectral signatures and spatial information, we used a sliding window to obtain 3-D patches. Then, these 3-D patches are sent to the initial convolutional layer to extract low-level features. Next, the extracted shallow features are fed to a 3-D MHR module to mine multiscale spectral and spatial features. Next, the multiscale spectral–spatial joint features are fed into the spectral and spatial branches to extract deep-level spectral and spatial features, respectively. In each branch, we use the SSDR module to enhance the feature flow. Additionally, the RCSA module is inserted in each branch to focus on the features of interest for SR in both channel and spatial dimensions. Then, the features from the two branches are adaptively fusion-weighted by the AWFF module. Finally, the feature maps are fed into a convolutional layer to obtain the final reconstructed HSI. The detailed configurations of the proposed MS2Net are listed in Table 1.

3.2. 3-D MHR Module

To effectively extract multiscale spectral–spatial features on a fine-grained level and simultaneously enhance the range of receptive fields, the 3-D MHR module, as shown in Figure 2, is introduced and incorporated into the head of the proposed MS2Net. Specifically, the structure of 3-D MHR module is shown in Figure 3.

As depicted in Figure 3, the 3-D MHR module comprises four scales, with the symbol “⊕” indicating the addition operation. Given input tensor

X \in R^{C \times D \times H \times W}

, where

X

denote the MSI patch and C, D, H, and W denotes the channel, depth, height, and width, respectively. We first split

X

into four feature subsets along the channel dimension, and each subset is denoted as

X_{i} \in R^{C / 4 \times D \times H \times W}

, where

i \in {1, 2, \dots, 4}

. Next, except for

X_{1}

, each

X_{i}

is sent to a 3 × 3 × 3 convolution layer, denoted by

{Conv}_{i} ()

. The extracted feature maps of

{Conv}_{i} ()

are denoted by

Y_{i}

. Subsequently, the feature subset

X_{i}

is added with the output of

{Conv}_{i - 1} ()

and then fed into

{Conv}_{i} ()

. In this way, the multiscale hierarchical residual features

Y_{i}

can be defined as follows:

Y_{i} = \{\begin{matrix} X_{i} & i = 1 \\ C o n v_{i} (X_{i}) & i = 2 \\ C o n v_{i} (X_{i} + Y_{i - 1}) & 2 < i \leq 4 \end{matrix}

(1)

According to this hierarchical residual structure, each convolution layer

{Conv}_{i} ()

can receive the feature maps from the previous layer, so the receptive field of

X_{i}

is larger than that of

X_{j} (j \leq i)

. Additionally, we use the PReLU activation function instead of the ReLU activation function to bring more nonlinear capabilities to the network. Finally, the concatenation operation at the end of this module combines the feature maps at different receptive fields.

3.3. SSDR Module

Generally, the deeper networks always have higher reconstruction quality. Nevertheless, simply stacking convolutional layers to a certain level can lead to performance bottlenecks. ResNet [16] first proposed a skip connection to alleviate this bottleneck, guaranteeing network performance even when the number of layers is deepened. Moreover, DenseNet inherits and improves the skip connection to make all preceding layers connected to each other, further improving its performance in the SR task. However, not all connections between layers have positive feedback on the results, and the large memory footprint and slow inference time become a drawback of DenseNet [17]. To address this drawback, we propose an SSDR module to reduce computation costs and extract deep spatial and spectral features. As shown in Figure 2, the SSDR module is implemented in the spectral SSDR module and the spatial SSDR module. In the spectral SSDR module, we adopt 1 × 1 × 3 spectral convolution kernels to extract discriminative spectral features. The following equations can describe the procedure of the spectral SSDR module:

X_{1} = δ ({Conv}_{1}^{1 \times 1 \times 3} (X_{0}))

(2)

X_{2} = δ ({Conv}_{2}^{1 \times 1 \times 3} (X_{1}))

(3)

X_{3} = δ ({Conv}_{2}^{1 \times 1 \times 3} (X_{2}))

(4)

X_{spectral} = X_{0} + δ ({Conv}_{4}^{1 \times 1 \times 1} [X_{0}, X_{1}, X_{2}, X_{3}])

(5)

where

X_{0}

,

X_{1}

,

X_{2}

denote the input feature maps of the first, second, and third convolution layer inputs, respectively.

X_{3}

denotes the third convolutional layer output.

X_{spectral}

represents the extracted deep spectral features.

δ

denotes the PReLU activation function. The procedure of the spatial SSDR module is the same as that for the spectral SSDR. Specifically, we apply 3 × 3 × 1 spatial convolution kernels to mine the rich spatial contextual features

X_{spatial}

. The following equations can describe the procedure of the spatial SSDR module:

X_{1} = δ ({Conv}_{1}^{3 \times 3 \times 1} (X_{0}))

(6)

X_{2} = δ ({Conv}_{2}^{3 \times 3 \times 1} (X_{1}))

(7)

X_{3} = δ ({Conv}_{2}^{3 \times 3 \times 1} (X_{2}))

(8)

X_{spatial} = X_{0} + δ ({Conv}_{4}^{1 \times 1 \times 1} [X_{0}, X_{1}, X_{2}, X_{3}])

(9)

Furthermore, the feature maps output by each convolutional layer in the module have the same size and can be connected in the channel dimension. In each SSDR module, every convolutional layer outputs k feature maps, which means that k new feature maps can be obtained. Referring to [32], we set k to 100 in this module, which can achieve a better reconstruction performance.

3.4. RCSA Module

To fully leverage critical features and capture long-range global dependencies, we introduce a self-attention mechanism [38] to design the RCSA module. As shown in Figure 4, given an input feature map

X_{in} \in R^{C \times H \times W}

, C refers to the number of channels and H and W refer to the input feature map spatial dimensions. First, to introduce the global contextual information in the attention map,

X_{in}

is fed into the construction module. This module contains two parallel branches, each containing a global average pooling layer and a linear layer. To obtain richer global contextual information, we use two unequal pooling kernel sizes of

H \times 1

and

1 \times W

in two branches to generate two new feature maps, i.e.,

{\hat{A}}_{w} \in R^{C \times 1 \times W}

and

{\hat{A}}_{h} \in R^{C \times H \times 1}

, respectively. Next,

{\hat{A}}_{w}

and

{\hat{A}}_{h}

are repeated along the height and width to form global features

A_{w} \in R^{C \times H \times W}

and

A_{h} \in R^{C \times H \times W}

in the horizontal and vertical directions, respectively. Furthermore,

A_{w}

and

A_{h}

are cut along the H and W dimensions to generate a group of slices with the size of

R^{C \times W}

and

R^{C \times H}

, respectively. After that, these group slices are merged to form the final global context feature map

A \in R^{(H + W) \times C \times S}

, where S = H = W. At the same time, we also cut the

X_{in}

along the H dimension, generating a set of H slices with a size of

R^{C \times W}

. Similarly, we cut the

X_{in}

along the W dimension with a size of

R^{C \times H}

. These two types of features are merged to generate the features

B \in R^{(H + W) \times S \times C}

. In the same way, the feature map

C \in R^{(H + W) \times C \times S}

is generated. Then, we perform an affinity operation between

A

and

B

to generate the attention map

M \in R^{(H + W) \times C \times C}

. The affinity operation is represented as

M_{i, j} = \frac{\exp (A_{i} \cdot B_{j})}{\sum_{i = 1}^{C} \exp (A_{i} \cdot B_{j})}

(10)

where

M_{i, j}

means the degree of correlation between the ith and jth channels at a specific spatial location. Then, we perform a matrix multiplication between

M

and

C

, reshaping the results to two groups, and each group has a size of

R^{C \times H \times W}

. Next, these two groups’ feature maps are added to form long-range contextual information. Finally, we multiply it by a learnable scale parameter

α

and perform an element-wise sum with input feature

X_{in}

to obtain the final output

X_{out} \in R^{C \times H \times W}

. The mathematical expression is as follows:

X_{out, j} = α \sum_{i = 1}^{C} M_{i, j} \cdot C_{j} + X_{in, j}

(11)

in which

α

is initialized to be 0 and can be updated during the training process.

3.5. AWFF Module

Due to the different feature extraction methods of the two branches, the two branch features contribute differently to the SR task. To sufficiently fuse spectral and spatial features from different branches, we designed an AWFF module that can adaptively fuse spectral and spatial features. Specifically, we assigned weight coefficients

θ

to each branch, thus making the features are better integrated, mathematically expressed as follows:

F_{Joint} = θ \cdot F_{Spe} + (1 - θ) \cdot F_{Spa}

(12)

where

F_{joint}

,

F_{Spe}

, and

F_{Spa}

denote the weight fused, spectral, and spatial branch feature maps.

θ

is an adaptive weight that is updated by back-propagation. Moreover, to obtain a stable and smooth reconstruction result, we chose mean-square error as the loss function of the proposed network. The loss function is defined as follows:

Loss (τ) = \frac{1}{N} \sum_{n = 1}^{N} {∥ Y_{GT}^{n} - Y_{HSI}^{n} ∥}^{2}

(13)

where

Y_{GT}^{n}

and

Y_{HSI}^{n}

are denoted as the n-th pixel value of ground-truth MSI patches and reconstructed HSI patches, respectively. N denotes the number of pixels in one training patch.

τ

represents the parameters of the proposed network.

4. Experiments

4.1. Dataset Description

To thoroughly estimate the effectiveness of the proposed MS2Net, we used two types of MSI datasets in the experiments, one type being simulated MSI datasets (i.e., the Pavia University dataset and the Indian Pines dataset) and the other real-world MSI datasets (i.e., the Chongqing dataset and the Jiaxing dataset) [42,43]. It is worth emphasising that the real-world datasets are recently collected high-quality MSI datasets that are very challenging in experiments.

Pavia University (PU) Dataset: The first dataset was acquired by the reflective optics system imaging spectrometer airborne sensor over the University of Pavia, Italy. The spatial size of this dataset is 610 × 340 and includes 207,400 pixels, and it contains 9 labeled land-cover types. The ground sampling distance (GSD) is 1.3 m. Every pixel has 115 bands with a spectral wavelength ranging from 430 to 860 nm. After dropping the noise-contaminated bands, the number of spectral bands used for the experiment was 103. To match the requirements of the SR experimental setup, the corresponding MSI was simulated by the black-SWIR1 (i.e., 450–520 nm, 520–600 nm, 630–690 nm, and 760–900 nm) spectral response functions (SRFs) of Quick Bird. The size of the simulated MSI dataset is 610 × 340 × 4. As shown in Table 2, the training set size is 305 × 340 × 4, and the remaining regions were used as the test set.
Indian Pines (IP) Dataset: The second dataset was photographed by the airborne visible/infrared imaging dpectrometer sensor over the Indian Pines test site in Northwestern Indiana, USA. The spatial size of this dataset is 145 × 145 and includes a total of 21,025 pixels, and it contains 16 labeled land-cover types. The GSD is 20 m. Every pixel has 220 bands with a spectral wavelength ranging from 400 to 2500 nm. To match the requirements of the SR experimental setup, the corresponding MSI was simulated by the SRFs of Sentinel-2. The size of the simulated MSI dataset is 145 × 145 × 13. As shown in Table 2, the training set size is 75 × 145 × 13, and the remaining regions were used as the test set.
Chongqing (CQ) Dataset: The third dataset includes the real data collected by the ZY-1 02D satellite in Chongqing, China, which contains paired HSI and MSI for the same region and time. The advanced hyperspectral imager has a GSD of 30 m, while the multispectral imager provides MSIs with a GSD of 10 m. To make the GSD of the MSIs uniform with those of the HSIs, we performed a spatial downsampling operation on the MSI. We selected MSI and HSI with a size of 400 × 1000, containing 400,000 pixels. Every pixel of the HSI has 94 bands with a spectral wavelength ranging from 395 to 1341 nm. The MSI includes eight bands with a spectral wavelength ranging from 452 to 1047 nm. As shown in Table 2, the training set and test set sizes are 400 × 400 × 8 and 400 × 600 × 8, respectively.
Jiaxing (JX) Dataset: The last dataset was collected by the ZY-1 02D satellite in Jiaxing, China. The GSDs for HSI and MSI are the same as those for the Chongqing dataset. We also performed a spatial downsampling operation on the MSI to ensure that the GSDs of the MSI and HSI were consistent. In this paper, we selected MSI and HSI with a size of 500 × 500, containing a total of 250,000 pixels. Each pixel has 76 bands for HSI and 8 bands for MSI. As shown in Table 2, the training set and test set sizes are 250 × 500 × 8 and 250 × 500 × 8, respectively.

4.2. Experimental Setup

All the experiments were performed on a deep learning workstation with a 2× Intel Xeon E5-2680 v4 processor (2.4 GHz), manufactured by Intel Corporation in Santa Clara, CA, USA, 128 GB of DDR4 RAM, and 8× NVIDIA GeForce RTX 2080Ti super graphical processing unit (GPU) with 11 GB of memory. The software environment was CUDA v11.2, PyTorch 1.10, and Python 3.8 [44]. For the training setup of the proposed MS2Net, we applied the Adam optimizer to update the parameters for 200 epochs. The initial learning rate was set to 0.0001. The batch size was set to 32. Due to the limited computing power, we randomly selected patch pairs from MSI and HSI to train the network. The patch size of the simulated dataset was set to 5 × 5 and 15 × 15 for the real-world dataset. Furthermore, to purposefully demonstrate the effectiveness of our proposed MS2Net, we selected eight representative SR methods for comparison: one sparse representation-based method and seven DL-based methods. All these comparison methods are briefly described in the following:

J-SLoL [9]: This method performs SR by learning the low-rank HS and MS dictionaries and their corresponding sparse representations.
AWAN [27]: The backbone of this network is composed of several dual residual attention modules, where the long and short skip connections ensure adequate feature flow. Adaptive weighted channel attention is proposed to make the network focus on representative channels. In addition, the PReLU replaces the ReLU activation function to increase the nonlinear expressiveness.
HSCNN [15]: This network consists of three stages to perform the SR task. It first applies 1 × 1 convolution for feature extraction; then, it stacks multiple densely connected blocks with a convolution kernel size of 3 × 3 for feature mapping. Finally, the 1 × 1 convolution is used for spectral reconstruction.
SSJSR [45]: This method contains two sub-networks, namely, spatial super-resolution sub-network and SR sub-network. In this experiment, only the SR sub-network is adopted for comparison. This spectral sub-network consists of five 3-D convolutional layers with a kernel size of $3 \times 3 \times 3$ , and the network ends with a sub-pixel convolutional layer with a kernel size of $3 \times 1 \times 1$ to reconstruct the HSI.
SSRAN [31]: This network utilizes three identical dual-branch residual blocks to extract spatial and spectral features of MSI, simultaneously. The bottom branch employs two 2-D convolutional layers with a kernel size of $1 \times 1$ to extract the spectral information. Two 2-D convolutional layers with a kernel size of $3 \times 3$ are used for the top branch to extract the spatial information. Similarly, the neighboring spectral attention is introduced at the end of each block.
MST++ [39]: This network mainly consists of several single-stage spectral-wise Transformer (SST) modules. Each SST adopts a U-shaped structure consisting of an encoder and a decoder to extract multi-resolution contextual information. The fundamental unit of both the encoder and decoder is the spectral-wise attention (SA) block. Unlike conventional Transformer architectures, the SA block introduces spectral-wise multi-head self-attention to replace standard self-attention, reducing computational complexity while preserving performance.
RepCPSI [23]: This network mainly consists of several polymorphic residual context restructuring (PRCR) modules. The PRCR module is similar to a basic residual structure, with the difference that it uses polymorphic convolution to extract features and that it equips a lightweight coordinate-preserving proximity spectral-aware attention block to enhance representational capabilities.
MSFN [40]: Similar to MST++, the main structure of this network mainly consists of multiple single-stage spatial–spectral fusion networks (SSFNs). SSFN also has the structure of the U-Net to extract multiscale features. In addition, it uses the codec of Swin Transformer as the basic unit for feature extraction.

4.3. Evaluation Metrics

To objectively evaluate the quality of the proposed MS2Net-reconstructed HSI, five evaluation metrics were utilized in our experiments. The first metric is the root mean square error (RMSE) [46], which measures the estimated error between the reconstructed HSI

\hat{X} \in R^{H \times W \times B}

and the ground-truth HSI

X \in R^{H \times W \times B}

. The RMSE of a single band can be calculated as follows:

RMSE (X^{i}, {\hat{X}}^{i}) = \sqrt{\frac{| X^{i} \cdot {\hat{X}}^{i} |_{F}^{2}}{H W}}

(14)

where

X^{i}

and

{\hat{X}}^{i}

represent the ith band reconstructed image and ground-truth image, respectively, and they are scaled to the range [0, 255]. The average RMSE of all bands is used as the final result. A smaller RMSE value means that a better reconstruction result is obtained.

The second metric is the peak signal-to-noise ratio (PSNR) [47], which measures the pixel difference between the

X

and

\hat{X}

. The PSNR of a single band can be defined as follows:

PSNR (X^{i}, {\hat{X}}^{i}) = 20 \cdot \log_{10} (\frac{255}{RMSE})

(15)

The average PSNR of all bands is considered the final result. A large PSNR means that the reconstructed HSI spatial information is less distorted.

The third metric is the spectral angle mapper (SAM) [48], which evaluates the similarity in shape between two spectral vectors in degrees. The spectral similarity between the reconstructed pixel

{\hat{X}}_{j}^{i}

and ground-truth pixel

X_{j}^{i}

is given by

SAM (X_{j}^{i}, {\hat{X}}_{j}^{i}) = \arccos (\frac{{\hat{X}}_{j}^{i} X_{j}^{i}}{∥ {\hat{X}}_{j}^{i} ∥_{2} ∥ X_{j}^{i} ∥_{2}})

(16)

The final SAM result is obtained by calculating the average value of the whole spatial domain pixels. If the angle is 0, there is no spectral change.

The fourth metric is the relative global dimensional error in synthesis (ERGAS) [49], which reflects the overall quality of the reconstructed HSI. The ERGAS of a single band is computed as follows:

ERGAS (X^{i}, {\hat{X}}^{i}) = 100 \sqrt{{(\frac{RMSE (X^{i}, {\hat{X}}^{i})}{ρ_{{\hat{X}}^{i}}})}^{2}}

(17)

where

ρ_{{\hat{X}}^{i}}

indicates the mean value of

{\hat{X}}^{i}

, and the RMSE

(X^{i}, {\hat{X}}^{i})

is the mean value of RMSE. The average ERGAS of all bands is considered the final result. A smaller ERGAS value means that a high-quality HSI is reconstructed.

The last metric is the structural similarity index (SSIM) [50], which is an improvement of the universal image quality index. The SSIM of a single band is shown in the following:

SSIM (X^{i}, {\hat{X}}^{i}) = \frac{(2 ρ_{x^{i}} ρ_{{\hat{x}}^{i}} + c_{1}) (2 σ_{x^{i} {\hat{x}}^{i}} + c_{2})}{(ρ_{x^{i}}^{2} + ρ_{{\hat{x}}^{i}}^{2} + c_{1}) (σ_{x^{i}}^{2} + σ_{{\hat{x}}^{i}}^{2} + c_{2})}

(18)

where

ρ_{X^{i}}

and

ρ_{{\hat{X}}^{i}}

are the average value of the image

X^{i}

and

{\hat{X}}^{i}

, respectively;

σ_{x^{i}}^{2}

and

σ_{{\hat{x}}^{i}}^{2}

are the standard deviations of the image

X^{i}

and

{\hat{X}}^{i}

, respectively; and

c_{1}

and

c_{2}

are constants. A SSIM value close to 1 implies good reconstructed quality.

4.4. Experimental Results and Analysis

To verify the superiority of the proposed method at all levels, we quantitatively and qualitatively evaluated the reconstruction results of the proposed method with eight comparative methods on four datasets. First, as shown in Table 3, we quantitatively assessed the reconstruction results on the IP dataset. Generally, the proposed MS2Net achieves the best reconstruction results in five evaluation metrics compared with other methods. Concretely, the reconstruction accuracy obtained by DL-based methods is much higher than that obtained by sparse representation-based methods (i.e., J-SLoL) in terms of PSNR since the DL-based method does not destroy the spatial information of the MSIs. Furthermore, compared with the second-best DL-based method (i.e., MSFN), the proposed MS2Net decreases RMSE by 23.85%, SAM by 23.74%, and EGRAS by 32.07%. As for PSNR and SSIM, the proposed MS2Net is 5.71% and 0.91% better, respectively, in these two metrics than the second-best method. Additionally, the proposed MS2Net shares similarities with the SSRAN network structure, as both are designed to simultaneously extract spectral and spatial information. However, due to the simple structure of the SSRAN, its effectiveness needs to be further improved. In contrast, given the well-designed network architecture, the proposed MS2Net achieves the desired results in all metrics.

To visually demonstrate the reconstruction performance, the error maps of different methods for five representative sampling bands (i.e., 450 nm, 550 nm, 650 nm, 806 nm, and 2370 nm) on the IP dataset are shown in Figure 5. As shown in Figure 5, the proposed MS2Net achieves the minimum reconstruction error in all five sampling bands. Specifically, all methods are poorly reconstructed in the 650 nm in the IP dataset because this band is at the edge of the MSI spectral response function and provides limited spectral information for the reconstruction task. Moreover, each method has different reconstruction results for different land-cover types. For example, the land-cover type of the red rectangle is the soybean–mintill, with a smooth and uniform spatial distribution. Due to the spatial information disrupted by J-SLoL, the edge texture information is lost in the reconstruction result of this region. On the contrary, the reconstruction error in this region is relatively small for those DL-based methods due to their consideration of spatial features. In addition, the green rectangle contains a mixture of land-cover types, such as wood and other vegetation. All comparative methods have poor reconstruction results in this region. Due to the elaborate multiscale feature extraction module, the proposed MS2Net significantly outperforms the other comparison methods in this region on all five sampling bands.

Second, the reconstruction results of different methods on the PU dataset are shown in Table 4. Due to the limited spectral information remaining after the spectral downsampling of the PU dataset, the reconstruction results of all methods are degraded in all five metrics compared to the IP dataset. However, the proposed MS2Net still leads in all metrics. Specifically, compared with the second-best method (i.e., HSCNN), the reconstruction results of the proposed MS2Net decreased by 10.83%, 9.92%, and 10.20% for RMSE, SAM, and EGRAS, respectively. Furthermore, regarding PSNR and SSIM, the proposed MS2Net improves by 3.72% and 0.2%, respectively. More visually, the four representative sampling bands error maps of different methods (i.e., 450 nm, 550 nm, 650 nm, and 806 nm) on the PU dataset are shown in Figure 6. As illustrated in Figure 6, the proposed MS2Net reconstructs HSI more accurately in all four representative bands and shows higher fidelity than other methods. Specifically, the reconstruction error is minor for land-cover types with continuous spatial distribution, such as roads and artificial buildings. In contrast, reconstructing sparsely distributed land-cover types, such as trees and roofs, is more challenging. Nevertheless, the MS2Net achieves outstanding reconstruction accuracy even for these land-cover types.

Third, we validated all methods on the real-world datasets. The CQ dataset has a GSD of 30 m, meaning the individual pixels contain various land-cover types. This results in a complex spectral signature recorded by a single pixel. These complex spectral signatures pose a significant challenge to the reconstruction task. The reconstruction results of different methods on the CQ dataset are listed in Table 5. From Table 5, we can learn that J-SLoL achieves the worst reconstruction accuracy. Other DL-based methods achieved acceptable reconstruction results. The RepCPSI method achieves the second-best results due to its deeper network layers. In particular, compared with the RepCPSI, the reconstruction results of the proposed MS2Net decreased by 14.03%, 17.29%, and 13.35% for RMSE, SAM, and EGRAS, respectively. Furthermore, regarding PSNR and SSIM, the proposed MS2Net performs 2.81% and 0.71% better, respectively. Visually, the four representative sampling bands error maps of different methods (i.e., 359 nm, 524 nm, 670 nm, and 962 nm) on the CQ dataset are displayed in Figure 7. As expected, the proposed MS2Net has a good reconstruction result and a smaller reconstruction error than other methods.

To further validate the robustness of the proposed method, we finally conducted experiments on the recently released JX dataset. Table 6 presents the reconstruction results of various methods on the JX dataset. Generally, we can observe from Table 6 that the proposed MS2Net consistently outperforms other methods across all evaluation metrics. Specifically, the proposed MS2Net achieves the lowest RMSE of 2.209, which is 24.37% lower than the second-best method (RepCPSI, 2.921). It also attains the highest PSNR of 43.932, representing a 4.11% improvement over MST++ (42.196). For SAM, the proposed MS2Net achieves a value of 0.737, which is 26.88% lower than RepCPSI (1.008), indicating significantly reduced spectral distortion. In terms of SSIM, the proposed MS2Net achieves 0.984, slightly surpassing HSCNN (0.972) by 1.23%. Finally, the lowest EGRAS is achieved by the proposed MS2Net with 4.443, which is a 17.92% improvement over HSCNN (5.413). Figure 8 presents the error maps of four representative sampling bands (395 nm, 550 nm, 1200 nm, and 2200 nm) for different methods on the JX dataset. As shown in Figure 8, the proposed MS2Net achieves more accurate HSI reconstruction across all four bands, demonstrating superior fidelity compared to the other methods. Moreover, Figure 9, Figure 10, Figure 11 and Figure 12 show the spectral curves of three typical land-cover types randomly selected for four datasets. Evidently, the reconstructed spectral curves of the proposed MS2Net on the three data sets are closer to the ground-truth spectral responses.

5. Discussion

5.1. Analysis of Ablation Experiments

In this subsection, we conduct extensive ablation experiments across four different datasets to evaluate the effectiveness of each module in the proposed MS2Net. Specifically, “W/O 3-D MHR” refers to MS2Net without the 3-D MHR module, “W/O RCSA” indicates the removal of the RCSA module, “W/O Spe-RCSA” denotes the exclusion of the spectral branch RCSA module, and “W/O Spa-RCSA” represents the absence of the spatial branch RCSA module. Additionally, “W/O Spe-Net” refers to MS2Net without the spectral branch, “W/O Spa-Net” excludes the spatial branch, and “W/O AWFF” indicates the absence of the AWFF module. Quantitative results of different ingredients of the proposed MS2Net on four datasets are presented in Table 7, Table 8, Table 9 and Table 10.

First, we focus on the ablation results of the 3-D MHR module, i.e., W/O 3-D MHR. From the results of the ablation experiments of this module on four datasets, the reconstruction results are largely improved due to its ability to extract multiscale spatial–spectral features for reconstruction tasks. Particularly, the land-cover distribution is complex for the CQ and JX dataset and variable in scale. The quality of model reconstruction decreases significantly after removing the 3-D MHR module. Second, we pay attention to the results of the ablation experiments with the attention mechanisms. It is observed from the ablation experiments of the attention mechanism on all four datasets that the reconstruction results of the proposed MS2Net are reduced after removing the attention mechanism (i.e., W/O RCSA model). The W/O Spe-RCSA and W/O Spa-RCSA models perform better than the W/O RCSA models on the IP and PU datasets. However, for the CQ and JX dataset, the W/O RCSA model performs better than the W/O Spe-RCSA and W/O Spa-RCSA models. This indicates that only inserting attention mechanisms in the spectral or spatial branch for datasets with complex feature distribution does not achieve good reconstruction results. Third, we focus on the impact of each branch on the reconstruction results. The ablation results show that the influence of spatial and spectral features on the reconstruction results differs across datasets. For example, for the IP and PU datasets, the W/O Spa-Net model performs better than the W/O Spe-Net model. In contrast, for the CQ and JX datasets, the W/O Spe-Net model performs better than the W/O Spa-Net model. This is reasonable because the CQ and JX datasets have low spatial resolution (i.e., 30 m), and a single pixel in the image contains many mixed pixels, resulting in a relatively complex composition of single-pixel spectral signatures for these datasets. The above analysis shows that good reconstruction accuracy cannot be achieved using only a single spatial or spectral branch. Both spatial and spectral features need to be extracted for the reconstruction task to obtain satisfactory results. Furthermore, we can also observe from the ablation experiment results that, across different datasets, spectral and spatial information has varying degrees of influence on the reconstruction results. For this reason, we propose an AWFF module to fuse these two types of features. To verify the effectiveness, we performed ablation experiments on the AWFF module. The ablation experimental results show that the module can effectively aggregate two features and improve the reconstruction quality.

5.2. Analysis of Model Complexity

In this subsection, we compare the complexity of the models based on parameters, floating-point operations (FLOPs), and runtimes, all evaluated with a batch size of 1. The results are presented in Table 11. As shown in Table 11, the AWAN and HSCNN methods adopt deep network structures, resulting in a higher number of parameters and FLOPs across all datasets. Specifically, AWAN has nearly 2M parameters and relatively low FLOPs, while HSCNN uses slightly more parameters but achieves comparable FLOPs. In contrast, the SSJSR method employs a simplified network architecture with only five 3-D convolution layers, which significantly reduces its parameters. However, this simplicity comes at the cost of higher FLOPs due to the increased computational demands of 3-D operations. Meanwhile, the SSRAN method has the lowest overall complexity, as it leverages lightweight 1-D and 2-D convolution operations to achieve minimal parameters, FLOPs, and runtime. Although RepCPSI is a CNN-based lightweight SR scheme, it introduces a new convolutional operator that is slightly more complex and thus has higher parameters and FLOPs than SSRAN. In contrast, the proposed MS2Net incorporates advanced operations that enhance feature extraction and representation while maintaining a manageable level of complexity. This allows the proposed MS2Net to achieve significantly better performance than both SSRAN and RepCPSI, with only a slight increase in computational demand. Furthermore, the recently introduced MST++ and MSFN methods, though designed as lightweight Transformer-based architectures, still exhibit relatively high complexity. Take the IP dataset as an example: compared to MSFN, the proposed MS2Net achieves reductions of 99.30% in parameters, 89.61% in FLOPs, and 91.36% in runtime, respectively, demonstrating its significant computational efficiency while maintaining strong performance. Based on the above analysis, it is evident that the proposed MS2Net strikes an optimal balance between computational efficiency and performance. The convergence of the proposed MS2Net on four datasets is shown in Figure 13. We can observe from Figure 13 that the proposed MS2Net achieves stable and efficient convergence across all datasets. In the initial epoch, the loss decreases rapidly, which indicates that the proposed MS2Net effectively learns the underlying features. As the epoch increases, the convergence becomes smooth and stable, which reflects the robustness of the network and its good generalization ability.

5.3. Analysis of Classification Validation

In this subsection, we design a classification experiment on two datasets with labeled categories to verify the quality of HSI reconstructed by different models. The SVM is selected as the basic classifier. The quantitative classification results for the IP and PU datasets are presented in Table 12 and Table 13, respectively, where accuracy, overall accuracy (OA), average accuracy (AA), and kappa coefficient (k) are reported. Generally, the proposed MS2Net achieves the classification accuracy closest to the real HSI (R-HSI) on the two datasets. More visually, the full-pixel classification maps for the two datasets with different methods are shown in Figure 14 and Figure 15. From Figure 14 and Figure 15, it can be seen that the classification map of MS2Net is more similar to the R-HSI classification map. Specifically, taking the IP dataset as an example, it can be seen from Figure 14 that the proposed MS2Net has a smoother classification performance for the region of class 9 compared to other methods. Furthermore, for the green rectangular region in Figure 14, the classification performance of HSI reconstructed by all methods is noisy in this region due to its inclusion of feature mixing. In contrast, the proposed MS2Net outperforms other comparative methods for classification performance in this region.

6. Conclusions

In this paper, we proposed a novel MS2Net framework to achieve the SR task from MSIs. Specifically, the 3-D MHR module was utilized to extract the multiscale spectral–spatial features of MSIs. To further enhance feature extraction, we designed a dual-branch structure, where each branch utilized SSDR modules with different convolutional kernel sizes to effectively capture spatial and spectral features. At the same time, we also introduced the attention mechanism, i.e., RCSA module, in each branch to direct the network to focus on the features of interest and to enhance its ability to capture global features. To further optimize the reconstruction quality, we proposed an AWFF module to ensure effective fusion of features extracted from the two branches. A series of ablation experiments verified the effectiveness of each module in the proposed MS2Net. Moreover, the classification validation experiments further confirmed the usability of the proposed MS2Net-reconstructed HSIs.

Although the proposed MS2Net has demonstrated excellent SR capabilities, there is still room for improvement in terms of model complexity. Furthermore, the reconstruction performance of the proposed MS2Net may be affected when applied to complex geographical features in different scenarios (e.g., diverse terrain, vegetation, or human-made objects). Consequently, in future work, we aim to explore the following aspects. On the one hand, a cross-modal self-supervised learning framework can be investigated, such as combining multiple remote sensing data types (e.g., MSI, LiDAR, SAR, etc.) for representation learning and leveraging the complementarity between different modalities to enhance SR performance. On the other hand, the exploration of lightweight models represents a promising direction for practical deployment, particularly in scenarios with limited computational resources.

Author Contributions

Conceptualization, M.L. and W.Z.; Methodology, M.L. and W.Z.; Software, M.L. and W.Z.; Validation, M.L., W.Z. and H.P.; Formal analysis, W.Z.; Investigation, W.Z.; Resources, W.Z.; Data curation, W.Z.; Writing—original draft preparation, M.L. and W.Z.; Writing—review and editing, M.L. and W.Z.; Visualization, M.L.; Supervision, W.Z.; Project administration, W.Z.; Funding acquisition, W.Z. and H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 42201503, in part by Heilongjiang Provincial Natural Science Foundation of China under Grant LH2023F050, and in part by the Fundamental Research Funds in Heilongjiang Provincial Universities under Grant 145309208.

Data Availability Statement

Data available in the publicly accessible repositories: Indian Pines, Pavia University, and Chongqing datasets (https://github.com/WenjuanZhang-aircas/Spectral-Mixing-Theory-Based-Double-Branch-Network-for-Spectral-Super-Resolution, accessed on 25 January 2025); Jiaxing dataset (https://github.com/rs-lsl/CSSNet, accessed on 25 January 2025).

Acknowledgments

The authors would like to thank Lingyu Sha, Ruiqi Sun, and Mengnan Jin, of the Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China, for valuable discussions on topics related to this study. At the same time, we also thank the reviewers for their suggestions on this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, B. Current status and future prospects of remote sensing. Bull. Chin. Acad. Sci. (Chin. Ver.) 2017, 32, 774–784. [Google Scholar]
Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Nasrabadi, N.M. Hyperspectral target detection: An overview of current and future challenges. IEEE Signal Process. Mag. 2014, 31, 34–44. [Google Scholar] [CrossRef]
Zhang, B.; Wu, D.; Zhang, L.; Li, S.; Chen, X.; Zhao, Y.; Wang, H. Application of hyperspectral remote sensing for environment monitoring in mining areas. Environ. Earth Sci. 2012, 65, 649–658. [Google Scholar] [CrossRef]
He, L.; Li, J.; Liu, C.; Li, S. Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines. IEEE Trans. Geosci. Remote Sens. 2018, 56, 1579–1597. [Google Scholar] [CrossRef]
Sun, X.; Zhang, L.; Yang, H.; Wu, T.; Cen, Y.; Guo, Y. Enhancement of spectral resolution for remotely sensed multispectral image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2198–2211. [Google Scholar] [CrossRef]
He, J.; Yuan, Q.; Li, J.; Xiao, Y.; Liu, X.; Zou, Y. DsTer: A dense spectral transformer for remote sensing spectral super-resolution. Int. J. Appl. Earth Obs. Geoinf. 2022, 109, 102773. [Google Scholar] [CrossRef]
Yi, C.; Zhao, Y.-Q.; Chan, J.C.-W. Spectral super-resolution for multispectral image based on spectral improvement strategy and spatial preservation strategy. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9010–9024. [Google Scholar] [CrossRef]
Gao, L.; Hong, D.; Yao, J.; Zhang, B.; Gamba, P.; Chanussot, J. Spectral superresolution of multispectral imagery with joint sparse and low-rank learning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 2269–2280. [Google Scholar] [CrossRef]
Fotiadou, K.; Tsagkatakis, G.; Tsakalides, P. Spectral super resolution of hyperspectral images via coupled dictionary learning. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2777–2797. [Google Scholar] [CrossRef]
Han, X.; Yu, J.; Luo, J.; Sun, W. Reconstruction from multispectral to hyperspectral image using spectral library-based dictionary learning. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1325–1335. [Google Scholar] [CrossRef]
Arad, B.; Ben-Shahar, O. Sparse recovery of hyperspectral signal from natural RGB images. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9911, pp. 19–34. [Google Scholar]
Aeschbacher, J.; Wu, J.; Timofte, R. In defense of shallow learned spectral reconstruction from RGB images. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017. [Google Scholar]
Xiong, Z.; Shi, Z.; Li, H.; Wang, L.; Liu, D.; Wu, F. HSCNN: CNN-based hyperspectral image recovery from spectrally undersampled projections. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 518–525. [Google Scholar]
Shi, Z.; Chen, C.W.; Xiong, Z.; Liu, D.; Wu, F. HSCNN+: Advanced CNN-based hyperspectral recovery from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1052–10528. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Can, Y.B.; Timofte, R. An efficient CNN for spectral reconstruction from RGB images. arXiv 2018, arXiv:1804.04647. [Google Scholar]
Stiebel, T.; Koppers, S.; Seltsam, P.; Merhof, D. Reconstructing spectral images from RGB-images using a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhao, Y.; Po, L.M.; Yan, Q.; Liu, W.; Lin, T. Hierarchical regression network for spectral reconstruction from RGB images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 422–423. [Google Scholar]
Koundinya, S.; Sharma, H.; Sharma, M.; Upadhyay, A.; Manekar, R.; Mukhopadhyay, R.; Chaudhury, S. 2D-3D CNN-based architectures for spectral reconstruction from RGB images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 844–851. [Google Scholar]
Fu, Y.; Zhang, T.; Zheng, Y.; Zhang, D.; Huang, H. Joint camera spectral response selection and hyperspectral image recovery. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 256–272. [Google Scholar] [CrossRef] [PubMed]
Wu, C.; Li, J.; Song, R.; Li, Y.; Du, Q. RepCPSI: Coordinate-preserving proximity spectral interaction network with reparameterization for lightweight spectral super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5508313. [Google Scholar] [CrossRef]
Hang, R.; Liu, Q.; Li, Z. Spectral super-resolution network guided by intrinsic properties of hyperspectral imagery. IEEE Trans. Image Process. 2021, 30, 7256–7265. [Google Scholar] [CrossRef]
Dian, R.; Shan, T.; He, W.; Liu, H. Spectral super-resolution via model-guided cross-fusion network. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 10059–10070. [Google Scholar] [CrossRef]
Ghaffarian, S.; Valente, J.; van der Voort, M.; Tekinerdogan, B. Effect of attention mechanism in deep learning-based remote sensing image processing: A systematic literature review. Remote Sens. 2021, 13, 2965. [Google Scholar] [CrossRef]
Li, J.; Wu, C.; Song, R.; Li, Y.; Liu, F. Adaptive weighted attention network with camera spectral sensitivity prior for spectral reconstruction from RGB images. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1894–1903. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Li, J.; Wu, C.; Song, R.; Xie, W.; Ge, C.; Li, B.; Li, Y. Hybrid 2-D–3-D deep residual attentional network with structure tensor constraints for spectral super-resolution of RGB images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 2321–2335. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zheng, X.; Chen, W.; Lu, X. Spectral super-resolution of multispectral images using spatial–spectral residual attention network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.-T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 11057–11066. [Google Scholar]
Wang, L.; Sole, A.; Hardeberg, J.Y. Densely residual network with dual attention for hyperspectral reconstruction from RGB images. Remote Sens. 2022, 14, 3128. [Google Scholar] [CrossRef]
Li, J.; Du, S.; Song, R.; Wu, C.; Li, Y.; Du, Q. HASIC-Net: Hybrid attentional convolutional neural network with structure information consistency for spectral super-resolution of RGB images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Sun, W.; Wang, Y.; Liu, W.; Shao, S.; Yang, S.; Yang, G.; Ren, K.; Chen, B. STANet: A hybrid spectral and texture attention pyramid network for spectral super-resolution of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Song, Q.; Li, J.; Li, C.; Guo, H.; Huang, R. Fully attentional network for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 2280–2288. [Google Scholar]
Cai, Y.; Lin, J.; Lin, Z.; Wang, H.; Zhang, Y.; Pfister, H.; Van Gool, L. MST++: Multi-stage spectral-wise transformer for efficient spectral reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 745–755. [Google Scholar]
Wu, Y.; Dian, R.; Li, S. Multistage spatial–spectral fusion network for spectral super-resolution. IEEE Trans. Neural Netw. Learn. Syst. 2024; early access. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Li, S. Real HSI-MSI-PAN image dataset for the hyperspectral/multi-spectral/panchromatic image fusion and super-resolution fields. arXiv 2024, arXiv:2407.02387. [Google Scholar]
Sha, L.; Zhang, W.; Zhang, B.; Liu, Z.; Li, Z. Spectral mixing theory-based double-branch network for spectral super-resolution. Remote Sens. 2023, 15, 1308. [Google Scholar] [CrossRef]
Liu, M.; Pan, H.; Ge, H.; Wang, L. MS3Net: Multiscale stratified-split symmetric network with quadra-view attention for hyperspectral image classification. Signal Process. 2023, 212, 109153. [Google Scholar] [CrossRef]
Mei, S.; Jiang, R.; Li, X.; Du, Q. Spatial and spectral joint super-resolution using convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4590–4603. [Google Scholar] [CrossRef]
Ranchin, T.; Wald, L. Fusion of high spatial and spectral resolution images: The ARSIS concept and its implementation. Photogramm. Eng. Remote Sens. 2000, 66, 49–61. [Google Scholar]
Dian, R.; Li, S. Hyperspectral image super-resolution via subspace-based low tensor multi-rank regularization. IEEE Trans. Image Process. 2019, 28, 5135–5146. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Summaries of the Third Annual JPL Airborne Geoscience Workshop, Volume 1: AVIRIS Workshop; JPL: Pasadena, CA, USA, 1992. [Google Scholar]
Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Visualization of multiscale spatial and spectral features captured by ZY-102D over Jiaxing, China [42]: The left panel highlights the spatial differences between urban (small-scale complex structure), forest (medium-scale canopy coverage), and ocean (large-scale homogeneous region). The right panel presents their distinct spectral signatures across multiple bands.

Figure 2. Overall flowchart of the proposed MS2Net. “©” represents the concatenate operation. “⊗” means matrix multiplication. “⊕” represents the element-wise addition operation.

Figure 3. Illustration of the 3-D MHR module.

Figure 4. Illustration of the RCSA module.

Figure 5. Five selected bands for hyperspectral reconstruction error maps from the IP dataset.

Figure 6. Four selected bands for hyperspectral reconstruction error maps from the PU dataset.

Figure 7. Four selected bands for hyperspectral reconstruction error maps from the CQ dataset.

Figure 8. Four selected bands for hyperspectral reconstruction error maps from the JX dataset.

Figure 9. Spectral response curves for three selected sample points in the IP dataset.

Figure 10. Spectral response curves for three selected sample points in the PU dataset.

Figure 11. Spectral response curves for three selected sample points in the CQ dataset.

Figure 12. Spectral response curves for three selected sample points in the JX dataset.

Figure 13. The convergence of the proposed MS2Net on four datasets.

Figure 14. Full-pixel classification maps of reconstructed HSIs for different methods on the IP dataset.

Figure 15. Full-pixel classification maps of reconstructed HSIs for different methods on the PU dataset.

Table 1. Detailed configurations of the proposed MS2Net based on the PU dataset, where # is the the number of feature maps.

Module	Input Shape	Layer Operations	Kernel Size	Padding	Stride	Filters	Output Shape
Input	(5, 5, 4, 1)	Conv-3D and PReLU	(1, 1, 1)	(0, 0, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
3-D MHR	(5, 5, 4, 100)	Split	/	/	/	/	#4 (5, 5, 4, 25)
	(5, 5, 4, 25)	/	/	/	/	/	(5, 5, 4, 25)
	(5, 5, 4, 25)	Conv-3D and PReLU	(3, 3, 3)	(1, 1, 1)	(1, 1, 1)	25	(5, 5, 4, 25)
	#2 (5, 5, 4, 25)	Add and Conv-3D and PReLU	(3, 3, 3)	(1, 1, 1)	(1, 1, 1)	25	(5, 5, 4, 25)
	#2 (5, 5, 4, 25)	Add and Conv-3D and PReLU	(3, 3, 3)	(1, 1, 1)	(1, 1, 1)	25	(5, 5, 4, 25)
	#4 (5, 5, 4, 25)	Concatenate	/	/	/	/	(5, 5, 4, 100)
Spectral-SSDR	(5, 5, 4, 100)	Conv-3D and PReLU	(1, 1, 3)	(0, 0, 1)	(1, 1, 1)	100	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(1, 1, 3)	(0, 0, 1)	(1, 1, 1)	100	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(1, 1, 3)	(0, 0, 1)	(1, 1, 1)	100	(5, 5, 4, 100)
	#4 (5, 5, 4, 100)	Concatenate	/	/	/	/	(5, 5, 4, 400)
	(5, 5, 4, 400)	Conv-3D and PReLU	(1, 1, 1)	(0, 0, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
	#2 (5, 5, 4, 100)	Add	/	/	/	/	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(1, 1, 4)	(0, 0, 0)	(1, 1, 1)	100	(5, 5, 100)
RCSA	(5, 5, 100)	Attention	/	/	/	/	(5, 5, 100)
RCSA	#2 (5, 5, 100)	Multiplication	/	/	/	/	(5, 5, 100)
Spatial-SSDR	(5, 5, 4, 100)	Conv-3D and PReLU	(3, 3, 1)	(1, 1, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(3, 3, 1)	(1, 1, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(3, 3, 1)	(1, 1, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
	#4 (5, 5, 4, 100)	Concatenate	/	/	/	/	(5, 5, 4, 400)
	(5, 5, 4, 400)	Conv-3D and PReLU	(1, 1, 1)	(0, 0, 0)	(1, 1, 1)	100	(5, 5, 4, 100)
	#2 (5, 5, 4, 100)	Add	/	/	/	/	(5, 5, 4, 100)
	(5, 5, 4, 100)	Conv-3D and PReLU	(1, 1, 4)	(0, 0, 0)	(1, 1, 1)	100	(5, 5, 100)
RCSA	(5, 5, 100)	Attention	/	/	/	/	(5, 5, 100)
RCSA	#2 (5, 5, 100)	Multiplication	/	/	/	/	(5, 5, 100)
AWFF	#2 (5, 5, 100)	Feature Fusion	/	/	/	/	(5, 5, 100)
Output	(5, 5, 100)	Conv-2D and PReLU	(1, 1)	(0, 0)	(1, 1)	103	(5, 5, 103)

Table 2. Training and test sets of the four datasets.

Datasets	Training Set		Test Set
Datasets	MSI	HSI	MSI	HSI
Pavia University
Indian Pines
Chongqing
Jiaxing

Table 3. Reconstruction performance comparison of different methods on the IP dataset.

Methods	Indian Pines Dataset
Methods	RMSE (↓ 0)	PSNR (↑ ∞)	SAM (↓ 0)	SSIM (↑ 1)	EGRAS (↓ 0)
J-SLoL [9]	2.927	38.374	2.071	0.961	3.404
AWAN [27]	2.864	44.936	1.929	0.955	3.452
HSCNN [15]	2.374	47.246	1.659	0.969	2.866
SSJSR [45]	2.261	48.785	1.573	0.971	2.673
SSRAN [31]	2.701	45.629	1.915	0.962	3.336
MST++ [39]	2.088	49.148	1.453	0.975	2.474
RepCPSI [23]	2.616	44.710	1.849	0.959	3.511
MSFN [40]	2.025	49.635	1.410	0.977	2.385
MS2Net	1.542	52.642	1.045	0.986	1.823