Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification

Zhang, Meng; Yang, Yi; Zhang, Sixian; Mi, Pengbo; Han, Deqiang

doi:10.3390/rs16122152

Open AccessArticle

Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification

by

Meng Zhang

^1,2

,

Yi Yang

^1,2,*,

Sixian Zhang

^1,2,

Pengbo Mi

^1,2 and

Deqiang Han

³

¹

State Key Laboratory for Strength and Vibration of Mechanical Structures, Xi’an Jiaotong University, Xi’an 710049, China

²

School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(12), 2152; https://doi.org/10.3390/rs16122152

Submission received: 24 March 2024 / Revised: 26 May 2024 / Accepted: 10 June 2024 / Published: 13 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Hyperspectral image (HSI) contains abundant spectral-spatial information, which is widely used in many fields. HSI classification is a fundamental and important task, which aims to assign each pixel a specific class label. However, the high spectral variability and the limited labeled samples create challenges for HSI classification, which results in poor data separability and makes it difficult to learn highly discriminative semantic features. In order to address the above problems, a novel spectral-spatial center-aware bottleneck Transformer is proposed. First, the highly relevant spectral information and the complementary spatial information at different scales are integrated to reduce the impact caused by the high spectral variability and enhance the HSI’s separability. Then, the feature correction layer is designed to model the cross-channel interactions, thereby promoting the effective cooperation between different channels to enhance overall feature representation capability. Finally, the center-aware self-attention is constructed to model the spatial long-range interactions and focus more on the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel. Experimental results on the common datasets show that compared with the state-of-the-art classification methods, S2CABT has the better classification performance and robustness, which achieves a good compromise between the complexity and the performance.

Keywords:

hyperspectral image classification; high spectral variability; limited labeled samples; feature correction layer; center-aware self-attention

Graphical Abstract

1. Introduction

Hyperspectral image (HSI) is obtained by continuous sampling on the spectral band using the spectral sensor [1,2]. It has been successfully applied in many fields such as geological examination, urban and rural planning, and agricultural production due to its abundant spectral-spatial information [3,4,5]. HSI classification is a fundamental and important task, which aims to assign each pixel a specific class label [6,7,8]. It is generally composed by the feature extraction and the label prediction. Among them, the feature extraction is used to integrate the spectral-spatial information, while the label prediction is conducted to find the optimal decision boundary according to the extracted features [9].

In the past decades, various HSI classification methods have been proposed [10]. They can be grouped into the non-deep learning (non-DL)-based classification model and the DL-based classification network. The non-DL model was widely applied to HSI classification in earlier works, including the support vector machine (SVM) [11], random forest (RF) [12], and multinomial logistic regression (MLR) [13]. They usually take the original HSI data as the input features. However, the higher spectral dimension makes it difficult for the non-DL models to estimate the optimal model parameters [14]. In recent years, DL has shown enormous development potential in the computer vision tasks, which improves the expressive ability for high-dimensional data. Therefore, the DL networks become the mainstream tool for HSI classification, such as the autoencoder (AE) [15], deep belief network (DBN) [16,17], recurrent neural network (RNN) [18,19,20], and convolutional neural network (CNN) [21,22,23], etc. The DL networks combine the feature extraction and the label prediction in a unified framework, thereby extracting high-level semantic features that are the most beneficial for classification performance during network optimization. Compared with other DL networks, the CNN can learn spectral-spatial features efficiently, so it has become the most popular DL network in HSI classification [24].

The CNN-based classification networks usually require a large number of labeled samples to estimate the optimal network parameters. However, it is difficult to obtain data labels in most cases [25]. The limited labeled samples make it difficult to learn highly discriminative semantic features from the complex spectral-spatial structure. Therefore, optimizing the feature representation learning framework is the key to solving this problem.

In the research on the CNN-based classification network framework, the channel attention plays a crucial role, which can optimize the feature learning and obtain the diverse semantic information from different channels [26]. The channel attentions widely used in HSI classification include the convolutional block attention module (CBAM) [27], selective kernel network (SK-Net) [28], and efficient channel attention network (ECA-Net) [29]. These methods can improve the network’s sensitivity to the important features, but they have certain commonalities and limitations in their implementation approaches. Specifically, the above channel attentions all use the global pooling layer to map the multi-dimensional channel features into a one-dimensional vector along the channel direction. Subsequently, the fully connected layer is used to assign an attention weight to each channel to reflect its relative importance. However, this process inevitably leads to compression and potential loss of some information that is crucial for classification decisions. In addition, the attention weight learned by the above channel attentions only represents the independent importance of a single channel, and fails to fully capture the interactions and synergies between channels. In order to promote the effective cooperation between channels to enhance the overall feature representation capability, this paper reveals the dependencies between channels by modeling the cross-channel interactions. When the given channel has significant interactions with multiple channels, it indicates that this channel contains more valuable information and should be given a higher attention weight to highlight its role. Compared with the traditional channel attention that only focuses on the independent importance of a single channel, the cross-channel interactions can enhance the discriminative ability of different channels more comprehensively, thereby providing powerful feature representation. Therefore, it is necessary to explore a novel channel attention to model the cross-channel interactions and optimize attention allocation on this basis, thereby improving the representation capabilities of the semantic features.

CNN is characterized by local connection, so it is difficult to model the spatial long-range interactions [30], especially for HSI classification task that requires cross-regional information integration. In order to address this problem, researchers explored various more advanced network frameworks [31]. More recently, the vision Transformer based on the multi-head self-attention (MHSA) has achieved great success in computer vision tasks [32]. MHSA can divide the hidden state vector into multiple heads to form multiple sub-semantic spaces, which can allow the network to attend to the information in different dimensional semantic spaces. The self-attention can model the spatial long-range interactions in the global receptive field, thus better capturing the global contextual information, so the Transformer is widely introduced into the CNN-based HSI classification networks to learn the global spectral-spatial properties [33]. As a key component in the Transformer, the self-attention is essentially a spatial attention and it can capture the dependencies between pixels. The HSI classification network takes the 3D patch cropped from the original HSI as the input and learns the high-level spectral-spatial features to predict the class label for the central pixel, so how to better characterize the center pixel is a key to improving classification performance. However, it is difficult for the existing work to capture the contextual information that is beneficial to the central pixel classification decision. Meanwhile, the complex spectral-spatial structure results in that the input 3D patch usually contains some neighboring pixels with different spectral-spatial properties from the central pixel, which will have an impact on the accurate prediction of the class label for the center pixel. Since the existing self-attention cannot measure the different contributions of neighboring pixels to the central pixel, it is difficult to learn the discriminative information for the central pixels from the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel. Therefore, it is necessary to introduce the spectral-spatial property similarities measure to allow the central pixel to learn discriminative features from the neighboring pixels to improve classification performance.

Moreover, in addition to the impact caused by the limited labeled samples on classification performance, the high spectral variability [34] inherent in HSI is also a problem worth considering, which will lead to a poor data separability, thereby increasing uncertainty in identifying the land covers and making it difficult to obtain high-quality training samples. Therefore, increasing the HSI’s separability to provide a highly descriptive feature space for subsequent feature learning is the key to solving this problem.

The existing works usually enhance the HSI’s separability by extracting the highly representative spectral-spatial features [35], which can project the original HSI data into a feature space that integrates spectral-spatial information. Recently, various spectral-spatial feature extraction methods have been proposed, such as the extended morphological profile (EMP) [36], local binary pattern (LBP) [37], and extended morphological attribute profile (EMAP) [38]. The above features deal with HSI in a completely independent manner and do not consider the spectral-spatial correlation [39]. In contrast, the covariance matrix can combine multiple related features to better characterize land covers, and it is widely used in the spectral-spatial feature extraction, in which the element expresses the correlation between two spectral bands in the neighborhood window. However, since the high spectral variability leads to the weak spectral correlations between pixels, using all pixels in the neighborhood window to calculate the covariance matrix will easily introduce some erroneous information. Meanwhile, the covariance matrix is extracted within the neighborhood window with a fixed size, so it is difficult to fully adapt to the complex spectral-spatial structure. For the regions containing the features with multiple scales (such as the homogeneous regions in a small range coexisting with the heterogeneous regions in a large range), it is difficult to simultaneously explore the detailed information and the effective representation for the larger structures. Based on the above analysis, it can be seen that integrating spectral-spatial information from the neighboring pixels with the high spectral correlation at different scales can further increase the HSI’s separability.

In order to enhance the data separability while optimizing the design for the existing channel attentions and the self-attention, a novel Spectral-Spatial Center-Aware Bottleneck Transformer (S2CABT) is proposed. S2CABT constructs a parallel dual-branch structure, which first extracts the multi-scale spectral-spatial features. Then, the heterogeneous convolution kernels are used to separate the spectral and spatial features, and the high-level spectral and spatial features are learned in parallel. Finally, the learned spectral and spatial features are concatenated and a fully connected layer is used to predict the class label. The main contributions of this paper can be summarized as follows:

(1): The Spectral-Spatial Feature Extraction Module (S2FEM) is designed to extract the Multi-Scale Covariance Matrix (MSCM). It constructs the spectral correlation selection block to remove the neighboring pixels with weak spectral correlation, thereby reducing the impact caused by the high spectral variability and obtaining highly relevant spectral information. Meanwhile, S2FEM introduces the multi-scale analysis to explore the complementary spatial information at different scales and improve the representation power for the land covers.
(2): A novel channel attention, i.e., the Feature Correction Layer (FCL), is designed to model the cross-channel interactions using the inner products and considers all the information in each channel. Compared with the existing channel attentions, FCL can promote the effective cooperation between channels to enhance overall feature representation capability. Furthermore, FCL is combined with the convolutional layer to construct the Residual Convolution Module (RCM), which can exploit the local spectral-spatial properties.
(3): The Center-Aware Bottleneck Transformer (CABT) is designed to capture the global spectral-spatial properties, which develops the Center-Aware Multi-Head Self-Attention (CAMHSA) to model the spatial long-range interactions. Meanwhile, CAMHSA introduces the spectral-spatial feature embedding to measure the property similarities between pixels, thereby learning discriminative features that are beneficial to characterizing the central pixel from the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel.

The experimental results on the common datasets show that compared with the state-of-the-art classification methods, S2CABT has the better performance and robustness, which achieves a good compromise between the complexity and the performance.

The rest of this paper is organized as follows. Section 2 introduces the related works. Section 3 presents the S2CABT. Section 4 validates and analyzes the proposed network. Section 5 discusses the S2CABT’s advantages. Section 6 draws the conclusion.

2. Related Works

This section will elaborate on the existing HSI classification networks from two aspects: the CNN in HSI classification and the Transformer in HSI classification.

2.1. CNN in HSI Classification

With the in-depth research on deep learning and related technologies, CNN has gradually been widely used in HSI classification. For example, Hu et al. first proposed a five-layer 1D CNN to extract the spectral features [40]. Li et al. proposed a pixel-level 1D CNN that obtains the pixel-pair spectral features based on the correlation between pixels [41]. In order to integrate the spatial information, Zhao et al. proposed a 2D CNN [42], which takes the patch cropped from the original HSI as the input, thereby capturing the spatial correlation between pixels. Hamida et al. proposed a 3D CNN to extract the joint spectral-spatial features [43]. In order to learn more discriminative spectral and spatial features, the classification network with a dual-branch structure is proposed [44,45]. Guo et al. proposed a serial dual-branch structure [44]. Yu et al. proposed a parallel dual-branch structure [45]. In the serial dual-branch structure, the subsequent feature will be impacted by the previous feature. In contrast, the parallel dual-branch structure can learn spectral and spatial features independently without impacting each other, which can extract more robust features and is more suitable for HSI classification.

The network depth is crucial for computer vision tasks, especially for processing HSI with complex spectral-spatial information. However, over-increasing network depth will lead to degradation. The residual connection is designed to address this problem with a better performance [46], which has become a basic structure in the CNN-based HSI classification networks [47,48,49]. For example, Zhong et al. proposed the spectral and spatial residual blocks to sequentially learn the spectral and spatial features [47]. Paoletti et al. proposed a pyramid residual block, which gradually increases the feature dimension as the network depth increases to learn more semantic features [48]. Zhang et al. partitioned HSI into multiple sub-bands and used several residual blocks to construct the feature extraction stage to extract semantic features on each sub-band [49]. However, the above residual blocks only focus on the feature learning, and ignore the feature optimization.

Since the convolutional layer usually uses multiple convolutional kernels to learn multi-channel features, and different channels can provide different semantic information for HSI classification, researchers usually introduce the channel attention into the residual block to adaptively emphasize the channel that contains the valuable information [50]. For example, Zhu et al. combined the spectral-spatial attention design based on CBAM with the convolutional layers to construct a spectral-spatial residual block [51]. Roy et al. combined the efficient feature recalibration layer design based on ECA-Net with the convolutional layer to construct the different residual blocks in different network depths [52]. Shu et al. combined the spectral and spatial attention layer design based on SK-Net with the convolutional layer to construct the spectral and spatial residual blocks [53]. The existing channel attentions usually use the global pooling layer to compress the multi-channel features into a 1D vector along the channel, and learn an attention weight for each channel using the fully connected layer. Specifically, let the feature size be

C \times H \times W

. The existing channel attentions first use the global pooling layer to compress the feature into

C \times 1 \times 1

. Then, the fully connected layer is used to learn the attention weight vector with size

C \times 1 \times 1

. Finally, the attention weight vector is multiplied with the multi-channel feature to obtain the emphasized feature. Each element in the attention weight vector represents the importance of the corresponding channel. Therefore, the existing channel attention easily loses some significant information due to the use of the global pooling layer, and it is difficult for the learned importance to describe the cross-channel interactions.

2.2. Transformer in HSI Classification

CNN is characterized by local connection, so it restricts the receptive field to a local window, which makes it difficult to learn the global semantic feature. As a key component in the Transformer, the self-attention has the global receptive field to better capture global context information, which shows great potential in modeling the spatial long-range interactions. The framework of the self-attention is shown in Figure 1.

Let the input feature be

F

. The self-attention uses three linear projections to obtain query (

Q

), key (

K

), and value (

V

),

Q = F W^{Q}, K = F W^{K}, V = F W^{V}

(1)

where

W^{Q}

,

W^{K}

and

W^{V}

are the projection matrixes which are trainable. The self-attention can be formulated as

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V

(2)

where

d

is the dimension of the input feature. MHSA is designed based on the self-attention. It first uses

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

to obtain

Q_{i}

,

K_{i}

, and

V_{i}

in

i

-

t h

head. Then,

Q_{i}

,

K_{i}

, and

V_{i}

is used to compute the self-attention according to the framework shown in Figure 1. Finally, the self-attention of all heads is concatenated and the linear projection is performed on the concatenated self-attention. MHSA with

h

heads can be formulated as

M H S A = C o n c a t (A_{1}, A_{2}, \dots, A_{h}) W^{o}

(3)

where

W^{o}

is a learnable projection matrix,

C o n c a t (\cdot)

is the concatenate operation.

A_{i}

is the output of

i

-

t h

head, which can be formulated as

A_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i})

(4)

It is currently a mainstream approach to overcome the shortcoming that CNN has difficulty in modeling the spatial long-range interactions by introducing the Transformer into the CNN-based HSI classification networks. For example, Sun et al. proposed the spectral-spatial feature tokenization Transformer (SSFTT) [54]. SSFTT uses the feature tokenizer with Gaussian weights to enhance the semantic information in the low-level spectral-spatial features, and uses the Transformer to model the spatial long-range interactions. Liao et el. proposed the spectral-spatial fusion Transformer network (S2FTNet) [55]. S2FTNet constructed the spectral and spatial Transformer modules and designed the multi-head double self-attention (MHDSA) to model the spatial long-range interactions of the spectral and spatial features, respectively. Ouyang et al. proposed the hybrid Transformer (HF) [56]. HF constructs the multi-granularity feature generator, and designs the spectral-spatial self-attention to focus on the channels and positions with more differences.

According to the above calculation process, it can be seen that the computational complexity of MHSA is affected by the feature channel dimension. In order to balance the computational complexity and performance, Srinivas et al. proposed the bottleneck Transformer [57], which combines MHSA’s modeling capabilities for the spatial long-range interactions with the lightweight design of bottleneck structures. The bottleneck structure was proposed by He et al. [46] in ResNet, as shown in Figure 2a. It contains three convolution layers. The first layer uses

1 \times 1

convolution to reduce the channel dimension, and the second layer uses

3 \times 3

convolution to extract semantic features, and the third layer uses

1 \times 1

convolution to rescale the channel dimension. The bottleneck Transformer replaces the second layer with MHSA, which is shown in Figure 2b. The self-attention in the bottleneck Transformer can be formulated as

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q P^{T} + Q K^{T}}{\sqrt{d}}) V

(5)

where

P

is a relative position encoding, which consists of the relative height encoding

R_{h}

and the relative width encoding

R_{w}

.

In recent years, the bottleneck Transformers have also been introduced into the HSI classification. For example, Song et al. introduced the bottleneck Transformer into HSI classification to extract global spectral and spatial features, thereby proposing the spectral-spatial bottleneck Transformer (S2BT) [58]. Zhang et al. designed a multilayer residual convolution module to extract the low-level spectral-spatial features, and input the extracted features into the dual-dimension spectral-spatial bottleneck Transformer to simultaneously model the global channel-spatial attention [59].

Unlike other computer vision tasks, HSI classification requires predicting class labels for the central pixel of input 3D patches. However, the input 3D patch usually contains some neighboring pixels with different spectral-spatial properties from the central pixel, which will have an impact on accurately predicting the class label for the central pixel. It is difficult for MHSA to measure the different contributions of neighboring pixels to the central pixel, making it cannot explore discriminative information from the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel.

3. Spectral-Spatial Center-Aware Bottleneck Transformer

In this section, we first show the overall framework. Then, the main modules in S2CABT are described in detail. Finally, an implementation example is provided.

3.1. Overall Framework

As shown in Figure 3, a novel Spectral-Spatial Center-Aware Bottleneck Transformer (S2CABT) is proposed, which can be divided into four modules: Spectral-Spatial Feature Extraction Module (S2FEM), Residual Convolution Module (RCM), Center-Aware Bottleneck Transformer (CABT), and classification module.

Specifically, since the high spectral variability results in the poor HSI’s separability, S2FEM is constructed to extract the Multi-Scale Covariance Matrix (MSCM). S2FEM uses spectral correlation selection to remove the neighboring pixels with weak spectral correlation and introduces multi-scale analysis to obtain complementary spatial information at different scales. Compared with the original HSI data, MSCM integrates highly relevant spectral information and multi-scale spatial information, so it has stronger separability. In order to build a reasonable network framework to learn high discriminative semantic features from the limited labeled samples, this paper designs RCM and CABT with a parallel dual-branch structure. Specifically, RCM separates the spectral and spatial features from MSCM using two heterogeneous convolutional kernels, and stacks spectral and spatial residual blocks on the spectral and spatial branches to learn the local semantic features, respectively. Among them, the spectral and spatial residual blocks are composed by the convolutional layer and the Feature Correction Layer (FCL). FCL can model the cross-channel interactions to promote the effective cooperation between channels, thereby enhancing overall feature representation capability. Similarly, CABT stacks the center-aware bottleneck Transformer encoders on the spectral and spatial branches to learn the global semantic features, respectively. In the center-aware bottleneck Transformer encoders, the Center-Aware Multi-Head Self-Attention (CAMHSA) is designed to model the spatial long-range interactions. Moreover, CAMHSA introduces the spectral-spatial feature embedding to measure the property similarities between the neighboring pixels to the central pixel. The classification module concatenates the learned high-level spectral feature

X_{S p e}^{O u t}

and spatial feature

X_{S p a}^{O u t}

, and predicts the class labels using the fully connected layer, whose calculation process can be formulated as

{\hat{y}}_{i} = S o f t m a x (F C (C o n c a t (X_{S p e}^{O u t}, X_{S p a}^{O u t})))

(6)

where

{\hat{y}}_{i}

denotes the predicted class probabilities,

F C (\cdot)

is the full connection layer.

C o n c a t (\cdot)

is the concatenation operator. In the next sections, we will detail the principles associated with S2FEM (Section 3.2), RCM (Section 3.3), and CABT (Section 3.4).

3.2. Spectral-Spatial Feature Extraction Module

As shown in Figure 4, in order to reduce the impact cause by the high spectral variability and enhance the HSI’s separability, S2FEM is designed to extract MSCM.

First, in order to remove the redundant spectral bands and decrease the memory requirements, the maximum noise fraction (MNF) [60] is used to generate reduced HSI

X_{R} \in R^{H \times W \times R}

, where

R

is the reduced spectral dimension. Second, let

Ω_{x}^{s} = \{x_{1}, x_{2}, \dots, x_{D_{s}^{2}}\} \in R^{D_{s} \times D_{s} \times R}

be the neighborhood window centered at pixel

x_{1}

with size

D_{s} \times D_{s}

, where

D_{s}

is the window size at scale

s

. However, the high spectral variability results in weak spectral correlations between pixels, and calculating the covariance matrix for the central pixel using all neighboring pixels in

Ω_{x}^{s}

easily introduces some erroneous information. In order to address this problem, we designed the spectral correlation selection block. Specifically, the cosine distance is used to measure the spectral correlations between the central pixel and the neighboring pixel, which can be formulated as

s i m (x_{1}, x_{i}) = \frac{〈x_{1}, x_{i}〉}{{‖x_{1}‖}_{2} \cdot {‖x_{i}‖}_{2}}, i = 2, \dots, D_{s} \times D_{s}

(7)

where

〈\cdot〉

and

‖\cdot‖

are the inner product and the Frobenius norm, respectively. According to Equation (7), the spectral correlations of all neighboring pixels and the central pixel can be obtained, and the average correlation is calculated. Only those neighboring pixels whose correlation to the central pixel is greater than or equal to the average correlation are retained, and the retained pixels are recorded as

{\{x_{k}\}}_{k = 2,3, \dots, K}

. Therefore, a new neighborhood window is constructed, i.e.,

{\{x_{k}\}}_{k = 1,2, 3, \dots, K}

. The covariance matrix of the central pixel

x_{1}

is computed using the new neighborhood window, which can be formulated as

C M (x_{1}) = \frac{1}{K - 1} \sum_{i = 1}^{K} (x_{i} - μ) {(x_{i} - μ)}^{T} \in R^{R \times R}

(8)

where

μ

is the average spectral vector of the new neighborhood window.

{(\cdot)}^{T}

is the transpose operation. Then, in order to obtain the complementary spatial information in different scales, this paper extracts the covariance matrix of the central pixel

x_{1}

in the neighborhood window with different sizes according to the above computational process, respectively. The MSCM of the central pixel

x_{1}

can be obtained by concatenating the covariance matrixes at all scales, and its size is

1 \times 1 \times B_{c}

, where

B_{c} = N_{s} \times R \times R

, and

N_{s}

denotes the number of scales. Finally, the HSI’s MSCM can be obtained by combining the MSCMs of all pixels according to the corresponding positions, and its size is

H \times W \times B_{c}

, where

H

and

W

are the height and width of HSI. In summary, compared with the original HSI, MSCM integrates the highly relevant spectral information and the complementary spatial information at different scales, thereby enhancing the HSI’s separability.

3.3. Residual Convolution Module

As shown in the orange dashed box in Figure 3, in order to model the cross-channel interactions to better describe the local spectral-spatial properties, RCM is constructed based on the convolution layer and the channel attention. Specifically, RCM first crops the 3D patch from MSCM. Then, it uses the

1 \times 1

convolution to adjust the feature channel. Finally, the Spectral Residual Convolution Branch (ERCB) and the Spatial Residual Convolution Branch (ARCB) are built to learn local spectral and spatial features, respectively.

As shown in Figure 5, let the size of the input 3D patch be

S \times S

and the adjusted dimension using

1 \times 1

convolutional layer be

D_{1}

. First, inspired by [61], ERCB and ARCB use the convolutional kernels of sizes

1 \times 1 \times 7

and

3 \times 3 \times D_{1}

to separate the spectral and spatial features from the input 3D patch. Then, ERCB and ARCB learn local spectral and spatial features by stacking the spectral and spatial residual blocks, which are composed with the convolutional layer, batch normalization, rectified linear unit (ReLU) activation function, and FCL. Among them, the sizes of convolutional kernels in the spectral and spatial residual blocks are

1 \times 1 \times 7

and

3 \times 3 \times 1,

respectively. Finally, ERCB and ARCB use the convolutional kernels of sizes

1 \times 1 \times D_{2}

and

3 \times 3 \times 1

to output the local spectral and spatial features, respectively, where

D_{2} = (D_{1} - 6) / 2

.

In order to promote the effective cooperation between channels and enhance overall feature representation capability, FCL is designed to model the cross-channel interactions. Inspired by the self-attention, the inner product is used to compute the interactions between every two channels, which represents the projection of one channel on another channel. The larger the projection means the stronger the interaction between these two channels. As shown in Figure 6, the different colors in the input multiple channel features represent different channels. First, FCL calculates the inner product between every two channels and saves the calculated inner product in the attention matrix according to their channel indexes. Then, the average pooling layer and the

S o f t m a x

function are used to eliminate the outliers and normalize the attention matrix. Each row in the attention matrix represents the interactions between the corresponding channel and other channels. Finally, the matrix multiplication is performed on the input features and the attention matrix to emphasize the channels that contain valuable information, thereby outputting the enhanced multiple channel features. The FCL’s calculation process can be formulated as

X_{F C L}^{O U T} = F_{F C L} (X_{F C L}^{I N})

(9)

F_{F C L} (X_{F C L}^{I N}) = X_{C A}^{O U T} + X_{F C L}^{I N}

(10)

X_{C A}^{O U T} = S o f t m a x (A v g P o o l (X_{F C L}^{I N} \cdot {(X_{F C L}^{I N})}^{T})) X_{F C L}^{I N}

(11)

where

X_{F C L}^{I N}

and

X_{F C L}^{O U T}

are the input and output features. FCL uses the inner product to calculate the cross-channel interactions and considers all information in each channel, which overcomes the drawbacks that traditional channel attention easily loses some significant information and only learns the independent importance of a single channel.

The residual connection is widely used in HSI classification networks to prevent network degradation. Compared with the existing residual blocks, RCM has significant advantages. On the one hand, RCM adopts a dual-branch structure and uses the heterogeneous convolution kernels with different receptive fields to learn the spectral and spatial features respectively, which helps S2CABT more comprehensively understand the HSI’s intrinsic structure. On the other hand, unlike the channel attention used in the existing residual blocks, this paper introduces FCL into RCM, which enables RCM to reveal the interactions and synergies between channels and optimize attention allocation on this basis, thereby improving the ability to characterize the complex spectral-spatial structure.

3.4. Center-Aware Bottleneck Transformer

As shown in the blue dashed box in Figure 3, in order to better describe the global spectral-spatial properties, CABT is designed based on the bottleneck Transformer. Similar to RCM, CABT contains two branches: the Spectral Center-Aware Bottleneck Transformer (ECABT) and the Spatial Center-Aware Bottleneck Transformer (ACABT). ECABT and ACABT first learn global hierarchical features by stacking the Center-Aware Bottleneck Transformer Encoders (CABTE). Then, the global average pooling layer is used to compress features for the subsequent prediction of class labels. As shown in Figure 7, CABTE uses the

1 \times 1

convolutional kernel to ensure the consistency of channel. Moreover, CAMHSA can learn discriminative information from the neighboring pixels for the central pixel while modeling the spatial long-range interactions to improve the classification performance, whose framework is shown in Figure 8.

The existing self-attention is usually used to model the spatial long-range interactions, while neglecting to capture the contextual information that is beneficial to the central pixel classification decision. In order to address this problem, this paper introduces the spectral-spatial feature embedding (

E \in R^{D \times S \times S}

) to measure the different contributions of neighboring pixels to the central pixel, thereby enabling CAMHSA to focus more on the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel, where

D

is the feature dimension in each head. Specifically, CAMHSA calculates the similarities between the central pixel and the neighboring pixels in the feature map using the Feature Similarity Measure (FSM) to measure the spectral-spatial properties consistency. Let

e_{1} \in R^{D \times 1 \times 1}

be the central pixel and

e_{i} \in R^{D \times 1 \times 1}

be the neighboring pixel; the similarity between them can be formulated as

F S M (e_{i}) = \frac{〈e_{1}, e_{i}〉}{{‖e_{1}‖}_{2} \cdot {‖e_{i}‖}_{2}} \in R^{1 \times S \times S}, i = 1,2, \dots, S^{2}

(12)

In the center-aware self-attention, we use four linear projection layers to generate

Q

,

K

,

V

and the spectral-spatial feature embedding (

E

) from the input feature. The calculation process of the center-aware self-attention can be formulated as

A t t e n t i o n (Q, K, V, E) = s o f t m a x (\frac{Q {(R_{h} + R_{w})}^{T} + Q K^{T}}{\sqrt{D}}) V \cdot F S C (E)

(13)

where

A t t e n t i o n (\cdot)

is the output of the center-aware self-attention.

R_{h}

and

R_{w}

denote the relative height encoding and the relative width encoding, respectively. Let the input feature map be

F

. CAMHSA with

h

heads can be formulated as

C A M H S A = C o n c a t (A_{1}, A_{2}, \dots, A_{h}) W^{o}

(14)

where

W^{o}

is the learnable projection matrix.

A_{i}

is the output of

i

-

t h

head, which can be formulated as

A_{i} = A t t e n t i o n (F W_{i}^{Q}, F W_{i}^{K}, F W_{i}^{V}, F W_{i}^{E})

(15)

where

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V},

and

W_{i}^{E}

are the learnable projection matrixes in

i

-

t h

head, respectively. CAMHSA concatenates the attention calculated by each head and uses a linear projection to obtain the desired attention.

Compared with the original self-attention, the center-aware self-attention designed in this paper can not only can model the spatial long-range interactions, but also can learn the discriminative feature that is beneficial to characterizing the central pixel from the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel, thereby enhancing the S2CABT’s ability to capture the contextual information and interpret the HSI data with a complex spectral-spatial structure.

3.5. Implementation Example

In order to show S2CABT more clearly, this section will take the Indian Pines dataset with size

145 \times 145 \times 200

as an example to describe the computation process of S2CABT in detail. Let the Indian Pines dataset be

X \in R^{145 \times 145 \times 200}

. It is worth noting that all parameters appearing below are assumed to facilitate the presentation of implementation.

In S2FEM, MNF is firstly used to generate the reduced HSI

X_{R} \in R^{145 \times 145 \times 25}

. Second, we extract the covariance matrix of pixel

x

at different scales, and the window sizes at different scales are 3, 15, 27, respectively. When the window size is 3, the neighborhood window with size

3 \times 3 \times 25

centered at

x

is selected. The correlations between all pixels and the central pixel are calculated based on the cosine distance, and the pixels with correlation greater than or equal to the average correlation are retained. The

x

’s covariance matrix is calculated using the retained pixels according to Equation (8) and its size is reshaped to

1 \times 1 \times 625

. Then, when the window sizes are 15 and 27, the covariance matrixes are calculated according to the above steps, and they are concatenated to obtain the

x

’s MSCM with size

1 \times 1 \times 1875

. Finally, the MSCMs of all pixels are combined according to the corresponding positions to obtain the whole MSCM with size

145 \times 145 \times 1875

.

In RCM, we crop the input 3D patch

x_{p} \in R^{9 \times 9 \times 1875}

from the MSCM and reshape the input 3D patch to

9 \times 9 \times 200

using the

1 \times 1

convolution. Assume that ERCB and ARCB have two spectral residual blocks and spatial residual blocks, respectively. ECABT and ACABT have two CABTEs, respectively. The implementation of ERCB and ARCB are shown in Table 1 and Table 2, respectively. The implementation of ECABT and ACABT are the same, so they are shown in Table 3. For ease of notation, the convolutional layer, batch normalization, and activation function are abbreviated as Conv-BN-ReLU.

The global spectral and spatial features are concatenated to obtain the fusion feature of size

64 \times 1 \times 1

and mapped into class probabilities using a fully connected layer. According to the above analysis, the implementation process of the proposed S2CABT is shown in Algorithm 1.

Algorithm 1: Implementation process of S2CABT.

Input: HSI dataset, ground truth, train sample indexes, validation sample indexes, test sample indexes, reduced dimension, multiscale window sizes, patch size, the number of convolutional kernels, batch size, epochs, learning rate.
Output: The predicted class labels on test samples.

1:: Extract MSCM from the original HSI data.
2:: Crop training patch, validation patch, and test patch from MSCM. Generate training label, validation label, and test label from the ground truth.
3:: Create training loader, validation loader, and test loader.
4:: For $e$ = 1 to epochs
5:: Use 2D convolutional kernel to adjust feature dimension.
6:: Use ERCB and ARCB to learn the local spectral and spatial features.
7:: Use ECABT and ACABT to learn the global spectral and spatial features.
8:: Concatenate the learned global spectral and spatial features.
9:: Input the concatenated features into the classification module to predict class label.
10:: Compute loss function and update network parameters on training loader.
11:: Apply the trained network to the validation loader and compute the loss function. If the loss function is smaller than the loss function in the last epoch, the network parameter in this epoch will be saved. Conversely, it will not be saved.
12:: End for
13:: Predict class labels on test loader using the best network.

4. Experiments and Analysis

This section first describes the common HSI datasets. Second, the relevant experimental setup is introduced. Then, we discuss some important parameters. Next, S2CABT is compared with 3 non-DL-based models and 13 CNN-based networks. Moreover, the ablation experiments are implemented. Finally, we analyze the complexity of S2CABT.

4.1. Dataset’s Description

In order to validate the S2CABT’s classification performance, the experiments are carried on the Indian Pines, Kennedy Space Center, Houston, and Pavia university datasets.

Indian Pines dataset: This dataset was taken by the AVIRIS on Indian Pines in northwestern Indiana, USA. It contains

145 \times 145

pixels and 200 spectral bands. Its ground truth provides 16 land-cover classes for 10,249 labeled pixels.

Kennedy Space Center (KSC) dataset: This dataset was taken by the AVIRIS on Kennedy Space Center in Florida, USA. It contains

512 \times 614

pixels and 176 spectral bands. Its ground truth provides 13 land-cover classes for 5211 labeled pixels.

Houston dataset: This dataset was taken by the ITRES CASI-1500 on the University of Houston in Texas, USA. It contains

349 \times 1905

pixels and 144 spectral bands. Its ground truth provides 15 land-cover classes for 15,029 labeled pixels.

Pavia University (PaviaU) dataset: This dataset was taken by the ROSIS on Pavia University in Pavia, Italy. It contains

610 \times 340

pixels and 103 spectral bands. Its ground truth provides 9 land-cover classes for 42,776 labeled pixels.

4.2. Experimental Setup

Evaluation Metrics: Three common evaluation metrics are used to measure the classification performance: Overall Accuracy (OA), Average Accuracy (AA), and Kappa (

κ

).

Configuration Detail: The training and test sets are selected in the form of percentages. In order to eliminate the effect of randomness, we carry out five repeated experiments. The average classification performance and the standard deviation are reported.

Parameters Setting: For the comparison methods, their parameters are set to the optimal values suggested in the corresponding papers, and the optimal classification performance is achieved on the HSI dataset to the maximum extent possible. Meanwhile, S2CABT is trained using Adam optimizer, and the window sizes at different scales are 3, 15, 27,

e p o c h s = 200

. Moreover, other parameters will be discussed in the next section.

4.3. Parameters Discussion

Batch Size: The batch size is the number of training samples input to the classification network in each iteration. A smaller batch size means that the updated gradient in each iteration is calculated based on fewer samples, which increases the randomness of the training process and may speed up the convergence. However, this may also cause the training process to be more volatile, thus affecting the classification performance. A larger batch size reduces this randomness, thereby making the gradient updates smoother and the network convergence process more stable. However, it may fall into a wide valley region due to the excessive smoothness, thereby causing the network to converge to a local minimum, and affecting the classification performance. Therefore, in order to determine an appropriate batch size, this paper validates the impact of the batch size on the classification performance. We select different batch sizes from the set

[16,96]

with step 16 to carry out the experiments with 2% training samples. The average OA, AA,

κ

and their standard deviations are shown in Figure 9a–c. As can be seen from Figure 9, the batch size has basically no impact on the classification performances of the Indian Pines, Houston, and PaviaU datasets. In contrast, the impact of the batch size on the classification performance of the KSC dataset is more obvious. When the batch size is 32 or 48, its classification performance is better than when the batch size is set to other values. Considering that the batch size is directly related to the memory occupied by the network parameters in each iteration, this paper sets the batch size of the PaviaU datasets to 32. In order to maintain the consistency, the batch sizes of other three datasets are also set to 32.

Learning rate: The learning rate is the extent to which the network parameters are updated along the gradient direction during the optimization process. A smaller learning rate will make the network parameters gradually approach the optimal solution, but the convergence speed will be very slow and it is easy to fall into a wide local minimum region, thereby making it difficult to obtain the optimal value and affecting the classification performance. A large learning rate will cause the network parameters to decrease too much in each iteration, thereby making it easy to ignore the optimal solution. Meanwhile, a larger learning rate will cause the network parameters to stop updating quickly, thereby making the network unable to mine valuable information and affecting the classification performance. Therefore, in order to determine an appropriate learning rate, this paper validates the impact of the learning rate on the classification performance. We select different learning rates from the set

[0.00003, 0.0001, 0.0003, 0.001, 0.003, 0.01]

to carry out the experiments with 2% training samples. The average OA, AA,

κ

and their standard deviations are shown in Figure 10a–c. As the learning rate increases, the classification performances of the Indian Pines and KSC datasets gradually increase and then decrease, and reach the optimal classification performance when the learning rate is 0.0003. The classification performance of the KSC dataset fluctuates up and down as the learning rate increases, and it has an optimal value when the learning rate is 0.003. Meanwhile, it can be seen that when the learning rate is 0.0001 or 0.0003, the classification performance of the KSC dataset is very close to the optimal value. In addition, the learning rate has basically no impact on the classification performance of the PaviaU dataset. As the learning rate increases, its classification performance remains basically unchanged. In order to unify the learning rate in the proposed S2CABT corresponding to different datasets as much as possible, this paper sets the learning rate of four datasets to 0.0003.

Reduced spectral dimension: S2FEM uses the MNF to reduce the spectral dimension to remove the redundant spectral bands and decrease the memory requirements. Since the spectral bands contain abundant spectral information, a smaller reduced spectral dimension will not only remove the redundant spectral bands, but also easily remove some bands containing the discriminative spectral information, thereby resulting in the loss of the spectral features and affecting the identification for land covers. A larger reduced spectral dimension easily introduces some redundant spectral information, which makes it more difficult to extract the useful spectral features and increases the computational complexity. Therefore, in order to determine an appropriate reduced spectral dimension, this paper validates the impact of the reduced spectral dimension on the classification performance. We select different reduced spectral dimensions from the set

[5, 50]

with step 5 to carry out the experiments with 2% training samples. The average OA, AA,

κ

and their standard deviation are shown in Figure 11a–c. It can be seen from Figure 11 that as the reduced spectral dimension increases, the classification performances of the four datasets gradually increase. Moreover, when the reduced spectral dimension is equal to 15 or greater, their classification performances remain basically unchanged. In other words, reducing the spectral dimension does lose the spectral features to a certain extent, thereby affecting the classification performance, but this is only limited to the smaller reduced spectral dimension. When the reduced spectral dimension is appropriately increased, the impact of the spectral reduction will become smaller.

Patch size: The patch size relates to the spatial features that can be learned by the classification network. A smaller patch size will lose some important spatial information and the pixels with high correlation to the current pixel, which will result in the unsmooth predicted class label map. A larger patch size will introduce some useless or erroneous spatial information, and increase the computational cost. Therefore, in order to determine an appropriate patch size, this paper validates the impact of the patch size on the classification performance. We select different patch sizes from the set

[3, 21]

with step 2 to carry out the experiments with 2% training samples. The average OA, AA,

κ

and their standard deviations are shown in Figure 12a–c. It can be seen from Figure 12 that as the patch size increases, the classification performances of the Indian Pines and Houston datasets gradually increase and then decrease. The Indian Pines dataset has the optimal classification performance when the patch size is 5. The Houston dataset has essentially the same classification performance when the patch sizes are 7 and 9. Considering the computational complexity, its patch size is set to 7. As the patch size increases, the classification performances of the KSC and PaviaU datasets fluctuate up and down. They reach the optimal classification performances when their patch sizes are 13 and 7, respectively.

Convolutional kernels: the number of convolutional kernels is related to the high-level semantic features that can be learned by the classification network. The fewer convolutional kernels cannot learn enough high-level semantic features, which affects the representational ability of the classification network and leads to underfitting. The more convolutional kernels tend to learn redundant or erroneous high-level semantic features, which leads to overfitting and increases the computational complexity. Therefore, in order to determine an appropriate number of the convolutional kernels, this paper validates the impact of the number of convolutional kernels on the classification performance. We select different numbers of convolutional kernels from the set

[4, 40]

with step 4 to carry out the experiments with 2% training samples. The average OA, AA,

κ

and their standard deviations are shown in Figure 13a–c. It can be seen from Figure 13 the classification performances of the Indian Pines, KSC, and PaviaU datasets gradually increase with the convolutional kernels’ increases, and then fluctuate up and down. The Indian Pines and KSC datasets have the optimal classification performances when the number of convolutional kernels is 20. The PaviaU dataset reaches the optimal classification performance when the number of convolutional kernels is 28. As the convolutional kernels increase, the classification performance of the Houston dataset gradually increases and then decreases. It has an optimal classification performance when the number of convolutional kernels is 28.

4.4. Classification Results

In order to demonstrate the classification performance, our network is compared with 3 non-DL models: SVM [11], RF [12], MLR [13] and 13 CNN networks: ResNet-34 [40], SSRN [47], DPRN [48], A2S2K [52], RSSAN [51], FADCNN [45], SPRN [49], FGSSCA [44], SSFTT [54], BS2T [58], S2FTNet [55], S3ARN [53], HF [56].

Table 4, Table 5, Table 6 and Table 7 show the statistical classification results of different methods on the Indian Pines, KSC, Houston, and PaviaU datasets with 2% training samples, respectively. The best results are in bold. We report the average ± standard deviation on the classification accuracy in each class (CA), OA, AA and

κ

. It can be seen that our network outperforms the other methods in most CA, all OA, all AA, and all

κ

on the four datasets. The non-DL-based models perform better than CNN-based networks on classes with fewer training samples. This is because fewer training samples cannot accurately estimate the network parameters, thereby resulting in underfitting. In the classes with more training samples, CNN-based networks perform better than the non-DL-based models. This is because CNN-based networks have strong feature learning capabilities and can predict land-cover classes more accurately using sufficient training samples.

Specifically, for the Indian Pines dataset, the proposed network has the highest CA in 10 classes (16 classes in total). Its evaluation metrics on the Indian Pines dataset are the best, which are 94.80%, 94.20%, and 0.9407, respectively. SPRN has the highest CA in 5 classes, among which it reaches 100% on class 8. Apart from S2CABT, it is the best performing network on CA. SPRN partitions the spectral bands into sub-bands and learns spectral-spatial features on each sub-band. Each sub-band can be regarded as a new dataset, so SPRN increases the available training samples by spectral partitioning. However, since it only uses the convolutional layer to learn the local features and ignores the cross-channel interactions as well as the global features, the features learned by SPRN make it difficult to accurately describe all classes. Therefore, its performance in other classes is not outstanding or even poor. Moreover, its OA, AA, and

κ

are 4.56%, 4.29%, and 0.0519 different from our network. The closest classification performance to our network is BS2T; its evaluation metrics are 94.01%, 91.46%, and 0.9316, respectively. The differences between S2CABT and BS2T on evaluation metrics are 0.79%, 2.74%, and 0.0091, respectively. BS2T learns the local and global features using CNN and Transformer, respectively, but it ignores the enhancement of semantic features by cross-channel interactions and spectral-spatial properties. Moreover, it does not consider the impact of limited labeled samples. Therefore, its CA reaches 100% only on class 8, and is worse than S2CABT on other classes.

For the KSC dataset, the proposed network has the highest CA in 10 classes (13 classes in total). Its evaluation metrics are the best, which are 99.11%, 98.42%, and 0.9901, respectively. It can be seen that the KSC dataset is more separable than the Indian Pines dataset, especially on classes 10 and 13, where 7 and 10 classification methods reach 100% CA, respectively. The main reason is that the number of training samples of each class in the KSC dataset is more balanced, which avoids overfitting to a certain extent. On this basis, the proposed network extracts more representative multi-scale spectral-spatial features to further enhance the KSC’s separability. The closest classification performance to our network is FGSSCA; its evaluation metrics are 98.36%, 97.43%, and 0.9818, respectively. The differences between S2CABT and FGSSCA on evaluation metrics are 0.75%, 0.99%, and 0.0083, respectively. Similar to SPRN, FGSSCA groups the shallow features and constructs the serial dual-branch structure to learn spectral and spatial features sequentially, thereby increasing the availability of training samples. However, in the serial dual-branch structure, the spatial features will be affected by the spectral features, which makes it difficult to learn highly discriminative semantic features.

For the Houston dataset, S2CABT has the highest CA in 2 classes (15 classes in total). Moreover, the difference in CA between our network and the method with optimal CA is small. For example, the difference between the CA of our network and the optimal CA is only 0.45%, 0.61%, 0.08%, 0.03%, and 0.17% on the classes 2, 3, 5, 7, and 11. The evaluation metrics of the proposed network are the best, which are 94.24%, 94.35%, and 0.9377, respectively. The closest classification performance to our network is FGSSCA; its evaluation metrics are 94.04%, 93.96%, and 0.9355, respectively. The differences between S2CABT and FGSSCA on the evaluation metrics are 0.2%, 0.39%, and 0.0022, respectively. For the PaviaU dataset, S2CABT has the highest CA in 2 classes (9 classes in total). The difference in CA between our network and the method with optimal CA is small. For example, the difference between the CA of our network and the optimal CA is only 0.26%, 0.03%, 1.14%, 0.4%, 0.09%, and 0.54% on classes 1, 2, 4, 5, 6, and 8. The evaluation metrics of the proposed network are the best, which are 99.35%, 98.75%, and 0.9914, respectively. The closest classification performance to our network is SPRN; its evaluation metrics are 99.27%, 98.61%, and 0.9903, respectively. The differences between S2CABT and SPRN on evaluation metrics are 0.08%, 0.14%, and 0.0011, respectively.

The visual classification performance of different methods on the Indian Pines dataset with 2% training samples is shown in Figure 14. Overall, the CNN-based networks perform better than the non-DL-based models. The main reason is that CNN-based networks usually take the 3D patch as the input feature, and comprehensively consider the spectral and spatial information. They combine the feature extraction and the label prediction in the unified framework, and can dynamically learn more discriminative high-level semantic features based on the loss function. The non-DL-based models take 1D spectral vectors as input features, which only consider the spectral information and ignore the spatial information, thereby resulting in many misclassified pixels in the class label map. They generally take the input feature directly as the classification basis, which makes it difficult to estimate the optimal model parameters. In contrast, S2CABT can obtain a smoother class label map and predict the labels for the most labeled samples more accurately.

Specifically, the land covers are closer to each other in the Indian Pines dataset, so the feature extraction and the label prediction are easily affected by the adjacent land covers. The Indian Pines dataset has a large spatial resolution (20 m), which leads to the higher spectral variability. Since the land covers in the Indian Pines dataset have size differences, the number of training samples of different classes are not balanced. Both the spectral variability and the unbalanced training samples will create difficulties in estimating the optimal network parameters. As shown in Figure 14, most methods are unable to accurately classify the land cover and recognize the image edges. The non-DL-based models and a few CNN-based networks (e.g., ResNet and DPRN) produce many discrete predicted labels due to the fact that the non-DL-based models ignore the spatial information, while ResNet and DPRN are underfitting because the training samples are too few. The remaining CNN-based networks are prone to over-smoothness, i.e., the land covers with different classes are recognized as the same class. The main reason is that these networks use original HSI data as the input feature, which makes it difficult for them to learn spectral-spatial features with high discriminative power from the original HSI data with low separability. In contrast, S2CABT enhances the data separability by integrating the highly relevant spectral information and the complementary spatial information at different scales. Meanwhile, it optimizes the designs for the channel attention and the self-attention to learn more discriminative semantic features, which can enhance the S2CABT’s ability to interpret the HSI data with the complex spectral-spatial structure and capture the contextual information, thereby better describing the essential properties of land covers.

In order to compare the classification performances of different methods on four datasets more comprehensively, we use different training sizes 2%, 4%, 6%, 8%, 10%, 15%, and 20% to complete the classification task, and the experimental results are shown in Figure 15. It can be seen that the classification performances of non-DL-based models outperform some CNN-based networks when the training size is small. For example, the non-DL-based models are better than DPRN, RSSAN, FADCNN in the Indian Pines dataset, better than RSSAN, SSFTT, S2FTNet in the KSC dataset, better than DPRN, RSSAN, S3ARN in the Houston dataset, and better than FADCNN in the PaviaU dataset. As the training size increases, the CNN-based networks gradually perform better than the non-DL-based models. This is mainly because more training samples allow the CNN-based networks to learn highly discriminative semantic features, thereby improving the classification performance. Different from the comparison methods, S2CABT extracts MSCM with stronger separability and builds the more efficient feature learning module, so it can still achieve exciting classification performance with smaller training size, and it is better than other methods. As the training size increases, the difference between S2CABT and other comparison methods gradually decreases, but our network is still optimal.

The classification’s robustness is easily affected by many factors such as the sample quality and the feature representation capacity. In order to improve the S2CABT’s robustness, this paper projects HSI into a feature space with higher descriptive power by integrating the highly relevant spectral information and the complementary spatial information at different scales, thereby increasing the data separability and providing high-quality training samples for subsequent feature learning. In addition, this paper designs FCL and CAMHSA to improve the feature representation learning capacity. Among them, FCL can model the cross-channel interactions and optimize attention allocation on this basis. CAMHSA introduces the spectral-spatial property similarity measure to allow the central pixel to learn discriminative information from the neighboring pixels.

It can be seen from Figure 15 that S2CABT still maintains better classification performance even with fewer training samples. Meanwhile, as the training size increases, the S2CAB’s classification performance gradually becomes better. Moreover, compared with other classification methods, S2CAB can converge faster. These phenomena all show that our network has better robustness and stability.

4.5. Ablation Study

In order to validate the impact of the important contributions in S2CABT on the classification performance, we designed an ablation study. In this experiment, the impact of the important contributions will be analyzed from three aspects: the ablation study on the input feature, the ablation study on the framework, and the ablation study on the attention.

Ablation study on the input feature: In order to reduce the impact caused by the high spectral variability and enhance the HSI’S separability, MSCM is extracted to reconstruct the input feature space. MSCM integrates the highly relevant spectral information and the complementary spatial information at different scales using the Spectral Correlation Selection (SCS) and the spatial multi-scale analysis, respectively. In this experiment, we compare the classification performance when the original HSI, single-scale covariance matrix, MSCM without SCS, and MSCM are used as input feature. Since we use three scales to extract MSCM, the window sizes at different scales are 3, 15, and 27, respectively. Therefore, different window sizes, i.e.,

D_{1} = 3

,

D_{2} = 15

, and

D_{3} = 27

need to be considered when using the single-scale covariance matrix as the input feature. In order to analyze the differences between input features more intuitively, T-SNE is used to project the input features into a two-dimensional space and visualize them. As shown in Figure 16, in each subgraph, the different colors represent the samples with different classes. Moreover, the experimental results on four HSI datasets are shown in Table 8, and the best results are in bold. We report the average ± standard deviation on OA, AA, and

κ

.

As can be seen from Figure 16, affected by the high spectral variability, the samples with different class labels overlap with each other in the original HSI data. The distribution of samples with the same class label is relatively divergent, thereby making it difficult for the classification network to use the training samples obtained from the original HSI data to learn the optimal parameters. Meanwhile, it can be seen from Table 8 that when the original HSI data is used as the input features, its classification performance is poor. In contrast, there are clear decision boundaries between the samples with different class labels in the MSCM, and the distribution of samples with the same class label is also more concentrated. Moreover, the classification performance when using the MSCM as the input feature is significantly better than that when using the original HSI data as the input feature. According to the above analysis, it can be seen that the separability of input features will directly affect the classification performance. In order to validate whether the spectral feature is lost when S2FEM uses the spectral correlation selection to remove the neighboring pixels with low spectral correlation, this section visualizes the MSCM without SCS. It can be seen from the visualization results that using all pixels in the neighborhood window to calculate the covariance matrix for the central pixel will lead to larger intra-class distances and smaller inter-class distances. Therefore, the introduction of the spectral correlation selection block not only does not lose the spectral features, but also can obtain the highly discriminative spectral information, thereby improving classification performance. Meanwhile, the multi-scale analysis enriches the feature representation to a certain extent. The small-scale covariance matrix can capture the fine features, while the large-scale covariance matrix can focus more on the overall representation. It can be seen from the visualization results of three single-scale covariance matrixes that it is difficult to obtain valuable information from the Indian Pines dataset on the small-scale region. However, in the large-scale region, a feature space with high descriptive power can be constructed for the Indian Pines dataset. The KSC, Houston, and PaviaU datasets can obtain effective feature representations at different scales, especially in the small-scale region, where they can extract spectral-spatial features with stronger separability.

Ablation study on the framework: In order to comprehensively characterize the spectral-spatial properties, the residual convolution module and the center-aware bottleneck Transformer are designed. In addition, the parallel dual-branch structure is built to learn robust spectral and spatial features. The feature learning module can be divided into four sub-modules: ERCB, ARCB, ECABT, and ACABT. ERCB and ARCB are constructed based on CNN to learn the local spectral and spatial features, respectively. ECABT and ACABT are designed based on the bottleneck Transformer to learn the global spectral and spatial features, respectively. In this experiment, we remove one or more sub-modules to validate the impact of different sub-modules on classification performance. The experimental results are shown in Table 9 and the best results are in bold. It is worth noting that ✘ means the module is removed, while ✓ means the module is not removed. We report the average ± standard deviation on OA, AA, and

κ

. It can be seen that each sub-module plays a vital role in improving classification performance. After removing one or more modules, the classification performance is degraded to varying degrees.

Specifically, this experiment can be divided into three groups: removing one sub-module, removing two sub-modules, and removing all sub-modules. On the KSC, Houston, and PaviaU datasets, the classification performance after removing one sub-module is better than other experiments. The classification performance after removing all sub-modules is worse than the other experiments. On the Indian Pines dataset, overall, the classification performance after removing one sub-module is better than other experiments. However, AA after removing two sub-modules is worse than AA after removing all modules. This phenomenon is manifested in the removal of the spectral or spatial branches, i.e., removing both ERCB and ECABT or removing both ARCB and ACABT. The main reason is that the spectral or spatial features are lost after removing a branch, which makes it difficult to accurately describe spectral-spatial properties. Especially for the land covers that only provide one training sample, removing one branch makes these classes have smaller CA, which has less effect on OA and

κ

but more effect on AA.

Ablation study on the attention: In order to optimize feature extraction and measure the different contributions of neighboring pixels to the central pixel, FCL and CAMHSA are designed. Among them, FCL uses the inner product between channels to model the cross-channel interactions, which suppresses the drawbacks that the existing channel attention easily loses some important information and only learns the independent importance for a single channel. CAMHSA introduces the feature similarity measure based on MHSA to focus more on the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel. In this experiment, in order to validate the advantages of FCL and CAMHSA, we replace FCL and CAMHSA with the existing channel attention (CBAM, SK-Net, and ECA-Net, etc.) and MHSA, respectively. The experimental results are shown in Table 10; the best results are in bold. We report the average ± standard deviation on OA, AA, and

κ

. It can be seen from Table 10 that both FCL and CAMHSA improve the classification performance. Compared with the existing channel attention and MHSA, they demonstrate stronger feature optimization and learning capabilities.

4.6. Network Complexity Analysis

In order to describe the proposed network more comprehensively, the network complexity analysis experiments are designed in this section. Table 11 shows the complexity on four datasets of the CNN-based classification networks involved in this paper from four aspects: the trainable parameters that need to be updated during the backpropagation process, the computational cost, the training time, and the testing time, respectively. The trainable parameters are in

M

, and the computational cost is in

G F L O P s

, and the training and testing times are in

s

. Among them,

1 M = 10^{6}

,

1 G F L O P s = 10^{9} F L O P s

.

As can be seen from Table 11, although this paper constructs the parallel dual-branch structure and designs four sub-modules to learn highly discriminative spectral-spatial features, the network complexity of our network is not the highest. This is due to the smaller convolution kernel used in the residual convolution module. In addition, the learned semantic features compress the spectral dimension, which provides an ideal feature size for subsequent feature learning. In the center-aware bottleneck Transformer, the proposed network uses

1 \times 1

convolution to adjust the channels, thereby reducing the network complexity. However, there are still some networks that have a smaller network complexity than our network. It is worth noting that the difference in complexity between our network and these networks is small, but the classification performance of our network is far better than these networks. This demonstrates that S2CABT does not rely on the excessive increases in the trainable parameters, the computational costs, the training and testing times to improve classification performance. In other words, S2CABT achieves a good compromise between network complexity and classification performance.

5. Discussion

This section mainly discusses the significant advantages of S2CABT compared to the existing state-of-the-art classification networks from the following aspects:

(1): Network framework: The significant advantages in the network framework are reflected in the reconstruction for the input feature space and the optimization for the feature representation learning. Among them, S2FEM integrates highly relevant spectral information and multi-scale complementary spatial information. Compared with the state-of-the-art classification networks that use original HSI as the input features, S2CABT can increase the HSI’s separability to provide a highly descriptive feature space for subsequent feature learning stages.

Unlike the residual blocks used in existing classification networks [47,48,49,51,52,53], RCM processes the spectral and spatial information in parallel, thereby improving computational efficiency. Meanwhile, RCM introduces FCL to reveal the interactions and synergies between different channels and optimize the attention allocation, thereby enhancing the overall feature representation ability.

CABT is a novel bottleneck Transformer, which develops the CAMHSA to model the spatial long-range interactions. CAMHSA is different from the self-attentions used in the state-of-the-art classification networks [54,55,56,57,58,59]. It introduces spectral-spatial feature embedding to measure the property differences between pixels, thereby capturing the contextual information that is beneficial to the central pixel classification decision.

(2): Classification performance: This paper carries out the comparison experiments between S2CABT and 13 existing state-of-the-art classification networks on four common HSI datasets, and reports the average evaluation metrics and the corresponding standard deviations. The experimental results show that S2CABT effectively addresses some key challenges in HSI classification through the innovative design. It shows better classification performance and robustness, and outperforms the existing state-of-the-art classification networks.
(3): Network complexity: This paper compares S2CABT with 13 existing state-of-the-art classification networks on the network complexity and running time. The introduction of the novel modules inevitably increases the complexity and running time, but it is still better than some existing classification networks. In other words, S2CABT achieves a good compromise between the network complexity and the classification performance.

6. Conclusions

This paper proposes a novel HSI classification network framework to reduce the impact caused by the high spectral variability and the limited labeled samples. The high spectral variability leads to poor data separability, so this paper designs the spectral correlation selection block to remove the neighboring pixels with low spectral correlation. Meanwhile, the complementary spatial information at different scales is integrated. The limited labeled samples make it difficult for the classification network to effectively learn feature representations, so this paper designs FCL and CAMHSA to optimize feature learning. FCL is used to model the interactions and synergies between channels to enhance the overall feature representation capability. CAMHSA is used to model the spatial long-range interactions while capturing the contextual information that is beneficial to the central pixel classification decisions.

This paper validates the proposed S2CABT on the common HSI datasets. The experimental results show that compared with the state-of-the-art classification methods, S2CABT has the better classification performance and robustness, which achieves a good compromise between the complexity and the performance.

Although S2CABT has achieved satisfactory performance, in order to consolidate and improve the current research results, we will continue to optimize the HSI classification network from the following aspects in the future:

(1): We will design more efficient feature extraction methods to improve the ability to represent the complex spectral-spatial structure.
(2): We will continue to explore the novel channel attention to enable the classification network to dynamically adjust the attention weights according to the input content and the feature distribution to achieve more refined feature learning.
(3): Considering the prominent position of the central pixel in the HSI classification, we will continue to research the interactions between pixels at different scales, thereby enhancing the feature representation learning for the central pixel.

In addition, this work mainly focuses on the HSI classification tasks. In order to expand its application scope, applying S2CABT to other remote sensing image processing tasks such as the object detection and scene classification is also an important direction for our future work. For the object detection tasks, we will integrate the cross-channel interaction and the spatial long-range attention into the object detection frameworks to enhance their detection capacities for the small or complex distributed objects. In the scene classification tasks, since scenes usually involve wider context information and complex structures, the attention mechanism in S2CABT can be used to emphasize semantic features that are crucial to the scene’s understanding. By adjusting the network structure, such as adding the self-attention that handles the larger scene contexts, the cross-regional semantic features can be better extracted, thereby providing more precise clues when distinguishing the different scene classes.

Author Contributions

Conceptualization, M.Z., Y.Y. and D.H.; methodology, M.Z. and Y.Y.; software, M.Z., S.Z. and P.M.; validation M.Z.; writing—original draft preparation, M.Z.; writing—review and editing, Y.Y. and D.H.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number U22A2045.

Data Availability Statement

The Indian Pines and KSC datasets can be obtained at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes. The Houston dataset can be obtained at https://hyperspectral.ee.uh.edu/?page_id=459 (all accessed on 23 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, H.; Wang, L.; Liu, H.; Sun, Y. Hyperspectral image classification with the orthogonal self-attention ResNet and two-step support vector machine. Remote Sens. 2024, 16, 1010. [Google Scholar] [CrossRef]
Yang, J.; Qin, J.; Qian, J.; Li, A.; Wang, L. AL-MRIS: An active learning-based multipath residual involution siamese network for few-shot hyperspectral image classification. Remote Sens. 2024, 16, 990. [Google Scholar] [CrossRef]
Guo, H.; Liu, W. S3L: Spectrum Transformer for self-supervised learning in hyperspectral image classification. Remote Sens. 2024, 16, 970. [Google Scholar] [CrossRef]
Cui, B.; Wen, J.; Song, X.; He, J. MADANet: A lightweight hyperspectral image classification network with multiscale feature aggregation and a dual attention mechanism. Remote Sens. 2023, 15, 5222. [Google Scholar] [CrossRef]
Diao, Q.; Dai, Y.; Wang, J.; Feng, X.; Pan, F.; Zhang, C. Spatial-pooling-based graph attention U-Net for hyperspectral image classification. Remote Sens. 2024, 16, 937. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, J.; Meng, Z.; Liu, H.; Chang, Z.; Fan, J. Multiple vision architectures-based hybrid network for hyperspectral image classification. Expert Syst. Appl. 2023, 234, 121032. [Google Scholar] [CrossRef]
Islam, T.; Islam, R.; Uddin, P.; Ulhaq, A. Spectrally segmented-enhanced neural network for precise land cover object classification in hyperspectral imagery. Remote Sens. 2024, 16, 807. [Google Scholar] [CrossRef]
Shi, C.; Wu, H.; Wang, L. A positive feedback spatial-spectral correlation network based on spectral slice for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503417. [Google Scholar] [CrossRef]
Shi, C.; Yue, S.; Wang, L. A dual branch multiscale Transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5504520. [Google Scholar] [CrossRef]
Liu, S.; Li, H.; Jiang, C.; Feng, J. Spectral–spatial graph convolutional network with dynamic-synchronized multiscale features for few-shot hyperspectral image classification. Remote Sens. 2024, 16, 895. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar] [CrossRef]
Haut, J.; Paoletti, M.; Paz-Gallardo, A.; Plaza, J.; Plaza, A.; Vigo-Aguiar, J. Cloud implementation of logistic regression for hyperspectral image classification. In Proceedings of the 17th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), Costa Ballena (Rota), Cádiz, Spain, 4–8 July 2017. [Google Scholar]
Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Li, T.; Zhang, J.; Zhang, Y. Classification of hyperspectral image based on deep belief networks. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014. [Google Scholar]
Chen, Y.; Zhao, X.; Jia, X. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Ma, A.; Filippi, A.; Wang, Z.; Yin, Z. Hyperspectral image classification using similarity measurements-based deep recurrent neural networks. Remote Sens. 2019, 11, 194. [Google Scholar] [CrossRef]
Seydgar, M.; Alizadeh Naeini, A.; Zhang, M.; Li, W.; Satari, M. 3-D convolution-recurrent networks for spectral-spatial classification of hyperspectral images. Remote Sens. 2019, 11, 883. [Google Scholar] [CrossRef]
Paoletti, M.E.; Moreno-Álvarez, S.; Xue, Y.; Haut, J.M.; Plaza, A. AAtt-CNN: Automatical attention-based convolutional neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5511118. [Google Scholar] [CrossRef]
Hu, Y.; Tian, S.; Ge, J. Hybrid convolutional network combining multiscale 3D depthwise separable convolution and CBAM residual dilated convolution for hyperspectral image classification. Remote Sens. 2023, 15, 4796. [Google Scholar] [CrossRef]
Li, S.; Liang, L.; Zhang, S.; Zhang, Y.; Plaza, A.; Wang, X. End-to-end convolutional network and spectral-spatial Transformer architecture for hyperspectral image classification. Remote Sens. 2024, 16, 325. [Google Scholar] [CrossRef]
Fang, S.; Li, X.; Tian, S.; Chen, W.; Zhang, E. Multi-level feature extraction networks for hyperspectral image classification. Remote Sens. 2024, 16, 590. [Google Scholar] [CrossRef]
Zhang, Z.; Gao, D.; Liu, D.; Shi, G. Spectral-spatial domain attention network for hyperspectral image few-shot classification. Remote Sens. 2024, 16, 592. [Google Scholar] [CrossRef]
Zhou, H.; Luo, F.; Zhuang, H.; Weng, Z.; Gong, X.; Lin, Z. Attention multi-hop graph and multi-scale convolutional fusion network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5508614. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Ma, Y.; Lan, Y.; Xie, Y.; Yu, L.; Chen, C.; Wu, Y.; Dai, X. A Spatial–Spectral Transformer for Hyperspectral Image Classification Based on Global Dependencies of Multi-Scale Features. Remote Sens. 2024, 16, 404. [Google Scholar] [CrossRef]
Arshad, T.; Zhang, J. Hierarchical attention transformer for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5504605. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhang, Z.; Wang, S.; Zhang, W. Dilated spectral–spatial Gaussian Transformer net for hyperspectral image classification. Remote Sens. 2024, 16, 287. [Google Scholar] [CrossRef]
Zhang, G.; Hu, X.; Wei, Y.; Cao, W.; Yao, H.; Zhang, X.; Song, K. Nonlocal correntropy matrix representation for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5502305. [Google Scholar] [CrossRef]
Jia, S.; Zhu, Z.; Shen, L.; Li, Q. A two-stage feature selection framework for hyperspectral image classification using few labeled samples. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 1023–1035. [Google Scholar] [CrossRef]
Beirami, B.A.; Mokhtarzade, M. Band grouping SuperPCA for feature extraction and extended morphological profile production from hyperspectral images. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1953–1957. [Google Scholar] [CrossRef]
Fang, Y.; Ye, Q.; Sun, L.; Zheng, Y.; Wu, Z. Multi-attention joint convolution feature representation with lightweight Transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513814. [Google Scholar] [CrossRef]
Wei, Y.; Zheng, Y.; Wang, R.; Li, H. Quaternion convolutional neural network with EMAP representation for multisource remote sensing data classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5508805. [Google Scholar] [CrossRef]
Torti, E.; Marenzi, E.; Danese, G.; Plaza, A.; Leporati, F. Spatial-spectral feature extraction with local covariance matrix from hyperspectral images through hybrid parallelization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7412–7421. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Li, W.; Wu, G.; Zhang, F.; Du, Q. Hyperspectral image classification using deep pixel-pair features. IEEE Trans. Geosci. Remote Sens. 2016, 55, 844–853. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D deep learning approach for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
Guo, W.; Ye, H.; Cao, F. Feature-grouped network with spectral–spatial connected attention for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5500413. [Google Scholar] [CrossRef]
Yu, C.; Han, R.; Song, M.; Liu, C.; Chang, C.-I. Feedback attention-based dense CNN for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5501916. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
Paoletti, M.E.; Haut, J.M.; Fernandez-Beltran, R.; Plaza, J.; Plaza, A.J.; Pla, F. Deep pyramidal residual networks for spectral–spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 740–754. [Google Scholar] [CrossRef]
Zhang, X.; Shang, S.; Tang, X.; Feng, J.; Jiao, L. Spectral partitioning residual network with spatial attention mechanism for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5507714. [Google Scholar] [CrossRef]
Gao, H.; Zhang, Y.; Chen, Z.; Li, C. A multiscale dual-branch feature fusion and attention network for hyperspectral images classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8180–8192. [Google Scholar] [CrossRef]
Zhu, M.; Jiao, L.; Liu, F.; Yang, S.; Wang, J. Residual spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 449–462. [Google Scholar] [CrossRef]
Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-based adaptive spectral–spatial kernel ResNet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7831–7843. [Google Scholar] [CrossRef]
Shu, Z.; Liu, Z.; Zhou, J.; Tang, S.; Yu, Z.; Wu, X.-J. Spatial–spectral split attention residual network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 419–430. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–spatial feature tokenization Transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Liao, D.; Shi, C.; Wang, L. A spectral–spatial fusion Transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515216. [Google Scholar] [CrossRef]
Ouyang, E.; Li, B.; Hu, W.; Zhang, G.; Zhao, L.; Wu, J. When multigranularity meets spatial–spectral attention: A hybrid Transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4401118. [Google Scholar]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Song, R.; Feng, Y.; Cheng, W.; Mu, Z.; Wang, X. BS2T: Bottleneck spatial–spectral Transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5532117. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Y.; Yang, L.; Chen, J.; Liu, Z.; Bian, L.; Yang, C. D 2 S 2 BoT: Dual-dimension spectral-spatial bottleneck Transformer for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 2655–2669. [Google Scholar] [CrossRef]
Green, A.A.; Berman, M.; Switzer, P.; Craig, M.D. A transformation for ordering multispectral data in terms of image quality with implications for noise removal. IEEE Trans. Geosci. Remote Sens. 1988, 26, 65–74. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Yang, Y.; Wang, X. Classification of hyperspectral image based on double-branch dual-attention mechanism network. Remote Sens. 2020, 12, 582. [Google Scholar] [CrossRef]

Figure 1. Framework of the self-attention.

Figure 2. Different bottleneck structures. (a) Bottleneck structure in ResNet; (b) bottleneck Transformer encoder.

Figure 3. Framework of S2CABT.

Figure 4. Framework of S2FEM.

Figure 5. Framework of ERCB and ARCB. (a) ERCB; (b) ARCB.

Figure 6. Framework of FCL.

Figure 7. Framework of CABTE.

Figure 8. Framework of CAMHSA.

Figure 9. The impact of the batch size on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 9. The impact of the batch size on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 10. The impact of the learning rate on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 10. The impact of the learning rate on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 11. The impact of the reduced spectral dimension on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 11. The impact of the reduced spectral dimension on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 12. The impact of the patch size on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 12. The impact of the patch size on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 13. The impact of the number of convolutional kernels on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 13. The impact of the number of convolutional kernels on the classification performance. (a) OA; (b) AA; (c)

κ

.

Figure 14. Visual classification results for Indian Pines dataset. (a) Ground truth; (b–r) the class label maps predicted by the SVM, RF, MLR, ResNet, SSRN, DPRN, A2S2K, RSSAN, FADCNN, SPRN, FGSSCA, SSFTT, BS2T, S2FTNet, S3ARN, HF, and S2CABT.

Figure 15. Statistical classification results of different methods on the Indian Pines, KSC, Houston, and PaviaU datasets under different training sizes: 2%, 4%, 6%, 8%, 10%, 15%, and 20%: (a–c) the OA, AA, and

κ

of the Indian Pines dataset; (d–f) the OA, AA, and

κ

of the KSC dataset; (g–i) the OA, AA, and

κ

of the Houston dataset; (j–l) the OA, AA, and

κ

of the PaviaU dataset.

Figure 15. Statistical classification results of different methods on the Indian Pines, KSC, Houston, and PaviaU datasets under different training sizes: 2%, 4%, 6%, 8%, 10%, 15%, and 20%: (a–c) the OA, AA, and

κ

of the Indian Pines dataset; (d–f) the OA, AA, and

κ

of the KSC dataset; (g–i) the OA, AA, and

κ

of the Houston dataset; (j–l) the OA, AA, and

κ

of the PaviaU dataset.

Figure 16. T-SNE visualization of different input features. The first to sixth rows are the original HSI, three single-scale covariance matrixes, MSCM without SCS and MSCM, respectively. (a) The Indian Pines dataset, (b) the KSC dataset, (c) the Houston dataset, (d) the PaviaU dataset.

Table 1. The implementation of ERCB.

Layer Name	Kernel Size	Output Size
Input	-	$9 \times 9 \times 200$
3D Conv.	$1 \times 1 \times 7$	$32 \times 9 \times 9 \times 97$
3D Conv-BN-ReLU	$1 \times 1 \times 7$	$32 \times 9 \times 9 \times 97$
FCL	-	$32 \times 9 \times 9 \times 97$
Residual Connection	-	$32 \times 9 \times 9 \times 97$
3D Conv-BN-ReLU	$1 \times 1 \times 7$	$32 \times 9 \times 9 \times 97$
FCL	-	$32 \times 9 \times 9 \times 97$
Residual Connection	-	$32 \times 9 \times 9 \times 97$
3D Conv.	$1 \times 1 \times 97$	$32 \times 9 \times 9 \times 1$
Squeeze	-	$32 \times 9 \times 9$

Table 2. The Implementation of ARCB.

Layer Name	Kernel Size	Output Size
Input	-	$9 \times 9 \times 200$
3D Conv.	$3 \times 3 \times 200$	$32 \times 9 \times 9 \times 1$
3D Conv-BN-ReLU	$3 \times 3 \times 1$	$32 \times 9 \times 9 \times 1$
FCL	-	$32 \times 9 \times 9 \times 1$
Residual Connection	-	$32 \times 9 \times 9 \times 1$
3D Conv-BN-ReLU	$3 \times 3 \times 1$	$32 \times 9 \times 9 \times 1$
FCL	-	$32 \times 9 \times 9 \times 1$
Residual Connection	-	$32 \times 9 \times 9 \times 1$
3D Conv.	$3 \times 3 \times 1$	$32 \times 9 \times 9 \times 1$
Squeeze	-	$32 \times 9 \times 9$

Table 3. The Implementation of ECABT and ACABT.

Layer Name	Kernel Size	Output Size
Input	-	$32 \times 9 \times 9$
2D Conv-BN-ReLU	$1 \times 1$	$12 \times 9 \times 9$
CAMHSA	-	$12 \times 9 \times 9$
2D Conv-BN-ReLU	$1 \times 1$	$32 \times 9 \times 9$
Residual Connection	-	$32 \times 9 \times 9$
2D Conv-BN-ReLU	$1 \times 1$	$12 \times 9 \times 9$
CAMHSA	-	$12 \times 9 \times 9$
2D Conv-BN-ReLU	$1 \times 1$	$32 \times 9 \times 9$
Residual Connection	-	$32 \times 9 \times 9$
Global Average Pooling	-	$32 \times 1 \times 1$

Table 4. Statistical classification results of different methods on Indian Pines dataset.

Class

Train

Test

SVM

RF

MLR

ResNet

SSRN

DPRN

A2S2K

RSSAN

FADCNN

SPRN

FGSSCA

SSFTT

BS2T

S2FTNet

S3ARN

HF

S2CABT

1

45

13.78 ± 7.60

3.11 ± 4.87

2.67 ± 3.65

8.89 ± 8.17

17.78 ± 12.47

8.89 ± 17.43

76.00 ± 15.51

12.89 ± 6.92

21.78 ± 22.58

78.67 ± 30.50

64.44 ± 12.86

84.89 ± 12.80

86.67 ± 4.16

24.00 ± 22.36

52.89 ± 8.80

44.89 ± 11.91

95.56 ± 6.29

2

29

1399

61.77 ± 2.46

48.25 ± 6.17

58.13 ± 3.48

69.12 ± 8.62

87.91 ± 2.27

48.91 ± 17.51

87.99 ± 3.28

59.70 ± 10.53

67.91 ± 20.31

87.13 ± 17.98

92.75 ± 2.09

89.75 ± 1.18

93.03 ± 3.08

89.55 ± 1.87

88.84 ± 4.47

91.87 ± 2.38

93.35 ± 3.02

3

17

813

47.18 ± 5.99

29.37 ± 11.42

34.46 ± 6.55

56.80 ± 12.70

78.23 ± 10.16

30.36 ± 19.53

82.27 ± 8.33

40.17 ± 9.34

58.13 ± 9.11

69.69 ± 31.60

88.14 ± 6.71

86.79 ± 8.53

91.00 ± 3.24

90.65 ± 5.62

90.95 ± 4.71

87.50 ± 8.96

91.51 ± 8.03

4

5

232

32.41 ± 10.91

19.40 ± 6.92

16.47 ± 10.49

7.84 ± 5.54

68.45 ± 10.22

13.54 ± 15.30

80.26 ± 16.18

15.26 ± 4.57

17.84 ± 8.14

93.11 ± 8.35

86.73 ± 11.35

86.47 ± 13.38

92.07 ± 11.15

81.47 ± 11.36

53.53 ± 25.03

71.81 ± 12.66

88.97 ± 13.21

5

10

473

74.97 ± 4.24

63.38 ± 6.95

63.47 ± 7.21

55.35 ± 14.43

89.05 ± 3.44

49.85 ± 29.02

90.28 ± 1.91

60.68 ± 7.13

41.18 ± 23.21

93.40 ± 2.82

91.08 ± 3.83

89.51 ± 1.79

93.23 ± 1.23

90.82 ± 2.26

89.18 ± 4.81

89.77 ± 4.66

93.49 ± 2.75

6

15

715

89.20 ± 1.63

86.24 ± 6.66

90.91 ± 2.23

91.02 ± 10.04

97.26 ± 1.71

76.73 ± 21.84

98.07 ± 1.35

88.36 ± 8.94

69.40 ± 28.35

97.98 ± 1.39

95.36 ± 3.17

97.42 ± 1.37

95.97 ± 1.92

98.35 ± 0.55

93.12 ± 2.27

94.46 ± 2.88

97.90 ± 1.60

7

1

27

46.67 ± 11.89

1.48 ± 3.31

5.93 ± 5.62

3.70 ± 5.24

14.07 ± 8.45

0.00 ± 0.00

92.59 ± 10.80

16.29 ± 10.01

14.07 ± 19.31

95.55 ± 4.06

89.63 ± 8.45

85.93 ± 11.23

88.15 ± 17.05

41.48 ± 35.85

73.33 ± 16.23

71.11 ± 18.78

100.00 ± 0.00

8

10

468

98.42 ± 0.70

98.08 ± 2.40

94.53 ± 4.31

74.74 ± 11.64

99.96 ± 0.09

43.50 ± 30.14

99.87 ± 0.19

97.10 ± 2.90

68.04 ± 22.47

100.00 ± 0.00

99.83 ± 0.28

99.70 ± 0.41

100.00 ± 0.00

99.96 ± 0.09

94.49 ± 5.59

99.87 ± 0.29

100.00 ± 0.00

9

1

19

33.68 ± 19.19

7.37 ± 8.81

5.26 ± 6.45

4.21 ± 4.40

4.21 ± 9.41

4.21 ± 6.86

60.00 ± 18.08

16.84 ± 6.86

16.84 ± 14.61

68.42 ± 30.01

50.52 ± 18.83

78.95 ± 15.35

64.21 ± 29.63

41.05 ± 28.92

70.53 ± 25.41

70.53 ± 20.92

90.53 ± 9.41

10

20

952

57.79 ± 6.46

42.67 ± 6.80

47.42 ± 4.33

56.37 ± 8.84

81.05 ± 4.99

61.07 ± 11.84

82.37 ± 6.02

53.42 ± 4.77

66.79 ± 7.61

84.81 ± 3.18

83.17 ± 2.85

82.80 ± 4.38

85.31 ± 4.34

83.82 ± 3.50

78.02 ± 2.22

84.45 ± 2.85

86.41 ± 6.07

11

50

2405

68.70 ± 4.31

78.30 ± 7.00

70.37 ± 2.68

86.29 ± 4.04

89.35 ± 4.48

73.71 ± 12.98

92.16 ± 2.43

74.90 ± 2.84

85.55 ± 4.44

89.32 ± 9.36

95.63 ± 2.29

91.24 ± 5.12

95.30 ± 3.83

93.91 ± 2.97

90.54 ± 3.49

96.06 ± 2.97

97.58 ± 2.38

12

581

38.18 ± 4.08

20.86 ± 9.15

28.57 ± 5.84

32.01 ± 17.19

82.75 ± 7.43

22.65 ± 14.16

86.20 ± 9.69

34.08 ± 14.07

30.05 ± 11.70

93.60 ± 4.28

88.78 ± 4.35

76.70 ± 5.47

92.67 ± 3.53

82.89 ± 6.10

66.16 ± 7.39

87.37 ± 4.86

89.12 ± 7.25

13

5

200

90.80 ± 3.21

88.30 ± 5.13

91.70 ± 8.28

71.80 ± 8.17

97.00 ± 3.86

50.20 ± 23.56

99.20 ± 1.79

98.40 ± 2.33

56.10 ± 26.47

99.60 ± 0.65

97.50 ± 0.79

99.50 ± 0.71

98.40 ± 2.16

98.60 ± 1.78

96.10 ± 2.53

95.30 ± 3.65

99.30 ± 1.10

14

26

1239

89.43 ± 3.75

90.69 ± 7.79

90.17 ± 4.22

89.56 ± 6.39

96.63 ± 4.05

83.07 ± 7.25

94.71 ± 4.44

95.00 ± 1.95

82.87 ± 9.66

99.28 ± 0.64

98.13 ± 1.73

97.51 ± 3.04

98.34 ± 1.86

98.27 ± 2.16

96.01 ± 2.83

97.50 ± 1.21

98.92 ± 1.24

15

8

378

35.08 ± 6.73

23.60 ± 4.10

38.10 ± 4.49

36.40 ± 31.58

83.60 ± 10.90

30.10 ± 35.77

90.16 ± 9.82

30.63 ± 4.36

36.08 ± 16.39

94.60 ± 6.70

89.95 ± 11.20

89.95 ± 2.92

97.78 ± 1.40

93.23 ± 8.42

79.52 ± 14.81

90.00 ± 6.72

98.20 ± 2.87

16

2

91

83.96 ± 2.53

48.13 ± 16.75

76.92 ± 11.07

10.55 ± 3.53

76.49 ± 22.02

4.18 ± 3.33

98.02 ± 3.33

18.90 ± 4.62

13.85 ± 11.95

93.40 ± 6.21

86.59 ± 8.13

86.59 ± 4.21

91.21 ± 4.03

88.13 ± 6.33

35.16 ± 19.81

57.36 ± 23.74

86.37 ± 12.30

OA (%)

66.99 ± 1.54

61.58 ± 1.26

63.29 ± 0.94

68.88 ± 1.23

87.67 ± 2.47

56.67 ± 11.75

90.22 ± 1.91

65.74 ± 2.88

66.22 ± 5.99

90.24 ± 1.84

92.61 ± 0.78

90.46 ± 1.76

94.01 ± 0.79

91.61 ± 1.30

86.80 ± 0.94

91.62 ± 0.98

94.80 ± 0.64

AA (%)

60.13 ± 2.18

46.83 ± 1.75

50.94 ± 1.04

47.17 ± 2.14

72.74 ± 3.41

37.56 ± 11.10

88.13 ± 2.96

50.79 ± 2.07

46.65 ± 7.18

89.91 ± 5.50

87.39 ± 1.07

88.98 ± 1.20

91.46 ± 2.16

81.01 ± 4.86

78.02 ± 2.74

83.12 ± 1.11

94.20 ± 1.07

κ

0.6223 ± 0.0171

0.5529 ± 0.0141

0.5763 ± 0.0117

0.6375 ± 0.0171

0.8590 ± 0.0282

0.4976 ± 0.1348

0.8885 ± 0.0220

0.6066 ± 0.0323

0.6096 ± 0.0723

0.8888 ± 0.0214

0.9155 ± 0.0088

0.8911 ± 0.0198

0.9316 ± 0.0089

0.9042 ± 0.0149

0.8491 ± 0.0109

90.42 ± 1.1090

0.9407 ± 0.0069

Table 5. Statistical classification results of different methods on the KSC dataset.

Class

Train

Test

SVM

RF

MLR

ResNet

SSRN

DPRN

A2S2K

RSSAN

FADCNN

SPRN

FGSSCA

SSFTT

BS2T

S2FTNet

S3ARN

HF

S2CABT

1

16

745

90.26 ± 5.63

90.47 ± 5.13

90.87 ± 5.08

99.46 ± 0.56

99.36 ± 0.90

99.17 ± 0.52

99.14 ± 0.73

78.87 ± 44.09

99.44 ± 0.81

100.00 ± 0.00

98.20 ± 2.87

99.87 ± 0.30

97.61 ± 1.18

99.33 ± 0.51

100.00 ± 0.00

2

5

238

76.22 ± 5.21

70.59 ± 6.04

84.79 ± 6.85

85.55 ± 12.44

92.69 ± 5.26

84.45 ± 16.88

90.00 ± 10.06

19.33 ± 16.31

62.86 ± 21.61

93.36 ± 7.79

95.55 ± 8.35

30.59 ± 6.86

98.07 ± 2.72

64.96 ± 9.57

78.91 ± 24.94

93.87 ± 3.95

98.22 ± 2.79

3

6

250

73.28 ± 6.17

73.68 ± 6.33

82.72 ± 4.85

90.88 ± 6.56

98.08 ± 1.63

77.60 ± 12.67

92.88 ± 7.62

47.04 ± 32.18

67.36 ± 16.95

99.28 ± 1.00

99.04 ± 1.71

48.48 ± 37.91

96.32 ± 8.01

90.72 ± 6.97

67.92 ± 32.84

92.56 ± 12.13

96.50 ± 7.00

4

6

246

43.82 ± 7.36

44.72 ± 6.25

37.72 ± 10.00

41.06 ± 21.04

62.76 ± 10.67

54.39 ± 11.75

88.94 ± 3.89

16.26 ± 9.00

43.98 ± 12.68

89.02 ± 13.40

87.48 ± 13.34

29.19 ± 9.61

82.44 ± 7.53

30.49 ± 22.51

43.74 ± 19.73

73.01 ± 4.85

97.05 ± 5.90

5

4

157

50.06 ± 10.88

48.54 ± 3.94

46.88 ± 14.87

29.04 ± 17.91

79.24 ± 15.17

19.36 ± 20.74

81.78 ± 16.94

21.53 ± 15.86

51.21 ± 18.72

89.81 ± 8.27

91.34 ± 9.35

36.30 ± 12.10

90.57 ± 8.80

27.64 ± 15.95

47.90 ± 15.07

84.59 ± 10.69

89.81 ± 9.69

6

5

224

36.34 ± 10.67

25.80 ± 14.55

47.05 ± 5.62

39.91 ± 31.97

91.79 ± 8.01

36.07 ± 24.79

98.04 ± 3.43

35.09 ± 25.55

73.66 ± 11.43

99.02 ± 1.42

99.29 ± 0.75

20.62 ± 26.08

99.46 ± 1.20

25.80 ± 19.99

72.14 ± 26.67

96.70 ± 0.60

100.00 ± 0.00

7

3

102

75.69 ± 15.66

65.29 ± 17.52

77.65 ± 16.80

62.16 ± 31.39

93.14 ± 4.10

70.59 ± 9.48

94.90 ± 8.81

57.06 ± 32.91

83.53 ± 13.39

96.08 ± 8.23

99.22 ± 1.75

47.65 ± 29.77

98.43 ± 3.51

2.74 ± 3.82

82.94 ± 15.63

92.55 ± 7.36

100.00 ± 0.00

8

9

422

70.76 ± 13.51

44.93 ± 7.08

63.03 ± 7.03

80.71 ± 12.76

95.59 ± 4.76

82.56 ± 6.87

97.63 ± 4.27

37.25 ± 21.71

79.90 ± 22.92

98.77 ± 2.38

98.01 ± 3.82

61.66 ± 17.94

98.58 ± 3.05

68.34 ± 15.67

97.35 ± 3.14

98.67 ± 1.08

99.11 ± 0.78

9

11

509

85.19 ± 6.42

82.04 ± 10.42

88.17 ± 6.22

93.36 ± 4.97

99.68 ± 0.49

94.34 ± 7.10

98.94 ± 2.37

91.71 ± 6.46

92.85 ± 7.49

99.65 ± 0.79

99.37 ± 1.40

84.95 ± 6.69

100.00 ± 0.00

89.70 ± 5.40

96.42 ± 7.35

98.23 ± 3.95

100.00 ± 0.00

10

9

395

83.29 ± 14.50

67.95 ± 8.18

88.35 ± 6.17

99.19 ± 1.67

100.00 ± 0.00

96.96 ± 4.34

100.00 ± 0.00

26.73 ± 19.39

97.06 ± 2.78

100.00 ± 0.00

71.75 ± 4.08

100.00 ± 0.00

81.17 ± 3.69

99.54 ± 0.63

100.00 ± 0.00

11

9

410

89.51 ± 3.56

92.49 ± 3.00

91.12 ± 5.62

96.10 ± 3.33

98.15 ± 3.62

92.44 ± 9.33

98.29 ± 3.82

74.93 ± 41.99

92.98 ± 6.40

99.12 ± 1.96

98.63 ± 3.05

84.88 ± 10.08

98.54 ± 2.48

97.27 ± 4.03

92.05 ± 11.33

98.54 ± 3.27

98.97 ± 1.29

12

11

492

88.94 ± 2.81

79.23 ± 6.47

92.93 ± 3.12

95.12 ± 2.77

98.41 ± 1.81

93.54 ± 5.09

98.86 ± 0.98

27.88 ± 16.92

71.95 ± 27.72

98.78 ± 1.70

98.70 ± 1.53

62.48 ± 9.17

98.94 ± 1.62

84.11 ± 14.61

76.66 ± 20.84

99.67 ± 0.73

99.80 ± 0.41

13

19

908

99.98 ± 0.05

99.71 ± 0.17

99.89 ± 0.25

100.00 ± 0.00

99.93 ± 0.15

100.00 ± 0.00

76.74 ± 41.00

100.00 ± 0.00

98.55 ± 1.71

100.00 ± 0.00

99.89 ± 0.19

100.00 ± 0.00

OA (%)

81.51 ± 2.04

76.24 ± 1.57

83.13 ± 1.09

87.27 ± 1.81

95.84 ± 0.82

86.48 ± 1.57

97.23 ± 0.74

55.56 ± 25.06

85.01 ± 5.35

98.36 ± 0.62

98.36 ± 0.61

72.09 ± 4.35

98.18 ± 0.42

79.89 ± 2.59

87.96 ± 8.95

96.84 ± 0.67

99.11 ± 0.12

AA (%)

74.10 ± 1.86

68.11 ± 2.10

76.24 ± 1.35

77.89 ± 2.93

92.99 ± 1.54

77.03 ± 2.40

95.34 ± 1.70

46.96 ± 22.12

78.21 ± 4.90

97.15 ± 0.86

97.43 ± 0.58

59.64 ± 6.43

97.02 ± 0.41

66.19 ± 2.98

81.15 ± 12.10

94.49 ± 1.12

98.42 ± 0.28

κ

0.7940 ± 0.0225

0.7349 ± 0.0173

0.8121 ± 0.0121

0.8574 ± 0.0205

0.9536 ± 0.0092

0.8488 ± 0.0176

0.9691 ± 0.0083

0.5014 ± 0.2805

0.8330 ± 0.0593

0.9818 ± 0.0069

0.9818 ± 0.0068

0.69 ± 0.0501

0.9797 ± 0.0046

0.7750 ± 0.0291

0.8653 ± 0.1008

0.9648 ± 0.0074

0.9901 ± 0.0014

Table 6. Statistical classification results of different methods on the Houston dataset.

Class

Train

Test

SVM

RF

MLR

ResNet

SSRN

DPRN

A2S2K

RSSAN

FADCNN

SPRN

FGSSCA

SSFTT

BS2T

S2FTNet

S3ARN

HF

S2CABT

1

26

1225

93.03 ± 7.14

93.06 ± 2.66

94.81 ± 2.86

93.96 ± 2.30

95.79 ± 2.31

93.60 ± 3.63

97.60 ± 1.70

77.70 ± 3.53

94.55 ± 1.57

97.49 ± 1.85

98.66 ± 0.87

97.49 ± 0.53

97.06 ± 1.85

96.47 ± 2.17

73.58 ± 42.24

97.55 ± 1.67

93.63 ± 3.08

2

26

1228

88.36 ± 5.59

88.50 ± 4.01

92.04 ± 5.45

87.98 ± 2.35

95.05 ± 5.02

85.65 ± 5.06

93.34 ± 4.35

82.25 ± 5.03

85.80 ± 5.36

92.72 ± 3.69

92.47 ± 7.54

96.29 ± 5.20

93.55 ± 5.67

93.19 ± 3.41

61.52 ± 35.74

93.32 ± 4.18

95.84 ± 2.13

3

14

683

61.58 ± 41.42

91.83 ± 1.78

99.50 ± 0.22

97.92 ± 2.49

99.88 ± 0.26

99.38 ± 0.83

99.97 ± 0.07

82.46 ± 4.87

98.36 ± 1.92

99.82 ± 0.19

99.93 ± 0.11

98.86 ± 1.44

99.68 ± 0.22

99.59 ± 0.84

79.47 ± 44.43

98.83 ± 0.87

99.36 ± 0.84

4

25

1219

93.58 ± 2.68

93.57 ± 2.82

93.75 ± 2.79

90.62 ± 4.74

97.49 ± 2.70

89.68 ± 2.96

96.54 ± 1.95

78.93 ± 8.81

85.76 ± 11.05

98.10 ± 2.55

99.72 ± 0.29

97.80 ± 2.46

97.43 ± 3.02

97.87 ± 2.77

75.78 ± 41.58

96.13 ± 3.59

91.42 ± 3.62

5

25

1217

97.52 ± 1.66

90.45 ± 5.51

97.50 ± 0.78

98.47 ± 1.56

100.00 ± 0.00

99.57 ± 0.38

99.29 ± 1.13

92.31 ± 4.45

99.06 ± 1.04

99.98 ± 0.04

100.00 ± 0.00

98.96 ± 2.32

100.00 ± 0.00

67.16 ± 42.97

100.00 ± 0.00

99.92 ± 0.14

6

7

318

81.70 ± 2.05

80.57 ± 4.58

83.08 ± 0.98

82.96 ± 4.25

93.84 ± 4.38

73.84 ± 28.24

94.21 ± 3.27

51.07 ± 13.72

74.78 ± 18.90

89.94 ± 4.59

88.68 ± 0.00

92.08 ± 5.04

90.50 ± 5.16

91.70 ± 6.07

72.83 ± 18.40

88.74 ± 2.16

91.40 ± 5.84

7

26

1242

78.45 ± 8.66

79.34 ± 3.68

86.22 ± 4.46

79.52 ± 6.22

90.52 ± 2.49

71.71 ± 11.01

89.66 ± 3.48

77.25 ± 4.48

70.45 ± 5.34

94.86 ± 4.56

94.45 ± 3.53

93.01 ± 3.02

95.65 ± 3.20

95.17 ± 2.86

43.59 ± 41.86

86.78 ± 4.37

95.62 ± 2.62

8

25

1219

51.11 ± 10.35

65.23 ± 7.00

54.39 ± 3.59

64.40 ± 7.51

76.13 ± 7.39

55.96 ± 2.63

75.95 ± 5.35

48.97 ± 6.71

68.86 ± 7.79

79.70 ± 6.26

76.29 ± 0.47

76.49 ± 9.51

76.16 ± 6.96

76.28 ± 10.17

40.95 ± 36.33

79.08 ± 3.98

89.17 ± 4.81

9

26

1226

78.48 ± 2.75

69.58 ± 5.07

66.05 ± 5.15

51.97 ± 19.90

75.34 ± 5.59

50.91 ± 8.58

84.68 ± 8.83

58.42 ± 9.22

69.04 ± 10.09

85.04 ± 3.87

89.52 ± 5.13

81.03 ± 3.36

84.54 ± 6.25

85.87 ± 5.77

46.05 ± 41.94

80.41 ± 4.19

86.71 ± 5.75

10

25

1202

28.35 ± 18.37

71.26 ± 2.85

77.40 ± 4.71

69.62 ± 8.32

88.29 ± 11.00

72.76 ± 14.67

87.59 ± 9.21

64.36 ± 10.88

78.20 ± 13.30

97.77 ± 1.34

97.09 ± 4.12

94.64 ± 4.81

98.59 ± 1.92

96.96 ± 4.64

49.07 ± 47.07

92.28 ± 4.12

95.37 ± 1.13

11

25

1210

67.97 ± 4.745

70.25 ± 8.20

68.38 ± 4.02

70.99 ± 7.73

90.55 ± 6.20

67.87 ± 10.75

92.51 ± 7.03

63.36 ± 9.25

83.16 ± 4.44

97.72 ± 0.94

100.00 ± 0.00

97.12 ± 3.84

96.11 ± 4.10

99.06 ± 1.36

74.59 ± 41.91

95.34 ± 2.80

99.83 ± 0.29

12

25

1208

23.76 ± 7.49

59.24 ± 4.39

53.96 ± 5.96

70.58 ± 17.00

86.89 ± 3.65

67.67 ± 12.51

83.58 ± 9.80

49.97 ± 11.34

77.10 ± 7.88

87.24 ± 6.35

89.12 ± 9.78

92.86 ± 4.03

89.98 ± 6.54

90.37 ± 3.06

56.74 ± 40.64

93.26 ± 4.38

91.17 ± 3.57

13

10

459

5.23 ± 5.18

12.77 ± 5.04

20.65 ± 4.90

52.11 ± 8.40

86.62 ± 5.76

47.32 ± 14.07

85.45 ± 7.00

43.92 ± 13.23

72.42 ± 10.64

81.74 ± 9.27

83.44 ± 5.54

92.46 ± 2.48

87.23 ± 3.19

87.93 ± 1.55

46.80 ± 42.40

80.70 ± 2.88

87.29 ± 3.40

14

9

419

89.50 ± 8.82

90.84 ± 2.69

94.18 ± 4.45

99.62 ± 0.32

99.95 ± 0.11

98.57 ± 1.27

100.00 ± 0.00

80.00 ± 8.46

94.08 ± 10.00

99.95 ± 0.11

100.00 ± 0.00

97.52 ± 5.15

99.95 ± 0.11

100.00 ± 0.00

57.57 ± 51.73

99.38 ± 0.90

100.00 ± 0.00

15

14

646

99.04 ± 0.25

84.55 ± 2.60

98.67 ± 0.14

99.97 ± 0.07

100.00 ± 0.00

99.97 ± 0.07

100.00 ± 0.00

78.98 ± 7.59

95.20 ± 5.88

100.00 ± 0.00

73.93 ± 42.27

100.00 ± 0.00

98.45 ± 1.21

OA (%)

69.84 ± 2.84

77.39 ± 0.83

79.10 ± 1.13

79.67 ± 2.37

90.83 ± 0.69

77.46 ± 2.64

91.20 ± 1.48

69.54 ± 1.88

82.55 ± 2.40

93.45 ± 0.69

94.04 ± 1.05

93.31 ± 1.35

93.50 ± 0.92

93.72 ± 0.75

60.38 ± 34.20

91.96 ± 0.87

94.24 ± 0.71

AA (%)

69.18 ± 3.07

76.07 ± 0.75

78.71 ± 0.98

80.71 ± 2.36

91.75 ± 0.64

78.30 ± 3.55

92.02 ± 1.39

68.66 ± 2.38

83.12 ± 1.59

93.47 ± 0.49

93.96 ± 0.62

93.77 ± 1.17

93.76 ± 1.02

94.03 ± 0.86

61.31 ± 34.11

92.12 ± 0.70

94.35 ± 0.55

κ

0.6732 ± 0.0309

0.7552 ± 0.0089

0.7738 ± 0.0122

0.7800 ± 0.0257

0.9009 ± 0.0074

0.7561 ± 0.0286

0.9049 ± 0.0160

0.6702 ± 0.0204

0.8113 ± 0.0259

0.9292 ± 0.0075

0.9355 ± 0.0113

0.9277 ± 0.0146

0.9297 ± 0.0099

0.9321 ± 0.0081

0.5698 ± 0.3724

0.9130 ± 0.0094

0.9377 ± 0.0078

Table 7. Statistical classification results of different methods on the PaviaU dataset.

Class

Train

Test

SVM

RF

MLR

ResNet

SSRN

DPRN

A2S2K

RSSAN

FADCNN

SPRN

FGSSCA

SSFTT

BS2T

S2FTNet

S3ARN

HF

S2CABT

1

133

6498

92.71 ± 2.05

88.04 ± 1.82

89.51 ± 2.09

91.79 ± 2.99

97.40 ± 0.99

91.83 ± 2.16

98.62 ± 0.34

92.82 ± 2.43

87.74 ± 4.77

99.75 ± 0.30

99.77 ± 0.19

99.46 ± 0.38

98.98 ± 0.43

99.56 ± 0.21

97.08 ± 1.20

98.98 ± 0.43

99.51 ± 0.55

2

373

18276

98.90 ± 0.33

96.49 ± 0.45

95.63 ± 0.27

99.38 ± 0.37

99.91 ± 0.11

99.51 ± 0.30

99.88 ± 0.09

98.32 ± 0.82

99.54 ± 0.30

99.99 ± 0.02

100.00 ± 0.01

99.96 ± 0.04

99.99 ± 0.01

99.98 ± 0.02

99.85 ± 0.15

99.99 ± 0.01

99.97 ± 0.03

3

42

2057

25.65 ± 12.51

48.66 ± 5.31

68.49 ± 2.24

76.84 ± 10.25

88.63 ± 3.67

76.71 ± 4.55

87.18 ± 5.07

65.12 ± 6.46

75.19 ± 13.22

94.16 ± 2.75

91.94 ± 3.70

96.84 ± 1.52

94.58 ± 1.15

95.09 ± 1.78

90.97 ± 5.18

94.58 ± 1.15

98.27 ± 1.85

4

61

3003

85.66 ± 1.59

82.38 ± 3.35

80.09 ± 3.31

90.44 ± 3.67

98.17 ± 0.73

88.81 ± 3.36

98.04 ± 1.40

95.57 ± 1.68

90.02 ± 6.58

97.44 ± 0.51

98.25 ± 1.07

96.68 ± 1.26

97.39 ± 1.15

97.81 ± 1.38

98.65 ± 0.66

97.39 ± 1.15

97.51 ± 0.87

5

27

1318

99.17 ± 0.35

97.75 ± 0.77

98.74 ± 0.71

99.98 ± 0.04

100.00 ± 0.00

99.98 ± 0.04

100.00 ± 0.00

99.12 ± 1.18

99.19 ± 1.05

99.77 ± 0.14

99.92 ± 0.13

99.77 ± 0.39

99.71 ± 0.44

100.00 ± 0.00

99.89 ± 0.15

99.71 ± 0.44

99.60 ± 0.44

6

101

4928

45.17 ± 7.64

49. 90 ± 4.07

72.49 ± 1.36

90.79 ± 2.27

98.13 ± 1.24

88.56 ± 5.66

98.96 ± 1.24

89.65 ± 8.45

98.38 ± 1.94

100.00 ± 0.00

99.65 ± 0.49

100.00 ± 0.00

99.82 ± 0.28

99.52 ± 0.46

100.00 ± 0.00

99.91 ± 0.14

7

27

1303

0.05 ± 0.07

66.31 ± 8.13

52.52 ± 7.26

88.20 ± 8.03

93.14 ± 5.04

74.75 ± 4.75

97.37 ± 2.37

67.80 ± 5.01

73.10 ± 28.13

99.94 ± 0.10

99.65 ± 0.23

99.25 ± 0.96

99.23 ± 1.36

99.98 ± 0.04

92.39 ± 1.75

99.23 ± 1.36

99.98 ± 0.04

8

74

3608

90.86 ± 1.48

82.47 ± 1.72

85.05 ± 0.99

71.29 ± 12.92

90.69 ± 2.72

68.70 ± 6.34

92.77 ± 1.58

78.23 ± 9.87

75.24 ± 3.31

97.90 ± 0.67

98.32 ± 0.99

96.80 ± 0.97

95.49 ± 2.30

97.08 ± 1.73

97.20 ± 1.08

95.49 ± 2.30

97.78 ± 1.02

9

19

928

99.76 ± 0.12

99.27 ± 0.26

99.72 ± 0.10

82.24 ± 10.72

99.53 ± 0.15

77.80 ± 17.60

99.63 ± 0.52

97.61 ± 1.43

77.05 ± 6.98

98.58 ± 0.91

98.36 ± 0.62

96.72 ± 0.16

96.96 ± 1.29

97.16 ± 2.26

99.20 ± 0.67

96.96 ± 1.29

96.21 ± 2.02

OA (%)

83.35 ± 0.97

84.30 ± 0.49

87.45 ± 0.26

92.32 ± 1.16

97.62 ± 0.45

91.26 ± 0.68

98.13 ± 0.24

91.95 ± 2.86

92.27 ± 1.92

99.27 ± 0.17

99.25 ± 0.20

99.09 ± 0.21

98.90 ± 0.25

99.19 ± 0.10

98.39 ± 0.51

98.90 ± 0.25

99.35 ± 0.12

AA (%)

70.88 ± 1.12

79.03 ± 1.31

82.47 ± 1.08

87.88 ± 1.58

96.18 ± 0.88

85.18 ± 1.47

96.94 ± 0.57

87.14 ± 3.54

86.16 ± 4.30

98.61 ± 0.30

98.47 ± 0.47

98.35 ± 0.35

98.04 ± 0.43

98.50 ± 0.20

97.20 ± 0.85

98.04 ± 0.43

98.75 ± 0.25

κ

0.7707 ± 0.0143

0.7865 ± 0.0075

0.8316 ± 0.0037

0.8976 ± 0.0153

0.9684 ± 0.0060

0.8829 ± 0.0094

0.9752 ± 0.0032

0.8928 ± 0.0383

0.8973 ± 0.0258

0.9903 ± 0.0022

0.9900 ± 0.0027

0.9879 ± 0.0028

0.9854 ± 0.0033

0.9893 ± 0.0013

0.9786 ± 0.0068

0.9854 ± 0.0033

0.9914 ± 0.0015

Table 8. Statistical classification results of ablation study on the input feature.

Dataset	Metrics	Original HSI	$Single Scale D_{1}$	$Single Scale D_{2}$	$Single Scale D_{3}$	Without SCS	MSCM
Indian Pines	OA (%)	91.90 ± 1.41	81.94 ± 1.09	94.26 ± 1.15	93.93 ± 0.69	91.36 ± 0.82	94.80 ± 0.64
	AA (%)	77.47 ± 1.63	71.16 ± 1.48	92.73 ± 1.99	92.88 ± 1.48	88.97 ± 1.91	94.20 ± 1.07
	$κ$	0.9074 ± 0.0160	0.7924 ± 0.0121	0.9345 ± 0.0131	0.9307 ± 0.0078	0.9015 ± 0.0092	0.9407 ± 0.0069
KSC	OA (%)	97.30 ± 1.32	98.08 ± 1.02	98.21 ± 0.65	98.83 ± 0.62	98.13 ± 0.93	99.11 ± 0.12
	AA (%)	96.07 ± 1.96	96.98 ± 1.58	97.07 ± 0.68	98.10 ± 0.91	97.15 ± 1.39	98.42 ± 0.28
	$κ$	0.9699 ± 0.0147	0.9786 ± 0.0114	0.9801 ± 0.0072	0.9870 ± 0.0069	0.9792 ± 0.0104	0.9901 ± 0.0014
Houston	OA (%)	90.85 ± 0.81	91.42 ± 0.68	92.64 ± 0.77	92.21 ± 1.44	90.38 ± 0.84	94.24 ± 0.71
	AA (%)	91.42 ± 0.61	92.05 ± 0.63	92.90 ± 0.61	92.75 ± 1.37	91.10 ± 0.72	94.35 ± 0.55
	$κ$	0.9011 ± 0.0086	0.9072 ± 0.0073	0.9205 ± 0.0083	0.9158 ± 0.0155	0.8960 ± 0.0091	0.9377 ± 0.0078
PaviaU	OA (%)	98.05 ± 0.27	99.16 ± 0.22	98.91 ± 0.16	99.20 ± 0.23	99.14 ± 0.20	99.35 ± 0.12
	AA (%)	96.69 ± 0.86	98.68 ± 0.37	98.17 ± 0.45	98.20 ± 0.72	97.96 ± 0.88	98.75 ± 0.25
	$κ$	0.9742 ± 0.0036	0.9888 ± 0.0029	0.9856 ± 0.0021	0.9894 ± 0.0030	0.9886 ± 0.0027	0.9914 ± 0.0015

Table 9. Statistical classification results of ablation study on the framework.

ERCB		✘	✓	✓	✓	✘	✓	✘	✓	✘	✓
ARCB		✓	✘	✓	✓	✘	✓	✓	✘	✘	✓
ECABT		✓	✓	✘	✓	✓	✘	✘	✓	✘	✓
ACABT		✓	✓	✓	✘	✓	✘	✓	✘	✘	✓
Indian Pines	OA (%)	94.41 ± 0.69	94.48 ± 0.67	94.44 ± 0.90	94.73 ± 0.87	93.34 ± 1.67	93.17 ± 1.80	93.25 ± 1.05	93.67 ± 1.87	92.92 ± 1.76	94.80 ± 0.64
	AA (%)	92.42 ± 2.30	93.21 ± 3.16	93.43 ± 1.43	93.31 ± 1.85	91.62 ± 3.25	91.68 ± 3.30	86.01 ± 5.06	81.80 ± 4.25	89.82 ± 4.92	94.20 ± 1.07
	$κ$	0.9362 ± 0.0078	0.9370 ± 0.0076	0.9365 ± 0.0102	0.9398 ± 0.0099	0.9244 ± 0.018	0.9220 ± 0.0191	0.9228 ± 0.021	0.9278 ± 0.0119	0.9190 ± 0.0186	0.9407 ± 0.0069
KSC	OA (%)	98.63 ± 0.43	98.63 ± 0.49	98.66 ± 0.35	98.67 ± 0.47	98.28 ± 1.43	97.57 ± 0.91	98.21 ± 2.31	98.10 ± 0.95	96.94 ± 2.70	99.11 ± 0.12
	AA (%)	97.67 ± 0.45	97.86 ± 0.66	97.87 ± 0.75	97.89 ± 0.74	97.47 ± 1.64	96.74 ± 1.61	97.54 ± 3.46	97.36 ± 1.31	95.26 ± 4.10	98.42 ± 0.28
	$κ$	0.9847 ± 0.0047	0.9848 ± 0.0055	0.9851 ± 0.0039	0.9851 ± 0.0053	0.9808 ± 0.0148	0.9729 ± 0.0126	0.9808 ± 0.0257	0.9788 ± 0.0106	0.9659 ± 0.0278	0.9901 ± 0.0014
Houston	OA (%)	92.89 ± 1.15	93.37 ± 1.19	92.75 ± 1.19	92.87 ± 1.23	92.09 ± 1.75	91.50 ± 1.87	91.86 ± 1.85	91.54 ± 1.53	90.74 ± 2.92	94.24 ± 0.71
	AA (%)	93.05 ± 0.95	93.48 ± 1.08	93.01 ± 0.84	93.09 ± 0.91	92.14 ± 1.73	91.97 ± 1.53	92.66 ± 1.54	92.00 ± 1.19	91.20 ± 2.77	94.35 ± 0.55
	$κ$	0.9231 ± 0.0125	0.9283 ± 0.0129	0.9216 ± 0.0129	0.9229 ± 0.0133	0.9144 ± 0.0181	0.9080 ± 0.0202	0.9119 ± 0.0200	0.9086 ± 0.0166	0.8999 ± 0.0230	0.9377 ± 0.0078
PaviaU	OA (%)	99.29 ± 0.18	99.28 ± 0.17	99.28 ± 0.18	99.24 ± 0.23	99.13 ± 0.20	99.11 ± 0.29	99.08 ± 0.32	99.04 ± 0.32	98.57 ± 1.21	99.35 ± 0.12
	AA (%)	98.57 ± 0.48	98.62 ± 0.36	99.58 ± 0.40	98.52 ± 0.38	98.43 ± 0.28	98.35 ± 0.45	98.27 ± 0.49	98.05 ± 0.66	97.55 ± 1.50	98.75 ± 0.25
	$κ$	0.9907 ± 0.0024	0.9905 ± 0.0022	0.9905 ± 0.0024	0.9900 ± 0.0030	0.9898 ± 0.0027	0.9888 ± 0.0036	0.9885 ± 0.0037	0.9887 ± 0.0035	0.9810 ± 0.0128	0.9914 ± 0.0015

Table 10. Statistical classification results of ablation study on the attention.

Dataset	Metrics	CBAM + CAMHSA	SK-Net + CAMHSA	ECA-Ne + CAMHSA	FCL + MHSA	FCL + CAMHSA
Indian Pines	OA (%)	94.42 ± 0.93	94.40 ± 0.69	94.47 ± 0.72	93.55 ± 1.02	94.80 ± 0.64
	AA (%)	91.84 ± 3.11	92.80 ± 1.02	93.28 ± 2.78	91.90 ± 2.04	94.20 ± 1.07
	$κ$	0.9368 ± 0.0106	0.9361 ± 0.0079	0.9369 ± 0.0082	0.9266 ± 0.0116	0.9407 ± 0.0069
KSC	OA (%)	98.22 ± 0.38	98.43 ± 0.45	98.21 ± 0.79	97.84 ± 0.60	99.11 ± 0.12
	AA (%)	97.31 ± 0.72	97.61 ± 0.40	97.37 ± 1.09	97.00 ± 0.67	98.42 ± 0.28
	$κ$	0.9802 ± 0.0042	0.9826 ± 0.0050	0.9801 ± 0.0088	0.9760 ± 0.0067	0.9901 ± 0.0014
Houston	OA (%)	92.71 ± 1.11	92.52 ± 1.43	93.16 ± 1.32	92.26 ± 1.01	94.24 ± 0.71
	AA (%)	92.95 ± 1.00	92.84 ± 1.10	93.29 ± 1.01	92.45 ± 0.73	94.35 ± 0.55
	$κ$	0.9212 ± 0.0120	0.9192 ± 0.0155	0.9260 ± 0.0143	0.9163 ± 0.0109	0.9377 ± 0.0078
PaviaU	OA (%)	99.16 ± 0.21	99.08 ± 0.27	99.10 ± 0.30	99.08 ± 0.30	99.35 ± 0.12
	AA (%)	98.50 ± 0.40	98.51 ± 0.43	98.42 ± 0.27	98.35 ± 0.51	98.75 ± 0.25
	$κ$	0.9895 ± 0.0027	0.9884 ± 0.0035	0.9888 ± 0.0039	0.9875 ± 0.0047	0.9914 ± 0.0015

Table 11. Computational complexity analysis of different networks on four datasets.

Dataset	Metrics	ResNet	SSRN	DPRN	A2S2K	RSSAN	FADCNN	SPRN	FGSSCA	SSFTT	BS2T	S2FTNet	S3ARN	HF	S2CABT
Indian Pines	Trainable Params (M)	21.9107	0.3642	22.3889	0.3707	0.1484	35.6777	0.2054	0.5307	0.1485	0.3781	0.9176	8.1104	3.3918	0.4545
	Computation Cost (GFLOPs)	0.0843	0.3168	0.0852	0.3432	0.0158	0.6665	0.0335	0.4679	0.0228	0.2164	0.2968	1.0752	0.6258	0.0825
	Training Time (s)	23.77	44.12	28.54	50.81	13.35	58.29	13.05	53.98	8.66	87.65	36.84	107.11	49.19	29.89
	Testing Time (s)	2.12	4.20	4.89	5.16	1.15	5.47	1.13	5.40	0.70	9.37	2.36	11.98	4.67	4.27
KSC	Trainable Params (M)	21.8339	0.3272	22.3090	0.3332	0.1314	35.7145	0.1975	0.4769	0.1485	0.3339	0.9157	6.8329	3.1082	0.4543
	Computation Cost (GFLOPs)	0.0805	0.2782	0.0814	0.3013	0.0146	0.6784	0.0323	0.4105	0.0338	0.1901	0.2968	0.8346	0.5027	0.5580
	Training Time (s)	12.34	21.05	22.43	22.03	14.05	36.72	6.25	25.23	3.70	44.49	18.74	50.33	22.71	24.67
	Testing Time (s)	1.13	1.94	2.51	2.24	1.28	2.01	0.56	2.51	0.37	4.36	1.21	5.26	2.37	2.45
Houston	Trainable Params (M)	21.7345	0.2781	22.2118	0.2833	0.1248	35.7015	0.1933	0.4062	0.1485	0.2758	0.9139	5.2644	2.8042	0.5235
	Computation Cost (GFLOPs)	0.0755	0.2267	0.0764	0.2455	0.0131	0.6742	0.0315	0.3339	0.0073	0.1549	0.2968	0.5612	0.3628	0.2787
	Training Time (s)	33.85	47.05	49.10	49.35	29.50	70.43	14.76	59.43	9.73	99.65	41.21	111.17	60.78	45.07
	Testing Time (s)	3.13	4.88	3.46	5.25	1.63	5.58	1.59	6.20	1.01	11.03	2.24	12.82	6.41	5.52
PaviaU	Trainable Params (M)	21.6029	0.2165	22.0739	0.2207	0.0946	35.6259	0.1850	0.3189	0.1485	0.2021	0.9105	3.4804	2.4754	0.5178
	Computation Cost (GFLOPs)	0.0690	0.1624	0.0700	0.1758	0.0111	0.6498	0.0303	0.2382	0.0228	0.1110	0.2968	0.2912	0.2239	0.2120
	Training Time (s)	94.97	103.68	104.00	104.43	42.20	174.06	38.49	127.70	27.62	234.01	97.66	230.22	165.20	102.42
	Testing Time (s)	8.54	10.92	9.53	11.31	4.55	14.48	4.36	13.72	2.93	25.86	6.44	28.13	17.84	10.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Yang, Y.; Zhang, S.; Mi, P.; Han, D. Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification. Remote Sens. 2024, 16, 2152. https://doi.org/10.3390/rs16122152

AMA Style

Zhang M, Yang Y, Zhang S, Mi P, Han D. Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification. Remote Sensing. 2024; 16(12):2152. https://doi.org/10.3390/rs16122152

Chicago/Turabian Style

Zhang, Meng, Yi Yang, Sixian Zhang, Pengbo Mi, and Deqiang Han. 2024. "Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification" Remote Sensing 16, no. 12: 2152. https://doi.org/10.3390/rs16122152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spectral-Spatial Center-Aware Bottleneck Transformer for Hyperspectral Image Classification

Abstract

1. Introduction

2. Related Works

2.1. CNN in HSI Classification

2.2. Transformer in HSI Classification

3. Spectral-Spatial Center-Aware Bottleneck Transformer

3.1. Overall Framework

3.2. Spectral-Spatial Feature Extraction Module

3.3. Residual Convolution Module

3.4. Center-Aware Bottleneck Transformer

3.5. Implementation Example

4. Experiments and Analysis

4.1. Dataset’s Description

4.2. Experimental Setup

4.3. Parameters Discussion

4.4. Classification Results

4.5. Ablation Study

4.6. Network Complexity Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI