Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification

Sun, Zhenming; Liu, Hui; Chen, Ning; Yang, Haina; Li, Jia; Liu, Chang; Pei, Xiaoping

doi:10.3390/rs17173100

Open AccessArticle

Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification

by

Zhenming Sun

¹

,

Hui Liu

¹

,

Ning Chen

^2,*

,

Haina Yang

¹

,

Jia Li

¹

,

Chang Liu

¹

and

Xiaoping Pei

³

¹

School of Energy and Mining Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China

²

Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China

³

Ordos Dongsheng District Energy Development Center, Ordos 017099, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 3100; https://doi.org/10.3390/rs17173100

Submission received: 16 July 2025 / Revised: 20 August 2025 / Accepted: 31 August 2025 / Published: 5 September 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the research trend of HSI classification has focused on the innovative integration of deep learning and Transformer architecture to enhance classification performance through multi-scale feature extraction, attention mechanism optimization, and spectral–spatial collaborative modeling. However, due to the excessive computational complexity and the large number of parameters of the Transformer, there is an expansion bottleneck in long sequence tasks, and the collaborative optimization of the algorithm and hardware is required. To better handle this issue, our paper proposes a method which integrates RWKV linear attention with Transformer through a novel TC-Former framework, combining TimeMixFormer and HyperMixFormer architectures. Specifically, TimeMixFormer has optimized the computational complexity through time decay weights and gating design, significantly improving the processing efficiency of long sequences and reducing the computational complexity. HyperMixFormer employs a gated WKV mechanism and dynamic channel weighting, combined with Mish activation and time-shift operations, to optimize computational overhead while achieving efficient cross-channel interaction, significantly enhancing the discriminative representation of spectral features. The pivotal characteristic of the proposed method lies in its innovative integration of linear attention mechanisms, which enhance HSI classification accuracy while achieving lower computational complexity. Evaluation experiments on three public hyperspectral datasets confirm that this framework outperforms the previous state-of-the-art algorithms in classification accuracy.

Keywords:

HSI analysis; spectral–spatial feature fusion; linear attention mechanism; computational complexity optimization

1. Introduction

Hyperspectral remote sensing is an advanced technique that utilizes imaging spectrometers to acquire ground object spectral information across continuous and narrow spectral bands [1,2]. The core principle involves capturing the reflectance or radiation characteristics of ground objects within the broad electromagnetic spectrum from ultraviolet to infrared, with high spectral resolution (typically below 10 nm), recording the spectral reflectance data for each pixel. This means each pixel represents not just a single color value, but rather a complete spectral curve that reflects the reflectance characteristics of that pixel at different wavelengths [3,4]. This technology can provide data from hundreds of spectral bands, far exceeding conventional multispectral remote sensing, thereby enabling precise identification and classification of ground object features [5,6,7]. Its distinguishing characteristics include high spectral resolution, multi-band imaging capability, and sensitivity to subtle spectral differences. These features allow for accurate capture and in-depth analysis of spectral characteristics of different entities and materials on the Earth’s surface, enhancing our ability to discriminate various surface features. In multiple fields such as agriculture [8,9], environmental monitoring [10,11], geology [12,13], urban planning [14,15], and national security [16,17], hyperspectral imaging technology has demonstrated outstanding application performance [18,19]. One of its core tasks is HSI classification, which involves accurately assigning each pixel to specific land cover categories. This process is not only a crucial aspect of hyperspectral imaging technology but also plays an irreplaceable role in its wide range of applications [20].

HSIs are characterized by high-dimensional spectral data, prompting the development of various statistical transformation methods for dimensionality reduction of spectral vectors. These approaches can be broadly categorized into orthogonal transformations (e.g., Principal Component Analysis (PCA) [21] and Singular Value Decomposition (SVD) [22]) and discriminant transformations (e.g., Linear Discriminant Analysis (LDA) [23] and Independent Component Analysis (ICA) [24]). However, these linear methods may fail to fully capture the nonlinear structures inherent in hyperspectral data, which has driven the advancement of nonlinear transformation techniques such as Kernel Discriminant Analysis (KDA) [25], Manifold Hypergraph Learning (MHL) [26], and Kernel Sparse Representation (KSR) [27]. Conventional classification methods based on manually designed spectral–spatial features, including k-Nearest Neighbors, Bayesian estimation, and Support Vector Machines (SVM) [28,29], often prove inadequate in complex environments due to their limited capacity to effectively utilize both spatial and spectral information.

In recent years, deep learning-based HSI classification methods have achieved significant progress through innovative applications of multimodal feature fusion and attention mechanisms, substantially improving classification performance. While traditional CNN methods [30,31,32,33,34] can effectively capture joint spectral–spatial features, their fixed convolutional kernel scales limit multi-scale information processing. To break through this limitation, researchers have proposed a series of innovative approaches: MSDN-SA [35] first integrated 3D dilated convolution with spectral attention mechanisms, using dense connections to enhance multi-scale feature representation; Bi-LSTM [36] networks further introduced bidirectional spectral dependency modeling, optimizing inter-band correlations through spectral–spatial attention mechanisms. For feature fusion, MSCF-MAM [37] integrates local and global information through pyramid compression attention modules while employing Transformer encoders to model long-range dependencies; DFAN [38] achieves efficient HSI classification through a novel interactive fusion mechanism that dynamically integrates local and global spatial-spectral features. MATNet [39] innovatively combines spatial-channel attention with transformer encoders and a polynomial-adaptive loss function, effectively addressing the challenge of boundary pixel misclassification in hyperspectral imagery caused by redundant spectral information and sparse background distribution. CAN’s [40] proposed SDPCA module extracts features from central pixels and similar neighborhoods, achieving multi-level feature fusion through dense connections to significantly improve edge region discrimination. More advanced architectures like AMDPCN [41] adopt dual-path designs incorporating GCN and CNN, dynamically adjusting feature interactions through multi-scale attention mechanisms to effectively balance global spatial relationships with local discriminative power.

To fully exploit the spectral–spatial characteristics of HSI, several Transformer-based feature extraction methods have been developed, including the Hyperspectral image Transformer (HiT) [42], spatial–spectral 1DSwin (SS1DSwin) [43], Spectral–Spatial Feature Tokenization Transformer (SSFTT) [44], Cross-Attention Spatial–Spectral Transformer (CASST) [45], Lightweight Self-Gaussian Attention Transformer (LSGA) [46], and Spectral Query Spatial (SQS) [47]. The HiT architecture incorporates convolutional operations within Transformer blocks to capture fine-grained spectral–spatial variations, while SS1DSwin employs group feature tokenization and 1D Swin Transformer modules to characterize hierarchical spatial–spectral relationships through dual processing pathways. SSFTT implements Gaussian-weighted tokenization of shallow features for deep semantic extraction, whereas CASST adopts a dual-branch structure with cross-attention mechanisms. LSGA preserves complete patch-wise relationships through hybrid tokens with Gaussian positional bias, and SQSFormer adaptively queries neighborhood information using rotation-invariant positional encoding. However, these approaches face significant challenges: The quadratic computational complexity of Transformer architectures, coupled with substantial parameter requirements, results in suboptimal efficiency for resource-constrained applications, particularly when processing long spectral sequences, which demands innovative solutions through algorithmic optimization and hardware acceleration strategies.

To address this challenges, this paper proposes the TC-Former framework, which employs a hierarchical architecture constructed by stacking TimeMixFormer and HyperMixFormer modules for deep feature extraction. The TimeMixFormer integrates the WKV operator from the RWKV model [48] (a linear-complexity Transformer alternative that combines the parallelizable training of Transformers with the efficient inference of Recurrent Neural Networks), utilizing learnable exponential decay parameters to enable efficient cross-timestep information propagation, while the HyperMixFormer enhances feature interactions along the channel dimension through gated convolutional structures. The framework incorporates a central pixel focusing mechanism to process spatial information by dynamically generating attention masks that emphasize critical regions, with its key innovation being the simultaneous improvement of HSI classification accuracy and significant reduction of computational complexity for long sequences through RWKV technology. The operational principles involve transferring and blending features across current and previous timesteps through TimeMixFormer and HyperMixFormer, where the feature mixing process enhances spatial orientation robustness, and the hyper attention module—incorporating layer normalization and residual design—reduces computational complexity to enable real-time processing of long spectral sequences (e.g., 200+ bands) while improving translational adaptability for HSI. The dual-reuse mechanism achieves parameter sharing to substantially decrease total model parameters, maintaining lightweight operation for central pixel classification through dynamic feature weight adjustment that enhances model adaptability and reduces computational overhead, with TC-Former’s progressive filtering layers achieving adaptive spectral–spatial feature refinement to effectively suppress irrelevant information propagation while maintaining computational efficiency and improved classification performance. The core contributions of this work are three fold:

This study proposes TC-Former, a novel HSI classification framework that innovatively combines the global modeling capability of Transformer with the linear attention advantages of RWKV. The hybrid architecture effectively addresses the computational efficiency bottleneck of conventional Transformer models while achieving significant performance improvements. Experimental evaluations on three public benchmark datasets demonstrate superior classification performance of the proposed framework.
By integrating RWKV’s linear attention with Transformer, we propose a channel-guided attention design that replaces dot-product computation, reducing complexity while maintaining performance. The model significantly reduces computational complexity through TimeMixFormer’s learnable decay factors, enabling efficient processing of long spectral sequences. A dual-path mechanism combines time-mixing and hyper-mixing to capture both temporal and cross-channel patterns. A lightweight micro-attention module further enhances global dependency modeling.
This work introduces a hybrid architecture combining TimeMixFormer and HyperMixFormer modules to enhance feature representation. TimeMixFormer employs learnable time-decay matrices to achieve efficient temporal modeling of long-range dependencies while maintaining computational tractability. HyperMixFormer improves spectral interactions via dynamic channel weighting and dual Mish activations, creating a higher-order feature space. This design excels at modeling nonlinear spectral reflectance in hyperspectral imagery, especially for complex features and blurred boundaries.

2. Materials and Methods

2.1. Overview

The overall architecture of our proposed model is illustrated in Figure 1a. Given a HSI denoted as

X \in R^{C \times H \times W}

, where C represents the number of spectral bands, H denotes the height, and W indicates the image width, the processing pipeline consists of several key stages. During preprocessing, the input HSI first undergoes PCA dimensionality reduction to extract crucial spectral features, yielding an intermediate representation

X \in R^{C^{'} \times H \times W}

where

C^{'}

is the reduced spectral dimension. This is followed by spatial feature extraction through a convolutional layer (Conv2D+BN+ReLU), after which the feature maps are reshaped into sequence format.

The core of our proposed model comprises multiple TimeMixFormer and HyperMixFormer modules, as illustrated in Figure 1b,c. The TimeMixFormer captures temporal dependencies across spectral bands through RWKV-based temporal mixing operations, while the HyperMixFormer processes spatial dependencies by employing a hybrid channel attention mechanism that utilizes central pixel spatial information as queries to aggregate relevant features from neighboring pixels. The Center Attention mechanism dynamically queries adjacent pixels using central pixel features and integrates multi-level embedded representations, thereby facilitating joint spectral–spatial feature learning for enhanced classification performance. These three modules operate synergistically: the TimeMixFormer captures spectral–temporal interactions, the HyperMixFormer models spatial relationships, and the Center Attention mechanism further refines spatial focus for classification tasks.

After processing through TimeMixFormer, HyperMixFormer, and Center Attention modules, we perform feature fusion where the final embeddings are aggregated. The multi-level central pixel embeddings are integrated and fed into an MLP (Multi-Layer Perceptron) classifier to predict the land-cover category of the corresponding pixel. In this study, we employ the cross-entropy (CE) loss as our objective function, defined as

C E l o s s = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{K} (y_{i j} log (p_{i j}))

(1)

2.2. TimeMixFormer

This study proposes an efficient sequence modeling method based on TimeMixFormer (as illustrated in Figure 1b), designed to address the high computational complexity and weak explicit temporal modeling capabilities of traditional Transformers in long-sequence modeling. The proposed TimeMixFormer consists of a two-layer TimeMixing module connected in series with a lightweight attention mechanism (Tiny Attention), forming a synergistic structure of “local temporal modeling + global feature interaction”. This approach significantly enhances classification performance through temporal–spatial collaborative modeling and multi-scale feature fusion.

The time-mixed computation vector is a linear projection of the linear combination of current and previous block inputs [48]:

r_{t} = W_{r} ⊙ (μ_{r} ⊙ x_{t} + (1 - μ_{r}) ⊙ x_{t - 1})

(2)

k_{t} = W_{k} ⊙ (μ_{k} ⊙ x_{t} + (1 - μ_{k}) ⊙ x_{t - 1})

(3)

v_{t} = W_{v} ⊙ (μ_{v} ⊙ x_{t} + (1 - μ_{v}) ⊙ x_{t - 1})

(4)

Token shifting is implemented as a simple temporal shift along the time dimension of each block using the PyTorch library, specifically via nn.ZeroPad2d((0, 0, 1, −1)).

In this model, the computation of the WKV operator shares similarities with the Attention Free Transformer (AFT) [48]. However, unlike AFT, which parameterizes W as a pairwise interaction matrix, our approach reformulates W as a channel-wise partitioned vector that is adaptively adjusted based on relative positional information. Furthermore, the model dynamically updates the WKV vectors in a time-recursive manner, introducing recurrent behavior through sequential state propagation. The detailed computation process is as follows:

w k v_{t} = \frac{\sum_{i = 1}^{t - 1} e^{- (t - 1 - i) w + k_{i}} ⊙ v_{i} + e^{u + k_{t}} ⊙ v_{t}}{\sum_{i = 1}^{t - 1} e^{- (t - 1 - i) w + k_{i}} + e^{u + k_{t}}}

(5)

The output gating in both the TimeMixing and HyperMixing modules is implemented via a sigmoid function

σ (r)

applied to the received terms. The output vector

o_{t}

[48] after WKV operator processing is expressed as follows:

o_{t} = W_{o} \cdot (σ (r_{t}) ⊙ w k v_{t})

(6)

In terms of specific design, TimeMixFormer integrates time-decay weights with gated attention to achieve efficient spatiotemporal feature modeling. The first TimeMixing layer captures local correlations between spectral bands along the spectral dimension, employing a time-decay weight mechanism (

W = e^{- λ t}

) to adaptively regulate dependency strength among different bands, thereby effectively capturing short-term temporal characteristics. The second TimeMixing layer models spatial neighborhood relationships in higher-order feature representations, leveraging spatially adaptive weights to capture local spatial features while further modeling long-term temporal patterns at an advanced feature level. The model operates with lightweight computations, expressed as

(K * V) * W + σ (r)

where

σ (\cdot)

is the sigmoid function, ensuring efficient processing.

To address the limitations of TimeMixing in modeling long-range dependencies, this study proposes a lightweight attention module named Tiny Attention. This module employs 1–4 attention heads to compute

Softmax (Q K^{T}) V

with low-dimensional projections, effectively balancing computational complexity and modeling capability through a sparse computation strategy. Unlike standard Transformers that traditionally generate

Q / K / V

using three separate linear layers, Tiny Attention adopts a design combining single linear layer projection with chunk operations (chunk operation splits the projected tensor into three equal parts

Q / K / V

along the feature dimension. This design minimizes memory access operations through shared weight matrices). By sharing projection matrices and partitioning features, this approach reduces both parameters and memory access operations while preserving the core functionality of attention mechanisms, thereby achieving superior computational efficiency. This design is particularly suitable for long-sequence modeling tasks in resource-constrained scenarios.

2.3. HyperMixFormer

In this study, we propose HyperMixFormer, a novel architecture composed of two HyperMixing modules and one Tiny Attention module connected in series (as illustrated in Figure 1c), designed to achieve efficient feature mixing and global information interaction.The HyperMixing module first transforms input features through projections. It then employs temporal modeling via time shift operations and extracts localized features using Mish activation. The Mish function [49], defined as Equation (7), preserves small negative values during training, enhancing gradient flow and feature expressiveness while maintaining computational efficiency. A gating mechanism, implemented through sigmoid-activated receptance, adaptively regulates feature updates to enable enhanced channel-wise feature representation. Each HyperMixing module incorporates a residual connection and further refines features through an output adjustment layer for dimension compression. Subsequently, HyperMixFormer introduces a computationally efficient Tiny Attention module that operates with a minimal number of attention heads (typically 1-4). This lightweight attention mechanism performs localized Softmax-based computations to effectively capture long-range dependencies while facilitating global temporal interactions across sequence steps.

Mish (x) = x \cdot tanh (ln (1 + e^{x}))

(7)

In the HyperMixing model, the input features (as shown in Equations (8) and (9)) [48] are first processed by the WKV operator (Equation (5)) before being fed into the channel mixing module. This module generates enhanced features

o_{t}^{'}

[48] (Equation (10)) through gated nonlinear transformations, combining the Mish activation function with value modulation for efficient feature interaction. The sigmoid gating

σ (r_{t}^{'})

controls information flow while maintaining gradient stability, benefiting from the smooth nonlinear properties of Mish.

This temporal processing pipeline then enters the scaled dot-product attention mechanism of the Tiny Attention module, which approximates the WKV operator in Equation (5) through multi-head attention computations. This approach achieves context-aware feature reweighting while preserving gradient stability. The system subsequently performs precise gated nonlinear transformations via channel mixing: The Mish-activated feature modulation and sigmoid-controlled information flow strictly adhere to the mathematical formulation of Equation (10), combining the element-wise operation

F_{mish} (k) ⊙ v

with zero-initialized gating weights to balance feature interactions.

r_{t}^{'} = W_{r}^{'} \cdot (μ_{r}^{'} ⊙ x_{t} + (1 - μ_{r}^{'}) ⊙ x_{t - 1})

(8)

k_{t}^{'} = W_{k}^{'} \cdot (μ_{k}^{'} ⊙ x_{t} + (1 - μ_{k}^{'}) ⊙ x_{t - 1})

(9)

o_{t}^{'} = σ (r_{t}^{'}) ⊙ (W_{h} \cdot (Mish (k_{t}^{'}) ⊙ v_{t}^{'}))

(10)

HyperMixFormer forms an efficient local feature modeling structure by consecutively calling the HyperMixing module with shared parameters twice. The first HyperMixing conducts preliminary channel feature reconstruction and screening at the local scale. It regulates the feature expression through the gating mechanism and the WKV weighting mechanism. On this basis, the second HyperMix further refines the feature relationships, strengthens key features and suppresses redundancy, forming a more discriminative high-level feature representation. Internally, each HyperMixing adopts Mish nonlinear activation, time shift operation, LayerNorm and residual connection. These are used to enhance the nonlinear interaction across bands, maintain the feature fluidity, suppress noise interference and stabilize the gradient propagation. This dual channel mixing design realizes the multi-level fusion of features, improving the model’s ability to express complex spectral–spatial patterns, especially suitable for high-dimensional HSI data. Meanwhile, under the premise of not increasing additional parameters, through the parameter reuse strategy (similar to the “bottleneck structure” in residual networks), it improves the feature abstraction and modeling depth.

In order to further enhance the model’s ability to model long-range dependencies, a lightweight multi-head attention module (Tiny Attention) is introduced after the double-layer HyperMixing. Tiny Attention computes

Softmax (Q K^{⊤}) V

through low-dimensional projection (such as 64-dimensional), and cooperates with the sparse mask mechanism to screen out invalid regions.It enhances global context-awareness with modest computational cost.

The high-quality local features refined by the double-layer HyperMixing provide more discriminative inputs for Tiny Attention, thus effectively improving the classification consistency and long-range feature interaction ability. Overall, HyperMixFormer achieves efficient local feature modeling while maintaining long-range dependency capture through auxiliary mechanisms, striking an effective balance between computational efficiency and model performance. This design makes it particularly suitable for large-scale or long-sequence tasks where both localized and global representations are critical.

3. Results

3.1. Datasets Description

To validate the effectiveness of the proposed algorithm, we conducted experiments on three public available hyperspectral remote sensing datasets: the Indian Pines dataset, the Pavia University dataset, and the WHU-Hi-HongHu dataset. These datasets are widely used in the field of HSI classification and can effectively evaluate the algorithm’s performance across different scenarios.

The Indian Pines (IP) dataset represents a benchmark dataset in remote sensing research, acquired on 12 June 1992 using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. The dataset covers an approximately 2 × 2 mile agricultural test site in northwestern Indiana, USA, encompassing Purdue University’s Agronomy Farm and adjacent watersheds. The dataset features a spatial resolution of 20 m per pixel, with an image dimension of 145 × 145 pixels. The original hyperspectral data comprises 224 contiguous spectral bands covering the visible to short-wave infrared spectrum (0.4–2.5 μm). Through rigorous quality control procedures, 24 bands significantly affected by water vapor absorption and noise artifacts (specifically bands 104–108, 150–163, and 220) were removed, resulting in 200 retained spectral channels for subsequent analysis. The ground truth data contains annotations for 16 distinct land cover classes, representing the predominant vegetation types and surface features in this agricultural test area. Figure 2 presents a false-color composite of the dataset along with its corresponding ground truth classification map.

The Pavia University (PU) dataset was acquired over the geographical area of the University of Pavia in northern Italy in 2001 using the Reflective Optics System Imaging Spectrometer (ROSIS-3) sensor. The original data were provided by Prof. Paolo Gamba’s team at the University of Pavia. Initially comprising 115 spectral bands, the dataset was subsequently processed by removing 12 noisy bands, resulting in 103 retained bands for further analysis. This dataset covers an urban and campus mixed scene, with a spatial dimension of 610 × 340 pixels and a spatial resolution of 1.3 m per pixel. The spectral range spans from 430 to 860 nm, encompassing nine distinct land-cover classes. Figure 3 presents a false-color composite image of the dataset along with its corresponding ground truth classification map.

The WHU-Hi-HongHu (WHHH) dataset, released by the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS) at Wuhan University, represents a benchmark hyperspectral remote sensing dataset covering typical wetland ecosystems in Honghu City, Hubei Province, China. The data acquisition was conducted using a Headwall Nano-Hyperspec imaging sensor (17 mm focal length) mounted on a DJI Matrice 600 Pro UAV platform during a low-altitude aerial survey from 16:23 to 17:37 on 20 November 2017. The dataset was acquired at a constant flight altitude of 100 m, yielding hyperspectral imagery with spatial dimensions of 940 × 475 pixels across 270 spectral channels (spectral range: 400–1000 nm). After rigorous geometric correction, the data achieves an exceptional spatial resolution of 0.043 m. This high-resolution dataset encompasses complex agricultural landscapes featuring 22 distinct land cover categories, including various crop types and their cultivars (e.g., lettuce varieties, oilseed rape, and cotton). Figure 4 presents a false-color composite of the dataset along with its corresponding ground truth classification map.

Table 1 presents the number of training and testing samples for each class in the three datasets used in this experiment, including the sample distribution across each land cover category.

3.2. Experimental Evaluation Indicators

To quantitatively evaluate the effectiveness of our proposed method and compare it with other approaches, we rely on four key performance metrics: Overall Accuracy (OA), Average Accuracy (AA), Kappa coefficient (Kappa), and per-class accuracy. These metrics provide a comprehensive view of classification performance, with higher values indicating superior classification results.

3.3. Brief Description and Settings of Compared Methods

All experiments were implemented and trained using the PyTorch 2.4.0 framework, with the hardware environment consisting of a Geforce RTX 4090 GPU and a Windows 11 operating system. We used the Adam optimizer with a learning rate of 1 × 10⁻³ to effectively guide the training process. The batch size, an important factor influencing memory consumption and training efficiency, was dynamically adjusted based on the available GPU memory. To ensure optimal processing, batch sizes of 200, 200, and 20 were allocated for the IP, PU, and WHHH datasets, respectively, in accordance with their characteristics. Furthermore, in our network architecture, the depth was set to 2 to more clearly identify potential relationships between variables. Additionally, Section 4 provides a comprehensive discussion of important parameters, including patch size, convolutional kernel size, and the number of PCA-retained channels.

3.4. Parameter Analysis

We conducted a detailed analysis of key parameters, including the size of the convolutional kernels, the patch size of the input data, and the number of channels retained by PCA. Figure 5 presents the experimental results for optimizing these parameters. We found that different datasets may have different optimal parameter selections. To ensure the fairness of experimental comparisons, this study adopts the controlled variable approach for parameter optimization, where only one specific parameter is modified while keeping all other parameters constant during testing. The initial parameter settings were primarily determined based on the optimal benchmark configurations from comparative algorithms, thereby guaranteeing the scientific rigor and comparability of the experimental design. For the kernel size, a kernel size of 19 yielded good results on the IP, PU, and WHHH datasets. Additionally, we observed that this parameter had a lower impact on the IP dataset compared to the PU and WHHH datasets. Regarding the Patch Size parameter, our algorithm performed better with relatively larger patch sizes. Experiments showed that with smaller patch sizes, such as 5 × 5 and 10 × 10, the OA for all three datasets was relatively low. As the patch size increased, the OA for the IP dataset showed no significant change, the OA for the PU dataset first increased and then slowly decreased, and the OA for the WHHH dataset continuously increased. The optimal patch sizes for the IP, PU, and WHHH datasets were found to be 15, 20, and 30, respectively. The number of PCA-retained channels also influenced the results for all three datasets. Furthermore, the experiments revealed that the optimal number of PCA channels for the IP, PU, and WHHH datasets were 100, 30, and 150, respectively.

3.5. Experimental Results

To evaluate the efficacy of the proposed methodology, we conducted comprehensive comparative experiments with eight state-of-the-art hyperspectral image classification approaches as benchmarks, including SSRN [50], SpectralFormer, HiT, SS1DSwin, SSFTT, CASST, LSGA, and our SQSFormer. The experiments strictly adhered to the principle of controlled variables, with all comparative algorithms employing the standardized hyperparameter configuration scheme detailed in Table 2, which encompasses a consistent learning rate, batch size, and training epochs. This rigorous experimental protocol ensures the reliability and comparability of the obtained results.

SSRN (Spectral–Spatial Residual Network) directly processes raw 3D HSI cubes, using spectral–spatial residual blocks with identity mapping and batch normalization to enhance feature learning and classification accuracy, thereby achieving state-of-the-art results on multiple datasets.

HiT (Hyperspectral image Transformer) is designed to capture spectral sequence and local spectral–spatial features. It introduces two key modules: a spectral-adaptive 3D convolutional projection for extracting spectral–spatial information, and a convolutional transformer that encodes features across height, width, and spectral dimensions.

SpectralFormer a transformer-based network for HSI classification that captures spectral–sequence attributes using group-wise spectral embeddings and cross-layer skip connections. It effectively models spectral continuity and supports both pixel- and patch-wise inputs.

SS1DSwin captures local and hierarchical spatial–spectral relationships through groupwise feature tokenization and a 1DSwin Transformer with cross-block normalized connections. It is one of the latest representative transformer-based algorithms for HSI classification.

SSFTT captures spectral–spatial and high-level semantic features by using a 3-D and 2-D convolutional layer for feature extraction, followed by a Gaussian weighted feature tokenizer and transformer encoder, with competitive effectiveness in the current HSI classification task.

CASST uses a dual-branch structure for spatial and spectral feature extraction, with spectral–spatial cross-attention and weighted sharing mechanisms to improve feature fusion and capture robust semantics, with competitive effectiveness in the current HSI classification task.

LSGA introduces the light self-Gaussian attention mechanism while extracting global deep features, reducing computation and parameters. The hybrid spectral–spatial tokenizer captures shallow features, and the Gaussian absolute position bias enhances the attention weights for the central query block.

SQSFormer adaptively queries relevant spatial information from neighboring pixels by leveraging the features of the central pixel, while reducing irrelevant spatial interference through rotation-invariant positional embedding and multi-scale spectral–spatial attention. This approach demonstrates superior performance in HSI classification tasks.

3.5.1. Quantitative Results

Table 3, Table 4 and Table 5 present the classification performance on the IP, PU, and WHHH datasets, including OA, AA,

κ

, and the accuracy for each class. The best performances are highlighted in bold. It can be observed that the proposed model achieves the best overall performance across all three datasets. When using 10 samples per class as the training set, the SpectralFormer, HiT, and SS1DSwin models perform worse than other algorithms, indicating that there is room for improvement when using smaller training samples. The SSRN, SSFTT, CASST, and LSGA algorithms exhibit higher classification accuracy, demonstrating their strong classification ability in limited sample training set scenarios. Our method further improves classification performance, with OA performance increasing by 2.7%, 1.75%, and 0.34%, respectively, compared to the second-best methods on the IP, PU, and WHHH datasets. Specifically, for different subclasses of the same dataset, our model still outperforms the comparison models in most categories. Further analysis was conducted on the subclasses where our model exhibited relative bias compared to the comparison models. For the IP and PU datasets, we found that in the subclasses where our algorithm performed relatively worse, the performance gap compared to the highest classification metric in that subclass was relatively small. However, on the WHHH dataset, our algorithm showed a significant gap in the Lactuca sativa category compared to the best-performing algorithm LSGA (LSGA: 96.85%, ours: 87.52%). Through analysis of the confusion matrix, we found that our model mistakenly classified some parts of Soybean-notill as Small Brassica chinensis and Broad bean. We believe that the proposed model enhances spatial information, which may lead to misclassification when there is strong reliance on pure spectral information in regions with similar spatial information.

We evaluated the model’s performance under different training sample sizes. As shown in Figure 6, we adjusted the number of samples used to train each land-cover class to assess the classification performance of different algorithms. The experiments indicate that, as the sample size increases, the classification accuracy of all models improves to varying degrees. Notably, the proposed model demonstrates competitive performance on the IP, PU, and WHHH datasets across different training sample sizes. The experiments further validate the effectiveness of the method.

3.5.2. Visual Evaluation

The final classification results for the IP, PU, and WHHH datasets are shown in Figure 7, Figure 8 and Figure 9. The proposed model achieves the best classification results with the least amount of noise points. Specifically, the SpectralFormer, HiT, and SS1DSwin models exhibit more misclassified results, with instances of incorrect classification of similar land covers within the same cluster. In contrast, the SSRN, SSFTT, CASST, and LSGA methods show significant improvements, reducing the occurrence of noise points in the classification results. Our model further enhances classification performance over these methods, particularly in areas with higher land-cover density within a unit space (e.g., the lower-right part of IP, around Bare Soil in PU, and the lower-right part of WHHH). In these cases, our model demonstrates superior classification accuracy compared to the other methods.

3.5.3. Accuracy and Efficiency Analysis

Table 6 presents our systematic comparison of multiple Transformer architectures on the Indian Pines dataset, evaluating both parameter counts and computational complexity (FLOPs). The results reveal that our model exhibits dual advantages: it establishes new state-of-the-art in computational efficiency (minimum FLOPs) while maintaining exceptional parameter efficiency (second-best parameter count).

As shown in Figure 10, to verify the overall effectiveness of the model and the consistency of performance across classes, ROC curves, F1 scores, and a normalized confusion matrix were employed for comprehensive evaluation. The results demonstrate that the classification model exhibits excellent overall performance with high consistency: the F1 scores reach 0.91 for IP, 0.95 for PU, and 0.91 for WHHH, indicating strong classification capability. In the normalized confusion matrix, the probabilities of off-diagonal elements are generally below 0.15, indicating low inter-class misclassification risk; meanwhile, most class curves in the ROC plot are tightly clustered near the top-left corner (approaching the coordinate point (0,1)), reflecting near-theoretical optimal discriminative ability between positive and negative samples. These three sets of results (F1 scores, confusion matrix, and ROC curves) mutually corroborate, collectively indicating that the model achieves stable and reliable classification performance for the majority of classes, with only minor performance degradation observed in a few classes due to feature overlap or sample characteristics. This further validates the comprehensive effectiveness of the model in multi-class tasks and the consistency of evaluation outcomes.

As illustrated in the Figure 11, to examine potential overfitting, we analyzed the training loss and test accuracy curves (recorded every 10 epochs) of the TC-Former model. The results demonstrate excellent convergence and generalization performance without evident signs of overfitting. Specifically, the training loss rapidly decreased from an initial value of 3.0, approaching zero after approximately 20 epochs and subsequently stabilizing, indicating effective learning on the training data. Concurrently, both the overall accuracy (OA) and average accuracy (AA) on the test set progressively increased before stabilizing at 90% and 94.5% respectively, with a consistent 4% performance gap between these metrics, reflecting balanced classification performance across different categories. Notably, the test accuracy curve exhibited a continuous upward trend followed by stabilization, devoid of characteristic overfitting patterns (such as initial improvement followed by decline). These observations collectively confirm that the features learned from training data possess strong transferability, enabling effective generalization to test data.

4. Discussion

To further validate the effectiveness of the TimeMixFormer (TF), HyperMixFormer (HF), and Center Attention module (CA), we conducted ablation studies. In these experiments, the network architecture, hyperparameters, and dataset splits were kept consistent while modifying the respective modules. To provide a detailed evaluation of the structures discussed in this study, we designed eight combinations and used a controlled variable method (√ indicates inclusion, × indicates removal) to assess the contribution of each of the three modules, specifically including the following:

(1): Our proposed model.
(2): Remove only the Center Attention module.
(3): Remove only HyperMixFormer.
(4): Remove the HyperMixFormer and Center Attention module.
(5): Remove only TimeMixFormer.
(6): Remove the TimeMixFormer and Center Attention module.
(7): Remove both TimeMixFormer and HyperMixFormer.
(8): Remove the TimeMixFormer, HyperMixFormer and Center Attention module.

The ablation study results in Table 7 demonstrate the performance impact of removing different modules. A comparison between Experiment 1 and Experiment 5 reveals that removing the TimeMixFormer module leads to performance degradation of 1.02%, 0.98%, and 0.46% on the IP, PU, and WHHH datasets, respectively. When comparing Experiment 1 with Experiment 3, the removal of the HyperMixFormer module results in accuracy reductions of 0.95%, 0.41%, and 1.06% on the respective datasets. The most significant impact is observed when removing the Center Attention module (Experiment 1 vs. Experiment 2), with performance drops of 3.14%, 3.42%, and 7.80% across the three datasets.

Comparative analysis of Experiments 1, 3, and 5 indicates that the combined TF-HF dual-module configuration yields superior performance compared to individual modules, achieving minimum accuracy improvements of 0.95%, 0.98%, and 0.46% on the IP, PU, and WHHH datasets, respectively. For the PU dataset, the OA metric in Experiment 5 approaches that of the complete model, while the removal of TF increases the standard deviation by 1.8%, underscoring the stabilizing role of temporal features in spectro-spatial coordination. Notably, on the WHHH dataset, the standalone TF and CA modules achieve 78.58% and 76.88% accuracy respectively, while their synergistic combination reaches 90.50%, demonstrating TF’s capacity to enhance CA’s adaptability to temporal variations.

The experimental results suggest that the three modules achieve collaborative optimization through the following mechanisms: (1) the TF module establishes a stable temporal feature foundation; (2) the HF module enables cross-channel spectral feature integration; (3) the CA module focuses on central sequence regions.

As shown in Figure 12, we take the IP dataset as an example to verify the loss variation caused by removing the TimeMixFormer, HyperMixFormer, and Center Attention module under different training sample sizes. From the figure, it can be observed that, under different training sample sizes, the impact of the TimeMixFormer on the model is more influenced by the number of samples compared to the HyperMixFormer in the IP dataset. Furthermore, as the sample size continues to increase, the loss caused by removing the CA module remains at around 1%. Notably, as the sample size grows, the loss caused by removing the HyperMixFormer shows a continuous decreasing trend.

5. Conclusions

To address the issue of high computational complexity in Transformer architectures for HSI classification, this study proposes an improved solution based on the RWKV linear attention mechanism. By replacing the quadratic-complexity attention mechanism in conventional Transformers with RWKV linear attention, we successfully reduce the computational complexity while preserving the model’s feature extraction capability. Specifically, TimeMixFormer leverages learnable temporal decay matrices to optimize computational efficiency, enabling scalable processing of long sequences with significantly reduced overhead. HyperMixFormer achieves efficient spectral feature interaction and nonlinear modeling through dynamic channel weighting and dual Mish activation design.

Future research will focus on developing lightweight designs and incremental learning strategies based on TC-Former to accommodate evolving ground object characteristics and computational resource constraints in practical applications, with the ultimate goal of achieving efficient and adaptive HSI classification in resource-constrained environments.

Author Contributions

Conceptualization, Z.S., H.L., N.C. and H.Y.; Methodology, Z.S., H.L., N.C. and H.Y.; Software, Z.S. and H.L.; Validation, Z.S., H.L. and N.C.; Formal Analysis, Z.S.; Investigation, Z.S., H.L., N.C., H.Y., J.L., C.L. and X.P.; Resources, Z.S. and H.L.; Data Curation, H.Y.; Writing—Original Draft, Z.S., H.L., N.C., H.Y., J.L., C.L. and X.P.; Writing—Review and Editing, Z.S., H.L., N.C., H.Y., J.L., C.L. and X.P.; Visualization, Z.S. and H.L.; Supervision, H.L., N.C. and H.Y.; Project Administration, N.C.; Funding Acquisition, H.L., N.C. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Joint Funds of the National Natural Science Foundation of China (U24B20162).

Data Availability Statement

The model implementation code is publicly available under MIT license at: https://github.com/liuhui14376/SCMT (accessed on 30 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral remote sensing data analysis and future challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
Plaza, A.; Benediktsson, J.A.; Boardman, J.W.; Brazile, J.; Bruzzone, L.; Camps-Valls, G.; Chanussot, J.; Fauvel, M.; Gamba, P.; Gualtieri, A.; et al. Recent advances in techniques for hyperspectral image processing. Remote Sens. Environ. 2009, 113, S110–S122. [Google Scholar] [CrossRef]
Fang, L.; Yan, Y.; Yue, J.; Deng, Y. Toward the vectorization of hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5518214. [Google Scholar] [CrossRef]
Paoletti, M.E.; Haut, J.M.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
Yang, X.; Ye, Y.; Li, X.; Lau, R.Y.; Zhang, X.; Huang, X. Hyperspectral image classification with deep learning models. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5408–5423. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Jia, X.; Kuo, B.C.; Crawford, M.M. Feature mining for hyperspectral image classification. Proc. IEEE 2013, 101, 676–697. [Google Scholar] [CrossRef]
Wang, C.; Liu, B.; Liu, L.; Zhu, Y.; Hou, J.; Liu, P.; Li, X. A review of deep learning used in the hyperspectral image analysis for agriculture. Artif. Intell. Rev. 2021, 54, 5205–5253. [Google Scholar] [CrossRef]
Lu, B.; Dao, P.D.; Liu, J.; He, Y.; Shang, J. Recent advances of hyperspectral imaging technology and applications in agriculture. Remote Sens. 2020, 12, 2659. [Google Scholar] [CrossRef]
Stuart, M.B.; McGonigle, A.J.; Willmott, J.R. Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems. Sensors 2019, 19, 3071. [Google Scholar] [CrossRef]
Khan, M.J.; Khan, H.S.; Yousaf, A.; Khurshid, K.; Abbas, A. Modern trends in hyperspectral image analysis: A review. IEEE Access 2018, 6, 14118–14129. [Google Scholar] [CrossRef]
Peyghambari, S.; Zhang, Y. Hyperspectral remote sensing in lithological mapping, mineral exploration, and environmental geology: An updated review. J. Appl. Remote Sens. 2021, 15, 031501. [Google Scholar] [CrossRef]
Murphy, R.J.; Monteiro, S.T.; Schneider, S. Evaluating classification techniques for mapping vertical geology using field-based hyperspectral sensors. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3066–3080. [Google Scholar] [CrossRef]
Lynch, P.; Blesius, L.; Hines, E. Classification of urban area using multispectral indices for urban planning. Remote Sens. 2020, 12, 2503. [Google Scholar] [CrossRef]
Nisha, A.; Anitha, A. Current advances in hyperspectral remote sensing in urban planning. In Proceedings of the 2022 Third International Conference on Intelligent Computing Instrumentation and Control Technologies (ICICICT), Kannur, India, 11–12 August 2022; pp. 94–98. [Google Scholar]
Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral imaging for military and security applications: Combining myriad processing and sensing techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
Yuen, P.W.; Richardson, M. An introduction to hyperspectral imaging and its application for security, surveillance and target acquisition. Imaging Sci. J. 2010, 58, 241–253. [Google Scholar] [CrossRef]
Ghamisi, P.; Plaza, J.; Chen, Y.; Li, J.; Plaza, A.J. Advanced spectral classifiers for hyperspectral images: A review. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–32. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Ghamisi, P.; Xie, W.; Li, J.; Chanussot, J.; Plaza, A. Optical remote sensing image understanding with weak supervision: Concepts, methods, and perspectives. IEEE Geosci. Remote Sens. Mag. 2022, 10, 250–269. [Google Scholar] [CrossRef]
Chen, N.; Yue, J.; Fang, L.; Xia, S. SpectralDiff: A generative framework for hyperspectral image classification with diffusion models. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522416. [Google Scholar] [CrossRef]
Uddin, M.P.; Mamun, M.A.; Afjal, M.I.; Hossain, M.A. Information-theoretic feature selection with segmentation-based folded principal component analysis (PCA) for hyperspectral image classification. Int. J. Remote Sens. 2021, 42, 286–321. [Google Scholar] [CrossRef]
Sarker, Y.; Fahim, S.R.; Hosen, M.S.; Sarker, S.K.; Mondal, M.N.I.; Das, S.K. Regularized singular value decomposition based multidimensional convolutional neural network for hyperspectral image classification. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 1502–1505. [Google Scholar]
Bandos, T.V.; Bruzzone, L.; Camps-Valls, G. Classification of hyperspectral images with regularized linear discriminant analysis. IEEE Trans. Geosci. Remote Sens. 2009, 47, 862–873. [Google Scholar] [CrossRef]
Falco, N.; Benediktsson, J.A.; Bruzzone, L. A study on the effectiveness of different independent component analysis algorithms for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2183–2199. [Google Scholar] [CrossRef]
Li, W.; Prasad, S.; Fowler, J.E.; Bruce, L.M. Locality-preserving discriminant analysis in kernel-induced feature spaces for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2011, 8, 894–898. [Google Scholar] [CrossRef]
Duan, Y.; Huang, H.; Tang, Y. Local constraint-based sparse manifold hypergraph learning for dimensionality reduction of hyperspectral image. IEEE Trans. Geosci. Remote Sens. 2020, 59, 613–628. [Google Scholar] [CrossRef]
Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral image classification via kernel sparse representation. IEEE Trans. Geosci. Remote Sens. 2012, 51, 217–231. [Google Scholar] [CrossRef]
Pal, M.; Foody, G.M. Feature selection for classification of hyperspectral data by SVM. IEEE Trans. Geosci. Remote Sens. 2010, 48, 2297–2307. [Google Scholar] [CrossRef]
Kaul, A.; Raina, S. Support vector machine versus convolutional neural network for hyperspectral image classification: A systematic review. Concurr. Comput. Pract. Exp. 2022, 34, e6945. [Google Scholar] [CrossRef]
Lee, H.; Kwon, H. Going deeper with contextual CNN for hyperspectral image classification. IEEE Trans. Image Process. 2017, 26, 4843–4855. [Google Scholar] [CrossRef]
Luo, Y.; Zou, J.; Yao, C.; Zhao, X.; Li, T.; Bai, G. HSI-CNN: A novel convolution neural network for hyperspectral image. In Proceedings of the 2018 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 16–17 July 2018; pp. 464–469. [Google Scholar]
Vaddi, R.; Manoharan, P. Hyperspectral image classification using CNN with spectral and spatial features integration. Infrared Phys. Technol. 2020, 107, 103296. [Google Scholar] [CrossRef]
Ge, Z.; Cao, G.; Li, X.; Fu, P. Hyperspectral image classification method based on 2D–3D CNN and multibranch feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5776–5788. [Google Scholar] [CrossRef]
Firat, H.; Hanbay, D. Classification of Hyperspectral Images Using 3D CNN Based ResNet50. In Proceedings of the 2021 29th Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkey, 9–11 June 2021; pp. 1–4. [Google Scholar]
Fang, B.; Li, Y.; Zhang, H.; Chan, J.C.W. Hyperspectral images classification based on dense convolutional networks with spectral-wise attention mechanism. Remote Sens. 2019, 11, 159. [Google Scholar] [CrossRef]
Mei, S.; Li, X.; Liu, X.; Cai, H.; Du, Q. Hyperspectral image classification using attention-based bidirectional long short-term memory network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5509612. [Google Scholar] [CrossRef]
Sun, Q.; Zhao, G.; Xia, X.; Xie, Y.; Fang, C.; Sun, L.; Wu, Z.; Pan, C. Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms. Remote Sens. 2024, 16, 2185. [Google Scholar] [CrossRef]
Shu, Z.; Liu, Z.; Yu, Z.; Wu, X.J. Dual feature aggregation network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5502916. [Google Scholar] [CrossRef]
Zhang, B.; Chen, Y.; Rong, Y.; Xiong, S.; Lu, X. MATNet: A combining multi-attention and transformer network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5506015. [Google Scholar] [CrossRef]
Liu, H.; Li, W.; Xia, X.G.; Zhang, M.; Gao, C.Z.; Tao, R. Central attention network for hyperspectral imagery classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8989–9003. [Google Scholar] [CrossRef]
Pu, C.; Huang, H.; Luo, L. Classfication of hyperspectral image with attention mechanism-based dual-path convolutional network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 5504905. [Google Scholar] [CrossRef]
Yang, X.; Cao, W.; Lu, Y.; Zhou, Y. Hyperspectral image transformer classification networks. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5528715. [Google Scholar] [CrossRef]
Xu, Y.; Xie, Y.; Li, B.; Xie, C.; Zhang, Y.; Wang, A.; Zhu, L. Spatial–spectral 1DSwin transformer with groupwise feature tokenization for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5516616. [Google Scholar] [CrossRef]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
Peng, Y.; Zhang, Y.; Tu, B.; Li, Q.; Li, W. Spatial–spectral transformer with cross-attention for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5537415. [Google Scholar] [CrossRef]
Ma, C.; Wan, M.; Wu, J.; Kong, X.; Shao, A.; Wang, F.; Chen, Q.; Gu, G. Light self-Gaussian-attention vision transformer for hyperspectral image classification. IEEE Trans. Instrum. Meas. 2023, 72, 5015712. [Google Scholar] [CrossRef]
Chen, N.; Fang, L.; Xia, Y.; Xia, S.; Liu, H.; Yue, J. Spectral Query Spatial: Revisiting the Role of Center Pixel in Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5402714. [Google Scholar] [CrossRef]
Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. Rwkv: Reinventing rnns for the transformer era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]

Figure 1. The framework of TC-Former.

Figure 2. The false-color composite image, the corresponding ground-truth map, and the legend of the Indian Pines dataset: (a) false-color map; (b) truth map.

Figure 3. The false-color composite image, the corresponding ground-truth map, and the legend of the Pavia University dataset: (a) false-color map; (b) truth map.

Figure 4. The false-color composite image, the corresponding ground-truth map, and the legend of the WHU-Hi-HongHu dataset: (a) false-color map; (b) truth map.

Figure 5. The impact of select hyperparameters on OA of three datasets (IP, PU, WHHH). (a) The impact of transformer block num on OA performance. (b) The impact of patch size on OA performance. (c) The impact of PCA num on OA performance.

Figure 6. Overall classification accuracy of various algorithms under different training sample sizes for (a) Indian Pines, (b) Pavia University, and (c) WHU-Hi-HongHu datasets.

Figure 7. The visual classification results of Indian Pines with compared methods ((a) truth; (b) SSRN; (c) SpectralFormer; (d) HiT; (e) SS1DSwin; (f) SSFTT; (g) CASST; (h) LSGA; (i) SQSFormer; (j) Ours).

Figure 8. The visual classification results of Pavia University from compared methods ((a) truth; (b) SSRN; (c) SpectralFormer; (d) HiT; (e) SS1DSwin; (f) SSFTT; (g) CASST; (h) LSGA; (i) SQSFormer; (j) Ours).

Figure 9. The visual classification results of WHU-Hi-HongHu from compared methods ((a) truth; (b) SSRN; (c) SpectralFormer; (d) HiT; (e) SS1DSwin; (f) SSFTT; (g) CASST; (h) LSGA; (i) SQSFormer; (j) Ours).

Figure 10. Classification performance evaluation via ROC curves and normalized confusion matrix.

Figure 11. Training loss and test accuracy curves of the TC-Former.

Figure 12. OA loss of different ablation strategies compared to our proposed model on the Indian Pines dataset at different sample sizes.

Table 1. Training and testing distribution of the Indian Pines dataset, the Pavia University dataset, and the WHU-Hi-HongHu dataset.

No.	Indian Pines			Pavia University			WHU-Hi-HongHu
No.	Class	Train	Test	Class	Train	Test	Class	Train	Test
1	Alfalfa	10	36	Asphalt	10	6621	Red roof	10	14,031
2	CornMottill	10	1418	Meadows	10	18,639	Road	10	3502
3	CornMottill	10	820	Gravel	10	2089	Bare soil	10	21,811
4	Corn	10	227	Trees	10	3054	Cotton	10	163,275
5	GrassPasture	10	473	PaintedMetalSheets	10	1335	Cotton firewood	10	6208
6	GrassTrees	10	720	BareSoil	10	5019	Rape	10	44,547
7	GrassPastureMowed	10	18	Bitumen	10	1320	Chinese cabbage	10	24,093
8	HayWindrowed	10	468	SelfBlockingBricks	10	3672	Pakchoi	10	4044
9	Oats	10	10	Shadows	10	937	Cabbage	10	10,809
10	SoybeanNotill	10	962				Tuber mustard	10	12,384
11	SoybeanMintill	10	2445				Brassica parachinensis	10	11,005
12	SoybeanClean	10	583				Brassica chinensis	10	8944
13	Wheat	10	195				SmallBrassicachinensis	10	22,497
14	Woods	10	1255				Lactuca sativa	10	7346
15	BuildingsGrassTreesDrives	10	376				Celtuce	10	992
16	StoneSteelTowers	10	83				Film covered lettuce	10	7252
17							Romaine lettuce	10	3000
18							Carrot	10	3207
19							White radish	10	8702
20							Garlic sprout	10	3476
21							Broad bean	10	1318
22							Tree	10	4030
	Total	160	10,089	Total	90	42,686	Total	220	386,473

Table 2. Experimental parameters configuration.

Dataset	Patch Size	Batch Size	PCA	Kernel Size	Learning Rate	Epochs
IP	15	200	150	9	0.0005	100
PU	24	100	20	17	0.0005	100
WHHH	30	20	135	19	0.0005	100

Table 3. Classification results compared with other methods on the Indian Pines dataset (10 samples per class), highlighting best in bold.

Class	SSRN	SpectralFormer	HiT	SSIDSwin	SSFTT	CASST	LSGA	SQSFormer	Ours
1	100.00 ± 0.00	87.22 ± 11.21	99.44 ± 1.24	88.89 ± 5.20	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	99.44 ± 1.24	97.50 ± 0.28
2	70.89 ± 12.57	24.30 ± 4.44	40.42 ± 11.73	38.69 ± 2.67	75.98 ± 2.58	80.37 ± 2.58	49.37 ± 3.59	78.87 ± 2.06	83.62 ± 3.46
3	77.95 ± 3.27	52.07 ± 5.52	31.17 ± 6.67	62.24 ± 1.25	84.63 ± 1.58	89.07 ± 2.30	84.17 ± 1.81	82.56 ± 1.26	86.46 ± 9.27
4	97.71 ± 1.75	55.77 ± 13.32	63.35 ± 24.26	61.06 ± 2.00	93.30 ± 3.83	92.56 ± 1.13	99.91 ± 0.20	99.21 ± 0.65	94.54 ± 3.08
5	93.78 ± 3.05	50.15 ± 13.23	38.09 ± 4.69	79.20 ± 2.00	94.71 ± 1.11	96.50 ± 1.96	84.14 ± 1.90	96.62 ± 1.46	93.51 ± 1.90
6	94.08 ± 1.66	54.92 ± 3.79	73.22 ± 9.47	63.33 ± 2.00	93.36 ± 0.58	96.50 ± 1.96	95.83 ± 1.47	94.31 ± 0.86	96.01 ± 3.75
7	100.00 ± 0.00	87.78 ± 15.42	100.00 ± 0.00	88.89 ± 2.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
8	99.96 ± 0.10	82.39 ± 8.86	98.58 ± 1.31	92.69 ± 2.03	100.00 ± 0.00	100.00 ± 0.00	99.91 ± 0.19	100.00 ± 0.00	100.00 ± 0.00
9	100.00 ± 0.00	100.00 ± 0.00	98.00 ± 4.47	96.00 ± 8.94	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
10	86.51 ± 2.42	52.93 ± 9.01	60.09 ± 16.08	56.11 ± 4.21	79.83 ± 3.65	75.24 ± 1.48	79.50 ± 2.66	87.61 ± 2.10	84.55 ± 3.22
11	68.76 ± 5.47	32.25 ± 10.66	56.92 ± 9.22	44.79 ± 9.09	67.46 ± 4.00	65.43 ± 1.36	69.22 ± 3.73	75.43 ± 3.27	85.92 ± 6.62
12	87.48 ± 2.20	26.52 ± 6.36	32.71 ± 10.14	51.70 ± 7.91	81.37 ± 6.01	81.17 ± 5.04	81.06 ± 5.54	93.17 ± 0.89	90.19 ± 3.26
13	99.28 ± 0.79	74.36 ± 21.55	81.54 ± 5.54	97.13 ± 0.58	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00	99.08 ± 0.51
14	98.15 ± 0.99	62.71 ± 22.16	84.01 ± 16.40	83.38 ± 5.89	97.91 ± 0.51	95.38 ± 2.29	89.96 ± 2.24	98.73 ± 0.16	96.86 ± 4.62
15	96.06 ± 3.01	38.56 ± 10.61	44.30 ± 8.24	78.14 ± 12.30	98.35 ± 0.55	96.86 ± 1.57	85.21 ± 4.42	99.15 ± 0.61	99.87 ± 0.13
16	98.80 ± 2.09	97.59 ± 2.95	93.49 ± 4.48	98.55 ± 1.98	100.00 ± 0.00	99.28 ± 1.62	100.00 ± 0.00	100.00 ± 0.00	100.00 ± 0.00
OA (%)	83.37 ± 1.88	45.44 ± 3.35	58.79 ± 4.74	60.04 ± 5.04	83.30 ± 1.35	82.26 ± 0.59	78.24 ± 0.92	87.31 ± 0.83	90.01 ± 1.12
AA (%)	91.84 ± 0.78	61.22 ± 1.85	68.46 ± 3.02	73.81 ± 3.36	91.68 ± 0.63	90.75 ± 0.57	88.64 ± 0.60	94.07 ± 0.29	94.36 ± 0.59
$κ \times 100$	81.31 ± 2.07	39.90 ± 3.41	53.24 ± 5.16	55.64 ± 5.48	81.20 ± 1.49	80.04 ± 0.65	75.41 ± 1.00	85.67 ± 0.92	88.65 ± 1.25

Table 4. Classification results compared with other methods on the Pavia University dataset (10 labels per class), highlighting best in bold.

Class	SSRN	SpectralFormer	HiT	SSIDSwin	SSFTT	CASST	LSGA	SQSFormer	Ours
1	75.80 ± 30.18	38.18 ± 5.61	58.06 ± 7.64	64.36 ± 8.47	71.89 ± 4.95	85.51 ± 2.00	81.52 ± 0.86	90.83 ± 2.23	88.95 ± 2.06
2	72.97 ± 23.22	75.18 ± 1.96	85.53 ± 6.78	72.44 ± 3.73	89.71 ± 1.87	91.91 ± 1.15	94.15 ± 1.81	95.17 ± 0.79	95.43 ± 4.27
3	91.89 ± 16.25	60.58 ± 1.33	42.72 ± 15.02	57.78 ± 11.16	90.33 ± 1.49	92.99 ± 4.33	88.84 ± 3.47	66.64 ± 3.69	93.18 ± 3.63
4	91.89 ± 5.21	87.25 ± 4.29	92.52 ± 3.06	85.80 ± 1.90	93.46 ± 2.60	83.24 ± 4.78	92.55 ± 2.41	91.51 ± 1.65	90.16 ± 4.98
5	98.20 ± 2.50	99.94 ± 0.03	99.97 ± 0.04	99.99 ± 0.03	99.19 ± 1.28	98.58 ± 1.17	100.00 ± 0.00	99.93 ± 0.09	99.39 ± 0.61
6	99.65 ± 0.79	54.18 ± 7.07	60.62 ± 14.84	63.34 ± 8.30	97.48 ± 1.85	97.68 ± 2.19	96.17 ± 2.70	97.27 ± 0.85	100.00 ± 0.00
7	95.14 ± 3.39	91.61 ± 1.56	89.92 ± 6.04	75.05 ± 7.54	98.24 ± 1.24	99.95 ± 0.10	99.15 ± 5.37	99.14 ± 1.89	99.30 ± 0.70
8	81.54 ± 14.94	60.73 ± 1.22	65.16 ± 13.23	73.17 ± 3.74	59.26 ± 6.50	53.15 ± 1.21	97.50 ± 1.21	88.14 ± 2.85	93.82 ± 2.45
9	90.25 ± 6.19	97.31 ± 1.48	95.45 ± 3.65	88.30 ± 3.86	97.99 ± 1.72	97.50 ± 1.21	97.74 ± 0.47	99.94 ± 0.06	96.61 ± 1.60
OA (%)	81.42 ± 10.09	67.65 ± 1.05	75.52 ± 5.10	71.71 ± 1.39	86.28 ± 1.31	88.27 ± 0.71	91.50 ± 0.68	92.85 ± 0.44	94.60 ± 1.47
AA (%)	88.59 ± 0.26	73.89 ± 0.26	76.66 ± 3.95	75.58 ± 1.49	88.62 ± 1.62	88.95 ± 0.68	92.61 ± 0.05	92.06 ± 0.57	95.20 ± 0.96
$κ \times 100$	77.01 ± 11.56	58.72 ± 1.10	68.75 ± 6.21	63.82 ± 1.64	82.24 ± 2.04	84.71 ± 0.89	88.87 ± 0.86	90.61 ± 0.57	92.91 ± 1.91

Table 5. Classification results compared with other methods on the WHU-Hi-HongHu dataset (10 samples per class), highlighting best in bold.

Class	SSRN	SpectralFormer	HiT	SSIDSwin	SSFTT	CASST	LSGA	SQSFormer	Ours
1	94.71 ± 1.68	77.43 ± 0.89	93.93 ± 1.16	88.89 ± 3.13	91.93 ± 1.04	93.03 ± 0.49	93.88 ± 1.69	88.99 ± 2.36	94.33 ± 6.95
2	79.26 ± 7.25	52.20 ± 6.45	73.79 ± 6.87	82.07 ± 1.31	81.52 ± 4.43	78.66 ± 2.88	87.11 ± 2.51	95.55 ± 1.28	86.36 ± 6.97
3	83.96 ± 1.92	78.44 ± 3.08	78.99 ± 3.86	72.94 ± 1.71	80.72 ± 0.68	82.57 ± 0.45	82.16 ± 2.03	81.53 ± 3.20	91.40 ± 5.16
4	97.71 ± 1.46	69.03 ± 7.81	92.16 ± 0.75	84.48 ± 6.84	90.73 ± 1.06	93.12 ± 0.78	90.59 ± 2.21	98.00 ± 1.16	97.44 ± 2.12
5	90.68 ± 3.82	67.44 ± 0.62	86.60 ± 2.55	70.61 ± 10.3	92.68 ± 2.32	93.16 ± 1.39	92.28 ± 2.06	94.57 ± 1.67	99.90 ± 0.10
6	89.36 ± 3.51	65.68 ± 1.30	83.80 ± 2.79	79.27 ± 4.31	91.84 ± 0.74	91.92 ± 0.76	91.84 ± 0.72	97.25 ± 1.13	98.90 ± 1.10
7	63.90 ± 15.99	35.27 ± 1.68	50.72 ± 3.38	52.45 ± 7.26	64.44 ± 2.71	68.16 ± 4.33	68.28 ± 2.75	73.62 ± 2.52	66.23 ± 6.32
8	77.50 ± 7.51	23.29 ± 2.17	33.90 ± 15.23	35.30 ± 2.90	65.44 ± 3.07	69.05 ± 2.02	84.07 ± 1.56	90.45 ± 1.44	86.40 ± 12.07
9	93.59 ± 1.08	88.74 ± 1.70	82.86 ± 8.08	88.12 ± 7.64	93.85 ± 0.89	93.85 ± 0.77	94.79 ± 0.08	94.07 ± 0.75	93.77 ± 3.57
10	80.30 ± 6.94	37.48 ± 10.71	33.34 ± 9.86	53.81 ± 10.71	70.90 ± 7.90	78.94 ± 1.08	74.13 ± 3.29	84.95 ± 3.35	84.40 ± 15.60
11	51.38 ± 18.29	28.04 ± 2.21	23.10 ± 5.77	46.22 ± 1.29	61.42 ± 3.28	66.52 ± 1.32	70.13 ± 5.55	67.93 ± 3.34	84.00 ± 6.84
12	67.82 ± 20.81	56.88 ± 5.24	60.77 ± 5.87	54.16 ± 3.00	58.37 ± 3.38	43.28 ± 1.82	61.22 ± 4.73	88.23 ± 1.94	78.88 ± 5.46
13	55.45 ± 28.57	34.53 ± 4.64	47.01 ± 5.83	48.43 ± 4.08	48.20 ± 2.27	36.20 ± 3.87	69.50 ± 1.82	69.80 ± 6.52	66.27 ± 10.74
14	92.90 ± 5.64	57.56 ± 2.18	60.74 ± 5.73	70.60 ± 9.26	92.56 ± 2.38	96.20 ± 0.63	96.85 ± 4.08	96.57 ± 1.40	87.52 ± 12.48
15	99.42 ± 0.84	74.54 ± 3.70	92.24 ± 2.56	86.36 ± 5.32	98.13 ± 1.11	98.19 ± 0.61	97.23 ± 1.07	97.32 ± 0.20	98.72 ± 1.28
16	87.37 ± 2.07	57.76 ± 9.71	27.08 ± 9.00	68.34 ± 5.34	90.44 ± 1.62	84.22 ± 1.64	95.23 ± 1.89	89.20 ± 4.81	98.43 ± 1.57
17	99.56 ± 0.39	71.41 ± 3.52	74.27 ± 10.87	74.02 ± 3.22	96.87 ± 0.85	96.54 ± 0.60	95.03 ± 0.64	99.83 ± 0.23	99.58 ± 0.42
18	99.42 ± 0.56	33.93 ± 18.70	63.04 ± 7.63	70.23 ± 2.80	99.40 ± 0.34	99.19 ± 0.30	98.67 ± 0.50	96.99 ± 0.13	95.00 ± 5.00
19	86.73 ± 10.97	67.34 ± 5.83	57.44 ± 7.63	78.81 ± 2.47	89.43 ± 1.84	90.67 ± 2.54	85.73 ± 4.47	82.51 ± 2.01	87.38 ± 7.21
20	98.04 ± 0.97	77.72 ± 4.09	74.39 ± 1.74	80.84 ± 11.7	83.14 ± 5.55	90.34 ± 1.32	96.67 ± 0.53	98.04 ± 1.11	98.73 ± 1.27
21	98.82 ± 1.14	49.88 ± 4.17	63.57 ± 10.32	87.08 ± 2.01	91.53 ± 2.08	93.12 ± 3.38	94.83 ± 1.61	98.98 ± 0.97	97.23 ± 2.35
22	90.67 ± 6.39	53.72 ± 3.84	86.23 ± 4.55	78.00 ± 4.42	86.53 ± 5.03	88.32 ± 0.84	95.23 ± 1.80	99.96 ± 0.04	99.93 ± 0.07
OA (%)	87.55 ± 4.48	61.93 ± 4.44	76.31 ± 1.34	74.88 ± 4.45	83.93 ± 0.36	85.71 ± 0.43	85.47 ± 1.64	91.22 ± 0.50	91.56 ± 0.58
AA (%)	85.39 ± 5.90	57.20 ± 1.29	65.45 ± 1.75	70.50 ± 2.80	82.82 ± 0.68	84.57 ± 0.39	85.65 ± 1.07	90.20 ± 0.35	90.54 ± 0.69
$κ \times 100$	84.35 ± 5.68	55.18 ± 4.43	70.65 ± 1.66	69.45 ± 5.01	80.10 ± 0.42	82.24 ± 0.52	82.02 ± 1.95	88.95 ± 0.61	89.39 ± 0.48

Table 6. Computational efficiency comparison of hyperspectral classification models.

Model	FLOPs (M)	Parameters (K)
HiT	190.40	61,915.55
SS1D	111.54	428.66
SSFTT	79.94	936.58
LSGA	192.15	1421.90
SQSFormer	440.45	1346.63
CASST	2700.83	13,804.62
Ours	75.99	919.00

Table 7. Classification performance of the proposed method with different ablation strategies on the three HSI datasets.

Idx	Modules			Indian Pines			Pavia University			WHU-Hi-HongHu
Idx	TF	HF	CA	OA (%)	AA (%)	$κ \times 100$	OA (%)	AA (%)	$κ \times 100$	OA (%)	AA (%)	$κ \times 100$
1	✓	✓	✓	90.01±1.12	94.36 ± 0.59	88.65 ± 1.25	94.60 ± 1.47	95.20 ± 0.96	92.91 ± 1.91	91.56 ± 0.58	90.54 ± 0.69	89.39 ± 0.48
2	✓	✓	×	86.87 ± 0.76	92.49 ± 0.46	85.15 ± 0.85	91.18 ± 0.79	92.24 ± 0.49	88.48 ± 0.99	83.76 ± 8.30	80.44 ± 2.75	79.85 ± 9.31
3	✓	×	✓	89.06 ± 2.24	93.92 ± 1.16	87.61 ± 2.66	94.19 ± 2.08	94.98 ± 0.99	92.37 ± 2.66	90.50 ± 4.63	89.55 ± 6.35	88.09 ± 5.78
4	✓	×	×	86.27 ± 1.14	91.86 ± 0.72	84.46 ± 1.28	90.77 ± 1.21	91.81 ± 0.75	87.97 ± 1.53	78.58 ± 1.70	79.52 ± 0.53	73.94 ± 1.84
5	×	✓	✓	88.99 ± 1.02	94.05 ± 0.99	87.53 ± 1.15	93.62 ± 3.27	94.53 ± 1.89	91.62 ± 4.20	91.10 ± 1.22	90.22 ± 0.98	88.83 ± 1.49
6	×	✓	×	86.44 ± 0.66	92.06 ± 0.42	84.31 ± 1.42	90.30 ± 1.51	91.72 ± 0.39	87.38 ± 1.87	78.94 ± 1.30	79.13 ± 0.41	74.31 ± 1.42
7	×	×	✓	86.10 ± 0.64	91.77 ± 0.46	84.28 ± 0.73	90.23 ± 1.75	91.51 ± 0.76	87.27 ± 2.19	76.88 ± 1.79	78.18 ± 0.54	72.07 ± 1.96
8	×	×	×	76.57 ± 7.95	87.15 ± 3.38	73.69 ± 8.44	80.39 ± 4.36	85.37 ± 2.46	74.95 ± 5.24	78.76 ± 1.01	79.10 ± 0.83	74.09 ± 1.12

Note: TF: TimeMixFormer, HF: HyperMixFormer, CA: Center Attention.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Z.; Liu, H.; Chen, N.; Yang, H.; Li, J.; Liu, C.; Pei, X. Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification. Remote Sens. 2025, 17, 3100. https://doi.org/10.3390/rs17173100

AMA Style

Sun Z, Liu H, Chen N, Yang H, Li J, Liu C, Pei X. Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification. Remote Sensing. 2025; 17(17):3100. https://doi.org/10.3390/rs17173100

Chicago/Turabian Style

Sun, Zhenming, Hui Liu, Ning Chen, Haina Yang, Jia Li, Chang Liu, and Xiaoping Pei. 2025. "Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification" Remote Sensing 17, no. 17: 3100. https://doi.org/10.3390/rs17173100

APA Style

Sun, Z., Liu, H., Chen, N., Yang, H., Li, J., Liu, C., & Pei, X. (2025). Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification. Remote Sensing, 17(17), 3100. https://doi.org/10.3390/rs17173100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spectral Channel Mixing Transformer with Spectral-Center Attention for Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview

2.2. TimeMixFormer

2.3. HyperMixFormer

3. Results

3.1. Datasets Description

3.2. Experimental Evaluation Indicators

3.3. Brief Description and Settings of Compared Methods

3.4. Parameter Analysis

3.5. Experimental Results

3.5.1. Quantitative Results

3.5.2. Visual Evaluation

3.5.3. Accuracy and Efficiency Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI