1. Introduction
Hyperspectral remote sensing is an advanced technique that utilizes imaging spectrometers to acquire ground object spectral information across continuous and narrow spectral bands [
1,
2]. The core principle involves capturing the reflectance or radiation characteristics of ground objects within the broad electromagnetic spectrum from ultraviolet to infrared, with high spectral resolution (typically below 10 nm), recording the spectral reflectance data for each pixel. This means each pixel represents not just a single color value, but rather a complete spectral curve that reflects the reflectance characteristics of that pixel at different wavelengths [
3,
4]. This technology can provide data from hundreds of spectral bands, far exceeding conventional multispectral remote sensing, thereby enabling precise identification and classification of ground object features [
5,
6,
7]. Its distinguishing characteristics include high spectral resolution, multi-band imaging capability, and sensitivity to subtle spectral differences. These features allow for accurate capture and in-depth analysis of spectral characteristics of different entities and materials on the Earth’s surface, enhancing our ability to discriminate various surface features. In multiple fields such as agriculture [
8,
9], environmental monitoring [
10,
11], geology [
12,
13], urban planning [
14,
15], and national security [
16,
17], hyperspectral imaging technology has demonstrated outstanding application performance [
18,
19]. One of its core tasks is HSI classification, which involves accurately assigning each pixel to specific land cover categories. This process is not only a crucial aspect of hyperspectral imaging technology but also plays an irreplaceable role in its wide range of applications [
20].
HSIs are characterized by high-dimensional spectral data, prompting the development of various statistical transformation methods for dimensionality reduction of spectral vectors. These approaches can be broadly categorized into orthogonal transformations (e.g., Principal Component Analysis (PCA) [
21] and Singular Value Decomposition (SVD) [
22]) and discriminant transformations (e.g., Linear Discriminant Analysis (LDA) [
23] and Independent Component Analysis (ICA) [
24]). However, these linear methods may fail to fully capture the nonlinear structures inherent in hyperspectral data, which has driven the advancement of nonlinear transformation techniques such as Kernel Discriminant Analysis (KDA) [
25], Manifold Hypergraph Learning (MHL) [
26], and Kernel Sparse Representation (KSR) [
27]. Conventional classification methods based on manually designed spectral–spatial features, including k-Nearest Neighbors, Bayesian estimation, and Support Vector Machines (SVM) [
28,
29], often prove inadequate in complex environments due to their limited capacity to effectively utilize both spatial and spectral information.
In recent years, deep learning-based HSI classification methods have achieved significant progress through innovative applications of multimodal feature fusion and attention mechanisms, substantially improving classification performance. While traditional CNN methods [
30,
31,
32,
33,
34] can effectively capture joint spectral–spatial features, their fixed convolutional kernel scales limit multi-scale information processing. To break through this limitation, researchers have proposed a series of innovative approaches: MSDN-SA [
35] first integrated 3D dilated convolution with spectral attention mechanisms, using dense connections to enhance multi-scale feature representation; Bi-LSTM [
36] networks further introduced bidirectional spectral dependency modeling, optimizing inter-band correlations through spectral–spatial attention mechanisms. For feature fusion, MSCF-MAM [
37] integrates local and global information through pyramid compression attention modules while employing Transformer encoders to model long-range dependencies; DFAN [
38] achieves efficient HSI classification through a novel interactive fusion mechanism that dynamically integrates local and global spatial-spectral features. MATNet [
39] innovatively combines spatial-channel attention with transformer encoders and a polynomial-adaptive loss function, effectively addressing the challenge of boundary pixel misclassification in hyperspectral imagery caused by redundant spectral information and sparse background distribution. CAN’s [
40] proposed SDPCA module extracts features from central pixels and similar neighborhoods, achieving multi-level feature fusion through dense connections to significantly improve edge region discrimination. More advanced architectures like AMDPCN [
41] adopt dual-path designs incorporating GCN and CNN, dynamically adjusting feature interactions through multi-scale attention mechanisms to effectively balance global spatial relationships with local discriminative power.
To fully exploit the spectral–spatial characteristics of HSI, several Transformer-based feature extraction methods have been developed, including the Hyperspectral image Transformer (HiT) [
42], spatial–spectral 1DSwin (SS1DSwin) [
43], Spectral–Spatial Feature Tokenization Transformer (SSFTT) [
44], Cross-Attention Spatial–Spectral Transformer (CASST) [
45], Lightweight Self-Gaussian Attention Transformer (LSGA) [
46], and Spectral Query Spatial (SQS) [
47]. The HiT architecture incorporates convolutional operations within Transformer blocks to capture fine-grained spectral–spatial variations, while SS1DSwin employs group feature tokenization and 1D Swin Transformer modules to characterize hierarchical spatial–spectral relationships through dual processing pathways. SSFTT implements Gaussian-weighted tokenization of shallow features for deep semantic extraction, whereas CASST adopts a dual-branch structure with cross-attention mechanisms. LSGA preserves complete patch-wise relationships through hybrid tokens with Gaussian positional bias, and SQSFormer adaptively queries neighborhood information using rotation-invariant positional encoding. However, these approaches face significant challenges: The quadratic computational complexity of Transformer architectures, coupled with substantial parameter requirements, results in suboptimal efficiency for resource-constrained applications, particularly when processing long spectral sequences, which demands innovative solutions through algorithmic optimization and hardware acceleration strategies.
To address this challenges, this paper proposes the TC-Former framework, which employs a hierarchical architecture constructed by stacking TimeMixFormer and HyperMixFormer modules for deep feature extraction. The TimeMixFormer integrates the WKV operator from the RWKV model [
48] (a linear-complexity Transformer alternative that combines the parallelizable training of Transformers with the efficient inference of Recurrent Neural Networks), utilizing learnable exponential decay parameters to enable efficient cross-timestep information propagation, while the HyperMixFormer enhances feature interactions along the channel dimension through gated convolutional structures. The framework incorporates a central pixel focusing mechanism to process spatial information by dynamically generating attention masks that emphasize critical regions, with its key innovation being the simultaneous improvement of HSI classification accuracy and significant reduction of computational complexity for long sequences through RWKV technology. The operational principles involve transferring and blending features across current and previous timesteps through TimeMixFormer and HyperMixFormer, where the feature mixing process enhances spatial orientation robustness, and the hyper attention module—incorporating layer normalization and residual design—reduces computational complexity to enable real-time processing of long spectral sequences (e.g., 200+ bands) while improving translational adaptability for HSI. The dual-reuse mechanism achieves parameter sharing to substantially decrease total model parameters, maintaining lightweight operation for central pixel classification through dynamic feature weight adjustment that enhances model adaptability and reduces computational overhead, with TC-Former’s progressive filtering layers achieving adaptive spectral–spatial feature refinement to effectively suppress irrelevant information propagation while maintaining computational efficiency and improved classification performance. The core contributions of this work are three fold:
This study proposes TC-Former, a novel HSI classification framework that innovatively combines the global modeling capability of Transformer with the linear attention advantages of RWKV. The hybrid architecture effectively addresses the computational efficiency bottleneck of conventional Transformer models while achieving significant performance improvements. Experimental evaluations on three public benchmark datasets demonstrate superior classification performance of the proposed framework.
By integrating RWKV’s linear attention with Transformer, we propose a channel-guided attention design that replaces dot-product computation, reducing complexity while maintaining performance. The model significantly reduces computational complexity through TimeMixFormer’s learnable decay factors, enabling efficient processing of long spectral sequences. A dual-path mechanism combines time-mixing and hyper-mixing to capture both temporal and cross-channel patterns. A lightweight micro-attention module further enhances global dependency modeling.
This work introduces a hybrid architecture combining TimeMixFormer and HyperMixFormer modules to enhance feature representation. TimeMixFormer employs learnable time-decay matrices to achieve efficient temporal modeling of long-range dependencies while maintaining computational tractability. HyperMixFormer improves spectral interactions via dynamic channel weighting and dual Mish activations, creating a higher-order feature space. This design excels at modeling nonlinear spectral reflectance in hyperspectral imagery, especially for complex features and blurred boundaries.
2. Materials and Methods
2.1. Overview
The overall architecture of our proposed model is illustrated in
Figure 1a. Given a HSI denoted as
, where
C represents the number of spectral bands,
H denotes the height, and
W indicates the image width, the processing pipeline consists of several key stages. During preprocessing, the input HSI first undergoes PCA dimensionality reduction to extract crucial spectral features, yielding an intermediate representation
where
is the reduced spectral dimension. This is followed by spatial feature extraction through a convolutional layer (Conv2D+BN+ReLU), after which the feature maps are reshaped into sequence format.
The core of our proposed model comprises multiple TimeMixFormer and HyperMixFormer modules, as illustrated in
Figure 1b,c. The TimeMixFormer captures temporal dependencies across spectral bands through RWKV-based temporal mixing operations, while the HyperMixFormer processes spatial dependencies by employing a hybrid channel attention mechanism that utilizes central pixel spatial information as queries to aggregate relevant features from neighboring pixels. The Center Attention mechanism dynamically queries adjacent pixels using central pixel features and integrates multi-level embedded representations, thereby facilitating joint spectral–spatial feature learning for enhanced classification performance. These three modules operate synergistically: the TimeMixFormer captures spectral–temporal interactions, the HyperMixFormer models spatial relationships, and the Center Attention mechanism further refines spatial focus for classification tasks.
After processing through TimeMixFormer, HyperMixFormer, and Center Attention modules, we perform feature fusion where the final embeddings are aggregated. The multi-level central pixel embeddings are integrated and fed into an MLP (Multi-Layer Perceptron) classifier to predict the land-cover category of the corresponding pixel. In this study, we employ the cross-entropy (CE) loss as our objective function, defined as
2.2. TimeMixFormer
This study proposes an efficient sequence modeling method based on TimeMixFormer (as illustrated in
Figure 1b), designed to address the high computational complexity and weak explicit temporal modeling capabilities of traditional Transformers in long-sequence modeling. The proposed TimeMixFormer consists of a two-layer TimeMixing module connected in series with a lightweight attention mechanism (Tiny Attention), forming a synergistic structure of “local temporal modeling + global feature interaction”. This approach significantly enhances classification performance through temporal–spatial collaborative modeling and multi-scale feature fusion.
The time-mixed computation vector is a linear projection of the linear combination of current and previous block inputs [
48]:
Token shifting is implemented as a simple temporal shift along the time dimension of each block using the PyTorch library, specifically via nn.ZeroPad2d((0, 0, 1, −1)).
In this model, the computation of the WKV operator shares similarities with the Attention Free Transformer (AFT) [
48]. However, unlike AFT, which parameterizes W as a pairwise interaction matrix, our approach reformulates W as a channel-wise partitioned vector that is adaptively adjusted based on relative positional information. Furthermore, the model dynamically updates the WKV vectors in a time-recursive manner, introducing recurrent behavior through sequential state propagation. The detailed computation process is as follows:
The output gating in both the TimeMixing and HyperMixing modules is implemented via a sigmoid function
applied to the received terms. The output vector
[
48] after WKV operator processing is expressed as follows:
In terms of specific design, TimeMixFormer integrates time-decay weights with gated attention to achieve efficient spatiotemporal feature modeling. The first TimeMixing layer captures local correlations between spectral bands along the spectral dimension, employing a time-decay weight mechanism () to adaptively regulate dependency strength among different bands, thereby effectively capturing short-term temporal characteristics. The second TimeMixing layer models spatial neighborhood relationships in higher-order feature representations, leveraging spatially adaptive weights to capture local spatial features while further modeling long-term temporal patterns at an advanced feature level. The model operates with lightweight computations, expressed as where is the sigmoid function, ensuring efficient processing.
To address the limitations of TimeMixing in modeling long-range dependencies, this study proposes a lightweight attention module named Tiny Attention. This module employs 1–4 attention heads to compute with low-dimensional projections, effectively balancing computational complexity and modeling capability through a sparse computation strategy. Unlike standard Transformers that traditionally generate using three separate linear layers, Tiny Attention adopts a design combining single linear layer projection with chunk operations (chunk operation splits the projected tensor into three equal parts along the feature dimension. This design minimizes memory access operations through shared weight matrices). By sharing projection matrices and partitioning features, this approach reduces both parameters and memory access operations while preserving the core functionality of attention mechanisms, thereby achieving superior computational efficiency. This design is particularly suitable for long-sequence modeling tasks in resource-constrained scenarios.
2.3. HyperMixFormer
In this study, we propose HyperMixFormer, a novel architecture composed of two HyperMixing modules and one Tiny Attention module connected in series (as illustrated in
Figure 1c), designed to achieve efficient feature mixing and global information interaction.The HyperMixing module first transforms input features through projections. It then employs temporal modeling via time shift operations and extracts localized features using Mish activation. The Mish function [
49], defined as Equation (
7), preserves small negative values during training, enhancing gradient flow and feature expressiveness while maintaining computational efficiency. A gating mechanism, implemented through sigmoid-activated receptance, adaptively regulates feature updates to enable enhanced channel-wise feature representation. Each HyperMixing module incorporates a residual connection and further refines features through an output adjustment layer for dimension compression. Subsequently, HyperMixFormer introduces a computationally efficient Tiny Attention module that operates with a minimal number of attention heads (typically 1-4). This lightweight attention mechanism performs localized Softmax-based computations to effectively capture long-range dependencies while facilitating global temporal interactions across sequence steps.
In the HyperMixing model, the input features (as shown in Equations (
8) and (
9)) [
48] are first processed by the
WKV operator (Equation (
5)) before being fed into the channel mixing module. This module generates enhanced features
[
48] (Equation (
10)) through gated nonlinear transformations, combining the Mish activation function with value modulation for efficient feature interaction. The sigmoid gating
controls information flow while maintaining gradient stability, benefiting from the smooth nonlinear properties of Mish.
This temporal processing pipeline then enters the scaled dot-product attention mechanism of the Tiny Attention module, which approximates the WKV operator in Equation (
5) through multi-head attention computations. This approach achieves context-aware feature reweighting while preserving gradient stability. The system subsequently performs precise gated nonlinear transformations via channel mixing: The Mish-activated feature modulation and sigmoid-controlled information flow strictly adhere to the mathematical formulation of Equation (
10), combining the element-wise operation
with zero-initialized gating weights to balance feature interactions.
HyperMixFormer forms an efficient local feature modeling structure by consecutively calling the HyperMixing module with shared parameters twice. The first HyperMixing conducts preliminary channel feature reconstruction and screening at the local scale. It regulates the feature expression through the gating mechanism and the WKV weighting mechanism. On this basis, the second HyperMix further refines the feature relationships, strengthens key features and suppresses redundancy, forming a more discriminative high-level feature representation. Internally, each HyperMixing adopts Mish nonlinear activation, time shift operation, LayerNorm and residual connection. These are used to enhance the nonlinear interaction across bands, maintain the feature fluidity, suppress noise interference and stabilize the gradient propagation. This dual channel mixing design realizes the multi-level fusion of features, improving the model’s ability to express complex spectral–spatial patterns, especially suitable for high-dimensional HSI data. Meanwhile, under the premise of not increasing additional parameters, through the parameter reuse strategy (similar to the “bottleneck structure” in residual networks), it improves the feature abstraction and modeling depth.
In order to further enhance the model’s ability to model long-range dependencies, a lightweight multi-head attention module (Tiny Attention) is introduced after the double-layer HyperMixing. Tiny Attention computes through low-dimensional projection (such as 64-dimensional), and cooperates with the sparse mask mechanism to screen out invalid regions.It enhances global context-awareness with modest computational cost.
The high-quality local features refined by the double-layer HyperMixing provide more discriminative inputs for Tiny Attention, thus effectively improving the classification consistency and long-range feature interaction ability. Overall, HyperMixFormer achieves efficient local feature modeling while maintaining long-range dependency capture through auxiliary mechanisms, striking an effective balance between computational efficiency and model performance. This design makes it particularly suitable for large-scale or long-sequence tasks where both localized and global representations are critical.
3. Results
3.1. Datasets Description
To validate the effectiveness of the proposed algorithm, we conducted experiments on three public available hyperspectral remote sensing datasets: the Indian Pines dataset, the Pavia University dataset, and the WHU-Hi-HongHu dataset. These datasets are widely used in the field of HSI classification and can effectively evaluate the algorithm’s performance across different scenarios.
The Indian Pines (IP) dataset represents a benchmark dataset in remote sensing research, acquired on 12 June 1992 using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor. The dataset covers an approximately 2 × 2 mile agricultural test site in northwestern Indiana, USA, encompassing Purdue University’s Agronomy Farm and adjacent watersheds. The dataset features a spatial resolution of 20 m per pixel, with an image dimension of 145 × 145 pixels. The original hyperspectral data comprises 224 contiguous spectral bands covering the visible to short-wave infrared spectrum (0.4–2.5 μm). Through rigorous quality control procedures, 24 bands significantly affected by water vapor absorption and noise artifacts (specifically bands 104–108, 150–163, and 220) were removed, resulting in 200 retained spectral channels for subsequent analysis. The ground truth data contains annotations for 16 distinct land cover classes, representing the predominant vegetation types and surface features in this agricultural test area.
Figure 2 presents a false-color composite of the dataset along with its corresponding ground truth classification map.
The Pavia University (PU) dataset was acquired over the geographical area of the University of Pavia in northern Italy in 2001 using the Reflective Optics System Imaging Spectrometer (ROSIS-3) sensor. The original data were provided by Prof. Paolo Gamba’s team at the University of Pavia. Initially comprising 115 spectral bands, the dataset was subsequently processed by removing 12 noisy bands, resulting in 103 retained bands for further analysis. This dataset covers an urban and campus mixed scene, with a spatial dimension of 610 × 340 pixels and a spatial resolution of 1.3 m per pixel. The spectral range spans from 430 to 860 nm, encompassing nine distinct land-cover classes.
Figure 3 presents a false-color composite image of the dataset along with its corresponding ground truth classification map.
The WHU-Hi-HongHu (WHHH) dataset, released by the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS) at Wuhan University, represents a benchmark hyperspectral remote sensing dataset covering typical wetland ecosystems in Honghu City, Hubei Province, China. The data acquisition was conducted using a Headwall Nano-Hyperspec imaging sensor (17 mm focal length) mounted on a DJI Matrice 600 Pro UAV platform during a low-altitude aerial survey from 16:23 to 17:37 on 20 November 2017. The dataset was acquired at a constant flight altitude of 100 m, yielding hyperspectral imagery with spatial dimensions of 940 × 475 pixels across 270 spectral channels (spectral range: 400–1000 nm). After rigorous geometric correction, the data achieves an exceptional spatial resolution of 0.043 m. This high-resolution dataset encompasses complex agricultural landscapes featuring 22 distinct land cover categories, including various crop types and their cultivars (e.g., lettuce varieties, oilseed rape, and cotton).
Figure 4 presents a false-color composite of the dataset along with its corresponding ground truth classification map.
Table 1 presents the number of training and testing samples for each class in the three datasets used in this experiment, including the sample distribution across each land cover category.
3.2. Experimental Evaluation Indicators
To quantitatively evaluate the effectiveness of our proposed method and compare it with other approaches, we rely on four key performance metrics: Overall Accuracy (OA), Average Accuracy (AA), Kappa coefficient (Kappa), and per-class accuracy. These metrics provide a comprehensive view of classification performance, with higher values indicating superior classification results.
3.3. Brief Description and Settings of Compared Methods
All experiments were implemented and trained using the PyTorch 2.4.0 framework, with the hardware environment consisting of a Geforce RTX 4090 GPU and a Windows 11 operating system. We used the Adam optimizer with a learning rate of 1 × 10
−3 to effectively guide the training process. The batch size, an important factor influencing memory consumption and training efficiency, was dynamically adjusted based on the available GPU memory. To ensure optimal processing, batch sizes of 200, 200, and 20 were allocated for the IP, PU, and WHHH datasets, respectively, in accordance with their characteristics. Furthermore, in our network architecture, the depth was set to 2 to more clearly identify potential relationships between variables. Additionally,
Section 4 provides a comprehensive discussion of important parameters, including patch size, convolutional kernel size, and the number of PCA-retained channels.
3.4. Parameter Analysis
We conducted a detailed analysis of key parameters, including the size of the convolutional kernels, the patch size of the input data, and the number of channels retained by PCA.
Figure 5 presents the experimental results for optimizing these parameters. We found that different datasets may have different optimal parameter selections. To ensure the fairness of experimental comparisons, this study adopts the controlled variable approach for parameter optimization, where only one specific parameter is modified while keeping all other parameters constant during testing. The initial parameter settings were primarily determined based on the optimal benchmark configurations from comparative algorithms, thereby guaranteeing the scientific rigor and comparability of the experimental design. For the kernel size, a kernel size of 19 yielded good results on the IP, PU, and WHHH datasets. Additionally, we observed that this parameter had a lower impact on the IP dataset compared to the PU and WHHH datasets. Regarding the Patch Size parameter, our algorithm performed better with relatively larger patch sizes. Experiments showed that with smaller patch sizes, such as 5 × 5 and 10 × 10, the OA for all three datasets was relatively low. As the patch size increased, the OA for the IP dataset showed no significant change, the OA for the PU dataset first increased and then slowly decreased, and the OA for the WHHH dataset continuously increased. The optimal patch sizes for the IP, PU, and WHHH datasets were found to be 15, 20, and 30, respectively. The number of PCA-retained channels also influenced the results for all three datasets. Furthermore, the experiments revealed that the optimal number of PCA channels for the IP, PU, and WHHH datasets were 100, 30, and 150, respectively.
3.5. Experimental Results
To evaluate the efficacy of the proposed methodology, we conducted comprehensive comparative experiments with eight state-of-the-art hyperspectral image classification approaches as benchmarks, including SSRN [
50], SpectralFormer, HiT, SS1DSwin, SSFTT, CASST, LSGA, and our SQSFormer. The experiments strictly adhered to the principle of controlled variables, with all comparative algorithms employing the standardized hyperparameter configuration scheme detailed in
Table 2, which encompasses a consistent learning rate, batch size, and training epochs. This rigorous experimental protocol ensures the reliability and comparability of the obtained results.
SSRN (Spectral–Spatial Residual Network) directly processes raw 3D HSI cubes, using spectral–spatial residual blocks with identity mapping and batch normalization to enhance feature learning and classification accuracy, thereby achieving state-of-the-art results on multiple datasets.
HiT (Hyperspectral image Transformer) is designed to capture spectral sequence and local spectral–spatial features. It introduces two key modules: a spectral-adaptive 3D convolutional projection for extracting spectral–spatial information, and a convolutional transformer that encodes features across height, width, and spectral dimensions.
SpectralFormer a transformer-based network for HSI classification that captures spectral–sequence attributes using group-wise spectral embeddings and cross-layer skip connections. It effectively models spectral continuity and supports both pixel- and patch-wise inputs.
SS1DSwin captures local and hierarchical spatial–spectral relationships through groupwise feature tokenization and a 1DSwin Transformer with cross-block normalized connections. It is one of the latest representative transformer-based algorithms for HSI classification.
SSFTT captures spectral–spatial and high-level semantic features by using a 3-D and 2-D convolutional layer for feature extraction, followed by a Gaussian weighted feature tokenizer and transformer encoder, with competitive effectiveness in the current HSI classification task.
CASST uses a dual-branch structure for spatial and spectral feature extraction, with spectral–spatial cross-attention and weighted sharing mechanisms to improve feature fusion and capture robust semantics, with competitive effectiveness in the current HSI classification task.
LSGA introduces the light self-Gaussian attention mechanism while extracting global deep features, reducing computation and parameters. The hybrid spectral–spatial tokenizer captures shallow features, and the Gaussian absolute position bias enhances the attention weights for the central query block.
SQSFormer adaptively queries relevant spatial information from neighboring pixels by leveraging the features of the central pixel, while reducing irrelevant spatial interference through rotation-invariant positional embedding and multi-scale spectral–spatial attention. This approach demonstrates superior performance in HSI classification tasks.
3.5.1. Quantitative Results
Table 3,
Table 4 and
Table 5 present the classification performance on the IP, PU, and WHHH datasets, including OA, AA,
, and the accuracy for each class. The best performances are highlighted in bold. It can be observed that the proposed model achieves the best overall performance across all three datasets. When using 10 samples per class as the training set, the SpectralFormer, HiT, and SS1DSwin models perform worse than other algorithms, indicating that there is room for improvement when using smaller training samples. The SSRN, SSFTT, CASST, and LSGA algorithms exhibit higher classification accuracy, demonstrating their strong classification ability in limited sample training set scenarios. Our method further improves classification performance, with OA performance increasing by 2.7%, 1.75%, and 0.34%, respectively, compared to the second-best methods on the IP, PU, and WHHH datasets. Specifically, for different subclasses of the same dataset, our model still outperforms the comparison models in most categories. Further analysis was conducted on the subclasses where our model exhibited relative bias compared to the comparison models. For the IP and PU datasets, we found that in the subclasses where our algorithm performed relatively worse, the performance gap compared to the highest classification metric in that subclass was relatively small. However, on the WHHH dataset, our algorithm showed a significant gap in the Lactuca sativa category compared to the best-performing algorithm LSGA (LSGA: 96.85%, ours: 87.52%). Through analysis of the confusion matrix, we found that our model mistakenly classified some parts of Soybean-notill as Small Brassica chinensis and Broad bean. We believe that the proposed model enhances spatial information, which may lead to misclassification when there is strong reliance on pure spectral information in regions with similar spatial information.
We evaluated the model’s performance under different training sample sizes. As shown in
Figure 6, we adjusted the number of samples used to train each land-cover class to assess the classification performance of different algorithms. The experiments indicate that, as the sample size increases, the classification accuracy of all models improves to varying degrees. Notably, the proposed model demonstrates competitive performance on the IP, PU, and WHHH datasets across different training sample sizes. The experiments further validate the effectiveness of the method.
3.5.2. Visual Evaluation
The final classification results for the IP, PU, and WHHH datasets are shown in
Figure 7,
Figure 8 and
Figure 9. The proposed model achieves the best classification results with the least amount of noise points. Specifically, the SpectralFormer, HiT, and SS1DSwin models exhibit more misclassified results, with instances of incorrect classification of similar land covers within the same cluster. In contrast, the SSRN, SSFTT, CASST, and LSGA methods show significant improvements, reducing the occurrence of noise points in the classification results. Our model further enhances classification performance over these methods, particularly in areas with higher land-cover density within a unit space (e.g., the lower-right part of IP, around Bare Soil in PU, and the lower-right part of WHHH). In these cases, our model demonstrates superior classification accuracy compared to the other methods.
3.5.3. Accuracy and Efficiency Analysis
Table 6 presents our systematic comparison of multiple Transformer architectures on the Indian Pines dataset, evaluating both parameter counts and computational complexity (FLOPs). The results reveal that our model exhibits dual advantages: it establishes new state-of-the-art in computational efficiency (minimum FLOPs) while maintaining exceptional parameter efficiency (second-best parameter count).
As shown in
Figure 10, to verify the overall effectiveness of the model and the consistency of performance across classes, ROC curves, F1 scores, and a normalized confusion matrix were employed for comprehensive evaluation. The results demonstrate that the classification model exhibits excellent overall performance with high consistency: the F1 scores reach 0.91 for IP, 0.95 for PU, and 0.91 for WHHH, indicating strong classification capability. In the normalized confusion matrix, the probabilities of off-diagonal elements are generally below 0.15, indicating low inter-class misclassification risk; meanwhile, most class curves in the ROC plot are tightly clustered near the top-left corner (approaching the coordinate point (0,1)), reflecting near-theoretical optimal discriminative ability between positive and negative samples. These three sets of results (F1 scores, confusion matrix, and ROC curves) mutually corroborate, collectively indicating that the model achieves stable and reliable classification performance for the majority of classes, with only minor performance degradation observed in a few classes due to feature overlap or sample characteristics. This further validates the comprehensive effectiveness of the model in multi-class tasks and the consistency of evaluation outcomes.
As illustrated in the
Figure 11, to examine potential overfitting, we analyzed the training loss and test accuracy curves (recorded every 10 epochs) of the TC-Former model. The results demonstrate excellent convergence and generalization performance without evident signs of overfitting. Specifically, the training loss rapidly decreased from an initial value of 3.0, approaching zero after approximately 20 epochs and subsequently stabilizing, indicating effective learning on the training data. Concurrently, both the overall accuracy (OA) and average accuracy (AA) on the test set progressively increased before stabilizing at 90% and 94.5% respectively, with a consistent 4% performance gap between these metrics, reflecting balanced classification performance across different categories. Notably, the test accuracy curve exhibited a continuous upward trend followed by stabilization, devoid of characteristic overfitting patterns (such as initial improvement followed by decline). These observations collectively confirm that the features learned from training data possess strong transferability, enabling effective generalization to test data.
4. Discussion
To further validate the effectiveness of the TimeMixFormer (TF), HyperMixFormer (HF), and Center Attention module (CA), we conducted ablation studies. In these experiments, the network architecture, hyperparameters, and dataset splits were kept consistent while modifying the respective modules. To provide a detailed evaluation of the structures discussed in this study, we designed eight combinations and used a controlled variable method (√ indicates inclusion, × indicates removal) to assess the contribution of each of the three modules, specifically including the following:
- (1)
Our proposed model.
- (2)
Remove only the Center Attention module.
- (3)
Remove only HyperMixFormer.
- (4)
Remove the HyperMixFormer and Center Attention module.
- (5)
Remove only TimeMixFormer.
- (6)
Remove the TimeMixFormer and Center Attention module.
- (7)
Remove both TimeMixFormer and HyperMixFormer.
- (8)
Remove the TimeMixFormer, HyperMixFormer and Center Attention module.
The ablation study results in
Table 7 demonstrate the performance impact of removing different modules. A comparison between Experiment 1 and Experiment 5 reveals that removing the TimeMixFormer module leads to performance degradation of 1.02%, 0.98%, and 0.46% on the IP, PU, and WHHH datasets, respectively. When comparing Experiment 1 with Experiment 3, the removal of the HyperMixFormer module results in accuracy reductions of 0.95%, 0.41%, and 1.06% on the respective datasets. The most significant impact is observed when removing the Center Attention module (Experiment 1 vs. Experiment 2), with performance drops of 3.14%, 3.42%, and 7.80% across the three datasets.
Comparative analysis of Experiments 1, 3, and 5 indicates that the combined TF-HF dual-module configuration yields superior performance compared to individual modules, achieving minimum accuracy improvements of 0.95%, 0.98%, and 0.46% on the IP, PU, and WHHH datasets, respectively. For the PU dataset, the OA metric in Experiment 5 approaches that of the complete model, while the removal of TF increases the standard deviation by 1.8%, underscoring the stabilizing role of temporal features in spectro-spatial coordination. Notably, on the WHHH dataset, the standalone TF and CA modules achieve 78.58% and 76.88% accuracy respectively, while their synergistic combination reaches 90.50%, demonstrating TF’s capacity to enhance CA’s adaptability to temporal variations.
The experimental results suggest that the three modules achieve collaborative optimization through the following mechanisms: (1) the TF module establishes a stable temporal feature foundation; (2) the HF module enables cross-channel spectral feature integration; (3) the CA module focuses on central sequence regions.
As shown in
Figure 12, we take the IP dataset as an example to verify the loss variation caused by removing the TimeMixFormer, HyperMixFormer, and Center Attention module under different training sample sizes. From the figure, it can be observed that, under different training sample sizes, the impact of the TimeMixFormer on the model is more influenced by the number of samples compared to the HyperMixFormer in the IP dataset. Furthermore, as the sample size continues to increase, the loss caused by removing the CA module remains at around 1%. Notably, as the sample size grows, the loss caused by removing the HyperMixFormer shows a continuous decreasing trend.