1. Introduction
Hyperspectral images (HSIs) are widely used in remote sensing (RS) due to their abundance of spatial and spectral information [
1]. Compared with natural images, HSIs consist of numerous dense and narrow spectral bands [
2], allowing for precise identification of land categories [
3,
4,
5,
6]. Consequently, HSIs have distinct advantages in various fields based on these characteristics, including ground material identification [
7], precision agriculture [
8,
9] and scene understanding [
10]. Among these applications, HSI classification emerges as a critical task.
In recent years, various HSI classification methods have been suggested, including support vector machine (SVM) [
11,
12], k-nearest neighbors (KNN) [
13,
14] and random forest (RF) [
15]. These algorithms have achieved remarkable results by utilizing spectral information effectively. While SVM performs well in high-dimensional problems, it needs to select some indispensable parameters [
16]. In [
17], linear discriminant analysis (LDA) was used for HSI classification. However, these classification methods still have much room for improvement due to their lack of spatial characteristics. Therefore, the extended morphological attribute profile (EMAP) took texture and morphological features into consideration [
18]. Subsequently, there were also methods to capture texture and edge features in images based on Gabor filters [
19,
20]. Nevertheless, these methods fail to consider the correlation between spatial and spectral information. Thus, in [
21,
22,
23], spectral and spatial information were jointly extracted for HSI classification. Nevertheless, these traditional methods extract shallow texture features and fail to reflect deep connections between spectral–spatial features. In this thesis, we develop a lower branch beyond the spectral–spatial features of the HSI itself to perform the processing of the local binary pattern (LBP) features under the first component band. LBP is an efficient approach for describing the local texture of an image and generates a binary encoding by comparing the differences in gray values between a pixel and its neighboring pixels. This operation can enhance the representation of spatial information by modeling spatial structures such as edges and corner points in the image. The upper and lower branches combine spectral and spatial information in order to more comprehensively and accurately describe the properties of HSI.
Compared with traditional machine learning classification technology, deep learning (DL) [
24] has the characteristics of automatic learning and strong classification ability and is widely used in HSIC. It is important to note that the CNN model [
25,
26] stands as the most prevalent choice because of its proficiency in extracting features efficiently, its adaptability in handling high-dimensional data processing, its retention of spatial information, and its capability in managing large-scale data processing. Yang et al. [
27] introduced four novel deep learning models, encompassing both two-dimensional CNN and three-dimensional CNN. Their research revealed that, while the 2D-CNN model excelled at exploiting spatial characteristics, it lacked consideration of spectral correlations. On the other hand, 3D-CNN models, despite having a higher number of network parameters compared to 2D models, effectively utilize spectral information alongside spatial features. To leverage both spatial and spectral information, Roy et al. [
28] developed a hybrid spectral CNN (HybridSN) that integrates a spectral–spatial 3D-CNN with a spatial 2D-CNN. Unlike models solely reliant on 3D-CNNs, HybridSN incorporates elements of 2D-CNNs to extract a more abstract spatial representation, consequently streamlining the model’s complexity. Zhu et al. [
29] devised the deformable HSI classification network (DHCNet): a CNN-based method tailored for hyperspectral image classification. DHCNet integrates deformable convolutional sampling locations that dynamically conform in size and shape to accommodate the intricate spatial characteristics found in HSIs. This adaptive feature enables enhanced extraction of spatial features, leveraging complex structural information more efficiently. Jia et al. [
30] proposed a lightweight convolutional neural network (LWCNN) for HSIC and designed a two-scale convolutional (DSC) module to process joint spatial–spectral information features [
31]. By merging operations, the parameter sizes are greatly reduced. It has the advantage of efficiency and robustness when solving small-sample-set problems. Gong et al. [
32] developed a multiscale convolutional and diversified metric CNN (DPP-DML-MS-CNN). Diversifying depth measurements rooted in multiscale features and determinantal point processes (DPPs) [
33] has enhanced the characterization and classification abilities of HSI. However, general CNN for HSIC tends to focus overly on local information, and it is difficult to comprehensively capture the trends of spectral band curves. We thus use a hybrid 3D- and 2D-CNN as a feature extractor to refine representative local features and combine it deeply with a transformer, which is excellent at global modeling, to adequately understand the global spectral trend features.
Although a deep neural network provides an invaluable contribution to HSI classification, HSIs have a high data dimension and a large number of spectral channels, which greatly increases the number of parameters of the CNN model, requires more computing resources and is prone to overfitting problems. Recurrent neural networks (RNNs) [
34,
35] can utilize each band in the spectrum by applying the cyclic operator layer by layer, thus obtaining fewer training parameters than convolutional neural networks (CNNs) and making the training and reasoning phases more efficient. By generating GANs [
36,
37,
38], the discriminator training process continues to be effective through network confrontation and competition, which can alleviate the overfitting phenomenon in the training process. ResNet [
39,
40,
41] mitigates the problem of disappearing gradients and avoids the loss of accuracy as the network deepens. In addition to the above methods, there are many classic deep learning methods such as autoencoders (AEs) [
42,
43], deep confidence networks (DBNs) [
44,
45], complete convolutional networks (FCNs) [
46,
47] and capsule networks (CapsNets) [
48,
49].
Transformers have significant advantages when processing sequential data and can establish global relationships, but they still encounter many challenges, such as limited spatial feature extraction capabilities or high computational costs [
50]. A transformer is a neural network architecture based on a self-attention mechanism. Its emergence abandons the traditional RNN or CNN and allows the model to be trained in parallel and to have global information. The key feature of the transformer model is that it relies on multi-head self-attention mechanism (MHSAs) to capture dependencies between different elements in the input sequence, regardless of their position or distance in the sequence. With their powerful parallel computing capabilities, good scalability, ability to handle long-distance dependencies and advantages in processing long sequence data, transformers have broad application prospects in various fields and are still being continuously improved and expanded. Hong et al. [
51] used transformers in HSI classification tasks for the first time. Later, Sun et al. [
52] developed a spectral–spatial feature tokenization transformer (SSFTT) model to capture spectral–spatial features and high-level semantic features. Tu et al. [
53] introduced an architecture named the local semantic feature aggregation transformer (LSFAT), which employs local semantic feature aggregation. This design enhances the capability of transformers to effectively capture long-term dependencies within multiscale features. Qiao et al. [
54] recently developed a new type of hierarchical dual-frequency transformer network (DFTN) in which a frequency domain feature extraction (FDFE) block was proposed to capture high-frequency and low-frequency features separately, allowing the network to effectively utilize the input data. The multi-layer feature information in the system improves the modeling ability for complex relationships. Wang et al. [
55] proposed a novel extended spectral spatial attention network (ESSAN) for HSI data classification when training samples are insufficient. For the whole network structure of a transformer, the information data for its global modeling all come from the token data generated by local patch transformation. In addition, some transformer networks based on the mask technique can effectively improve the classification performance in scenarios with insufficient samples [
56,
57]; such a technical improvement is also eye-catching. Our proposed SSEA module achieves further enhancement of the spectral–spatial features by computing the attention in three dimensions and also skillfully incorporates the LBP information. This operation achieves high-performance feature refinement for the subsequently generated token and also eliminates redundant information to a certain extent, providing accurate and representative global modeling information for the MHSA operation in the transformer.
In this paper, a novel multiscale efficient attention with enhanced feature transformer is presented for HSI classification. It mainly includes a multiscale efficient attention feature extraction module, a spectral–spatial enhancement attention module and a transformer encoder. The ingenious feature extraction method adequately exploits the abundant spatial and spectral information in HSI. The SSEA module enhances the interaction of spectral information with spatial features and LBP features from multiple perspectives. The transformer encoder fully integrates the key features through multi-head self-attention to optimize the feature representation. The main contributions of this paper are listed as follows:
- (1)
MEA-EFFormer is a multiscale efficient attentional feature extraction module that incorporates an efficient channel attention mechanism with multiscale convolution. It facilitates the mining of details in spectral–spatial information and solves the problem of fine-grained feature loss during single-scale sampling.
- (2)
MEA-EFFormer uses an SSEA module. Based on three directions, C-H, C-W and H-W, it captures the dependencies between spectral–spatial LBP information, refines the scale of the features and improves the perception of the attention mechanisms.
- (3)
The classification performance of MEA-EFFormer outperforms several classical and SOTA methods. Experiments on all three well-known datasets show that the proposed method has excellent classification performance.
3. Experiment and Analysis
In this section, we employ three well-known HSI datasets—Indian Pines (IP), Salinas (SA) and Pavia University (PU)—to evaluate the effectiveness of the proposed method, and we use three metric indicators to give a quantitative assessment of the classification results.
3.1. Data Description
3.1.1. Indian Pines
This dataset was acquired by the AVIRIS sensor over a test site in northwestern Indiana, USA. It has a spatial resolution of 20 m per pixel and covers an area of 145 × 145 pixels. After removing the water absorption bands, the dataset contains 200 spectral bands for analysis. The ground truth for Indian Pines identifies 16 distinct classes, including various crops, forests and other natural vegetation types. This dataset is frequently used to benchmark HSIC algorithms.
Figure 5 illustrates the false-color image and labeling map of Indian Pines, and the specific division of the training and testing sets of the samples is shown in
Table 1.
3.1.2. Salinas
This dataset was captured by the AVIRIS sensor; the Salinas dataset focuses on the Salinas Valley in California. It offers a higher spatial resolution of 3.7 m per pixel with dimensions of 512 × 217 pixels. Similar to Indian Pines, the Salinas dataset typically uses 204 spectral bands after water absorption band removal. The ground truth comprises 16 classes representing agricultural fields, vineyards, and bare soil. Researchers often use this dataset to explore the challenges of classifying crops with finer spatial details.
Figure 6 illustrates the false-color image and labeling map of Salinas, and the specific division of the training and testing sets of the samples is shown in
Table 1.
3.1.3. Pavia University
The ROSIS sensor collected this dataset over an urban area in Pavia, Italy. It boasts a high spatial resolution of 1.3 m per pixel. The dataset contains 103 spectral bands and covers an image size of 610 × 340 pixels. Pavia University offers 9 land-cover classes focused on urban features. This dataset is commonly used to study the classification of urban environments and to address the challenge of working with noisy spectral bands.
Figure 7 illustrates the false-color image and labeling map of Pavia University, and the specific division of the training and testing sets of the samples is shown in
Table 1.
3.2. Experimental Setting
3.2.1. Evaluation Criteria
In order to quantitatively evaluate the experimental results, three quantitative evaluation metrics were employed: overall accuracy (OA), average accuracy (AA) and kappa coefficient. First, OA measures the ratio between the number of correctly classified samples in a dataset and the total number of samples. OA provides an overall assessment of classification performance by indicating the model’s ability to correctly classify samples. Second, AA calculates the average accuracy for each category in the dataset. It provides an evaluation of the model’s performance on different categories and helps to determine whether the model performs well uniformly across all categories. Finally, the kappa coefficient measures the agreement between the predictions and the true classification while taking into account the stochastic agreement. A kappa value close to 1 indicates that there is strong agreement between prediction and true categorization beyond random consistency.
3.2.2. Environment Configuration
The proposed method was implemented using the PyTorch 2.2.0, while the traditional classical methods used for comparison were executed in the MATLAB R2018b environment. The computational setup included an Intel Xeon Silver 4314 CPU (Intel Corporation, Santa Clara, CA, USA) with 256 GB of RAM, along with an NVIDIA GeForce RTX 4090 GPU server (ASUS, Taipei, Taiwan) equipped with 24 GB of memory. In the comparison involving deep learning and transformer-based methods, parameters were configured as follows: the number of epochs was set to 100, and a batch size of 128 was employed.
3.2.3. Parameter Setting Adjustment
In this subsection, we analyze the impact of several important parameters on the classification results of the proposed network. These parameters are patch size, reduced spectral dimension, learning rate of the network, and the number of attention heads.
Figure 8 illustrates the impact of patch size on the classification metrics OA, AA and kappa. On all three datasets, the accuracy generally increases as the patch size increases. This is likely because larger patches capture more spectral information, which can help the model better distinguish between different classes. However, there is a point of diminishing returns at which increasing the patch size further does not improve accuracy. This is because larger patches may also include irrelevant information that can confuse the model. The specific patch size that yields the best accuracy varies depending on the dataset. For Indian Pines and Pavia University, the highest accuracies are achieved with a patch size of
. For Salinas, the best accuracy is achieved with a patch size of
. Moreover, our proposed network achieves relatively good and stable performance over a wide range of patch sizes, e.g.,
. This demonstrates that our proposed network has certain robustness to the parameter of patch size.
Figure 9 shows the classification results as a function of the reduced spectral dimensions. The reduced spectral dimensions indeed have a strong impact on the performance of the proposed network; however, the proposed network can achieve relatively stable results when the reduced dimensions lie in the range of [15, 35] for all three datasets. Specifically, it achieves the best performance when the reduced dimension is 20 for Indian Pines. For the Salinas and Pavia University datasets, the optimal values of the reduced dimensions are 25 and 30, respectively.
Figure 10 plots the impact of the learning rate on the classification accuracy. For three datasets, the learning rate of the proposed network indeed has a strong impact on the performance. It can be seen that on the three datasets, the OA, AA and kappa curves of the proposed network all show a trend of first increasing and then decreasing, indicating that these three indicators all have an optimal value. In addition, when the learning rate is in the range [
], the quantitative indicator values obtained by the proposed network are relatively stable, indicating that the network has certain robustness to the learning rate parameter. In the subsequent classification, we set the learning rate to
.
Figure 11 plots the number of attention heads as a function of the classification accuracy. The performance of the proposed network is quite stable when the number of attention heads lies in the range of [2, 32]. When the number is eight, the network achieves optimal performance.
3.3. Ablation Study
We performed an ablation experiment employing a 5% sample rate on the Indian Pines dataset. The framework was deconstructed into six discernible sections: principal component analysis (PCA), multiple scales (MS), efficient channel attention (ECA), LBP feature branch (LBP), spectral–spatial enhancement attention module (SSEA) and transformer encoder (TE). Subsequently, a comprehensive evaluation was conducted utilizing performance metrics including OA, AA and the kappa coefficient. The outcomes of these ablation experiments are meticulously tabulated in
Table 2 for reference and analysis.
In Case 1, we eliminate the PCA component of the network and input all 200 bands of the Indian Pines dataset into the model. The amount of data computed is more than six times that of our proposed method. The large amount of redundant data being fed into the model computation also brings about a slight decrease in the accuracy metric.
In Case 2, we cancel the MS component of the network and employ a single-scale CNN to extract the HSI features. At this time, the accuracy metrics are all significantly decreased, especially the AA accuracy. This demonstrates the remarkable advantage of the MS component for exploiting the details of the imbalanced category samples.
In Case 3, we eliminate the ECA component of the network and do not compute the attention to the spectral dimension information. At this point, the degradation of AA accuracy is also obvious due to the direct and rough integration of the spectral information at multiple scales. This demonstrates that the ECA component can significantly mitigate the sensitivity of spectral information to scale transformations.
In Case 4, we directly cut the lower branch of the network and do not use the LBP features for spatial information enhancement. The AA accuracy decreases significantly in this case. This indicates that LBP features can provide effective spatial feature enhancement for samples with unbalanced distributions so as to provide the model’s capture of the data as a whole.
In Case 5, we remove the SSEA module so that spectral–spatial information and LBP features are merely stacked and fed into the network. Each accuracy metric at this time is slightly decreased. This suggests that the fusion and de-redundancy operations within the SSEA module on the features can extract finer representations, which is conducive to the downstream recognition of the ground surface categories.
In Case 6, the transformer encoder is replaced with a deep residual convolutional network, and all accuracy metrics drop severely. This shows that with purely local features, it is difficult to carry out effective data modeling and the model lacks the overall consideration of global information. And it also further proves that the transformer is favorable for capturing trends within the global spectral curve and for calculating long-distance spatial information.
In addition, we conducted additional ablation experiments for the three branches of the SSEA module to explore the impact of each branch on the final classification results. The results, as shown in
Table 3, show that the computation of attention with the absence of any branch causes a decrease in the accuracy metric. This also proves that computation using each branch of our proposed SSEA module effectively strengthens the degree of coupling between spectral–spatial and LBP-HSI for refinement of feature representation.
3.4. Classification Results
In this subsection, we compare the proposed MEA-EFFormer network with state-of-the-art classifiers using quantitative and qualitative measures. These classifiers include random forest (RF) [
58], support vector machine (SVM) [
11], 1D-CNN [
59], 2-DCNN [
60], 3DCNN [
61], HybridSN [
28], GAHT [
62], SpectralFormer [
51], SSFTT [
52] and GSC-ViT [
63].
Table 4,
Table 5 and
Table 6, respectively, provide quantitative results of the compared algorithms on the Indian Pines, Salinas and Pavia University datasets. The parameter settings of the comparison methods were set according to the optimal settings of the reference source texts. To ensure the generality of the experimental results, we conducted ten separate rounds of each experiment and retained the means and variances. From the tables, it can be seen that traditional classification methods have a significant gap compared to deep learning methods. Among deep learning classifiers, methods based on transformers generally achieved better results; this is a benefit of the deep exploration of long-distance relationships between features. It is evident that our MEA-EFFormer method achieves the highest OA, AA and kappa values across all three datasets.
As well-known methods for HSIC in recent years, SSFTT and GSC-ViT effectively integrate spatial–spectral features from HSI and enhance feature discrimination through the integration of transformer networks. Compared with SSFTT on the Indian Pines dataset, MEA-EFFormer demonstrated a 0.47% increase in OA, a 1.26% increase in AA and a 0.53% increase in kappa while reducing bias by 0.11, 1.17, and 0.13, respectively. This comparison highlights the superior performance of MEA-EFFormer for enhancing classification accuracy and stability through the incorporation of multiscale information and an SSEA strategy for exploring discriminative features of land-cover objects.
For the Salinas dataset, GAHT obtained the second-best classification accuracies in terms of OA, AA and kappa. This is a benefit of the group-aware hierarchical transformer strategy in the network, which is good at classifying scenarios with objects that are relatively concentrated and uniform. However, MEA-EFFormer still achieves classification results that are similar to or even better than GAHT. Specifically, it improves OA by more than 0.16%, AA by more than 0.53% and kappa by more than 0.18%. This result further demonstrates the effectiveness of the proposed network for HSIC.
Finally, for the Pavia University dataset, which is known for its high spatial complexity, our method also achieves the best classification results, especially in terms of AA. This shows that MEA-EFFormer can effectively utilize the existing samples for global information even when dealing with scenarios with uneven sample distributions.
In a word, the proposed network has a significant advantage over the state-of-the-art transformer classifiers on these three well-known datasets and achieves the best results in terms of OA, AA and kappa.
3.5. Visual Evaluation
To qualitatively compare the performance of different algorithms, we illustrate the classification maps of different methods on the Indian Pines, Salinas and Pavia University datasets in
Figure 12,
Figure 13 and
Figure 14, respectively.
It can be clearly seen from the figures that the traditional methods exhibit numerous noisy points in the classification maps across the three datasets, indicating that their classification accuracies are relatively low. This is primarily attributed to the inherent limitations of traditional methods in terms of feature representation and exploration in high-dimensional data, which results in an inability to capture deep-level feature representations effectively. In contrast, the classification maps generated by deep neural network classifiers are generally smoother compared to those produced by traditional methods, aligning with the quantitative results. Particularly, transformer methods stand out for their superior performance: yielding classification maps that deliver satisfactory results both within categories and at their boundaries.
Notably, on the Indian Pines and Salinas datasets, the classification maps generated by the proposed MEA-EFFormer demonstrate enhanced classification performance, and the classification results of the boundary pixels are all relatively precise. At the same time, the smoothness and consistency of the classification maps are also appealing inside each category. For example, the “orange area” in
Figure 13j is more accurate than for the other classification maps.
For the Pavia dataset, due to its high spatial resolution, the distribution of objects is more dispersed and the boundaries are more complex. The classification accuracy of most classifiers is lower, and there are many noise points in the classification map. However, the proposed network MEA-EFFormer can still obtain a relatively satisfactory result. For example, the “gray area” in
Figure 14l is significantly more accurate than that of other classification maps.
3.6. Model Complexity and Efficiency Analysis
We analyzed the computational efficiency of several common deep-learning-based methods on the Pavia University dataset with a sampling rate of 1%. The results are shown in
Table 7; our proposed method achieves a moderate advantage in terms of training time and parameter size while achieving the leading classification accuracy.
For training time, we ranked second among the several classes of transformer-based methods that we compared. As for SSFTT, the simple fact is that it only performs convolutional extraction of a hybrid on raw HSI data, whereas MEA-EFFormer additionally uses LBP data for spatial information augmentation, which results in a slight increase in time. With the 3D-CNN and HybridSN methods, the runtime is faster since they only use convolutional networks.
For the model parameter sizes, our method is also preferred to most of the methods. The smaller parameter sizes of SpectralFormer and SSFTT are due to the fact that they are too simple, as they only have a single scale in the feature extraction stage, while our method adopts a multiscale strategy to fully exploit the spectral–spatial information in the HSI data. This is an important reason why MEA-EFFormer achieves higher accuracy. For GSC-ViT, the method itself is known for its light weight, and there is an obvious gap with the proposed method in terms of accuracy.
In summary, our proposed MEA-EFFormer can keep the computational efficiency as low as possible with small parameter sizes under the premise of leading accuracy, which again proves the superiority of our method.