Next Article in Journal
Spatial Spectrum Estimation of Weak Scattering Wave Signal in Range-Doppler Domain
Next Article in Special Issue
Adaptive Background Endmember Extraction for Hyperspectral Subpixel Object Detection
Previous Article in Journal
Dynamic Delay-Sensitive Observation-Data-Processing Task Offloading for Satellite Edge Computing: A Fully-Decentralized Approach
Previous Article in Special Issue
Hyperspectral Image Classification Using Spectral–Spatial Double-Branch Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms

by
Qian Sun
1,
Guangrui Zhao
1,
Xinyuan Xia
2,*,
Yu Xie
2,
Chenrong Fang
3,
Le Sun
4,5,
Zebin Wu
2 and
Chengsheng Pan
1
1
School of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China
2
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
3
College of Intelligence and Computing, Tianjin University, Tianjin 300000, China
4
School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
5
Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(12), 2185; https://doi.org/10.3390/rs16122185
Submission received: 8 May 2024 / Revised: 31 May 2024 / Accepted: 11 June 2024 / Published: 16 June 2024
(This article belongs to the Special Issue Advances in Hyperspectral Remote Sensing Image Processing)

Abstract

:
Convolutional neural network (CNN)-based and Transformer-based methods for hyperspectral image (HSI) classification have rapidly advanced due to their unique characterization capabilities. However, the fixed kernel sizes in convolutional layers limit the comprehensive utilization of multi-scale features in HSI land cover analysis, while the Transformer’s multi-head self-attention (MHSA) mechanism faces challenges in effectively encoding feature information across various dimensions. To tackle this issue, this article introduces an HSI classification method, based on multi-scale convolutional features and multi-attention mechanisms (i.e., MSCF-MAM). Firstly, the model employs a multi-scale convolutional module to capture features across different scales in HSIs. Secondly, to enhance the integration of local and global channel features and establish long-range dependencies, a feature enhancement module based on pyramid squeeze attention (PSA) is employed. Lastly, the model leverages a classical Transformer Encoder (TE) and linear layers to encode and classify the transformed spatial–spectral features. The proposed method is evaluated on three publicly available datasets—Salina Valley (SV), WHU-Hi-HanChuan (HC), and WHU-Hi-HongHu (HH). Extensive experimental results have demonstrated that the MSCF-MAM method outperforms several representative methods in terms of classification performance.

1. Introduction

With the rapid development of spectral imaging technology, we can now acquire numerous high-quality hyperspectral images (HSIs) [1,2]. These images offer abundant spectral details through dense narrow-band spectral imaging and find extensive utility in various domains like vegetation surveys [3], precision agriculture [4,5], atmospheric assessments [6], geological explorations [7], food safety [8], medical diagnostics [9], and military surveillance [10]. As the utilization of HSI applications proliferates, the swift evolution of HSI processing and analysis methods including unmixing [11,12], denoising [13,14], pan-sharpening [15], super-resolution [16], target detection [17,18], and classification [19,20] has been observed. Within the realm of these HSI data processing techniques, the significance of HSI land cover classification emerges distinctly. Nonetheless, the task of HSI classification poses considerable challenges owing to the intricate high-dimensional nature of HSI and the nuanced spectral distinctions among features [21].
Traditional HSI classification methods typically depend on manual feature extraction and conventional machine learning (ML) algorithms. Many traditional methods initially reduce the dimensionality of high-dimensional spectral data using methods like principal component analysis (PCA) [22] and linear discriminant analysis [23] to preserve key spectral features. Then, the obtained spectral feature is used to assign pixels to different classes using traditional ML classifiers, such as support vector machines (SVMs) [24], random forests [25], and k-nearest neighbors [26], etc. In addition to using only spectral features, traditional classification methods often incorporate some spatial features to more fully capture these features. Some methods based on mathematical morphology are used to characterize the shape features, such as morphological profile (MP) [27], extended MP (EMP) [28], and extended multi-attribute profile [29]. Some local spatial statistical features based on the neighborhood around a pixel help to capture the local spatial structure of features, such as local binary pattern (LBP) [30] and Gabor wavelet transform [31]. These traditional spatial features are usually used in combination with spectral features to form a fused feature set, which is then classified by applying traditional ML classifiers. Although these traditional methods have achieved some success in certain scenarios, they typically have a limited ability to model the nonlinear relationships and complex spatial structures of spectral information.
The emergence of deep learning (DL) methods has ushered in superior performance and enhanced feature learning flexibility for HSI classification tasks, surpassing traditional machine learning (ML) classification techniques [32]. The initial DL methods in HSI classification tried to use stacked autoencoders [33] and deep belief networks [34]. However, such methods still regard the pixels as 1D vectors as inputs to the network, which cannot effectively retain spatial information. Subsequently, a series of prominent DL backbone models have been introduced to the HSI classification task. Generative adversarial networks (GANs) [35,36] based on generative models can be used to generate virtual samples with similar spectral features for HSI data enhancement. In the HSI classification task, the availability of labeled sample data is typically limited. Augmenting the dataset using GANs has been proven to be beneficial in enhancing the model’s performance and generalization capabilities. Capsule networks [37] based on dynamic routing mechanisms can learn the feature representation of hierarchical structures through the capsule layer, which can effectively deal with the complex hierarchical structure of HSIs. As a feed-forward neural network, recurrent neural networks [38,39] can effectively model the correlation information between long-range spectral bands. Graph convolutional networks (GCNs) [40,41] have demonstrated the capability to effectively capture spatial relationships among pixels in HSIs. This is achieved by treating pixels as nodes and dynamically constructing graph representations based on their interrelations. This is very helpful in processing spatial structure and texture information in HSIs. Convolutional neural networks (CNNs) are extensively employed in image processing for their capability to capture spatial structural information and have emerged as the predominant approach in HSI classification [42]. Hu et al. [43] first designed a 1D CNN structure to achieve HSI classification. When processing HSIs, the method needs to be converted into a 1D feature sequence, which loses some spatial information. Zhao and Du [44] constructed a 2D CNN structure based on image spatial dimension, which mainly consists of five 2D convolutional layers, for the spectral–spatial classification of several principal component bands of HSIs. Zhao et al. [45] proposed a 2D CNN with spectral attention based on a compact band weighting module for capturing spectral information in HSIs. To extract correlation information between spatial and spectral dimensions in HSIs, Hamida et al. [46] proposed a 3D CNN to jointly process spatial and spectral features from multiple dimensions of the data. Apart from a single CNN structure, Roy et al. [47] designed a hybrid 3D and 2D CNN (HybridSN) model, which effectively extracts complementary spatial–spectral features and spectral features for HSI classification. Song et al. [48] introduced a deep feature fusion network (DFFN) based on residual learning, which effectively constructs special relationships between multilevel features in deep networks. While methods based on CNN have proved to be successful in HSI classification, they are not without limitations. Firstly, the feature receptive fields of CNN-based methods are limited to a square region with convolutional kernel size, which makes it difficult for CNN-based methods to model global information and long-range dependencies in images. Secondly, when the HSI contains land cover objects with complex shapes and different scales, it is difficult for the CNN-based method to effectively extract multi-scale features.
Recently, a deep learning model named Transformer, utilizing a multi-head self-attention (MHSA) architecture, was originally introduced for natural language processing tasks [49]. Leveraging the self-attention mechanism, the Transformer model excels at capturing long-range dependencies within sequences, enabling focused feature extraction across various scales and positions in the input sequence. It is also due to this advantage that Transformers are widely used in image processing [50]. CNNs usually use a fixed-size convolutional kernel when processing images, while Transformers can establish global correlations over the entire image, which helps one better understand the overall structure of the image [51]. Similarly, many Transformer-based methods are becoming increasingly used in HSI classification tasks. He et al. [52] first proposed a spatial–spectral Transformer model for HSI classification. This model utilizes a CNN-based backbone to extract spatial features and a densely connected Transformer to capture relationships between neighboring spectra. Hong et al. [53] developed the spectralformer, a spectral-based model that extracts spectral representation information from grouped neighboring bands and incorporates a cross-layer Transformer encoder module. Subsequently, Sun et al. [54] introduced the spectral–spatial feature tokenization Transformer (SSFTT) model to capture spectral–spatial features and extract high-level semantic features. Mei et al. [55] proposed a local–global attention-based group-aware hierarchical Transformer (GAHT) for HSI classification. Combining graph convolutional networks and Transformer networks, Yang et al. [56] designed an end-to-end classification method for joint spatial–spectral feature extraction. With the help of the superpixel segmentation algorithm, Feng et al. [57] proposed a center attention Transformer with a stratified spatial–spectral token generated by superpixel sampling to further improve the classification performance. Zou et al. [58] proposed a local-enhanced spectral–spatial Transformer method. By converting the HSI into an adaptive spectral–spatial token representation, this method enhanced the discriminative ability of these tokens and improved the classification performance by capturing both local and global information. Peng et al. [59] designed a cross-attention spatial–spectral Transformer to achieve effective HSI classification by fusing features from both spatial and spectral branches. With the help of spatial and spectral morphological convolutions, Roy et al. [60] proposed a spectral–spatial morphological attention Transformer (MorghAT). Ouyang et al. [61] designed a hybrid Transformer encoder with spatial–spectral serial attention to effectively improve the classification accuracy. Fang et al. [62] fused traditional and deep features by constructing a multi-attention Transformer with multi-level features, which effectively improved the classification performance. These methods make full use of the long sequence modeling capability of the Transformer to tap the correlation of spatial- and spectral-based feature sequences, but seldom take into account the spatial shape scale variations of the image.
As a result, this paper proposes the multi-scale convolutional feature and multi-attention mechanism (MSCF-MAM) classification method, which utilizes CNN and Transformer to address these issues and enhance existing HSI classification techniques. The network architecture includes a multi-scale convolutional feature extraction module, a feature enhancement module leveraging pyramid squeeze attention (PSA), and a Transformer encoder (TE) module. The multi-scale convolutional feature extraction module aims to capture diverse scale information present in HSI patches by employing three 3D convolutional blocks with varying kernel sizes to effectively represent different object shapes and scale variations. The feature enhancement module consists mainly of the PSA. The PSA mainly consists of multi-scale group convolutional layers and multiple squeeze and excitation (SE) attention. This component aims to bolster the multi-scale feature representation and facilitate the establishment of long-range feature dependencies through diverse attention mechanisms. The TE module, built on top of the multi-head self-attention (MHSA) mechanism, further refines the internal feature correlations and generates the cls token for classification. The classification is performed using a linear layer to obtain the ultimate category label. The primary contributions of this paper are detailed below.
  • This method employs multi-scale convolutional kernels to capture features at different scales in HSIs. By using the multi-scale convolutional kernels, the model can adapt and capture detail and structural information on these different scales, helping to retain more information and thus improving the ability to identify complex land cover environments.
  • Multiple attention mechanisms based on the pyramid squeeze attention and multi-head self-attention mechanism are introduced to enhance the modeling capabilities of the perception and utilization of critical information in HSIs. By using multiple attention mechanisms, the redundant or secondary information present in HSIs is filtered to improve the modeling efficiency and comprehensively capture the image features.
  • The systematic combination of the multi-scale CNNs and multiple attention mechanisms can fully and efficiently exploit the spectral and spatial features in HSIs, thereby significantly improving the classification performance. The experiments conducted on three public datasets demonstrate the superior performance of the proposed method.
The structure of this paper is as follows. Section 2 provides a comprehensive description of the MSCF-MAM method. Section 3 presents the three experimental datasets, the design of the experimental parameters, extensive experimental results, and further analyses of the proposed method. Some related conclusions and possible future improvements are summarized in Section 4.

2. Materials and Methods

The overall framework of the proposed MSCF-MAM method for HSI classification is illustrated in Figure 1. It is divided into four main modules—HSI data pre-processing, multi-scale convolutional features extraction, feature enhancement-based pyramid squeeze attention, and Transformer encoder.

2.1. HSI Data Pre-Processing

We represent the obtained HSI as X R m × n × d , where m is the image spatial height, n is the image spatial width, and d represents the image spectral band number. While the HSI captures spectral data from multiple bands and offers detailed land cover information, the complexity of high-dimensional data presents processing challenges. Hence, preprocessing is necessary to augment the retention of valuable image data and alleviate computational load. PCA, a conventional data preprocessing technique, is adept at reducing the dimensionality of high-dimensional HSI data. It extracts the first b principal components of X while preserving the spatial dimensions. The dimension-reduced HSI is represented as X pca R m × n × b , where b corresponds to the number of spectral dimensions resulting from the PCA transformation.
Then, for multi-scale convolutional feature extraction, 3D patches are extracted from X pca . Each 3D neighbouring patch P R s × s × b is generated from X pca , where s × s is the size of the spatial window. The center coordinates in each patch’s spatial dimensions are set to ( x i , x j ) , with 0 i < m and 0 j < n . The category of the central pixel determines the true category of each patch. Moreover, when the 3D patches around the edge pixels are extracted, a padding operation of width ( s 1 ) / 2 is performed on these pixels. Hence, the total number of 3D patches generated from X pca is m × n . Following the removal of patches labeled as zero, the remaining samples are partitioned into training and test sets.

2.2. Multi-Scale Convolutional Feature Extraction

After data pre-processing, the multi-scale convolutional features of the 3D patches were extracted using a set of multi-scale 3D convolution blocks. As shown in Figure 1, three 3D convolutions blocks, namely, 3 D c o n v 1 , 3 D c o n v 2 , 3 D c o n v 3 , with kernel sizes k 1 × k 1 × k 0 , k 2 × k 2 × k 0 , k 3 × k 3 × k 0 operate on the 3D image patch simultaneously. The primary objective of this design is to undertake multi-scale feature extraction of the image characteristics in the spatial domain. Given a 3D patch P and a 3D convolution kernel K , the 3D convolution operation can be expressed as
G ( x , y , z ) = α = 1 k i β = 1 k i γ = 1 k 0 P ( x + α , y + β , z + γ ) · K ( α , β , γ )
G ( x , y , z ) is an element of the output feature map. P ( x + α , y + β , z + γ ) denotes the input 3D patch at position ( x + α , y + β , z + γ ) . K ( α , β , γ ) is the weight of the convolution kernel at position ( α , β , γ ) . k i represents the width and height size of the convolution kernel, i [ 1 , 2 , 3 ] .
Next, based on the size of the 3D convolution kernel, a feature padding operation is performed after each 3D convolution to ensure that the feature maps obtained after the three convolution operations are maintained at the same size. Next, we fused the multi-scale features using a simple concatenation operation. The multi-scale 3D convolutional feature output G c o n c a t using a concatenation operation can then be expressed as
G c o n c a t = Concat G 1 , G 2 , G 3 , axis = α
concat denotes the concatenation operation, and axis = α specifies the dimension in which the splicing operation takes place. We set the concatenation operation on the channel dimension of the feature maps and concatenate the feature maps obtained from different convolutional blocks directly. This operation efficiently integrates features of varying scales while retaining all output information from the convolution blocks, thereby enhancing the model’s capability to represent features of diverse dimensions.

2.3. Feature Enhancement Based on Pyramid Squeeze Attention

The shallow multi-scale convolutional features obtained from multi-scale 3D convolutional operations may not sufficiently represent rich feature information. Utilizing a feature enhancement module based on PSA enables the effective acquisition of a more comprehensive multi-scale feature representation, dynamically adjusting the attention weights of each channel.
As illustrated in Figure 1, the feature enhancement module mainly comprises the PSA module, which involves three steps: Firstly, leveraging a multi-scale pyramid structure to acquire spatial information of various scales on each channel feature map. Secondly, feeding the multi-scale feature map into the SE module to establish the attention mechanism of the channel within the multi-scale feature map. Lastly, recalibrating the attention weights of the channel within the multi-scale feature map using the softmax algorithm and multiplying them with the feature maps from the initial step to generate multi-scale spatial–spectral feature representations.
Initially, a 2D point-by-point convolution is used to reduce the number of channels by mapping the acquired multi-scale features through channel mapping. This operation effectively reduces parameters, enhances nonlinear expressive capabilities, and preserves spatial information. Subsequently, the multi-scale pyramid structure facilitates the extraction of diverse scale features. Features are extracted concurrently through multiple branches, each utilizing convolutional kernels of varying sizes to capture features with different perceptual fields. The input channel dimension for each branch is denoted as D. The dimension of the feature channels of the output feature maps F i for each branch at different scales is D = D s , i = 0 , 1 , 2 , , S 1 ; D should be divisible by S. However, as the size of the convolution kernel increases, the corresponding number of parameters also increases. Therefore, the group convolution method is used to control the corresponding computational cost and ensure the timeliness of feature processing. The group size g dynamically adjusts in response to the multi-scale convolution kernel size. The correlation between them can be expressed as follows:
g = 2 k 1 2
where k is the kernel size, and g is the group number. Moreover, the padding operations are used to guarantee that each branch produces output feature maps with uniform dimensions. The output feature maps from each branch F i are illustrated by the following equation:
F i = Conv k i × k i , g i ( G c o n c a t ) , i = 0 , 1 , 2 , , S 1
where the convolution kernel size is denoted as k i × k i , with k i × k i = ( 2 i + 3 ) × ( 2 i + 3 ) .
The SE block is used to extract channel attention weights for multi-scale feature maps. The recalibration of attention weights for feature maps at various scales using the softmax operation facilitates the integration of local and global information. Following this, the feature vectors and attention weights are combined with the feature maps at the corresponding scales to construct a comprehensive feature representation. The specific formula is as follows:
T c o n c a t = Concat F i Softmax SE F i , i = 0 , 1 , 2 , , S 1
where T c o n c a t is the output feature map with multi-scale channel attention weights, and ⊙ represents the channel-wise multiplication operation.
As shown in Figure 2, an SE block comprises two steps: squeeze and excitation, respectively. It is calculated as follows. The input feature map G R h × w × c is defined, where h, w, and c represent the height, width, and number of input channels, respectively. Subsequently, a global average pooling operation is conducted on the input feature map to compute the average feature values across each channel, resulting in a scalar value. The global average pooling operator is computed as follows:
z = 1 h × w i = 1 h j = 1 w x ( i , j )
The resulting z feature tensor is input into a fully connected layer (usually containing a ReLU activation function) to learn the importance weights for each channel. The output of the fully connected layer is then passed through a Sigmoid function that limits the value of the weights learned on each channel to between 0 and 1, which is used to adjust the feature response of each channel. The output y in the SE block can be written as
y = σ w 2 δ w 1 z
where δ represents the ReLU activation function, and w 1 and w 2 represent the weight matrices of the fully connected layer. The symbol σ represents the Sigmoid activation function. Employing the activation function enables the assignment of weights to post-channel interactions, thereby enhancing the efficiency of information extraction.
Finally, the obtained multi-scale features are channel mapped by 2D point-by-point convolution to increase the number of channels and the input feature map is summed with the enhanced features to obtain the final output features.

2.4. Transformer Encoder

The Transformer encoder (TE) module is primarily utilized for capturing profound long-range correlations among image features. It comprises a multi-head self-attention (MHSA) block (Figure 3), a Multilayer Perceptron (MLP) block, and two normalization layers. Residual connections are incorporated before both the MHSA block and the MLP block.
The TE is recognized for its central MHSA block. This block employs the self-attention (SA) mechanism to effectively capture feature correlations. To enable learning of diverse semantics, three trainable weight matrices, W Q , W K , and W V , are pre-defined. These matrices linearly map the feature tokens into three-dimensional invariant representations, comprising query Q , key K , and value V . The attention scores are calculated using all Q and K , and their weights are determined through the softmax function. In conclusion, the self-attention mechanism is elucidated as follows:
SA = Attention ( Q , K , V ) = softmax Q K T d K V
where softmax is the activation function used to calculate weights, and d K is the key’s dimension.
The MHSA block utilizes various sets of weight matrices to map the query Q , key K , and value V , calculating the multi-head attention values using the same operational procedure. Following this, the outputs from each attention head are concatenated. This procedure is represented by the equation below:
MHSA ( Q , K , V ) = Concat SA 1 , SA 2 , , SA h W
where h is the head number, W is a linear transformation matrix, W R h × d K × d w , where d w = w (number of feature channels).
Subsequently, the weight matrix generated by the MHSA is passed through the MLP block, which comprises two fully connected layers. An interposed Gaussian Error Linear Unit (GELU) nonlinear activation function exists between these layers. Following the MLP block, a normalization layer is applied, aiding in mitigating the gradient explosion, addressing gradient vanishing, and accelerating the training process.
After undergoing the aforementioned steps, the final feature tokens are derived, with the initial embedding learned classification token c l s serving as a classification vector. This vector is then input into a linear classifier based on the softmax function to acquire the ultimate category information. The comprehensive procedure of the proposed method is outlined in Algorithm 1.
Algorithm 1 The proposed MSCF-MAM method.
Input: HSI data X R m × n × d , ground truth Y R m × n , batch size = 64, the PCA bands b = 30 , patch size s = 13 , epoch number e = 100 , learning rate = 1 e 3 , and the training sample rate μ % = 0.1 % .
Output: Predicted labels of the test dataset.
  1: Perform PCA transformation to obtain X pca .
  2: Create sample patches from X pca , then partition them into training and testing sets.
  3: for  i = 1 to e do
  4:     Perform multi-scale 3D convolution block to obtain multi-scale convolutional features.
  5:     Flatten multi-scale 3D convolutional features into 2D feature maps.
  6:     Perform feature enhancement using PSA through Equations (4) and (5).
  7:     Concatenate the learnable tokens to create feature tokens and embed position information into the tokens.
  8:     Perform Transformer encoder module using Equations (8) and (9).
  9:     Input the first c l s token into the linear layer.
10:     Use the softmax function to identify the labels.
11: end for
12: Utilize the test dataset along with the trained model to obtain predicted labels.

3. Experiments and Results

Three publicly available HSI datasets are employed in the experiments to evaluate the efficacy of the proposed MSCF-MAM method. Firstly, the specifics of the three datasets employed in the experiments are outlined. Subsequently, a detailed description of the experimental configuration parameters was provided. The quantitative and visual results of the classification experiment are then exhibited to demonstrate the effectiveness and advantages of the proposed method. Finally, parameter and ablation analysis experiments are carried out to assess the influence of various parameters and modules on the model performance, accompanied by additional comparative time analysis to showcase the efficiency of the proposed method.

3.1. Dataset Description

In our experiments, three publicly available HSI datasets from different imaging platforms are used to validate the designed method’s effectiveness. Detailed descriptions of the three datasets are given below.
(1) The Salinas Valley (SV) dataset was collected over Salinas Valley, California, utilizing the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor. The dataset comprises 224 spectral bands spanning from 0.4 to 2.5 μm, with a spatial resolution of 3.7 m. For the experiment, we discarded 20 water-absorbing bands and chose to use the 204 band for the experiment. The area covered consists of 512 × 217 pixels and comprises 16 categories of land cover. Figure 4 shows the pseudo-color image, ground truth map, and the land cover categories’ names and corresponding colors, respectively.
(2) The WHU-Hi-HanChuan (HC) dataset was acquired in Hanchuan, Hubei province, China, with a 17 mm focal length Headwall Nano-Hyperspec imaging sensor equipped on a Leica Aibot X6 UAV V1 platform [63]. The dataset comprises 274 spectral bands spanning from 0.4 to 1 μm. The image spatial dimensions are 1217 × 303 pixels, having a spatial resolution of 0.109 m and encompassing a total of 16 land cover categories. Figure 5 provides the pseudo-color image and ground truth map alongside the names of land cover categories and their corresponding colors.
(3) The WHU-Hi-HongHu (HH) dataset was obtained in Honghu City, Hubei Province, China. The dataset was captured using a DJI Matrice 600 Pro UAV platform equipped with a 17 mm focal length Headwall Nano-Hyperspec imaging sensor [64]. It contains 270 bands ranging from 0.4 to 1 μm, and the spatial resolution of the UAV-borne HSI dataset is 0.043 m. The size of this dataset is 940 × 475 pixels, and 22 land cover categories are covered. Figure 6 displays the pseudo-color image, ground truth map, and the names of land cover categories along with their respective colors.
Table 1 lists the land cover categories’ names along with the corresponding counts of training and test samples for the three datasets. In each dataset, 0.1% of the total samples were randomly allocated to the training set, while the remainder formed the test set.

3.2. Experimental Setting

(1) Evaluation Indicators: Four representative quantitative evaluation indicators, including overall accuracy (OA), average accuracy (AA), kappa coefficient ( κ ), and accuracy per class, were chosen to evaluate and compare the classification performance of the proposed method with other comparative methods. Higher values for each indicator denote a more precise classification.
(2) Configuration: The evaluation experiments were performed on the PyTorch platform using a server configured with an Inter Xeon Silver 4314 2.40-GHz CPU, 192 GB RAM, and an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM). The Adam optimizer was used to optimize the network. For the training phase, we set the batch size to 64 and the number of training epochs to 100.

3.3. Quantitative and Visual Classification Results

To evaluate the classification performance of the MSCF-MAM method, we compared it with several representative methods. These included the traditional SVM [24] method, CNN-based methods, namely, 1D CNN [43], 3D CNN [46], DFFN [48], and HybridSN [47], and the Transformer-based methods MorghAT [60], GAHT [55], and SSFTT [54]. To be fair, we conducted experiments on all methods using the same settings on all three datasets. We calculated the average values for the 10 sets of experiments and highlighted the optimal results in bold.
(1) Quantitative Analysis: The detailed classification results for the three HSI datasets are presented in Table 2, Table 3 and Table 4. Overall, the performance of the traditional SVM classifiers is considered average. In addition, the classification accuracy achieved by the earlier CNN-based method is notably higher than that of the traditional method, underscoring the superiority of deep learning approaches. Furthermore, the recently introduced Transformer-based method has shown even more promising results, surpassing the performance of the CNN-based method. For instance, in the SV dataset, despite using only 0.1% of the sample size with some categories having only one sample, our method achieved over 99% classification accuracy in certain categories. This success underscores the effectiveness and robustness of our model, particularly in scenarios with limited samples. In the HC and HH datasets, our method also demonstrates commendable classification accuracy, despite challenges such as sample imbalance and lower accuracy for sparsely distributed categories. In comparison to other existing methods, our results stand out, showcasing the efficacy of our method even in challenging classification scenarios.
(2) Visual Evaluation: The classification result maps of the method presented in this paper and several comparison methods on the three datasets are illustrated in Figure 7, Figure 8 and Figure 9. The comparison of the spatial and edge clarity in these classification result maps reveals that the method’s classifications align closely with the ground truth maps. Nonetheless, the utilization of only 0.1% of the training samples results in some level of noise in the classification result maps obtained. Traditional classifiers typically exhibit a lower effectiveness than deep learning classifiers, evident from their higher misclassification rates and prevalent noise in classification maps. For conventional SVM classifiers and several CNN-based methods, the classification outcomes are comparatively inferior, displaying extensive noise distribution areas and an inadequate representation of feature class boundaries. In contrast, Transformer-based techniques like GAHT and SSFTT produce cleaner classification result maps that distinctly delineate the boundaries of various feature classes by considering the global correlation information of features. The proposed method in this paper demonstrates minimal misclassification among diverse classification results, maintaining superior edge information and affirming its optimal classification performance at a 0.1% sampling rate.

3.4. Parameter Analysis

We have analyzed several crucial hyper-parameters that significantly influence the classification accuracy, including the patch size (s) and the number of PCA bands (b).
(1) Patch Size: We select the patch size (s) from the candidate set { 9 , 11 , 13 , 15 , 17 } for evaluation. The impact of various patch sizes (s) on OA, AA, and κ is depicted in Figure 10. The optimal classification accuracy is observed when the patch size is set to either 11 or 13 in the three datasets.
(2) PCA Band: To assess the impact of PCA band numbers (b), we choose the parameter b from the set of candidates { 5 , 10 , 15 , 20 , 25 , 30 }. Figure 11 illustrates how the OA, AA, and κ of the proposed method change with varying PCA band numbers across the three datasets. The three indicators show an initial increase followed by stabilization as b increases. Taking into account the computational cost and classification performance, we set b to 30.

3.5. Ablation Analysis

To thoroughly validate the proposed method, we performed ablation experiments on the SV dataset to analyze the influence of various modules on the network’s overall performance. Four module combinations were assessed using the OA, AA, and k values, as presented in Table 5. The model comprises three modules: the multi-scale feature extraction module with three 3D convolutional blocks of varying scales, the feature enhancement module utilizing the PSA structure, and the TE module. The results from Table 5 show that employing a single 3D convolutional block for spatial–spectral feature extraction reduced the OA and AA by 2.70% and 4.57% compared to the full model results. Conversely, utilizing two 3D convolutional blocks of different scales in parallel led to a 1.3% improvement in the OA and a 2.30% increase in the AA. This underscores the benefit of extracting multi-scale convolutional features for enhancing the model’s classification performance. The third combined experiment focused on evaluating the impact of the PSA-based feature enhancement module on classification performance. Removing the PSA structure caused a 3.66% decrease in the OA and a substantial 5.49% decline in the AA, highlighting the crucial role of the feature enhancement module in the model. Furthermore, eliminating the TE module in the final experiment resulted in a 1.21% OA decrease, affirming the affirmative impact of the TE structure on performance improvement. Overall, a detailed analysis of the results of these ablation experiments further confirms the efficacy of this paper’s method.

3.6. Comparison of Computational Efficiency

Table 6 presents a comparative analysis of the computational efficiency for the 1D CNN [43], 3D CNN [46], DFFN [48], HybridSN [47], MorghAT [60], GAHT [55], SSFTT [54], and our proposed MSCF-MAM method. It is evident that, with the exceptions of the 1D CNN and SSFTT, our proposed method demonstrates a relatively higher speed compared to the other methods. Notably, our method incurs significantly lower time costs when contrasted with DFFN, MorghAT, and GAHT. These findings underscore the effectiveness of our method in reducing the computational time and enhancing the classification efficiency. At the same time, there are several comparative methods of FLOP values and the number of parameters can also be analyzed to obtain, due to the design of the feature enhancement module, the addition of a multi-group convolution operation. This paper’s method, in terms of the computational complexity, is better than the SSFTT, and at the same time, the number of parameters is also increased by a factor of two, but still can maintain a better performance.

4. Conclusions

In this paper, an HSI classification method using multi-scale convolutional features and a multi-attention mechanism is proposed, which aims to comprehensively capture multi-scale spatial–spectral features by fusing multi-scale information and introducing a multi-attention mechanism and can process HSI data more efficiently to significantly improve the classification accuracy in the case of limited data samples. Multiple CNNs operating at diverse scales are leveraged for parallel feature extraction from the HSI data. Additionally, a feature enhancement module based on the PSA structure is incorporated to significantly enrich the feature representation. By introducing this feature enhancement module, the input of image features into the Transformer encoder structure contributes to a deeper understanding of the HSI data, thereby improving the overall performance of the model. This approach promotes a more comprehensive understanding of the HSI data, which in turn improves the accuracy and generalization of the classification model. The remarkable classification performance of the proposed method is confirmed by extensive experiments on three HSI datasets. Prospective research directions may involve the refinement of lightweight Transformer network architecture and the amalgamation of graph convolutional networks to collectively extract spatial and spectral features, aiming to deepen the exploration of abundant feature details in HSIs and elevate the classification efficacy and precision.

Author Contributions

Conceptualization, Q.S., G.Z. and L.S.; methodology, L.S. and G.Z.; software, G.Z., Y.X. and X.X.; validation, X.X. and G.Z.; investigation, X.X., C.F. and G.Z.; visualization, L.S. and G.Z.; writing—original draft, Q.S. and G.Z.; writing—review and editing, Q.S. and L.S.; supervision, Z.W. and C.P.; funding acquisition, Q.S., Z.W. and C.P. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China under Grant 62076137, and Grant 61931004; in part by the Startup Foundation for Introducing Talent of NUIST, Grant number 2022r075; in part by the Natural Science Foundation of Jiangsu Province under Grant BK 20211539; and in part by the Jiangsu Innovation & Entrepreneurship Group Talents Plan.

Data Availability Statement

The datasets presented in this paper can be obtained through https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes (accessed on 4 March 2024) and http://rsidea.whu.edu.cn/resource_WHUHi_sharing.htm (accessed on 4 March 2024).

Acknowledgments

The authors would like to thank the National Natural Science Foundation of China and the Natural Science Foundation of Jiangsu Province for supporting our work and thank the anonymous reviewers and the editors for their insightful comments and helpful suggestions for improving the quality of our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HSIHyperspectral image
CNNConvolutional neural network
MHSAMulti-head self-attention
PSAPyramid squeeze attention
MLMachine learning
PCAPrincipal component analysis
SVMSupport vector machine
MPMorphological profile
DLDeep learning
TETransformer encoder
GCNGraph convolutional network
GANGenerative adversarial network
DFFNDeep feature fusion network
SESqueeze and excitation
κ Kappa coefficient
OAOverall accuracy
AAAverage accuracy

References

  1. Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral Remote Sensing Data Analysis and Future Challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef]
  2. Wu, Z.; Sun, J.; Zhang, Y.; Wei, Z.; Chanussot, J. Recent Developments in Parallel and Distributed Computing for Remotely Sensed Big Data Processing. Proc. IEEE 2021, 109, 1282–1305. [Google Scholar] [CrossRef]
  3. Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
  4. Gevaert, C.M.; Suomalainen, J.; Tang, J.; Kooistra, L. Generation of Spectral–Temporal Response Surfaces by Combining Multispectral Satellite and Hyperspectral UAV Imagery for Precision Agriculture Applications. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 3140–3146. [Google Scholar] [CrossRef]
  5. Khanal, S.; Kc, K.; Fulton, J.P.; Shearer, S.; Ozkan, E. Remote Sensing in Agriculture—Accomplishments, Limitations, and Opportunities. Remote Sens. 2020, 12, 3783. [Google Scholar] [CrossRef]
  6. Soppa, M.A.; Silva, B.; Steinmetz, F.; Keith, D.; Scheffler, D.; Bohn, N.; Bracher, A. Assessment of Polymer Atmospheric Correction Algorithm for Hyperspectral Remote Sensing Imagery over Coastal Waters. Sensors 2021, 21, 4125. [Google Scholar] [CrossRef]
  7. Shirmard, H.; Farahbakhsh, E.; Müller, R.D.; Chandra, R. A Review of Machine Learning in Processing Remote Sensing Data for Mineral Exploration. Remote Sens. Environ. 2022, 268, 112750. [Google Scholar] [CrossRef]
  8. Virtriana, R.; Riqqi, A.; Anggraini, T.S.; Fauzan, K.N.; Ihsan, K.T.N.; Mustika, F.C.; Suwardhi, D.; Harto, A.B.; Sakti, A.D.; Deliar, A.; et al. Development of Spatial Model for Food Security Prediction Using Remote Sensing Data in West Java, Indonesia. ISPRS Int. J. Geo-Inf. 2022, 11, 284. [Google Scholar] [CrossRef]
  9. Dremin, V.; Marcinkevics, Z.; Zherebtsov, E.; Popov, A.; Grabovskis, A.; Kronberga, H.; Geldnere, K.; Doronin, A.; Meglinski, I.; Bykov, A. Skin Complications of Diabetes Mellitus Revealed by Polarized Hyperspectral Imaging and Machine Learning. IEEE Trans. Med. Imaging 2021, 40, 1207–1216. [Google Scholar] [CrossRef]
  10. Shimoni, M.; Haelterman, R.; Perneel, C. Hypersectral Imaging for Military and Security Applications: Combining Myriad Processing and Sensing Techniques. IEEE Geosci. Remote Sens. Mag. 2019, 7, 101–117. [Google Scholar] [CrossRef]
  11. Sun, L.; Chen, Y.; Li, B. SISLU-Net: Spatial Information-Assisted Spectral Information Learning Unmixing Network for Hyperspectral Images. Remote Sens. 2023, 15, 817. [Google Scholar] [CrossRef]
  12. Sun, L.; Wu, F.; He, C.; Zhan, T.; Liu, W.; Zhang, D. Weighted Collaborative Sparse and L1/2 Low-Rank Regularizations with Superpixel Segmentation for Hyperspectral Unmixing. IEEE Geosci. Remote Sens. Lett. 2020, 19, 5500405. [Google Scholar] [CrossRef]
  13. Sun, L.; Cao, Q.; Chen, Y.; Zheng, Y.; Wu, Z. Mixed Noise Removal for Hyperspectral Images Based on Global Tensor Low-Rankness and Nonlocal SVD-Aided Group Sparsity. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5506617. [Google Scholar] [CrossRef]
  14. Sun, L.; He, C.; Zheng, Y.; Wu, Z.; Jeon, B. Tensor Cascaded-Rank Minimization in Subspace: A Unified Regime for Hyperspectral Image Low-Level Vision. IEEE Trans. Image Process. 2022, 32, 100–115. [Google Scholar] [CrossRef] [PubMed]
  15. Diao, W.; Zhang, F.; Sun, J.; Xing, Y.; Zhang, K.; Bruzzone, L. ZeRGAN: Zero-Reference GAN for Fusion of Multispectral and Panchromatic Images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8195–8209. [Google Scholar] [CrossRef] [PubMed]
  16. Sun, L.; Cheng, Q.; Chen, Z. Hyperspectral Image Super-Resolution Method Based on Spectral Smoothing Prior and Tensor Tubal Row-Sparse Representation. Remote Sens. 2022, 14, 2142. [Google Scholar] [CrossRef]
  17. Nasrabadi, N.M. Hyperspectral Target Detection: An Overview of Current and Future Challenges. IEEE Signal Process. Mag. 2013, 31, 34–44. [Google Scholar] [CrossRef]
  18. Sun, L.; Wang, Q.; Chen, Y.; Zheng, Y.; Wu, Z.; Fu, L.; Jeon, B. CRNet: Channel-enhanced Remodeling-based Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618314. [Google Scholar] [CrossRef]
  19. Zhao, G.; Ye, Q.; Sun, L.; Wu, Z.; Pan, C.; Jeon, B. Joint Classification of Hyperspectral and Lidar Data Using a Hierarchical CNN and Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5500716. [Google Scholar] [CrossRef]
  20. Sun, L.; Fang, Y.; Chen, Y.; Huang, W.; Wu, Z.; Jeon, B. Multi-Structure KELM with Attention Fusion Strategy for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539217. [Google Scholar] [CrossRef]
  21. Liu, S.; Marinelli, D.; Bruzzone, L.; Bovolo, F. A Review of Change detection in Multitemporal Hyperspectral Images: Current Techniques, Applications, and Challenges. IEEE Geosci. Remote Sens. Mag. 2019, 7, 140–158. [Google Scholar] [CrossRef]
  22. Licciardi, G.; Marpu, P.R.; Chanussot, J.; Benediktsson, J.A. Linear Versus Nonlinear PCA for the Classification of Hyperspectral Data based on the Extended Morphological Profiles. IEEE Geosci. Remote Sens. Lett. 2011, 9, 447–451. [Google Scholar] [CrossRef]
  23. Bandos, T.V.; Bruzzone, L.; Camps-Valls, G. Classification of Hyperspectral Images with Regularized Linear Discriminant Analysis. IEEE Trans. Geosci. Remote Sens. 2009, 47, 862–873. [Google Scholar] [CrossRef]
  24. Melgani, F.; Bruzzone, L. Classification of Hyperspectral Remote Sensing Images with Support Vector Machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
  25. Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the Random Forest Framework for Classification of Hyperspectral Data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar] [CrossRef]
  26. Ma, L.; Crawford, M.M.; Tian, J. Local Manifold Learning-based K-Nearest-Neighbor for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2010, 48, 4099–4109. [Google Scholar] [CrossRef]
  27. Fauvel, M.; Benediktsson, J.A.; Chanussot, J.; Sveinsson, J.R. Spectral and Spatial Classification of Hyperspectral Data Using SVMs and Morphological Profiles. IEEE Trans. Geosci. Remote Sens. 2008, 46, 3804–3814. [Google Scholar] [CrossRef]
  28. Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of Hyperspectral Data from Urban Areas Based on Extended Morphological Profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
  29. Dalla Mura, M.; Villa, A.; Benediktsson, J.A.; Chanussot, J.; Bruzzone, L. Classification of Hyperspectral Images by Using Extended Morphological Attribute Profiles and Independent Component Analysis. IEEE Geosci. Remote Sens. Lett. 2010, 8, 542–546. [Google Scholar] [CrossRef]
  30. Li, W.; Chen, C.; Su, H.; Du, Q. Local Binary Patterns and Extreme Learning Machine for Hyperspectral Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3681–3693. [Google Scholar] [CrossRef]
  31. Jia, S.; Shen, L.; Li, Q. Gabor Feature-based Collaborative Representation for Hyperspectral Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1118–1129. [Google Scholar] [CrossRef]
  32. Fang, S.; Li, X.; Tian, S.; Chen, W.; Zhang, E. Multi-Level Feature Extraction Networks for Hyperspectral Image Classification. Remote Sens. 2024, 16, 590. [Google Scholar] [CrossRef]
  33. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep Learning-based Classification of Hyperspectral Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
  34. Chen, Y.; Zhao, X.; Jia, X. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
  35. Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Generative Adversarial Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [Google Scholar] [CrossRef]
  36. Zhan, Y.; Hu, D.; Wang, Y.; Yu, X. Semisupervised Hyperspectral Image Classification Based on Generative Adversarial Networks. IEEE Geosci. Remote Sens. Lett. 2017, 15, 212–216. [Google Scholar] [CrossRef]
  37. Paoletti, M.E.; Haut, J.M.; Fernandez-Beltran, R.; Plaza, J.; Plaza, A.; Li, J.; Pla, F. Capsule Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2145–2160. [Google Scholar] [CrossRef]
  38. Hang, R.; Liu, Q.; Hong, D.; Ghamisi, P. Cascaded Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5384–5394. [Google Scholar] [CrossRef]
  39. Hao, S.; Wang, W.; Salzmann, M. Geometry-Aware Deep Recurrent Neural Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2448–2460. [Google Scholar] [CrossRef]
  40. Wan, S.; Gong, C.; Zhong, P.; Du, B.; Zhang, L.; Yang, J. Multiscale Dynamic Graph Convolutional Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3162–3177. [Google Scholar] [CrossRef]
  41. Hong, D.; Gao, L.; Yao, J.; Zhang, B.; Plaza, A.; Chanussot, J. Graph Convolutional Networks for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5966–5978. [Google Scholar] [CrossRef]
  42. Wang, Z.; Cao, B.; Liu, J. Hyperspectral Image Classification via Spatial Shuffle-Based Convolutional Neural Network. Remote Sens. 2023, 15, 3960. [Google Scholar] [CrossRef]
  43. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
  44. Zhao, W.; Du, S. Spectral–spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and Deep Learning Approach. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4544–4554. [Google Scholar] [CrossRef]
  45. Zhao, L.; Yi, J.; Li, X.; Hu, W.; Wu, J.; Zhang, G. Compact Band Weighting Module Based on Attention-Driven for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9540–9552. [Google Scholar] [CrossRef]
  46. Hamida, A.B.; Benoit, A.; Lambert, P.; Amar, C.B. 3-D Deep Learning Approach for Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4420–4434. [Google Scholar] [CrossRef]
  47. Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN Feature Hierarchy for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 277–281. [Google Scholar] [CrossRef]
  48. Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral Image Classification with Deep Feature Fusion Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [Google Scholar] [CrossRef]
  49. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 7 May 2024).
  50. Sun, L.; Zhang, H.; Zheng, Y.; Wu, Z.; Ye, Z.; Zhao, H. MASSFormer: Memory-Augmented Spectral-Spatial Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5516415. [Google Scholar] [CrossRef]
  51. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  52. He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
  53. Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
  54. Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5522214. [Google Scholar] [CrossRef]
  55. Mei, S.; Song, C.; Ma, M.; Xu, F. Hyperspectral Image Classification Using Group-Aware Hierarchical Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539014. [Google Scholar] [CrossRef]
  56. Yang, A.; Li, M.; Ding, Y.; Hong, D.; Lv, Y.; He, Y. GTFN: GCN and transformer fusion with spatial-spectral features for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 6600115. [Google Scholar] [CrossRef]
  57. Feng, J.; Wang, Q.; Zhang, G.; Jia, X.; Yin, J. CAT: Center Attention Transformer with Stratified Spatial-Spectral Token for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5615415. [Google Scholar] [CrossRef]
  58. Zou, J.; He, W.; Zhang, H. Lessformer: Local-enhanced spectral-spatial transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5535416. [Google Scholar] [CrossRef]
  59. Peng, Y.; Zhang, Y.; Tu, B.; Li, Q.; Li, W. Spatial–spectral transformer with cross-attention for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5537415. [Google Scholar] [CrossRef]
  60. Roy, S.K.; Deria, A.; Shah, C.; Haut, J.M.; Du, Q.; Plaza, A. Spectral–Spatial Morphological Attention Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5503615. [Google Scholar] [CrossRef]
  61. Ouyang, E.; Li, B.; Hu, W.; Zhang, G.; Zhao, L.; Wu, J. When Multigranularity Meets Spatial–Spectral Attention: A Hybrid Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5524916. [Google Scholar] [CrossRef]
  62. Fang, Y.; Ye, Q.; Sun, L.; Zheng, Y.; Wu, Z. Multi-Attention Joint Convolution Feature Representation with Lightweight Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5513814. [Google Scholar] [CrossRef]
  63. Zhong, Y.; Hu, X.; Luo, C.; Wang, X.; Zhao, J.; Zhang, L. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets and classifier for precise crop identification based on deep convolutional neural network with CRF. Remote Sens. Environ. 2020, 250, 112012. [Google Scholar] [CrossRef]
  64. Zhong, Y.; Wang, X.; Xu, Y.; Wang, S.; Jia, T.; Hu, X.; Zhao, J.; Wei, L.; Zhang, L. Mini-UAV-borne hyperspectral remote sensing: From observation and processing to applications. IEEE Geosci. Remote Sens. Mag. 2018, 6, 46–62. [Google Scholar] [CrossRef]
Figure 1. The overall framework of the proposed MSCF-MAM method.
Figure 1. The overall framework of the proposed MSCF-MAM method.
Remotesensing 16 02185 g001
Figure 2. Illustration of the SE block.
Figure 2. Illustration of the SE block.
Remotesensing 16 02185 g002
Figure 3. Illustration of the MHSA block.
Figure 3. Illustration of the MHSA block.
Remotesensing 16 02185 g003
Figure 4. SV dataset. (a) Pseudo-color image. (b) Ground truth map.
Figure 4. SV dataset. (a) Pseudo-color image. (b) Ground truth map.
Remotesensing 16 02185 g004
Figure 5. HC dataset. (a) Pseudo-color image. (b) Ground truth map.
Figure 5. HC dataset. (a) Pseudo-color image. (b) Ground truth map.
Remotesensing 16 02185 g005
Figure 6. HH dataset. (a) Pseudo-color image. (b) Ground truth map.
Figure 6. HH dataset. (a) Pseudo-color image. (b) Ground truth map.
Remotesensing 16 02185 g006
Figure 7. Classification maps of different methods for the SV dataset with 0.1% training samples per class. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Figure 7. Classification maps of different methods for the SV dataset with 0.1% training samples per class. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Remotesensing 16 02185 g007
Figure 8. Classification maps obtained by different methods with 0.1% training samples per class for the HC dataset. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Figure 8. Classification maps obtained by different methods with 0.1% training samples per class for the HC dataset. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Remotesensing 16 02185 g008
Figure 9. Classification maps of different methods for the HH dataset with 0.1% training samples per class. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Figure 9. Classification maps of different methods for the HH dataset with 0.1% training samples per class. (a) Ground truth map. (b) SVM [24]. (c) 1-D CNN [43]. (d) 3-D CNN [46]. (e) DFFN [48]. (f) HybridSN [47]. (g) MorghAT [60]. (h) GAHT [55]. (i) SSFTT [54]. (j) MSCF-MAM.
Remotesensing 16 02185 g009
Figure 10. The impact of different patch sizes on OA, AA, and κ . (a) SV dataset. (b) HC dataset. (c) HH dataset.
Figure 10. The impact of different patch sizes on OA, AA, and κ . (a) SV dataset. (b) HC dataset. (c) HH dataset.
Remotesensing 16 02185 g010
Figure 11. The impact of different PCA bands on OA, AA, and κ . (a) SV dataset. (b) HC dataset. (c) HH dataset.
Figure 11. The impact of different PCA bands on OA, AA, and κ . (a) SV dataset. (b) HC dataset. (c) HH dataset.
Remotesensing 16 02185 g011
Table 1. Training and test sample numbers of each category for the SV, HC, and HH datasets.
Table 1. Training and test sample numbers of each category for the SV, HC, and HH datasets.
NO.SVHCHH
Class Name Training Test Class Name Training Test Class Name Training Test
C01Brocoli green weeds122007Strawberry4544,690Red roof1414,027
C02Brocoli green weeds243722Cowpea2322,730Road43508
C03Fallow21974Soybean1010,277Bare soil2221,799
C04Fallow rough plow11393Sorghum55348Cotton163163,122
C05Fallow smooth32675Water spinach11199Cotton firewood66212
C06Stubble43955Watermelon54528Rape4544,512
C07Celery43575Greens65897Chinese cabbage2424,079
C08Grapes untrained1111,260Trees1817,960Pakchoi44050
C09Soil vinyard develop66197Grass99460Cabbage1110,808
C10Com senesced green weeds33275Red roof1010,506Tuber mustard1212,382
C11Lettuce romaine 4 wk11067Gray roof1716,894Brassica parachinensis1111,004
C12Lettuce romaine 5 wk21925Plastic43675Brassica chinensis98945
C13Lettuce romaine 6 wk1915Bare soil99107Small brassica chinensis2222,485
C14Lettuce romaine 7 wk11069Road1918,541Lactuca sativa77349
C15Vinyard untrained77261Bright object11135Celture11001
C16Vinyard vertical trellis21805Water7575,326Film covered lettuce77255
C17 Romaine lettuce33007
C18 Carrot33214
C19 White radish98703
C20 Garlic sprout43482
C21 Broad bean11327
C22 Tree44036
-Total5454,075Total257257,273Total386386,307
Table 2. Classification results obtained by different methods with 0.1% training samples per class for the SV dataset (optimal results are highlighted in bold).
Table 2. Classification results obtained by different methods with 0.1% training samples per class for the SV dataset (optimal results are highlighted in bold).
NO.SVM [24]1-D CNN [43]3-D CNN [46]DFFN [48]HybirdSN [47]MorghAT [60]GAHT [55]SSFTT [54]MSCF-MAM
C0165.25 ± 5.1062.77 ± 24.4497.70 ± 1.9697.12 ± 1.0698.23 ± 1.6580.72 ± 3.7099.05 ± 1.2696.47 ± 1.8197.82 ± 1.60
C0286.24 ± 4.2798.52 ± 3.2693.18 ± 8.5890.78 ± 3.0099.41 ± 0.5193.23 ± 6.3599.80 ± 0.27100.00 ± 0.00100.00 ± 0.00
C0353.70 ± 8.2354.26 ± 5.9832.69 ± 5.4851.00 ± 6.2683.32 ± 8.5983.42 ± 6.9753.45 ± 0.5099.52 ± 0.4699.87 ± 0.14
C0484.11 ± 4.0786.86 ± 9.2597.96 ± 1.2996.72 ± 2.7951.54 ± 4.6093.36 ± 6.1897.60 ± 1.8675.78 ± 12.9385.83 ± 9.70
C0580.09 ± 6.1487.48 ± 7.8596.95 ± 0.3485.44 ± 2.7994.12 ± 8.1586.25 ± 9.8796.10 ± 1.5893.41 ± 3.8593.44 ± 2.71
C0697.09 ± 0.8699.40 ± 0.2298.93 ± 0.7584.32 ± 6.5794.55 ± 3.1497.88 ± 1.9799.90 ± 0.1799.99 ± 0.0298.68 ± 1.39
C0793.68 ± 1.4798.89 ± 0.2799.69 ± 0.2381.40 ± 4.67100.00 ± 0.0096.26 ± 0.6698.40 ± 0.4699.87 ± 0.2499.99 ± 0.02
C0850.80 ± 7.6455.63 ± 4.2977.16 ± 12.5870.75 ± 9.1386.61 ± 3.6777.16 ± 7.9583.31 ± 3.2885.04 ± 0.4888.25 ± 2.46
C0997.32 ± 1.2397.04 ± 2.4599.95 ± 0.0595.39 ± 2.6698.94 ± 1.5198.08 ± 1.7099.87 ± 0.23100.00 ± 0.0099.92 ± 0.13
C1077.59 ± 6.5981.57 ± 5.8629.79 ± 10.2681.15 ± 1.0894.12 ± 3.6785.10 ± 3.4888.87 ± 2.3796.42 ± 0.6995.70 ± 1.23
C1156.10 ± 7.2857.16 ± 28.6260.63 ± 4.3338.43 ± 12.0246.86 ± 5.7737.52 ± 7.2292.02 ± 3.9998.57 ± 3.0399.19 ± 1.75
C1292.36 ± 2.9697.24 ± 2.5494.91 ± 5.6095.42 ± 7.9360.92 ± 6.3261.05 ± 10.6999.98 ± 0.0396.32 ± 3.0899.03 ± 1.12
C1346.78 ± 4.1541.33 ± 19.6550.10 ± 1.8694.90 ± 4.1083.86 ± 14.0490.04 ± 1.5394.27 ± 5.2887.28 ± 11.2286.95 ± 8.88
C1475.97 ± 2.7181.09 ± 7.6682.73 ± 6.3285.68 ± 2.1180.51 ± 6.1782.30 ± 8.4866.56 ± 3.6167.72 ± 16.6778.99 ± 9.00
C1549.86 ± 10.5259.19 ± 5.1543.84 ± 10.6371.83 ± 2.1572.32 ± 16.3684.43 ± 1.5566.56 ± 3.6150.14 ± 7.5261.02 ± 8.02
C1675.35 ± 3.6472.82 ± 3.6447.22 ± 7.3357.40 ± 2.4590.66 ± 0.2347.73 ± 7.4180.72 ± 2.0198.91 ± 0.7398.55 ± 0.40
OA (%)69.52 ± 3.4575.91 ± 2.0074.61 ± 1.8177.70 ± 2.7785.40 ± 2.5382.16 ± 0.9687.94 ± 0.3987.81 ± 1.0590.44 ± 1.86
AA (%)72.89 ± 4.0475.94 ± 4.3471.87 ± 3.0077.98 ± 4.5879.83 ± 8.1979.24 ± 1.6589.37 ± 1.7490.33 ± 1.2192.70 ± 1.99
κ × 10067.15 ± 3.0273.21 ± 2.2273.28 ± 2.6075.15 ± 3.2083.72 ± 2.8580.11 ± 1.0286.54 ± 0.4586.39 ± 1.1889.33 ± 2.08
Table 3. Classification results obtained by different methods with 0.1% training samples per class for the HC dataset (optimal results are highlighted in bold).
Table 3. Classification results obtained by different methods with 0.1% training samples per class for the HC dataset (optimal results are highlighted in bold).
NO.SVM [24]1-D CNN [43]3-D CNN [46]DFFN [48]HybirdSN [47]MorghAT [60]GAHT [55]SSFTT [54]MSCF-MAM
C0192.73 ± 0.5777.58 ± 8.4982.08 ± 7.4078.05 ± 10.6797.69 ± 1.7794.05 ± 2.2390.97 ± 3.0194.29 ± 2.2895.79 ± 3.32
C0253.70 ± 9.2358.08 ± 4.9265.86 ± 4.8350.08 ± 2.0882.56 ± 3.4680.37 ± 1.1985.41 ± 4.2886.77 ± 6.6592.69 ± 3.47
C0360.10 ± 5.1664.28 ± 3.2158.76 ± 2.3955.78 ± 7.5867.54 ± 2.0067.67 ± 8.2269.79 ± 4.2679.79 ± 5.0372.83 ± 4.49
C0446.54 ± 14.0140.75 ± 11.1141.12 ± 14.2149.89 ± 6.5499.08 ± 0.7572.63 ± 7.8691.30 ± 6.9192.53 ± 6.7694.49 ± 2.74
C0515.67 ± 9.0614.65 ± 8.6915.48 ± 3.3822.89 ± 5.0121.18 ± 8.6921.07 ± 4.4217.29 ± 5.3126.56 ± 11.9323.86 ± 2.33
C0613.21 ± 4.0812.74 ± 3.9616.71 ± 2.7110.72 ± 3.6333.83 ± 2.6731.99 ± 3.2931.56 ± 4.5047.91 ± 8.8150.68 ± 2.10
C0765.77 ± 5.2761.49 ± 4.6853.29 ± 3.4461.87 ± 7.6864.53 ± 0.2480.43 ± 4.3567.46 ± 3.8982.99 ± 3.7673.93 ± 4.59
C0845.72 ± 5.2546.82 ± 3.0255.69 ± 2.2972.24 ± 4.9874.08 ± 2.2874.35 ± 3.0672.42 ± 1.2867.40 ± 9.7375.96 ± 7.06
C0916.86 ± 12.3716.97 ± 15.1917.71 ± 4.2917.36 ± 4.9343.63 ± 1.8525.42 ± 1.9633.04 ± 7.9260.17 ± 5.4063.79 ± 2.66
C1026.21 ± 6.2925.98 ± 4.1329.71 ± 4.6133.31 ± 5.7382.08 ± 4.1653.95 ± 9.6087.78 ± 10.6591.21 ± 5.1195.80 ± 1.68
C1125.17 ± 6.3241.70 ± 11.0548.94 ± 5.4444.38 ± 12.1985.30 ± 4.8690.55 ± 10.4195.59 ± 0.7594.78 ± 3.4989.07 ± 3.86
C1212.10 ± 4.4710.09 ± 5.2712.64 ± 3.2715.26 ± 3.3231.84 ± 0.9738.37 ± 9.0542.62 ± 7.7645.91 ± 1.9252.22 ± 6.36
C1313.51 ± 6.7814.62 ± 5.8414.82 ± 3.2720.00 ± 5.8538.54 ± 4.0351.03 ± 11.4151.60 ± 7.0842.51 ± 6.4246.51 ± 3.14
C1474.04 ± 4.2169.38 ± 3.9471.58 ± 2.9871.60 ± 2.8477.04 ± 5.1172.19 ± 11.7788.72 ± 2.1186.51 ± 5.7188.71 ± 4.82
C1511.19 ± 15.1112.62 ± 5.2438.01 ± 6.7557.17 ± 8.7944.11 ± 1.9824.94 ± 9.0310.29 ± 4.5429.91 ± 7.9637.94 ± 3.67
C1698.06 ± 0.6492.53 ± 7.2299.54 ± 0.7098.26 ± 1.0099.58 ± 0.2698.64 ± 0.0699.04 ± 0.4298.51 ± 1.1099.40 ± 0.49
OA (%)60.27 ± 3.2857.41 ± 3.0165.06 ± 1.1962.86 ± 3.9583.06 ± 1.0477.09 ± 1.1583.19 ± 1.7185.80 ± 1.4086.94 ± 0.47
AA (%)40.37 ± 9.5439.42 ± 3.5353.51 ± 5.5448.04 ± 4.2162.04 ± 1.1352.91 ± 2.4160.95 ± 1.9267.67 ± 2.1268.04 ± 1.40
κ × 10053.93 ± 6.1149.49 ± 3.4458.05 ± 3.0955.48 ± 5.3380.06 ± 1.2372.98 ± 1.5080.29 ± 2.0083.35 ± 1.6184.65 ± 0.57
Table 4. Classification results obtained by different methods with 0.1% training samples per class for the HH dataset (optimal results are highlighted in bold).
Table 4. Classification results obtained by different methods with 0.1% training samples per class for the HH dataset (optimal results are highlighted in bold).
NO.SVM [24]1-D CNN [43]3-D CNN [46]DFFN [48]HybirdSN [47]MorghAT [60]GAHT [55]SSFTT [54]MSCF-MAM
C0174.93 ± 2.4255.11 ± 10.4979.64 ± 2.6582.58 ± 11.9891.35 ± 3.1892.35 ± 3.5395.81 ± 1.0986.95 ± 5.7096.93 ± 0.59
C0233.81 ± 11.2318.65 ± 16.1056.64 ± 6.4533.67 ± 5.6636.95 ± 5.3445.44 ± 9.6754.84 ± 9.6059.96 ± 10.4761.92 ± 8.05
C0393.24 ± 2.1691.11 ± 2.4992.71 ± 1.5696.32 ± 0.7787.95 ± 2.3192.99 ± 0.7489.16 ± 3.9491.50 ± 3.5693.43 ± 3.95
C0498.61 ± 1.0196.44 ± 0.6699.28 ± 0.1499.72 ± 0.1198.87 ± 0.8998.32 ± 0.6799.33 ± 0.3299.02 ± 0.3298.42 ± 1.77
C0525.67 ± 10.0419.43 ± 6.8325.48 ± 4.3822.76 ± 4.7642.22 ± 5.8551.07 ± 14.4246.16 ± 12.2762.80 ± 4.4856.84 ± 6.85
C0687.75 ± 0.9282.44 ± 6.1988.11 ± 1.3588.31 ± 0.9592.65 ± 0.5896.11 ± 1.3291.57 ± 2.9193.13 ± 4.0293.98 ± 1.02
C0775.46 ± 3.2753.65 ± 13.7380.44 ± 3.0566.96 ± 12.9470.37 ± 2.9276.99 ± 4.6878.14 ± 2.2985.52 ± 4.8292.98 ± 2.77
C0835.16 ± 9.2529.91 ± 4.9630.80 ± 10.2130.06 ± 9.0536.29 ± 5.1537.32 ± 4.6745.25 ± 3.9734.02 ± 3.2434.32 ± 4.99
C0983.21 ± 4.1662.99 ± 17.4586.64 ± 1.8884.63 ± 4.3790.43 ± 2.6494.45 ± 0.7598.99 ± 0.5091.90 ± 3.5896.12 ± 1.45
C1010.31 ± 8.299.79 ± 4.1849.51 ± 9.5044.77 ± 4.4650.07 ± 7.8770.02 ± 7.4979.07 ± 7.1884.89 ± 3.1887.84 ± 1.72
C1114.47 ± 14.3216.53 ± 15.1917.38 ± 3.9416.00 ± 12.5250.62 ± 10.8351.03 ± 8.0446.48 ± 10.8466.00 ± 3.8868.03 ± 7.70
C1250.48 ± 8.4726.08 ± 10.4452.64 ± 4.6447.23 ± 3.4558.78 ± 10.4155.59 ± 7.2565.06 ± 2.2458.39 ± 7.4358.95 ± 4.28
C1356.19 ± 6.2846.49 ± 5.8868.71 ± 0.4465.89 ± 6.1378.81 ± 2.9372.39 ± 6.3175.36 ± 4.7170.43 ± 10.5872.40 ± 5.30
C1436.89 ± 6.1531.37 ± 11.3330.91 ± 13.9444.36 ± 12.9260.83 ± 5.2866.45 ± 7.1278.94 ± 4.3172.61 ± 5.2773.45 ± 9.92
C1511.23 ± 10.0415.11 ± 9.1312.17 ± 8.2910.54 ± 11.0545.50 ± 3.6841.93 ± 11.6141.17 ± 3.6150.07 ± 4.0956.85 ± 3.00
C1683.89 ± 3.6485.19 ± 11.5690.33 ± 1.5089.04 ± 3.0880.50 ± 9.7088.30 ± 7.1895.96 ± 1.6588.74 ± 4.5990.59 ± 2.03
C1723.58 ± 11.0312.95 ± 8.6316.60 ± 2.9416.40 ± 4.0334.94 ± 6.5440.44 ± 3.9040.16 ± 2.2740.15 ± 4.2248.09 ± 10.79
C1816.54 ± 10.2410.34 ± 5.5910.27 ± 10.4717.40 ± 6.4534.60 ± 10.2946.56 ± 7.3336.91 ± 1.6252.33 ± 7.3465.50 ± 4.93
C1945.18 ± 6.3523.84 ± 7.0349.69 ± 10.3215.11 ± 9.8470.33 ± 3.5474.41 ± 11.4473.09 ± 10.9181.00 ± 8.4179.61 ± 5.31
C2014.71 ± 8.6612.40 ± 13.7028.04 ± 14.2922.89 ± 15.5328.44 ± 4.7575.33 ± 4.4681.01 ± 4.2943.28 ± 3.2950.27 ± 2.27
C2113.90 ± 13.4510.10 ± 8.1712.64 ± 4.5717.29 ± 12.1419.82 ± 5.8921.71 ± 2.5519.86 ± 10.8724.24 ± 3.5630.30 ± 2.65
C2221.54 ± 12.3615.19 ± 13.7824.64 ± 8.3625.21 ± 12.4756.55 ± 5.1851.84 ± 8.1569.87 ± 10.5373.47 ± 3.4872.88 ± 3.00
OA (%)73.52 ± 5.2869.52 ± 2.9279.09 ± 1.1977.38 ± 1.7382.44 ± 0.7685.09 ± 0.7886.34 ± 0.6286.74 ± 1.0087.87 ± 0.60
AA (%)45.52 ± 9.0442.77 ± 6.8254.91 ± 3.4251.90 ± 3.5455.77 ± 2.4759.14 ± 1.6062.16 ± 2.1664.29 ± 1.5465.99 ± 0.48
κ × 10067.90 ± 5.0260.42 ± 4.8073.05 ± 2.2875.15 ± 3.2077.68 ± 0.9581.07 ± 0.9682.66 ± 0.7983.05 ± 1.4484.55 ± 0.80
Table 5. Ablation analysis of the proposed method with 0.1% training samples per class for the SV dataset (optimal results are highlighted in bold).
Table 5. Ablation analysis of the proposed method with 0.1% training samples per class for the SV dataset (optimal results are highlighted in bold).
CasesComponentIndicators
3D Conv13D Conv23D Conv3 PSA TE OA (%) AA (%) κ × 100
187.7488.1387.58
289.1290.4388.71
386.7887.2186.58
489.2390.3388.24
590.4492.7089.33
Table 6. Comparison of computational efficiency between the proposed method and the comparison methods on the three datasets (optimal results are highlighted in bold, our method’s results are underlined).
Table 6. Comparison of computational efficiency between the proposed method and the comparison methods on the three datasets (optimal results are highlighted in bold, our method’s results are underlined).
MethodsSVHCHHFLOPs (M)Params (M)
Training (s) Test (s) Training (s) Test (s) Training (s) Test (s)
1-D CNN [43]4.175.723.457.684.8411.69277.460.12
3-D CNN [46]8.534.9740.4921.5058.3725.761728.970.16
DFFN [48]15.9010.0174.6344.05124.2154.861328.130.42
HybridSN [47]6.659.1810.0819.4113.5738.823252.564.84
MorghAT [60]26.3323.3895.4637.63136.5150.342725.210.20
GAHT [55]17.0210.0577.9839.25110.2247.583046.750.97
SSFTT [54]1.933.368.6519.325.6415.75781.380.16
MSCF-MAM3.264.9811.6421.9412.0626.202551.710.59
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Q.; Zhao, G.; Xia, X.; Xie, Y.; Fang, C.; Sun, L.; Wu, Z.; Pan, C. Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms. Remote Sens. 2024, 16, 2185. https://doi.org/10.3390/rs16122185

AMA Style

Sun Q, Zhao G, Xia X, Xie Y, Fang C, Sun L, Wu Z, Pan C. Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms. Remote Sensing. 2024; 16(12):2185. https://doi.org/10.3390/rs16122185

Chicago/Turabian Style

Sun, Qian, Guangrui Zhao, Xinyuan Xia, Yu Xie, Chenrong Fang, Le Sun, Zebin Wu, and Chengsheng Pan. 2024. "Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms" Remote Sensing 16, no. 12: 2185. https://doi.org/10.3390/rs16122185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop