1. Introduction
Classification is one of the most important tasks [
1,
2] in HSI processing, which provides the basis for many subsequent applications, such as urban planning, military target recognition, and geological prospecting. In addition, HSI classification is also a prerequisite for many subsequent processing tasks of HSIs, such as semantic segmentation [
3,
4], content understanding [
5,
6], target recognition [
7,
8], and anomaly detection [
9,
10].
One of the key points of classification is feature extraction. For decades, many conventional feature extraction methods have been proposed for HSI classification. For instance, the Extended Multiple Attribute Profile (EMAP), as a popular method for spectral–spatial feature extraction, is widely used in HSI processing. This method achieves the purpose of selecting the best features by connecting multiple morphological attribute filters without considering the pixel problem [
11,
12,
13]. Later, Kwan et al. [
14,
15] used EMAP to enhance image bands. Zhang et al. [
16] proposed a new classification framework based on the gravitational optimized multilayer perception classifier and EMAP, combined with Sentinel-2 Multispectral images (MSI), to draw complex coastal wetland maps. Huang et al. [
17] proposed to use EMAP to explore spatial features and remove suspicious abnormal pixels to achieve the effect of image background purification.
In addition to EMAP, there is a series of other notable techniques, such as Support Vector Machine [
18] and Discriminant Analysis [
19,
20]. For example, Baassou et al. [
21] proposed a novel method using Support Vector Machine (SVM) with Spatial Pixel Association (SPA) features to enhance SVM’s classification performance by extracting regional texture information from hyperspectral data. Guo et al. [
22] addressed the particular demands of HSI classification for SVM by introducing a spectral-weighted kernel. They selected a specific set of weights by optimizing the estimation of generalization error or appraising the practicality of each band. Melgani et al. [
23] comprehensively evaluated the potential of SVM in HSI classification through a combination of theoretical exploration and experimental analysis, providing a comprehensive perspective for a deeper understanding of its performance. Kang et al. [
24] suggested a novel PCA-based Edge-Preserving Features (PCA–EPFs) method, processing HSIs by constructing standard Edge-Preserving Features (EPFs), followed by dimension reduction utilizing Principal Component Analysis (PCA), and ultimately employing SVM for classification. Wang et al. [
25] introduced a supervised approach utilizing PCA Network (PCANet) and Gaussian-Weighted Support Vector Machine (Gaussian-SVM), obtaining HSI classification results through threshold decision. Villa et al. [
26] presented a method using Independent Component (IC) Discriminant Analysis (ICDA). They utilized ICDA to choose the transformation matrix that maximizes the independence of components and applied the Bayesian rule for the final classification. Bandos et al. [
27] introduced an efficient version of Regularized Linear Discriminant Analysis (RLDA) for HSI classification, addressing the challenges when the ratio between the number of spectral features and the number of training samples is large. These traditional methods perform well in small-sample classification problems. However, as training datasets become complex and scale increases, they may encounter performance bottlenecks due to limitations such as linear assumptions, computational complexity, dimension constraints, and specific assumptions about data distribution. Recently, there has been widespread adoption of deep learning methods in HSI classification, addressing limitations posed by traditional approaches.
The rapid advancement of deep learning technology has significantly influenced various domains, notably making substantial contributions to the field of image processing [
28,
29]. In the domain of remote sensing data classification, the application of deep learning methods has garnered considerable attention for its efficacy in analyzing fragmented data with improved efficiency and precision. Multiple approaches have been proposed for the classification of HSIs by leveraging deep models [
30,
31,
32]. Hu et al. [
33] devised a method that employs a 1D CNN comprising five convolutional layers. This method utilizes spectral dimension information as input, accurately extracting spectral features. However, this network inadequately considers the importance of spatial information. To overcome this limitation, Zhao and Du [
34] designed a 2D CNN model that, through dimensionality reduction of spectral information, extracts valuable spatial features from the data. Nevertheless, both of these methods analyze data from a singular feature dimension. Yang et al. [
35] employed a dual-branch structure, utilizing both one-dimensional and two-dimensional CNNs to extract spectral and spatial features simultaneously. Subsequently, Chen et al. [
36] introduced three-dimensional CNNs from the natural image domain to address HSI classification problems. To extract deep spectral–spatial features, Roy et al. [
37] used a concatenation of three-dimensional and two-dimensional CNNs to extract features from HSI.This method not only comprehensively extracts spatial–spectral feature information but also improves classification accuracy while reducing computational complexity.
Given the widespread acceptance of residual networks, He et al. [
38] introduced residual networks (ResNets) for HSI classification. This approach ensures more comprehensive feature information extraction, allowing the model to minimize information loss at each convolutional layer to address the challenge of gradient vanishing. Zhong et al. [
39] introduced a spatial–spectral residual network (SSRN) that supplements information on the previous layer’s features with the next layer’s features to achieve enhanced classification performance. In [
40], adding a residual network to the model to increase the network’s depth and feature map dimensions has successfully extracted feature information that traditional convolutional filters may overlook. Dense convolutional network structures, such as Cubic-CNN [
41] and lightweight heterogeneous kernel convolutions [
42], are also capable of achieving effective feature extraction for HSI and yielding satisfactory classification results.
All the aforementioned methods have employed strategies based on CNN backbones and their variants, effectively enhancing the classification performance of hyperspectral images (HSI). However, challenges remain in the face of decreasing classification performance due to limited training samples and increasing network depth. Additionally, these methods face the significant challenge of feature redundancy.
In recent years, Vision Transformer (ViT) [
43] has found widespread application in various computer vision tasks [
44,
45,
46], serving as an extension of the Transformer [
47] architecture into the visual domain. While traditional CNNs excel in various visual tasks such as classification, detection, and segmentation, HSIs typically comprise hundreds of contiguous spectral bands, posing challenges for CNNs in effectively capturing the global dependencies within spectral information. In contrast, ViT leverages the same self-attention mechanism as the original Transformer, enabling it to establish relationships between different positions within an image, effectively capturing global information. This capability has propelled ViT to excel in HSI classification tasks, even surpassing traditional CNNs in certain scenarios [
48].
Hong et al. [
49] reexamined the HSI classification problem from a sequential perspective and introduced a novel Transformer-based backbone network known as SpectralFormer. SpectralFormer incorporates two simple yet effective modules, grouped spectral embedding (GSE) and cross-layer adaptive fusion (CAF). These modules are designed to facilitate the learning of local detailed spectral representations and to transmit memory-like components from shallow layers to deeper layers. Sun et al. [
50] presented a novel model called SSFTT, designed to convert shallow-level features into deep semantic tokens. This model effectively captures spectral–spatial joint features through a combination of convolutional layers and Transformers. In the study by Xue et al. [
51], they introduced a local Transformer for HSI classification. Within this Transformer, there is a Spatial Partitioned Recurrent Local Transformer Network (SPRLT-Net). SPRLT-Net not only acquires global contextual information but also its dynamic attention weights can adaptively accommodate spatial variations among different pixels in HSI. Huang et al. [
52] presented the 3D SwinT (3DSwinT) model, tailored to accommodate the 3D characteristics of HSI and capture the abundant spatial–spectral information within HSI. Additionally, they introduced a novel hierarchical contrastive learning method based on 3DSwinT (3DSwinT-HCL), which effectively harnesses multi-scale semantic representations of images. Fang et al. [
53] introduced an approach called MAR-LWFormer, which utilizes a multi-attention joint mechanism and a lightweight Transformer to achieve multi-channel feature representation. The design of MAR-LWFormer aims to effectively leverage the multispectral and multi-scale spectral–spatial information within HSI data, significantly enhancing classification accuracy, particularly under extremely low sampling rates. Xu et al. [
54] introduced a method called spatial–spectral 1DSwin (SS1DSwin) Transformer, which comprises two critical components, the grouped Feature Tokenization module (GFTM) and the 1DSwin Transformer with a cross-block normalization connection module (TCNCM). The design of the SS1DSwin Transformer investigates local and hierarchical spatial–spectral relationships from two distinct perspectives. Zhang et al. [
55] proposed a novel and efficient lightweight spectral–spatial Transformer (ELS2T). This method employs a global multi-scale attention module (GMAM) to emphasize feature distinctiveness and proposes an adaptive feature fusion module (AFFM) for the effective integration of spectral and spatial features.
Currently, HSI classification is one of the hot research areas in HSI processing [
56,
57]. Researchers have made significant progress in unsupervised learning [
58], autoencoders [
59], latent representation learning [
60], adversarial representation learning [
61], and other fields. They have opened up new directions for handling HSI classification tasks. However, unsupervised feature learning and adversarial learning have not achieved satisfactory results in extracting spectral–spatial features from HSI. In our proposed ETLKA model, we design a novel architecture that consists of dual-branch shallow feature extraction, an innovative attention mechanism, and an efficient Transformer framework.
This paper introduces an innovative network that enhances the understanding of image data within the Transformer model and improves its robustness. We have enhanced the module for extracting Spectral–Spatial Shallow Features and integrated it with a Transformer architecture featuring a Large-Kernel Attention module to thoroughly extract spatial and spectral information from HSIs. During the phase of extracting Spectral–Spatial Shallow Features, We adopted a dual-branch structure, with the first branch extracting spectral–spatial features from HSI and the second branch extracting spatial features. To further enhance the quality of extracted features, strengthen the Transformer’s comprehension of image data, alleviate the computational complexity of attention mechanisms, and effectively mitigate the impact of redundant information, we designed a Large-Kernel Attention Module preceding the Transformer encoder. This enhancement aims to reduce redundancy and elevate the overall performance of the model.
The main contributions of the ETLKA method can be condensed into the following three points:
In order to more comprehensively extract spatial and spectral feature information from HSIs, a high-performance network has been designed that combines a dual-branch CNN with a Transformer framework equipped with a Large-Kernel Attention mechanism. This further enhances the classification performance of the CNN–Transformer combined network;
In the shallow feature extraction module, we designed a dual-branch network that uses 3D convolutional layers to extract spectral features and 2D convolutional layers to extract spatial features. These two discriminative features are then processed by a Gaussian-weighted Tokenizer module to effectively fuse them and generate higher-level semantic tokens;
From utilizing a CNN network for shallow feature extraction to effectively capturing global contextual information within the image using the Transformer framework, our proposed ETLKA allows for comprehensive learning of spatial–spectral features within HSI, significantly enhancing joint classification accuracy. Experimental validation on three classic public datasets has demonstrated the effectiveness of the proposed network framework.
2. Materials and Methods
Figure 1 illustrates the overall framework of the ETLKA model proposed for HSI classification. It is primarily divided into four main components: Feature Extraction via Dual-Branch CNNs, HSI Feature Tokenization, Large-Kernel Attention (LKA), and Transformer Encoder (TE) modules.
2.1. Feature Extraction via Dual-Branch CNNs
We represent the obtained HSI data using a 3D tensor , where represents the spatial dimensions of the HSI data and e represents the number of spectral dimensions in the HSI. Each pixel in the HSI contains e spectral dimensions, and we typically use one-hot encoded class vectors to represent the labels for this feature, denoted as , where D is the number of land cover categories present in the current region. As HSI data often have a high number of spectral dimensions, we preprocess the HSI using PCA to significantly reduce the computational burden of the model. PCA reduces the number of spectral bands from e to b, while keeping the spatial dimensions as . The HSI data after dimension reduction are represented as , where b represents the number of spectral dimensions obtained through the PCA operation.
After obtaining the 3D blocks from the preprocessed HSI data , these extracted 3D blocks serve as inputs to the entire model. Here, represents the spatial dimensions when extracting 3D blocks, and b represents the spectral dimensions of the 3D block. The center coordinates in the spatial dimensions of each block obtained from the HSI are denoted as , where m and n range from and . The true labels for each 3D block are determined by the class of the element located at the central coordinates.
When extracting 3D blocks from the edge pixels, due to the lack of some edge pixels, padding is applied to the original HSI data with a width of . The number of generated 3D blocks equals the number of spatial pixels contained in the HSI (). After removing all 3D blocks with the background class (label 0), the remaining 3D blocks are divided into training and testing sets for model training and evaluation.
Following data preprocessing, we proceed to extract shallow spectral–spatial features from each acquired 3D sample block using a dual-branch convolutional layer. In this module, the 3D convolutional layer theoretically consists of 8 3D convolution kernels. The training samples pass through the first branch of this module’s 3D convolutional layer, generating 8 3D feature cubes, which contain rich spectral–spatial features obtained from the HSI.
At the same time, we convert the 3D HSI cube data into 2D data in order to feed it into a 2D convolutional layer. In the 2D convolution layer, the size of the convolutional kernel is set to
. Set the size of the padding to
. Next, we convert the 3D feature information obtained from the 3D convolution branch into 2D data, in order to fuse it with the 2D feature information obtained from the 2D convolution branch through concatenation. Finally, we use a 2D convolutional layer with
2D convolutional kernels to further extract spatial features from the obtained 2D feature information. The entire module is as follows:
where
represents the HSI cube data,
represents the new feature information obtained by concatenating two-dimensional feature information from two branches,
k represents the size of the convolution kernel,
is (3 × 3 × 3),
and
are (3 × 3),
p represents the fill size,
is (0 × 1 × 1), and
and
are (1 × 1).
2.2. HSI Feature Tokenization
The training samples extract rich shallow spectral–spatial features using the dual-branch CNNs that we designed. However, there is still deeper feature information to be explored. To address this issue, we redefine the obtained shallow spectral–spatial features as semantic tokens. We represent the flattened feature map from the input (obtained by flattening the 2D feature map) as
, where
represents the height and width of the 2D feature map, and
z represents the number of channels. The final output obtained by this module is represented as
, where
t is the token number. To obtain feature tokens
from the feature map
, one can use the following equation:
In the formula,
represents element-wise multiplication with dimensions
, and
represents a weight matrix. We initialize this weight matrix using a Gaussian distribution. At this stage, the newly generated semantic groups are represented by
. Next,
is transposed. We apply the softmax function (
) to the transposed result to emphasize important semantic components. Finally, the multiplication of
and
generates the module’s final output semantic tokens
. The specific process is visualized in
Figure 2.
2.3. Large-Kernel Attention
The tokens output from the previous module can be represented as
. To make our model more suitable for our classification task, we include a learnable classification token
in the first position of these tokens. In order to not lose the positional information inherent in the image, we embed positional information
into each semantic token. For
and
, the input
for the Large-Kernel Attention can be represented by the following equation:
In the realm of academic research focused on HSI processing, Large-Kernel Attention offers several valuable contributions to subsequent Transformer modules. First and foremost, it elevates the quality of feature extraction, enhancing the Transformer’s comprehension of image data and thereby advancing overall model performance. Significantly, Large-Kernel Attention serves to alleviate the computational complexity of attention mechanisms, particularly evident when dealing with extensive image datasets, as it effectively reduces the number of positional elements that necessitate consideration, ultimately enhancing computational efficiency. Furthermore, it effectively mitigates the impact of redundant information, thereby fortifying the robustness of the Transformer model. This is especially pertinent within the context of image data processing, given the inherent wealth of information typically contained therein. Lastly, Large-Kernel Attention plays a pivotal role in augmenting computational efficiency, facilitating accelerated model training convergence, and curtailing memory and computational requisites. These advantages collectively position Large-Kernel Attention as an invaluable component within the Transformer architecture for scholarly endeavors in HSI processing.
As depicted in
Figure 1, this module primarily consists of a 2D convolution layer with
kernels, a dilated 2D convolution layer, and a 2D convolution layer with
kernels. The process can be expressed by the following equations:
where
k represents the size of the convolution kernel,
and
are
,
is (1 × 1),
p represents the fill size,
is (1 × 1),
is (3 × 3),
d represents the size of dilation, and
is 3.
2.4. Transformer Encoder Module
After the Large-Kernel Attention, we obtain enhanced quality features, which are then fed into the TE module. This module uses self-attention mechanisms to handle relationships between tokens, capturing both global and local features. As can be observed from
Figure 1, this module primarily consists of four components, including two Layer Normalization (LN) layers, a Multilayer Perceptron (MLP) layer, and a Multi-Head Self-Attention (MHSA) block. To facilitate deep neural network training and optimization, alleviate gradient vanishing problems, and enhance performance, residual connections are added after the MHSA block and MLP layer.
The TE module includes two normalization layers placed before MHSA and MLP, which help alleviate gradient explosion, reduce vanishing gradient problems, and accelerate training. The core of TE is the MHSA block. MHSA integrates the Self-Attention (SA) mechanism, with its essential components typically named
(Queries),
(Keys), and
(Values). These three matrices are learned during the model’s training process to adapt to the classification task and HSI data. The SA mechanism computes attention scores using
and
, and the weights of these scores are determined using the softmax function. Subsequently, the scores are multiplied by
to obtain the output of SA, as shown in
Figure 3b. These descriptions can be expressed by the following equations:
where
is the transpose of K and
is the dimension of
.
Compared to SA, MHSA uses multiple sets of weight matrices, allowing it to map multiple sets of
,
, and
. Following the same operations as described earlier, it computes attention values for each set to calculate Multi-Head Attention values. The attention results from each head are then concatenated together. The final step involves concatenating each attention value and multiplying it with the weight matrix
, where
n represents the number of attention heads and
represents the number of tokens. The computation equation for MHSA can be expressed as follows:
The MLP component consists of two fully connected layers. After passing through the TE module,
is transformed into
. We extract the classification token vector
embedded in Equation (
3) for the classification task. Next, we use a linear layer with an output dimension equal to the number of land cover classes in the HSI data. Finally, we use the softmax function to assign the label of the input sample to the category represented by the class with the highest probability in the final vector.
The complete procedure of the ETLKA method, as proposed, is outlined in Algorithm 1.
Algorithm 1 Enhanced Transformer with Large-Kernel Attention Model |
Input: Input HSI data and ground truth labels ; Set the spectral dimension after PCA preprocessing to ; Extract patch size ; and specify the training sample rate as . Output: Predicted labels for the test dataset. - 1:
Configure batch size = 64, use the Adam optimizer with a learning rate of , and set the number of iterations to . - 2:
Obtain PCA-transformed HSI, denoted as , from which generate patches for all samples and split them into training and test sets according to the training sample rate. - 3:
Create training and test data loaders. - 4:
for to do - 5:
Generate shallow features using the spectral–spatial shallow feature extraction module. - 6:
Flatten the extracted 2D shallow spectral–spatial feature maps to obtain a 1D feature vector. - 7:
Execute tokenization transformation using feature vectors and initialized weights to produce semantic tokens. - 8:
The first position of the semantic token sequence combines with a learnable class token, and positional embeddings are applied to these semantic tokens. - 9:
Perform Large-Kernel Attention and TE module - 10:
Input the learnable class tokens into a linear layer and use the softmax function to obtain classification probabilities. - 11:
end for - 12:
Use the trained model with the test dataset to obtain predicted labels.
|