1. Introduction
Remote sensing images contain abundant spectral and spatial information [
1]; thus, numerous studies have been conducted on remote sensing images, such as land cover mapping [
2], water detection [
3], and anomaly detection [
4]. HSI plays an indispensable role within the remote sensing community [
5] and is widely used in change area detection [
6], atmospheric environment research, vegetation cover detection [
7], and mineral mapping [
8]. However, the correlation between spectral bands is complex, which causes information redundancy while forming the curse of dimensionality. In addition, the analysis and processing of HSI require a large amount of computation; therefore, it is essential to reduce the computation overloadwhile maintaining processing accuracy.
Principal component analysis (PCA) [
9] and linear discriminant analysis (LDA) [
10] are classical dimensionality reduction methods. However, these linear methods cannot handle the nonlinear distribution of spectral vectors well. Following the successful application of deep learning in various fields, this technology has also attracted much attention for use in dimensionality reduction. Deep learning has a strong nonlinear processing ability, in which the use of autoencoders is a typical unsupervised learning method. Zhang et al. [
11] introduced a basic framework for the application of deep learning to remote sensing data processing and proposed a stacked autoencoder for data dimensionality reduction. To fully extract the rich spatial–spectral information, Ma et al. [
12] proposed a spatial update deep autoencoder, which is based on a deep encoder with added regularization terms. Ji et al. [
13] proposed a three-dimensional (3D) convolutional autoencoder for the construction of a 3D input using spatial neighborhood information. However, these models are all followed by a simple classification model after the use of the autoencoder for feature extraction, which leads to the problem of insufficient feature extraction. Therefore, we hoped to further explore deep learning methods for HSI classification to fully extract feature information and finally achieve higher performances.
In recent years, CNNs have been proven to be outstanding for image recognition, speech recognition, and pattern analysis. However, CNNs are vulnerable to backdoor attacks. Some outstanding works have endeavored to solve this problem, such as MedicalGuard [
14], BlindNet backdoor [
15], the multi-model selective backdoor attack method [
16], and the use of a de-trigger autoencoder against backdoor attacks [
17]. CNN-based methods have been widely used for image processing and also for HSI classification tasks. These methods have achieved significant breakthroughs due to their local processing and shared weight properties. According to the extracted features, these models can be divided into three categories: spectral-based methods, spatial-based methods, and spatial–spectral cooperative methods. The spectral-based methods classify each pixel by making use of the rich spectral information. Mu et al. [
18] proposed a dual-branch CNN-based method for multispectral entropy super-pixel segmentation for HSI classification. Yang et al. [
19] proposed a deep similarity network to solve imbalances between the slight intra-category and large inter-category differences. Moreover, a new pixel similarity measurement method has been developed using a double-branch neural network to deal with the task of classification. In an attempt to ameliorate the problem of mixed pixels destroying the credibility of original spectral information and the computational efficiency of overly complex models, Gao et al. [
20] proposed a 3D data preprocessing method and designed a new sandwich CNN that is based on the proposed method. To improve the performance of HSI classification that is based on spectral feature learning, a dual-channel attention spectral feature fusion method was proposed, based on a CNN, which extracts local and inter-block spectral features simultaneously in a parallel manner after grouping the adjacent spectral bands [
21]. The spatial-based methods only use spatial information, which means that the rich spectral information is not used. A consolidated CNN [
22] was proposed to overcome the problem of insufficient spatial resolution. Fang et al. [
23] proposed a 3D asymmetric inception network to overcome this overfitting problem. The third group of methods extracts spatial and spectral information at the same time and then fuses the extracted information for HSI classification. Sun et al. [
24] developed a method for extracting local features and then concatenating the spatial and spectral features for classification. Zhao et al. [
25] constructed an architecture that is based on a spatial–spectral residual network for deep feature extraction.
Although CNNs have achieved efficient performances in HSI classification, two main problems still exist. On the one hand, HSI classification comprises point-wise prediction, so the convolutional kernels cannot extract all of the useful information due to different regional topographies. On the other hand, the size of the convolutional kernels limits the receptive field of a CNN, which makes it impossible to carry out long-range modeling. The use of transformers [
26] makes up for this deficiency.
Along with the rapid development of deep learning, CNNs have always been mainstream in the computer vision (CV) field and have demonstrated some extraordinary achievements. Correspondingly, transformers have dominated the natural language processing field. Since 2020, transformers have started to be used in the CV field, such as for image classification (ViT, DeiT, etc.) [
27,
28], target detection (DETR, deformable DETR, etc.) [
29,
30], semantic segmentation (SETR, MedT, etc.) [
31,
32], and image generation (GANsformer) [
33]. For CV problems, convolution has a number of natural advantages, such as translation equivalence and locality. Although transformers do not have the above-mentioned advantages, they can obtain long-range information and extract global information that is based on their unique structure. By contrast, CNNs need to continuously accumulate convolutional layers to obtain larger receptive fields. Based on a ViT, Li et al. [
34] proposed a simple yet effective visual transformer (ViT) called SimViT, which uses multi-head central self-attention and a simple sliding window to concentrate the spatial structure and local information into the ViT. Simultaneously, multi-scale hierarchical features can be applied to various intensive visual prediction tasks. Given the wide application of transformers within the CV field, some studies have introduced ViTs into HSI classification. Hong et al. [
35] examined the problem of HSI classification from the perspective of sequencing and proposed SpectralFormer, which applies a transformer to HSI classification without convolution or cyclic units. He et al. [
36] proposed a spatial–spectral transformer for HSI classification, which uses a well-designed CNN to extract features and adopts a densely connected transformer to deal with the long-range dependencies. Qing et al. [
37] improved transformers to enable them to extract the spectral–spatial features of HSIs by utilizing the spectral attention and self-attention mechanisms. However, these models are still heavyweight, which leads to low efficiency.
As CNNs use the natural inductive bias advantage to learn visual representation information, they can only establish local dependencies in the spatial information domain. A ViT that is based on the self-attention mechanism can capture the global receptive field of the input feature map and can establish global dependencies in the spatial dimension to learn the global visual representation information. However, due to the structure of the self-attention mechanism, network architectures usually have a large number of parameters and computations. In view of this, we committed to combining the advantages of CNNs and ViTs into the design of an efficient network architecture. Moreover, the feature destruction that is caused by the linear dimensionality reduction method was also a point of our concern. In this study, we adjusted the structure of the MobileViT [
38] and constructed a lightweight, robust, and high-performance framework, which can adapt to HSI processing. The proposed method combines the advantages of CNNs and ViTs and improves previous classification performances. Finally, we conducted experiments using four benchmark hyperspectral datasets to confirm the feasibility and excellence of our method for HSI classification.
The three significant contributions of this paper are as follows:
(a) According to our review of the literature, this study is the first to attempt to extend a lightweight ViT (MobileViT) for HSI classification. The MobileViT network can extract local and global information simultaneously and promote accurate classification;
(b) To preserve the more original information of HSI while reducing computational costs, we chose an end-to-end 3D convolutional autoencoder (3D-CAE) network for nonlinear feature dimensionality reduction. Moreover, we proposed an efficient end-to-end CAEVT network, which is based on the MobileViT and the 3D-CAE network;
(c) We evaluated the proposed method using four public datasets and achieved excellent classification results compared to other classification algorithms. In addition, sufficient ablation experiments demonstrated that the proposed method is efficient and effective in terms of time consumption, the number of parameters, and floating point operations (FLOPs). It is worth nothing that our CAEVT network also achieves a competitive performance when labeled samples are scarce.
The rest of this article is organized as follows.
Section 2 introduces the experimental datasets and the proposed framework. The experimental results and an analysis of different methods are presented in
Section 3 and
Section 4, respectively. Finally,
Section 5 presents the conclusions.
2. Datasets and Methods
In this section, we introduce the four public HSI datasets that were used in this study and the proposed CAEVT network in detail.
2.1. Introduction: Datasets
This study used four common HSI datasets to compare and verify the proposed method: the Indian Pines (IP) dataset (
Table 1), Salinas (SA) dataset (
Table 1), Pavia University (PU) dataset (
Table 2), and Houston (HS) dataset (
Table 2).
The PU dataset comprises the continuous imaging of 115 bands within the wavelength range of 0.43–0.86 μm, of which 12 bands were eliminated due to noise, and the spatial resolution of the images is 1.3 m. The size of the data points is 610 × 340, including 42,776 feature pixels in total. These pixels contain nine types of ground truths, including trees, asphalt roads, bricks, pastures, etc.
The IP dataset contains images with a spatial dimension of 145 × 145 pixels and 224 spectral bands within the wavelength range of 0.4–2.5 μm, of which 24 spectral bands that encompassed water absorption areas were deleted. There are 10,249 accessible ground truths, which are divided into 16 vegetation classifications.
The SA dataset comprises the continuous imaging of 224 bands, 20 of which were eliminated because they could not be reflected by water. The spatial resolution of the images is 3.7 m. The size of the data points is 512 × 217 and 54,129 pixels can be applied to the classification. These pixels are divided into 16 categories, including fallow, celery, etc.
The HS dataset was developed for the 2013 IEEE GRSS data fusion competition. The data point size is 349 × 1905, including 144 bands with a spectral range of 364–1046 nm. The ground truths are labeled into 15 categories.
2.2. Three-Dimensional Convolutional Autoencoder
The use of an autoencoder is an effective way to extract deep-seated features due to its hierarchical structure. For a given autoencoder, our goal was to obtain the same output as the input, as far as possible, by optimizing the parameters. Naturally, we obtained several different representations of input X (the feature maps of each layer represent the different representations).
An autoencoder has two parts: An encoder and a decoder. Furthermore, a loss function is required to measure any loss. The smaller the loss, the closer the obtained features are to the features of the original input data. The parameters of the encoder and decoder can be adjusted by optimizing the loss function. In this study, to extract spatial–spectral features simultaneously, we used a 3D-CAE (Equation (
1)) to construct the encoder and decoder:
where
W represents the convolutional kernel,
X is the input,
b is the bias,
is the activation function, and
v is the extracted features.
The structure of the 3D-CAE is shown in
Figure 1. The encoder part comprises convolutional and pooling layers: two convolutional layers and an average pooling layer. Similarly, the decoder consists of two deconvolutional layers. The convolutional layers are used for local processing and the pooling layer is used for downsampling. The deconvolutional layers are used to reconstruct information. The results are measured by the following equation:
where
represents the reconstructed image,
X represents the input image, and
L stands for the loss. The smaller the
L value, the closer the reconstructed features are to the features of the input image.
In addition, a normalization operation [
39] (Equation (
3)) and activation function (Equation (
4): PReLU [
40]) were added to speed up propagation and alleviate overfitting.
where
is the artificial set and
stands for the input. The activation function can increase nonlinearity in the lower dimensions, but it may destroy spatial characteristics in the higher dimensions [
41]. We verified this through the experiments that are detailed in
Section 4.1. So, we did not choose to adopt any activation functions in the last deconvolutional layer.
Taking the PU dataset as an example, the parameters of the 3D-CAE that was developed in this study are listed in
Table 3. We used larger cores for the spectral channels to rapidly reduce the number of bands. The mean squared error (MSE) loss function was used to measure the deviation between the reconstructed data and the original data. The adaptive moment estimation (Adam) method was adopted to optimize the network parameters. In addition, we set the learning rate to 0.001. Finally, the obtained features were transmitted into the next structure.
2.3. Vision Transformer
The transformer encoder consists of an alternating multi-head self-attention layer and a multi-layer perceptron (MLP) block. First, the input feature is mapped into Query (
Q), Key (
K), and Value (
V) using the MLP. Next, the encoder is gained according to the following expression:
The expression calculates its own attention and then multiplies it by to obtain the aggregate feature representation.
Inspired by the successful scaling of the transformer in NLP, we developed a ViT that tries to directly explore the standard transformer in the image and reduces the amount of modification as much as possible. To this end, the image is split into patches and the linear embedding sequence of these image blocks is then used as the input for the transformer.
The standard transformer accepts a one-dimensional sequence of token embedding as its input. In order to process 2D images, the ViT reshapes the image
into a flattened 2D sequence
, where (
H,
W) is the resolution of the original image,
C is the number of channels (RGB image,
C = 3), (
P,
P) is the resolution of each image block,
is the number of generated image blocks, and
N is the effective input sequence length of the transformer. Later, we demonstrate how we developed this transformer for HSI processing (
Figure 2).
2.4. MobileViT Block
In CNNs, locality, 2D neighborhood structures, and translation equivalences exist within each layer of the model; however, ViTs have much less image-specific inductive bias than CNNs. In ViTs, the MLP layers are local and equivariant, yet the self-attention layers are global. As an alternative to the original image blocks, the input sequences can be composed of CNN feature maps. Based on the above considerations, this model was proposed in the literature [
38].
The MobileViT block is shown in
Figure 3. It is assumed that the input character is
. Then, the local expression can be obtained using convolution. At this stage, a separable convolutional structure with convolutional kernels of 3 × 3 and 1 × 1 is used to replace the normal convolution. The separable structure can easily change the number of channels and speed up the operation. The resulting characteristic is recorded as
(
). Due to the heavyweight peculiarity of the ViT, we reduced the input features to a lower
d dimension. As the ViT operates, the input feature map is divided into a series of disjointed blocks, which are recorded as
. Under these conditions,
h and
w were the input parameters, which were to 2, and
.
For each
, the transformer is used to achieve global processing and the relationship between each patch is also obtained. The expression is as follows:
Then, the size of the feature, which is recorded as
, is reconstructed to be the same as that of the initial image. Low-level features
and high-level features
are combined in the third dimension. Next, the dimension is reduced to
C using a convolution with a kernel of 3 × 3. In addition, the parameters of the MobileViT block are listed in
Table 4. This contains all of the details about the MobileViT block.
2.5. The Framework of the Proposed CAEVT
The framework contains three steps: dataset generation, training and validation, and prediction, which can be seen in
Figure 4. First of all, the dataset is randomly divided into a training set, validation set, and testing set. For the training set, four channels (
C,
B,
H, and
W) are reshaped into three channels (
,
H, and
W) (
C stands for the channel and
B stands for the band) after using the 3D-CAE model to reduce the dimensions. Next, a convolutional layer is adopted and the features are input into the MobileViT block for the extraction of local and global features. Before the features are input into the classification network, another convolutional layer, an average pooling layer, and a dropout rate of 0.2 are adopted. Afterward, the features are reshaped into one dimension for classification. The classification network consists of a fully connected layer. Finally, a cross-entropy loss function is adopted to calculate the error.
Taking the PU dataset as an example, the CAEVT network is shown in
Figure 4 and the parameters are listed in
Table 5. In addition, all strides and paddings in the convolutions were set to 1.
In the previous literature, spatial information is captured by learning the linear relationship between patches and considering that CNNs can extract local properties and transformers can obtain global properties. The CAEVT network adopts convolutions and a transformer to capture spatial information. The steps of the proposed CAEVT network are summarized in Algorithm 1. Within this framework, the MobileViT can be iterated to improve accuracy at the cost of computation time; however, the block was only adopted once in this study for the sake of efficiency. In addition, we illustrate the lightweight nature of the CAEVT network by comparing the FLOPs and the number of parameters in
Section 4.2.
Algorithm 1: The proposed method. |
Input: HSI original data X and label Y; Output: The evaluation index. (1) Divide randomly the input data X and annotated label Y into training set (), validation set (), and test set (,). (2) Optimize CAEVT network using training set (). (3) Estimate the model using validation set (). (4) Judge whether the training is over. If yes, output the optimal model; if not, continue the training. (5) Save the optimal model after training 50 epochs. (6) Input to obtain the predicted result and calculate the evaluation index. |
2.6. Experimental Settings
The following four methods were compared to the proposed method.
SSRN [
42]: Based on the 3D convolutional classification models that were proposed by our predecessors, the idea of a skip connection for ResNet [
43] was introduced. This network uses spectral residual blocks and spatial residual blocks to extract rich spectral and spatial features.
FDSSC [
44]: Using different convolutional kernel sizes to extract spectral and spatial features and using an effective convolutional method to reduce the high dimensions, an end-to-end fast dense spectral–spatial convolutional network for HSI classification was proposed.
DBMA [
45]: A double-branch multi-attention mechanism network for HSI classification was proposed. The network uses two branches, which adopt attention mechanisms, to extract spectral and spatial features and reduce the interference between the two types of features. Finally, the extracted features are fused for classification.
DBDA [
46]: Based on DBMA, a network was designed, namely a double-branch dualattention mechanism network, for HSI classification. This method further enhances the ability of the network to extract spectral and spatial features and has a better performance when there are limited training samples.
We executed the public code of these algorithms to obtain our results. The accuracy was measured using the three metrics of overall accuracy (OA), average accuracy (AA), and kappa coefficient. OA represents the proportion of correctly predicted samples out of the total number of samples. The average accuracy of all categories is denoted by AA. The consistency between the ground truth and a result is shown by the kappa coefficient. The better the categorization results, the higher the three metric values. Additionally, all experiments were carried out within the framework of Pytorch 1.10.2 using the RTX Titan GPU.
3. Results
In this section, experiments on four popular datasets were executed to compare the accuracy and efficiency of the proposed algorithm to those of the other methods. We divided the dataset into three parts: the training set, validation set, and testing set. Due to the limited number of annotated samples in the IP and HS datasets, 5% of the samples were randomly selected each for training and validation. For the PU and SA datasets, the proportion of samples for training and validation was set to 1%. Furthermore, in the proposed algorithm, the learning rate was set to 0.001 and the weight decay was set to 0.0005. The parameters of the algorithms for comparison were based on their best settings, which were provided by the relevant authors. Finally, the number of training epochs for all algorithms was set to 50.
3.1. Results for the IP Dataset
The classification results of all methods when using 5% of the data for training samples are shown in
Table 6 and the best results are shown in bold. The ground truth and prediction maps of the methods are shown in
Figure 5.
The main characteristic of the IP dataset is that the number of labeled samples is small and the data distribution is imbalanced. In particular, the number of samples in class 1, class 7, class 9, and class 16 is less than 100, which is far less than that in the other classes. The SSRN algorithm absorbed the characteristics of the ResNet algorithm and performed the best out of the four algorithms that were adopted for comparison. This algorithm achieved optimal results for class 2, class 4, class 6, class 8, class 13, class 14, and class 16. Notably, the accuracy of class 4 and class 16 was 100%. The DBMA algorithm achieved the worst results, with 53.49% OA, 40.92% AA, and 44.91% Kappa. For the DBDA algorithms with the attention mechanism, the results were not satisfactory. The DBDA algorithm used more attention mechanisms than the DBMA algorithm, so the former performed better than the latter. The results increased by 18.17% for OA, 16.22% for AA, and 21.93% for Kappa. The FDSSC and DBMA algorithms showed the best performance for class 16 and class 10, respectively. Additionally, the classification results from the other methods for class 1, class 7, and class 9 were 0, which we speculate was caused by the insufficient number of labeled samples. Similar to the SSRN algorithm, the proposed method obtained the best results for seven categories and surpassed the SSRN algorithm by a slim margin. Moreover, the network that we designed showed the best performance, with 90.71% OA, 78.61% AA, and 89.37% Kappa. It can also be observed from the prediction maps that the category boundaries that were obtained using the proposed method were more obvious and that the edges were clearer.
3.2. Results for the SA Dataset
The classification results of all methods when using 1% of the data for training samples are listed in
Table 7 and the best results are shown in bold. The ground truth and prediction maps of the methods are shown in
Figure 6.
The main characteristics of the SA dataset are a large number of labeled samples and the balanced distribution of classes. For the SA dataset, the SSRN algorithm was error-free for class 6, class 13, and class 16. Similarly, the FDSSC algorithm was error-free for class 1, class 13, and class 16. In addition, a zero error was achieved by the DBMA algorithm for class 1 and by the DBDA algorithm for class 2, class 6, class 14, and class 16. Moreover, the proposed method achieved the best performance for class 3, class 4, class 5, class 7, class 9, class 10, class 11, class 12, and class 15. Compared to the FDSSC algorithm, which achieved the worst results, our proposed method improved by 27.45% for OA, 39.46% for AA, and 31.18% for Kappa. As shown in
Table 7, the results from the CAEVT network were optimal, according to the three selected indexes, and the accuracy of each category that was classified using our method exceeded 89%. It can be observed from the prediction maps that the four methods that were adopted for comparison had some obvious misclassifications. The results that were obtained by the CAEVT network were consistent with the ground truth.
3.3. Results for the PU Dataset
The classification results of all methods when using 1% of the data for training samples are listed in
Table 8 and the best results are in bold. The ground truth and prediction maps of the methods are shown in
Figure 7.
In the PU dataset, the SSRN algorithm demonstrated certain advantages and performed the best for class 1, class 2, and class 5. The performances of the FDSSC, DBMA, and DBDA algorithms were similar and were inferior to that of the SSRN algorithm. The proposed algorithm performed the best for class 4, class 5, class 6, and class 8. In addition, the proposed algorithm exceeded the SSRN algorithm by 0.24% for OA, 0.13% for AA, and 0.29% for Kappa. The other methods showed satisfactory accuracies for every category due to the sufficient number of samples. Moreover, we had difficulty observing any obvious differences between the prediction maps, which was a phenomenon that we speculate occurred due to the similar OAs.
The overall sample size of the PU dataset is large and basically balanced. Among them, class 1 and class 8 are the two classes with the largest number of samples, which far exceed the other classes.
3.4. Results for the HS Dataset
The classification results of all methods when using 5% of the data for training samples are listed in
Table 9 and the best results are shown in bold. The ground truth and prediction maps of the methods are shown in
Figure 8.
The overall sample size of the HS dataset is small and slightly imbalanced. Similar to the results from the SA dataset, the CAEVT network performed the best for nine classes. There was no problem of sample size imbalance and all methods performed well using this dataset. Among the contrast algorithms, the OA, AA, and Kappa of the SSRN algorithm were higher than those of the others but our proposed algorithm obtained the best results with 92.67% for OA, 90.78% for AA, and 92.06% for Kappa, as seen in
Table 9. As seen in
Figure 8, the proposed algorithm performed the best.