1. Introduction
Hyperspectral techniques are crucial instruments used to gather and analyse continuous spectrum data on objects in various wavelength ranges. In addition to offering a wide area for the advancement of hyperspectral image (HSI) classification study, the diverse and rich methods of acquiring hyperspectral data also present new problems and opportunities for our comprehension and use of hyperspectral data. HSIs are currently used in several industries, including food safety [
1,
2], medical diagnostics [
3], industrial product quality inspection [
4], mineral detection [
5], and pest and disease monitoring [
6]. A wide range of topics, including noise removal [
7], spectrum unmixing [
8], data classification and clustering [
9], and target identification and recognition [
10,
11], have been covered by the rich and varied development of HSI processing approaches.
Feature land cover classification algorithms are challenged by the complex spectral properties and high-dimensional features of hyperspectral data. Programs such as Spectral Angle Mapper [
12], Support Vector Machines [
13,
14], Decision Tree [
15], and K-Nearest Neighbours [
16] identified HSI in the first stage of the process based solely on the similarity or difference between their spectra. In situations when there is a considerable amount of spectral overlap or similarity, techniques that solely consider spectral features have not worked effectively. Therefore, efforts have been made to solve this issue using techniques such as the extended multi-attribute profile (EMAP) [
17], the grey scale covariance matrix (GLCM) [
18] based on texture features, and the morphological profile (MP) [
19] based on shape features. These techniques for logically fusing and representing spectral and spatial characteristics can be used to acquire pixel features that are more thorough and precise. To increase the effectiveness and efficiency of classification, HSIs typically require various dimensionality reduction techniques, such as principal component analysis (PCA) [
20,
21] and linear discriminant analysis (LDA) [
22,
23], to maintain important data dimensions and features.
Traditional classification methods often require manual feature selection. The quality and effectiveness of feature extraction rely heavily on human experience and domain knowledge, which may not fully capture the complex and abstract features in images. In comparison, deep learning-based classification methods have stronger feature learning capabilities and automated feature extraction abilities. In this context, they have demonstrated strong performance in the field of image processing tasks. Various HSI classification methods have been proposed based on convolutional neural networks (CNNs) [
9,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. Although CNN-based networks have achieved positive results in visual classification, some areas require optimisation. CNNs typically require a large amount of training data, and smaller datasets may not be sufficient to fully train the CNN model, leading to overfitting. Given the complexity and large number of parameters in CNNs, the training time can be relatively long. The receptive field obtained by CNNs is strongly influenced by the size and variations in the input data. In the context of high-dimensional data such as hyperspectral images, traditional CNN models may require dimensionality reduction or the decomposition of spectral data to adapt to the input requirements of the model.
To address these issues raised regarding CNNs, such as parameter redundancy and a limited ability to handle spatial and scale variations, researchers have introduced various CNN variants to improve the models and enhance the performance and adaptability of CNNs. Wang et al. [
34] proposed a model known as the expansion spectral–spatial attention network. It introduces expansion convolution as a dual-branch feature extraction unit, which expands the receptive field of input patches, enhancing the perception of large-scale features. In Ref. [
35], kernel-guided deformable convolution was used to dynamically adjust the sampling positions of the convolution kernel, extracting pure spatial–spectral information from the local neighbourhood. Zhao et al. [
36] presented a grouped separable convolution model, which combines grouped convolution and separable convolution to reduce the number of parameters and improve computational efficiency. By introducing different convolution operations and sampling mechanisms, these CNN variants effectively increase the receptive field range of the model and more effectively capture spectral, structural, textural, and spatial shape information in images without introducing extra parameters. They also reduce the number of parameters and computational complexity in the network, allowing the model to maintain efficient performance even in resource-constrained environments.
HSI data typically contain many spectral bands, where certain bands are more important for classification tasks and others may contain noise or redundant information. Attention mechanisms [
37,
38,
39,
40,
41] can assist models to automatically learn and focus on crucial features. The introduction of Vision Transformer [
42,
43,
44,
45,
46] has opened new avenues for attention-based model design. Visual Transformers use multi-head self-attention mechanisms (MHSAs) and feedforward neural networks for feature extraction instead of CNNs. Through MHSAs, transformers can globally model the relationships within the entire input sequence, enabling more effective capturing of global dependencies in images. Each MHSA head can focus on different correlations, allowing transformers to learn richer and more diverse feature representations. In contrast, relying solely on CNNs for feature representation is limited by fixed convolution kernels and pooling operations. This may fail to capture diverse features within the input. Many researchers have successfully combined transformers with other components of CNNs, achieving state-of-the-art results in image classification tasks [
47,
48,
49,
50,
51,
52].
However, the visual transformer was originally designed for RGB natural images, and applying the transformer to HSI classification tasks still has multiple challenges. Although the transformer has certain advantages in processing images with fewer channels, when dealing with HSIs, which have multiple spectral bands, there may be issues with scattered feature extraction. HSI data have a larger number of spectral bands for each pixel, and the multi-head self-attention (MHSA) in the transformer may establish long-range dependency correlations on a global scale, thereby excluding the local spatial relationships between pixels. This research offers Dilated Spectral–Spatial Gaussian Transformer Net (DSSGT) as a new model framework to enhance transformer performance on HSI classification tasks. The low-level null-spectrum joint characteristics of the input data were extracted using a two-branch cascade of dilated convolutional layers. By using this tactic, the receptive field for global information could be expanded, while the computing burden was decreased. The low-level joint features were then transformed into pixel-level vectors using a Gaussian weight matrix. This was then fed into the transformer encoder block to create high-level long-range fusion features. By using global average pooling to predict the labels of each pixel, we created a classification map.
A summary of the key contributions of this study includes the following:
(1) We proposed a low-level feature extraction module based on dilated convolution. This module stacks multiple dilated convolution layers with different dilation rates to perceive dependencies between features at different scales, thereby enlarging the receptive field and aggregating the obtained features. This allowed us to capture broader spatial–spectral contextual information and global structural correlations while also reducing computational complexity.
(2) We designed a Gaussian Weighted Pixel Embedding block that transforms low-level fusion features into pixel-level vector representations by introducing Gaussian weighted encoding. This transformation method enables the generation of more expressive and context-aware vector matrices, thereby enhancing the representation capability of the features.
(3) By combining a Dilated Convolutional Feature Extraction Unit and a Transformer Architecture Feature Fusion Unit, low, mid, and high semantic HSI features can be extracted effectively and efficiently, and more expressive and discriminative feature representations can be created. Five datasets were the subject of extensive study to verify the efficacy of the suggested model.
3. Experiment and Analysis
3.1. Data Description
This section introduces the five hyperspectral datasets that were used, and
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 display the raw HSIs and ground truth labelled images of these datasets.
(1) Matiwan Village (MV) dataset: An aerial HSI taken in the Chinese province of Hebei in the village of Matiwan, Xiong’an New Area, Baoding City. With a spectral resolution of 2.1 nm and a range of 400–1000 nm, the dataset consists of 256 bands. The image has a spatial resolution of 0.5 m and 3750 × 1580 image components. Following fieldwork, 18 different varieties of cash crops were chosen as examples for outlining.
Figure 4.
Original HSI and ground truth labels are included in the MV dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
Figure 4.
Original HSI and ground truth labels are included in the MV dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
(2) WHU-LongKou dataset: This dataset consists of a hyperspectral unmanned aerial vehicle image that was taken in Longkou Town, Hubei Province, China, on 17 July 2018. There were no clouds or rain clouds in the sky during data collection. The dataset has a spatial resolution of approximately 0.463 m and a spatial dimension of 400 × 550 pixels. There are 270 bands in a spectral range of 400–1000 nm. Agriculture constitutes most of the research region and six of the nine elements that were defined are related to agriculture.
Figure 5.
Original HSI and ground truth labels are included in the WHU-LongKou dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
Figure 5.
Original HSI and ground truth labels are included in the WHU-LongKou dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
(3) WHU-HongHu dataset: This dataset was collected on 20 November 2017 in Honghu City, Hubei Province, China. During the collection process, the UAV data were obscured by a small amount of cloud cover. The dataset has a spatial extent of 940 × 475 pixels, a spatial resolution of 0.043 m, and contains a total of 270 bands. The dataset mainly covers a wide range of complex crop types, and has a high degree of similarity between specific feature classes, such as Chinese cabbage/cabbage and Brassica chinensis/small Brassica chinensis. In total, 22 feature types were outlined.
Figure 6.
Original HSI and ground truth labels are included in the WHU-HongHu dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
Figure 6.
Original HSI and ground truth labels are included in the WHU-HongHu dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
(4) Houston dataset: This dataset was provided by the GRSS Data Fusion Contest 2013, and its scope was the University of Houston and its neighbouring cities. The dataset has a spatial extent of 1905 × 349 with a spatial resolution of 2.5 m and a total of 144 bands. The dataset contains 15 different feature classes such as roofs, streets, gravelled roads, grass, trees, water, and shadows.
Figure 7.
Original HSI and ground truth labels are included in the Houston dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
Figure 7.
Original HSI and ground truth labels are included in the Houston dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
(5) Pavia Centre (PC) dataset: The northern Italian city of Pavia was the collection point for the PC dataset. The dataset has 102 bands with an image size of 1096 by 715 pixels, spanning the spectral range from 380 nm to 1050 nm. Nine distinct feature classes, including roads, trees, grass, water, and buildings, are present in the dataset.
Figure 8.
Original HSI and ground truth labels are included in the PC dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
Figure 8.
Original HSI and ground truth labels are included in the PC dataset. (a) Original his, (b) ground truth map, and (c) object names and their corresponding colours.
3.2. Experimental Setup
(1) Configuration of the workstations: To maintain equity, we conducted the trials using workstations outfitted with Intel(R) Xeon(R) Gold 5218R CPUs, NVIDIA GeForce RTX 3080 GPUs, and 128 GB of RAM, running Windows 10. Using the PyTorch deep learning framework, all the models underwent training. We implemented an early terminating training technique and limited the number of training epochs to 300. This indicated that we would terminate training and store the best model if the loss function stopped decreasing, that is, approached saturation, for 30 consecutive epochs. The early halting method speeds up training while assisting in preventing the overfitting of the model. The experiment results are more reliable because we repeated each experiment ten times and used the mean treatment.
(2) Input settings: We experimented on five datasets, progressively increasing the patch size from 5 × 5 to 15 × 15, to establish the ideal patch size. Considerations were the model computing expense, the accuracy of the experiments, and the capacity of the image to capture rich contextual information and minute features. Ultimately, we decided that the patch input size for this experiment would be 11 × 11.
(3) Sample proportion setting: For the training, validation, and test sets of the five HSI datasets, different sample proportions were used in this experiment. For the MV dataset, 1% of the total data were chosen at random to serve as the training and validation sets, and the other 98% were the test set. From the WHU-LongKou dataset, 99.8% of the labelled data were used for testing, 0.1% for validation, and 0.1% for training. From the WHU-HongHu dataset, 0.5%, 0.5%, and 99% of the samples were chosen for testing, validation, and training, respectively. From the Houston dataset, the percentage of labelled samples used for testing, validation, and training was 90%, 5%, and 5%, respectively. The percentages for the PC dataset were 1%, 1%, and 98%.
(4) Learning rate setting: For this, we experimented with five different learning rates (1E-3, 1E-4, 5E-4, 1E-5, and 5E-5) and tested the model against five datasets. After conducting numerous rounds of trials, we ascertained that 1E-4 represents the ideal learning rate size. The model performance and the outcomes of the experiment were carefully considered before making this decision.
(5) Evaluation Criteria: The overall accuracy (OA), average accuracy (AA), and kappa coefficient were important metrics for evaluating the goodness of the classification results. OA provides an intuitive measure of the overall classification accuracy, AA considers the evaluation in the case of category imbalance, and the kappa coefficient measures the consistency of the model classification results with the randomised classification. Their calculation formulas are as follows:
where
represents the overall accuracy of each class, C represents the confusion matrix, and K represents the number of land cover classes in the dataset. Thus,
. For any given pixel,
represents the number of samples where the jth class is classified as the ith class,
represents the number of true labels for the ith class, and
represents the number of predicted labels for the ith class. M represents the total number of samples.
3.3. Experimental Results
We compared the proposed method with seven classical and novel deep learning HSI classification models, including 3D-CNN [
55], DBDA [
40], AB-LSTM [
56], HRWN [
57], SpectralFormer [
58], DFFN [
31], and SSFTT [
47]. Three convolutional layers and two pooling layers comprise a 3D-CNN. Convolutional layers, spatial attention modules, and channel attention modules are used by DBDA to extract features. Based on a recurrent neural network, the AB-LSTM is a bidirectional long- and short-term memory HSI classification network. An effective hierarchical random walk joint classification network is the hierarchical random walk network. Based on the transformer architecture, SpectralFormer is an HSI classification model that excels at processing high-dimensional, multi-channel data. It extracts features and classifies them using a multilayer perceptron and self-attention mechanism. Recursive learning is used by DFFN, a deep feature fusion network, to streamline the training procedure. High-level semantic characteristics and null-spectrum features are captured using the transformer-based HSI classification model SSFTT.
(1) Experimental Results for the MV Dataset: In
Table 1 and
Figure 9, we show the classification effectiveness of this dataset among eight methods. In
Table 1, ‘Train’ represents the number of pixels used for training, while ‘Test’ represents the number of pixels used for testing. The proposed method demonstrated the highest OA, AA, and kappa coefficient on the MV dataset. While DSSGT achieved the highest average precision among the 13 classes, it struggled to accurately classify the ’Maize (label 10)’ and ‘Soybean (label 16)’ land cover types, as observed in
Figure 9. Insufficient training samples for these two classes limited the ability of all methods to learn their specific features, resulting in subpar classification performance. However, the GWPE-based transformer module excelled at learning long-range global features from small sample sizes, enabling it to achieve reasonable results even with limited training data. DSSGT boasted an average training time of 7.72 s per epoch, providing a 2.18 s advantage over SSFTT. Although HRWN had the fastest training speed, its classification accuracy fell short. DSSGT tackled this issue by using dilated convolutions as feature extraction units, effectively expanding the receptive field while improving training efficiency. This demonstrated an effective balance between performance and efficiency. The MV dataset primarily consists of land cover types associated with economic crops including maize, soybean, and trees, which share similar spectral characteristics. However,
Figure 9 shows that the DSSGT model performed effectively on the MV dataset, yielding the highest similarity between the generated land cover classification map and the ground truth map. This highlights the advantages of DSSGT in capturing subtle spectral differences and spatial contextual information, enabling superior differentiation among these visually similar categories.
(2) Experimental Results for the WHU-LongKou Dataset:
Table 2 and
Figure 10 contain an accuracy evaluation table and classification effect graphs for this dataset. We only chose 0.1% of the labelled samples as the training set and there were five categories in this training set with less than 10 pixels of training samples. The DSSGT method demonstrated the highest OA of 91.96%, while the DFFN method achieved the highest kappa coefficient (96.51%). All methods exhibited relatively low AA, which suggests a potential impact of data imbalance. The significant disparity in the number of training and testing samples may have adversely affected the performance of the classifiers. To mitigate this issue, the GWPE approach introduced a Gaussian weighted matrix, assigning higher weights to minority classes to give them more attention. Consequently, the DSSGT method achieved the highest AA (66.55%).
Figure 10 shows that all methods incorrectly classified ‘Sesame (Label 3)’ as ‘Broad-leaf soybean (Label 4)’. This misclassification can be attributed to the high similarity between these two classes in terms of spectral features, making them challenging to differentiate. However, our proposed method excelled in distinguishing between these two categories by simultaneously extracting spectral, spatial, and texture features from the input data.
(3) Experimental results for the WHU-HongHu Dataset:
Table 3 and
Figure 11 illustrate the quantitative evaluation results of the WHU-HongHu dataset. DBDA and SSFTT achieved accuracy results on par with DSSGT, indicating their effective use of spatial–spectral attention mechanisms. By leveraging these mechanisms, they could more effectively capture the interdependencies and contextual information among land cover categories. This enhanced capability allowed the models to accurately discern the spatial distribution patterns of land features, thereby improving the overall classification accuracy. However, while DSSGT excelled in achieving the highest classification accuracy in 11 out of the 22 land cover classes, it accounted for only half of the total classes. This discrepancy may be attributed to the greater similarity among the remaining 11 land cover classes, posing challenges for the classifier to accurately differentiate them. Consequently, future model optimisation efforts should focus on using more refined and sophisticated feature extraction techniques and classifier designs to effectively tackle classes with similar features.
(4) Experimental Results for the Houston Dataset:
Figure 12 shows the corresponding feature classification maps, while
Table 4 displays the classification outcomes for DSSGT and the other seven classifiers. The classification model appears to have a high level of accuracy overall and across categories and a high level of consistency with the random selection. This is because DSSGT produced the best classification results and the three evaluation factors were relatively close. Given the limited training samples and the variety of data sources, the current dataset DBDA representation is relatively weak. This made it difficult for the conventional two-channel attentional mechanism to effectively capture useful characteristics. Meanwhile, SSFTT and DSSGT based on MHSA can better understand the long-distance connections between features, which enhances the classification outcomes. The classification maps generated using 3D-CNN exhibit numerous isolated noise points and discontinuous pixels, indicating a considerable issue with salt-and-pepper noise. This could be attributed to the limited robustness of 3D-CNN in handling noise and variations in details during land cover classification. Meanwhile, DBDA struggles to differentiate subtle variations between different species, incorrectly categorising large patches as the same category and disregarding inter-species differences. In contrast, the classification maps produced by DFFN, SSFTT, and DSSGT show significant improvements. These models demonstrated enhanced accuracy in identifying subtle variations and differences between species, effectively eliminating the salt-and-pepper noise phenomenon, and improving image quality.
(5) Experimental Results for the PC Dataset: The findings of the quantitative evaluation metrics and classification plots for the PC dataset are shown in
Table 5 and
Figure 13. In this dataset, DSSGT outperformed other methods in terms of evaluation metrics such as OA, AA, and the kappa coefficient. This highlights its effectiveness in capturing complex spatial relationships between land cover categories. This indicates that DSSGT exhibits stronger capabilities in modelling dependencies and contextual information among different land features, resulting in more accurate and consistent classification results. The observed discontinuities and patchy patterns in the classification maps generated by 3D-CNN and AB-LSTM raise concerns about their ability to capture fine-grained details and handle spatial variations. This can lead to misclassifications and inconsistencies in land cover recognition. In contrast, the classification maps generated by DSSGT are smooth and accurate, demonstrating its ability to effectively eliminate patchy structures and produce visually coherent results. This is attributed to the attention mechanisms and the model’s ability to capture long-range dependencies, enabling it to better capture spatial context and generate more refined classification boundaries.
3.4. Discussion
3.4.1. Ablation Experiments
We performed ablation tests on five datasets to validate the efficacy of the novel component of our suggested technique. The effectiveness of the dilated convolution in comparison to that of the standard convolution was assessed in the first section of the ablation experiments, and the effectiveness of the GWPE block was assessed in the second section. The experimental results are presented in
Table 6, where each dataset’s first row represents the usage of standard convolution with GWPE. The second row corresponds to the usage of dilated convolution without GWPE, and the third row represents the usage of dilated convolution with GWPE.
(1) Impact of dilated convolutions:
Table 6 shows the results from adjusting the variables using the GWPE scenario, using standard and dilated convolution, respectively, in the first and third rows of each dataset. In all five datasets, utilising dilated convolution resulted in greater OA, AA, and kappa coefficients than doing so with normal convolution. This is because our model’s input patch size was 11. The range of the receptive field included the global data of the input patches when we used three consecutive dilated convolutional layers with dilation rates of one, two, and five. This architecture enhanced the model’s performance on the dataset by enabling it to capture the contextual and general aspects of the input data more effectively.
Figure 14 uses standard and dilated convolution for each category to illustrate the effects of DSSGT on five datasets. The graphed results show that the accuracy of feature categorisation is not significantly affected by the type of convolution used, for both the WHU-LongKou and PC datasets. The usage of dilated convolution performs better in several of the poorly categorised feature classes, as shown by the other three datasets. Specific categories that benefited from the use of dilation convolution include ‘vegetable field (Label 6)’ in the MV dataset, ‘Pakchoi (Label 8)’ in the WHU-HongHu dataset, and ‘commercial (Label 8)’ in the Houston dataset. This shows that dilation convolution has an advantage over typical standard convolution in capturing feature categories whose features are concealed or inconspicuous and can do so more effectively.
(2) Impact of GWPE: GWPE is the most critical component of the DSSGT module. It enhances the classification performance by introducing Gaussian weights to emphasise important parts of the input features. In
Table 6, the second and third rows for each dataset represent the classification results with and without the GWPE module, respectively.
Figure 15 shows the influence of GWPE on each category. The framework with GWPE achieved an OA improvement of 3.7% compared to the regular framework on the MV dataset. Similar results were observed with the WHU-LongKou and WHU-HongHu datasets, with an OA increase of approximately 1% in both cases. However, given the already high initial OA values in the Houston and PC datasets, the improvement brought by GWPE is not significant. Nevertheless, the annotations in
Figure 15 indicate that GWPE has a positive impact on all land cover classification tasks in the Houston and PC datasets. The introduction of GWPE enables the model to more effectively capture the distinctions between land cover categories and the correlations among features, significantly enhancing the classification accuracy and performance.
3.4.2. Robustness Evaluation
We conducted tests on five datasets to examine the impact of various ratios on the outcomes by reducing or increasing the number of their training samples by two or three times, respectively. This was performed to ensure the stability of the suggested strategy. For instance, we selected the 0.25%, 0.5%, 2%, and 3% labelled samples as the training set for the MV dataset.
Figure 16 shows the quantitative results of different classifiers, which demonstrate that DSSGT outperformed the other methods. However, the improvement in DSSGT compared to other methods is not significant for datasets such as WHU-LongKou, Houston, and PC, which have a large proportion of training samples. Meanwhile, when faced with extremely limited sample sizes, the advantage of DSSGT becomes more pronounced. This indicates that the proposed method not only performs well with an ample amount of sample data but also demonstrates superior results in scenarios with limited sample data. In contrast, AB-LSTM performs poorly in small-sample scenarios, emphasising the superiority and stability of DSSGT in addressing data scarcity issues.