1. Introduction
The rapid advancement in aerospace vehicle and optics technology has substantially improved hyperspectral imaging techniques [
1], resulting in higher-quality and more cost-effective acquisition of hyperspectral images (HSIs) [
2]. Each pixel within these images contains an abundance of spectral information, encompassing hundreds of distinct bands [
3]. This extensive spectral information renders HSIs suitable for numerous practical applications, including environmental inspection [
4], agricultural assessment [
5], and military investigations [
6]. The advantages of HSI extend to the capability of assigning specific land-cover labels to each hyperspectral pixel, hence promoting interest in HSI classification.
To address the challenge of HSI classification, various methods have been introduced [
7]. In essence, HSI classification categorizes pixels by extracting image features, thereby emphasizing the need for discriminative feature extraction [
8]. In the early stages, feature extraction methods were often manually designed using machine learning techniques, primarily based on spectral and spatial information found in spectral images. For example, methods like support vector machine (SVM) [
9], logistic regression [
10], linear discriminant analysis (LDA) [
11], and principal component analysis (PCA) [
12], which rely on spectral information, were proposed. However, these methods typically focus on HSI features while disregarding spatial information, resulting in less-than-ideal classification performance [
13]. To enrich the feature space, it is imperative to harness spatial information in conjunction with spectral data in HSIs [
14]. As a result, several spatial–spectral-based techniques, which consider both spatial and spectral information, have been developed. Among these, specific joint spatial–spectral classifiers leverage the characteristics of HSI data cubes. For instance, the utilization of 3-D wavelet analysis (WT) [
15,
16] and morphological information methods [
17,
18] is noteworthy. Furthermore, other methods place a higher emphasis on the spatial relationship between pixels, utilizing concepts like manifold learning [
19], superpixel learning [
20], and graph representation learning [
21]. These techniques enable the construction of structural information within HSI data. While these methods have achieved considerable success in HSI classification, traditional machine learning classification methods are constrained by their reliance on human-crafted features and empirical parameters, which hinders in-depth feature extraction [
22].
In comparison, deep learning (DL) can autonomously extract depth features through multi-layer neural networks [
23,
24]. Following their achievements in the realm of RGB images, DL models have been widely adopted in HSI classification due to their strong feature representation capabilities [
25,
26]. In their nascent stages, Chen et al. spearheaded the use of DL models for HSI classification by devising a stacked autoencoder for high-level feature extraction [
27]. Subsequently, a plethora of DL classifiers for HSI classification have surfaced, achieving exceptional results. Notable examples include convolutional neural networks (CNNs) [
28], recurrent neural networks (RNNs) [
29], deep belief networks (DBNs) [
30], and various other models. Among these deep learning approaches, CNNs have emerged as the most widely employed framework for HSI classification. The learning paradigms of CNN-based methods cover a variety of models, including transfer learning [
31], meta-learning [
32], active learning [
33], and so on. Several CNN models with distinct architectures have been introduced to maximize the utilization of spectral and spatial data in HSIs, enhancing the network’s expressiveness by focusing on dimensions, network depth, and complexity. For example, Hu et al. pioneered the incorporation of the CNN model into HSI classification, developing a 1-D CNN classifier for spectral information extraction [
34]. To leverage the spatial information in HSIs, Li et al. elevated the dimensionality of CNN, presenting a 3D-CNN model for extracting spectral–spatial features [
35]. Additionally, the extraction of spectral–spatial features can be realized through multi-channel fusion, as exemplified by two-channel CNN (TCCNN) [
36] and multi-channel CNN (MCCNN) [
37]. Furthermore, inspired by the visual attention mechanism, attention techniques have been integrated into CNNs in various recent methods [
38,
39]. While existing CNN-based methods have made significant performance strides, the convolution operations in CNN models operate on regular square regions, neglecting pixel-to-pixel relationships. Although attention-based CNNs can assign different weights to distinct feature dimensions, they cannot directly evaluate pixel correlations due to CNN models’ inability to handle non-Euclidean data, such as the adjacency relationships between pixels.
Non-Euclidean spatial data pervade the real world. Graphs which connect vertices and edges offer a means to represent non-Euclidean data, with graph neural networks (GNNs) capable of directly extracting information from these graph structures [
40]. Consequently, GNNs enable comprehensive extraction of spectral–spatial features in hyperspectral imaging (HSI) and yield impressive classification results. Presently, GNN approaches for HSI classification encompass three main methods of constructing graphs: pixel-based [
41], selection region-based [
42], and superpixel-based [
43]. Among these, the pixel-based approach designates each pixel as a vertex, generating the graph structure based on all HSI pixels. Given the typically high resolution of HSIs, this method entails significant computational demands [
41]. In contrast, the selection region-based approach constructs the graph using randomly chosen pixel patches from any HSI region rather than the entire HSI dataset. Hong et al. [
42] effectively harnessed constructive relations and image features through a dual-channel mode employing GCNs and CNNs. On the other hand, superpixels contain adjacent pixels with similar features, making them a useful preprocessing tool for HSIs. They mitigate memory consumption and maintain the edge information of ground objects. Ren et al. devised superpixel adjacency graph structures at various scales and adapted these structures at each scale using feature similarity and nuclear diffusion. This approach facilitates high-level feature extraction while conserving computational resources [
43]. While the former extracts multi-scale features by simultaneously constructing graph structures of different scales, our method evolves into multi-scale graph structures using the spatial pooling module, subsequently fusing features of varying scales throughout the U-shaped structure.
However, as discussed in [
44], pixel-level graph nodes result in substantial computation. In response, a spatial-pooling-based graph attention neural network (SPGAU) is proposed for HSI classification. In this network, we use a superpixel-based method to preprocess the original HSI and greatly reduce the number of graph nodes. Moreover, to extract multi-scale spatial features from HSI, we design a spatial pooling module to obtain multi-scale maps. As the number of graph nodes decreases layer-by-layer, the computational burden gradually lightens. The primary contributions of this article are as follows:
- 1.
We propose a novel spatial pooling module that emulates the process of superpixel merging and conducts random merging operations on neighboring nodes exhibiting high similarity in the graph. This operation not only reduces computational costs but also abstracts the multi-scale graph structure.
- 2.
We incorporate an attention mechanism into the graph convolution network to enhance the efficacy of feature extraction.
- 3.
We employ 1-D CNN to extract per-pixel image features, thus mitigating the potential loss of pixel-level information resulting from superpixel segmentation.
The remainder of the paper is structured as follows: In
Section 2, we introduce the proposed SPGAU model.
Section 3 describes three HSI classification datasets and experimental settings. In
Section 4, we present a systematic discussion of the experimental results. Finally,
Section 5 presents the conclusions and provides insight into future research directions.
5. Discussion
In this section, we first discuss the influence of network depth on the proposed SPGAU, then analyze whether the classification performance of SPGAU is competitive with different numbers of labeled samples, and finally, we briefly compare and analyze the computational efficiency of SPGAU.
5.1. Influence of Network Depth
Graph convolutional networks typically employ no more than three layers to prevent over-smoothing. Our SPGAU overcomes this limitation by incorporating a spatial pooling module, allowing for increased network depth and simplified spatial topology. This enhancement forms a strong basis for extracting deep feature information. To evaluate the impact of network depth on SPGAU’s classification performance, we select network depths ranging from 1 to 6 for the experiments. As depicted in
Figure 7, increasing the network depth leads to improved classification accuracy, particularly when the depth reaches 2. This suggests that combining graph convolution with image convolution effectively enhances overall performance. On the PU and WHU datasets, deeper SPGAU leads to improved classification performance. However, for the Pavia University dataset, minimal improvement is observed beyond a depth of 2, mainly due to the presence of discontinuous and fragmented land covers, which limits the efficacy of advanced graph structures.
Beyond a network depth of five layers, the improvements in network classification performance become statistically insignificant, leading to the conclusion that the optimal number of network layers is five.
5.2. Impact of the Number of Labeled Examples
Given the high pixel resolution of HSI, manual annotation imposes substantial costs. In this experiment, we examine whether SPGAU maintains a robust classification capability with limited samples. Specifically, we set the training sample proportion for the IP dataset at between 1% and 6% per class and for the PU and WHU datasets at between 0.1% and 0.6% per class. Each experiment is repeated 10 times, and we employ the average OA index to assess the performance of both our proposed SPGAU and the comparative algorithm. The experimental results for the three datasets are presented in
Figure 8.
Our SPGAU algorithm attains the highest OAs across all datasets. We notice an improvement in OA as the number of labeled instances increases. This suggests that incorporating more labeled data allows the model to acquire more practical information and enhances its training effectiveness. While the GCN-based method outperforms other models in the initial training stage, its OA does not rise significantly with the increase in samples in the later stage. Conversely, although the CNN-based methods exhibit lower OAs with fewer training samples, they consistently exhibit a growth trend over time. This phenomenon underscores the distinct learning characteristics between these two models. By combining their strengths, our SPGAU model achieves superior classification performance, further validating its effectiveness. HybridSN, an attention mechanism-based method, witnesses a significant increase in OAs with increase in the label proportion.
However, it does not reach the level of other methods, indicating that this method may not be suitable for scenarios with a few labeled samples. The MDGCN model secures the second-highest classification accuracy on both the IP and PU datasets. This can be attributed to its multi-scale mechanism, allowing it to learn diverse and informative features, demonstrating adaptability even with the limited labeled samples. Similarly, our SPGAU model, designed as a multi-scale feature extraction and fusion approach, highlights that incorporating diverse models and employing a multi-scale structure can enhance the effectiveness and robustness of HSI classification.
5.3. Running Time
We evaluated the model’s time efficiency, a critical indicator for evaluating the algorithm’s practicality. The experimental settings align with those outlined in
Section 4.2.
Table 9 reports the time cost (training and testing) of SPGAU on the three datasets. Specifically, machine learning-based SVM exhibits a characteristic of rapid training but slower testing, possibly attributed to the absence of a network structure suitable for GPU acceleration. CNNs are limited by their extensive parameter count, leading to significant time consumption. Notably, for CNN-based deep learning methods, network depth primarily influences processing speed. For instance, SSRN, an extension of the 3D-CNN model incorporating residual modules, results in more layers and longer training and testing times compared to 3D-CNN. Conversely, GCN-based methods do not exhibit an increase in time consumption with larger dataset scales. Interestingly, our proposed SPGAU shows higher time consumption than GCN and SSGCN on the Indian Pines dataset. Nevertheless, SPGAU delivers significantly improved classification performance compared to these two methods. In addition, on large-scale datasets, such as the PU dataset, SPGAU requires less time than GCN and SSGCN while achieving satisfactory results. Thus, our proposed spatial pooling module and graph attention convolution module effectively reduce redundant computations. Given that superpixel preprocessing substantially reduces the graph’s size, SPGAU performs efficiently in terms of time consumption on large-scale datasets. In sum, our proposed SPGAU employs superpixel preprocessing technology to control the graph’s scale, effectively reducing computational complexity.