1. Introduction
HSIs are complete spectral information for each pixel generated by hyperspectral sensors by capturing the reflection information of an object in multiple consecutive spectral bands through hyperspectral imaging technology. Compared to RGB images, HSIs also contain the information about the shape, texture, and structure of the object [
1], but HSIs contain a large amount of waveband information, which allows identification and differentiation of substances with similar colors but different spectral characteristics. Thus, HSIs are widely used in scientific and industrial applications that require precise substance identification and analysis, such as medical imaging and diagnosis [
2], geological and mineral exploration [
3], environmental protection [
4], agricultural crop monitoring [
5], food safety monitoring [
6], and military reconnaissance and security [
7]. To fully exploit the value of HSIs, many subtasks are derived, such as classification [
8,
9], target detection [
10,
11,
12], and unmixing [
13,
14,
15]. Among these tasks, the land cover information classification task has received extensive attention.
When classifying objects in HSIs, due to the phenomenon that “the spectra of the same object may be different and the spectra of different objects may be the same” exists in HSIs [
16], therefore, it is not feasible to simply apply the same methods used for the RGB image classification to the HSIs classification. To address the above challenges, researchers around the world have proposed various approaches, such as principal component analysis (PCA) [
17], the Bayesian estimation method [
18], SVM [
19,
20], and k-mean clustering [
21,
22].
However, with the breakthrough of deep learning, convolutional neural networks (CNNs) are gradually replacing the traditional HSI classification methods due to their stronger model generalization ability and deep feature characterization. And in recent years, in the field of HSI classification, CNNs have been rapidly developed. For example, Hu et al. used 1-D CNN [
23] in order to extract the spectral information. But for the HSI classification task, using only spectral information is not enough to realize to obtain accurate classification results. Therefore, Zhao et al. proposed 2-D CNN [
24] to extract spatial features. However, both 1-D CNN and 2-D CNN do not fully utilize the 3-D characteristics of HSIs. Thus, Chen et al. applied 3-D CNN [
25] to the field of HSI classification in order to fuse the spatial–spectral features of HSIs, and the experimental results showed that the performance of the model was improved. Based on these experiments, many researchers have proposed hybrid convolutional methods [
26,
27,
28,
29,
30,
31,
32]. Among them, Roy et al. proposed the HybridSN [
29] with a linear structure. HybridSN contains three sequentially connected 3D convolutional layers for fusing spatial and spectral information and one 2D convolutional layer for extracting local spatial features, respectively. In addition, Zhong et al. proposed the SSRN [
32], where they introduced residual connectivity between convolutional blocks to promote backpropagation of the gradient. However, these convolution-based methods are limited by the convolutional kernel, which can only learn the information within the coverage of the convolutional kernel, thus restricting the representation of global features.
During this period, Dosovitskiy et al. proposed the vision transformer (ViT) [
33], and the multi-head attention mechanism proposed in the paper greatly alleviated the problem of sensory field limitation of convolution-based methods. Since then, a large number of methods combining ViT and CNN appeared [
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45], such as SSTN [
45], SSFTT [
37], CTMixer [
43], and 3DCT [
39]. These methods have both global feature perception capability based on ViT methods and local feature fusion capability based on CNN methods. Thus, compared to methods [
46,
47,
48,
49] that only use CNN to obtain spectral–spatial features, these mixed-use CNN and ViT methods are able to extract more comprehensive spectral–spatial features due to the use of a global attention mechanism.
However, the characteristics of HSIs, especially the importance of edge features between different classes and spectral bands in the classification process, are not fully considered in these methods that use a mixture of CNN and ViT. To enhance edge features, edge data augmentation methods are often employed. Traditional image edge data augmentation methods usually apply edge detection operators (e.g., Laplacian, Canny, Sobel operators) [
50] directly on the original image to obtain the edge information, which is then used for subsequent model training by directly superimposing with the original image. However, in the field of HSI classification, due to the existence of the characteristic that the boundaries of the same object may be different in different spectral bands, processing the original data in this way will cause a large amount of noise, which will affect the subsequent classification performance. In order to minimize the effects of superimposed noise, Tu et al. applied edge-preserving filtering to the edge portion in their proposed MSFE [
51] with pyramidal structure, but the MSFE does not take into account the fact that different spectral bands play different roles in the classification process.
Therefore, inspired by the above work, and in order to enhance the image features and weaken the noise interference of the initial HSI, we adopt a dynamic learning approach to obtain the edge information and the decision weights of different spectral bands. Then, we use a mixture of attention mechanisms and CNN on this basis with the aim of obtaining global spectral–spatial features.
Figure 1 shows the edge-aware and spectral–spatial feature extraction network which we propose. The network contains two parts: an edge feature augment block and a spectral–spatial feature extraction block. Different from traditional data augmentation that is not dynamically learnable, our edge feature augment block adaptively learns the degree of edge feature enhancement in different spectral bands, which reduces the high-frequency noise. In addition, in the spectral attention block, we adaptively adjust the weights of different spectral bands for classification and then perform feature extraction on its basis. To sum up, there are three main contributions:
We propose a novel feature extraction network (ESSN) with richer and more efficient representation of edge features and spectral–spatial features compared to existing networks;
We designed a novel edge feature augment block. The block consists of an edge-aware part and a dynamic adjustment part. Compared with edge data augmentation methods that are not dynamically learnable, this block greatly reduces edge distortion and noise amplification;
We propose a spectral–spatial features extraction block. It contains a spectral attention block, a spatial attention block, and a 3D–2D hybrid convolution block. Among them, the spectral attention block and the spatial attention block gain an effective feature by enhancing the information favorable for classification and suppressing noise and other interfering information. The convolution block fuses the above features.
The subsequent sections are composed as follows. Our proposed method is described in
Section 2. In
Section 3, we describe our experimental environment and make a detailed comparison with other SOTA methods in the same environment. We perform sensitivity analysis experiments and ablation experiments aimed at verifying the importance of each part of the model in
Section 4. In
Section 5, we distill the paper and suggest directions for model improvement.
2. Methodology
Figure 1 shows the whole process of HSI classification. It consists of a data preprocessing block, the backbone of the proposed network, and a linear classifier.
Real objects are given a hyperspectral image (HSI) after passing through a hyperspectral sensor. Assuming that the HSI is
.
are the height, width, and number of spectral bands of the raw HSI image, respectively. In HSI, each pixel can be represented by the vector
, where
represents the pixel value on the
Cth spectrum. Obviously, the greater the number of spectra, the richer the information, but this greatly slows down computational efficiency. Therefore, we adopt the PCA technique to preprocess the HSI data to improve the efficiency, maintain the height and width unchanged, and reduce the spectral number from
to
. We denote the HSI after PCA dimensionality reduction as
, where P denotes the number of the spectra after PCA dimensionality reduction. In order to obtain a suitable input format for the network, we crop the image into patches
with pixel-centered dimensions as
, where
represent the height, width, and spectral number of the patch, respectively. The data preprocessing block is shown in
Figure 1. Note that the same symbols appearing in this section represent the same meaning.
The backbone of ESSN contains both an edge feature augment block and a global spectral–spatial feature extraction block, and we will describe the content of ESSN in as much detail as possible. Finally, we use a linear classifier to determine the class of each pixel.
2.1. Edge Feature Augment Block
As shown in
Figure 2, the points where the model fails to predict are mostly at the intersection of different categories, which is due to the high feature similarity between certain categories on the one hand, and the possible boundary blurring between certain categories [
52].
Previously, edge data augmentations were usually used to strengthen the edge for the above problems. However, the direct superposition of edge information may produce strong edge noise, leading to the confusion of similar categories. Therefore, we propose a novel edge feature augment block, as shown in
Figure 1, which can adaptively adjust the model’s emphasis on the edges of a region by learning the importance of the edge information in the input data, and personalize the edge information.
Laplacian of Gaussian Operator
The Laplacian of Gaussian operator is generated by the convolution of the Laplace operator and Gaussian filtering operator. The Laplace operator is particularly sensitive to regions of the image that change abruptly, and therefore has a better performance in edge-awareness tasks. Due to the prevalence of Gaussian noise in images captured by electronic devices, which seriously affects the accuracy of edge perception, hyperspectral images need to be processed with Gaussian filtering before perceiving the edges. The Gaussian filtering operator and Laplace operator can be expressed by Equations (1) and (2), respectively:
where
denote the spatial coordinate positions of the HSIs,
is the Gaussian standard deviation, and
represents the value of the pixel on the image.
Convolutional operations have the law of union, so we use the result of the convolution of the Gaussian filter operator with the Laplace operator as a new edge-aware operator (
) and then use the
to convolve the image to obtain the image edges. The
expression is shown in Equation (3).
Due to the discrete representation of hyperspectral images, we discretize Equation (3) to obtain an approximate
operator for practical use. As shown in
Figure 3, we list the
operators for the two cases
and
, respectively. Then, let the result after edge-aware operator be
with the following expression:
where
indicates depthwise separable convolution with the kernel of
.
In the edge feature augment block, because of the characteristic that “the spectra of the same object may be different and the spectra of different objects may be the same”, strengthening the edge features at the same rate in different spectral bands will generate interference noise, so we design a learnable parameter
for adjusting the degree of feature augment in different spectra. We explore the importance of
in
Section 4.2. And in order to make the network more flexible and the optimization process smoother and more efficient, we use residual connectivity. The output
of the module is shown below:
where
indicates activation function of sigmoid.
denotes the dot product of the corresponding position.
2.2. Spectral–Spatial Feature Extraction Block
2.2.1. Spectral Attention Block
HSIs are rich in spectral information; to make it easier to see, we show the specific image of each spectral band by means of a grayscale map, as shown in
Figure 4. Obviously, the importance of different spectra in the decision-making process is different [
53]. The spectral attention helps the model in adaptive adjustment of weights for different spectra and in enhancing the representation of these spectra during the learning process. This helps the model to suppress the influence of task-irrelevant spectra.
In the spectral attention block, to strengthen the correlation between encoded and decoded data, we use residual concatenation. Let the input features of the block be
, then the results after global maximal pooling and global mean pooling are
and
, respectively, where global maximum pooling is complementary to global average pooling. The results are shown below.
To reduce the size of parameters, the pooled features are entered into a shared multi-layer perception (MLP), and the results
and
are obtained. Let the rate of dimensionality reduction be
and the weights of the two MLP layers be, in order,
and
.
and
are as follows:
Adaptive spectral weights
of the input feature map are obtained by adding
and
and passing through the sigmoid activation function.
is shown below:
where
is the activation function of sigmoid.
Finally, let the output of the spectral attention block be
, the expression is as follows:
2.2.2. Spatial Attention Block
In contrast to traditional convolutional operations, which focus on only a portion of the input data, the spatial attention mechanism [
54] can adaptively adjust the area of attention over the global spatial range of the input data and give more importance and weight to these locations during preprocessing, thus improving the recognition accuracy and efficiency of the model.
In
Figure 5, the structure of spatial attention in the spatial attention block is illustrated. Considering that the spatial information of the same location may behave differently on different spectral bands, we first fused the local spatial features by 2D convolution and then projected the convolved feature maps to obtain
,
, and
, respectively:
Then, the attention map
can be calculated as follows:
where
is the dimension of
. Let
be the output in the network of
Figure 4, the expression is as follows:
Finally, we reshape
to the size of
for subsequent processing.
In
Figure 1, the spatial attention block contains two spatial attention parts; we use two convolutional kernels of different sizes on two spatial attention parts to enhance the perceptual region. In addition, in order to strengthen spatial expression, we use a residual structure.
2.2.3. 2D–3D Convolution
2D convolution layer can extract spatial features, and 3D convolution layer can extract spectral features. Therefore, as shown in
Figure 1, 2D convolution and 3D convolution are used in the spectral–spatial feature extraction block. In the 2D–3D convolution block, three consecutive 3D convolutional layers with a different kernel and one 2D convolutional layer are included in the 2D–3D convolutional block. A detailed description is shown below.
In the 3D convolution layer, a single 3D convolution can be regarded as a 3D convolution kernel sliding along the three dimensions (H, W, C). During the convolution process, the spatial and spectral information of the neighboring spectra are fused. And the values of the
nth feature map of the
mth layer at the spatial location of
are as follows:
where
is the activation function, and
and
are the bias parameters and weight values of kernel corresponding to the
nth feature map of the
mth layer, respectively.
indicates the number of feature maps in the (
l − 1)th layer and the depth of
. The height, width, and spectral dimension of the kernel are
,
, and
, respectively.
In the 2D convolution layer, the convolution kernel slides over the entire space, and the output of the convolution is the sum of the dot products between the kernel and the input data. During the convolution process, the information of different spectra in the same space is fully integrated. In 2D convolution, the values of the
nth feature map of the
mth layer at the spatial location of
are as follows:
the parameters appearing in Equation (18) represent the same meaning as in Equation (17).
4. Discussion and Analysis
4.1. Parametric Analysis
In this section, we perform a sensitivity analysis on three parameters: patch size, training ratio, and operator of , respectively, and explore their impact on model performance.
In
Table 5, ESSN performs better on IP and PU when patch size is selected as 15 × 15 but does not perform as well on KSC as when patch size is 19 × 19. Considering the amount of computation, patch size 15 × 15 is selected as the optimal size. In addition, as the patch size increases, the OA on the KSC becomes larger and larger, and combined with the full ground truth map of KSC, there are two reasons for this result: one is that as the patches increase, each patch contains more spatial information, and thus the model can learn more key elements from it. The other is that as the patches increase, the longer distance edges are gradually incorporated into the model’s observation range. In addition, by looking at the ground truth plots of IP and PU, it can be seen that there is more category intermingling in these two datasets. Increasing the patch can obtain a larger perceptual field which is beneficial to the model but, at the same time, will introduce more noise and confusing information which is detrimental to the model. Therefore, ESSN shows a performance on both IP and KSC that first increases with increasing patch and then decreases with increasing patch. From
Table 5, the point of patch size of 15 × 15 is the cutoff point where the model performance goes from up to down.
As seen in
Figure 12, apparently, as expected, the performance of all models improves with increasing training samples, with OA gradually approaching 100%. In addition, ESSN has a large advantage when the number of training samples is insufficient, and the OA of ESSN gradually decreases with the increase in samples used for training, and eventually the performance of all models gradually converges to the same. All in all, ESSN outperforms other models on different training ratios.
Figure 13 shows the performance on different cases. Comparing ‘a’ with ‘c’, it is clear that if traditional data enhancement methods are used without processing the raw data with learnable parameters, the performance is not as good as when using edge feature augment blocks. In addition, comparing ‘a’ with ‘e’, it can be found that traditional data augment does not have a positive effect. Especially on the KSC dataset, it greatly reduces its classification performance. When using the edge feature augment block, it can be seen that the performance of the different operators of
is very close. And comparing ‘e’ with ‘b’, ‘c’, ‘d’, the classification capabilities of the model are improved when the edge feature augment block is added. Comparing ‘b’ with ‘c’, the performance gap between the two on all the datasets used in the experiment is super small. The reason is that the
operators used for both ‘b’ and ‘c’ are discrete approximations at
, and the difference between the two is the difference in the angle at which the rotational invariance is satisfied, with the
operator corresponding to ’b’ having invariant results for rotations in the 90° direction, and the
operator corresponding to ‘c’ having invariant results for rotations in the 45° direction. In this study, raw data are not rotationally transformed, so the difference between the two is not significant, and both are better than the case corresponding to ’e’. Obviously, ‘b’ performs optimally on the KSC, and ‘c’ performs optimally on the IP and PU. After comprehensive consideration, the
operator corresponding to ‘c’ is chosen.
4.2. Ablation Experiment
In this section, for the influence of external factors, the training samples in the ablation experiments are kept the same as those in the experiments in
Section 3. Thus, effects arising from hyperparameters and randomized training samples are excluded.
PCA operation is used in the data preprocessing part, but it also causes loss of spectral information when extracting the principal components of hyperspectral images. Therefore, we explore the effect of PCA operation on the comprehensive performance of the model by conducting ablation experiments of PCA operation on three datasets: IP, KSC, and PU. The experimental results are shown in
Figure 14. When PCA operation is used to downscale to 50 dimensions, it can be clearly observed from subplot (a) in
Figure 14 that the classification accuracy of the model under PCA operations is not very different from the classification performance of the model without PCA operations, but from subplot (b) in
Figure 14, it can be found that the time cost of the model can be greatly reduced by using PCA operations. With the comprehensive consideration of model classification accuracy and time cost, we adopt PCA to preprocess the original data.
In addition, in order to analyze the effect of each component in the model on the model performance, a total of seven combinations were considered in this experiment. All the results are shown in
Table 6, which depicts the edge feature augment block with “Edge Block” instead, the spectral attention block with “Spectral Block” instead, and the spatial attention block with “Spatial Block” instead, respectively.
Comparing ‘7’ with ‘1‘ to ‘6’, when all blocks are employed, the model has the best performance.
Comparing ‘7’ to ‘6’, ‘5’ to ‘3’, and ‘4‘to ‘2’, respectively, it can be seen that adding the edge feature augment block improves the model’s performance in IP, KSC, and PU. Combined with the experimental conclusions about edges drawn in
Section 4.1, it can be found that the edge feature augment block is different from the traditional edge data augmentation and plays a positive role in improving the network performance. Comparing ‘7’ to ‘5’, ‘6’ to 3, and ‘4’ to ‘1’, respectively, it can be found that the ability to model will be strengthened on IP, KSC, and PU when adding the spectral attention block. Comparing ‘7’ to ‘4’, ‘6’ to ‘2’, and ‘5’ to ‘1’, adding the spectral attention block is also beneficial in improving the model’s performance on IP, KSC, and PU.
Comparing ‘7’ with ‘1’ to ‘6’, it is found that ‘7’ performs optimally on IP, KSC, and PU, which indicates that the combination of edge feature augment block, spectral attention block, and spatial attention block is effective. Then, comparing ‘4’ with ‘1’ and ‘2’, ‘5’ with ‘1’ and ‘3’, and ‘6’ with ‘2’ and ‘3’, it can be found that the performance of the combination of two blocks is better than a single block.
Comparing ‘1’, ‘2’, and ‘3’, it can be found that the model using only the edge feature augment block and the model using only the spectral attention block perform similarly on IP, KSC, and PU, while the model using only the spatial attention block does not perform as well. This is because for the models using either the edge feature augment block or the spectral attention block, this has the effect of removing noise, whereas using the spatial attention block directly introduces this invalid noise into the network, which adversely affects the model’s capabilities.
Comparing ‘4’, ‘5’, and ‘6’, the model performs similarly on all datasets at this point, and better than the model in ‘1’, ‘2’, and ‘3’. Comparing ‘3’ with ‘5’ and ‘6’, it can be found that the performance of the model gains a significant improvement, which greatly illustrates the need for edge feature augmentation and noise reduction processing of HSI using edge feature augment block and spectral attention block.
Overall, the edge feature augment block and the spectral attention block play a great role in suppressing the noise in HSIs, and combining them with the spatial attention block will result in better performance than other combinations.
5. Conclusions
In this paper, a novel feature extraction network (ESSN) is proposed for efficiently extracting local edge features and global spectral–spatial features from HSIs. In the ESSN, firstly, the edge feature augment block performs edge-aware and selective feature enhancement efficiently compared to the traditional edge data augmentation using the operator with no learnable parameters. Secondly, due to the presence of a large amount of noise in some of the spectra in the HSI, different spectra do not have the same importance for the classification decision, so we introduce the spectral attention block to enhance the effective spectra and suppress the noise. Also, due to the geometric constraints of the convolutional operation, we introduce spatial attention to model the pixel–pixel interactions at all locations. Finally, we fuse representations of the feature maps reconstructed by the above methods through the 2D–3D convolution block to obtain the final feature representations. The experimental results show that ESSN performs competitively on the IP, KSC, and PU datasets.
Although ESSN has better performance in HSI classification, further improvements are needed. Afterwards, we will continue the following studies: