1. Introduction
In the field of remote sensing imaging earlier, the electromagnetic wave reflection of objects was often used to obtain the feature information of objects, which could obtain good imaging results for distant targets. However, the information obtained from electromagnetic waves was mainly two-dimensional (2D) spatial information, and there were still some limitations in detecting and classifying objects being observed. In the 1960s, the concept of hyperspectral remote sensing began to emerge. Compared with remote sensing imaging, hyperspectral remote sensing can obtain spectral information in addition to spatial information. The spectral information and spatial information mentioned here are one-dimensional and two-dimensional, respectively, so the original hyperspectral image (HSI) [
1] containing these two kinds of information is a three-dimensional data cube with multiple bands. Spectral information contains detailed descriptions of ground objects, including small differences between different categories, and has a better effect on feature recognition, which also makes hyperspectral images widely used in the collaborative classification of multi-source data in the field of remote sensing. In addition, the large amount of information and high spectral resolution of hyperspectral images also make them play an increasingly important role in scenarios such as forest fire prevention, ocean exploration, environment detection [
2], and disaster assessment, and become the hotspots of research.
Hyperspectral image classification is a common research direction in the field of HSI processing, which uses a large number of continuous narrow spectral band features and spatial features acquired during hyperspectral imaging to determine the category of each pixel. The earliest HSI classification method only extracted the spectral information of images as the basis for feature classification, such as the support vector machine (SVM) proposed by Melgani and Bruzzone [
3]. SVM converts input images into points in space by using the labeling information in the training dataset, and effectively separates samples of different categories in space by searching for the best hyperplane, thus achieving efficient classification. However, SVM is limited by the low spatial resolution of HSIs and the sensitivity to parameter selection. In 1933, the principal component analysis (PCA) method was proposed [
4]. Camps et al. [
5] initially used a hybrid kernel function to combine spectral and spatial features, which required complex calculations, so Tarabalka et al. [
6] proposed the method of clustering and watershed segmentation in 2010 to improve the extraction of spatial features and simplify the calculation process. However, there may be certain information loss in PCA as well as the disability of retaining the nonlinear correlation between samples, which often leads to the lack of sufficient feature information for classification.
Among the methods that focus on spatial feature processing, guided filtering has gradually gained people’s attention in recent years because of its ability to better optimize the results of morphological filtering. Guided filtering constructs a local linear model of the guidance image and the output image, solving the difference function between input and output for implicit filtering, and the edge information of the guidance image is extracted and emphasized. He et al. [
7] first proposed a guided filtering method and applied it to spatial feature extraction, whereas Li et al. [
8] added an edge-preserving filter and got better classification results. In 2019, Guo et al. [
9] proposed multi-scale spatial features and cross-domain convolutional neural networks (MSCNN) to combine guided filtering with deep learning for spatial feature extraction. The multi-scale spatial features obtained by guided filtering were reordered for classification, which greatly improved the classification accuracy. But there was often detailed information lost in traditional guided filtering, and the extraction of edge information was insufficient, influencing the accuracy of classification results.
With the help of deep learning (DL), feature extraction is greatly improved in HSI classification, where a convolutional neural network (CNN) is widely used. Zhao et al. [
10] added dimensionality reduction to the two-dimensional convolutional neural network (2DCNN) to obtain deeper spatial and spectral features for better classification results. Subsequently, the three-dimensional convolutional neural network (3DCNN) [
11], which outperformed 2DCNN, was proposed. 3DCNN divided the HSI into many neighborhood blocks as the input of the network, and fully extracted the spatial spectral features to improve the classification effects, which took huge amount of calculation and training samples in the case of limited labeled samples. In 2017, Mou et al. [
12] proposed to use time series networks that were relatively simple and with less computational cost for HSI classification, including the long short-term memory (LSTM) network. To make better use of spectral features for classification, spectral attention mechanism was proposed. Squeeze and excitation net (SENet) [
13] squeezed each channel through a pooling operation and learned attention weights between channels with an activation function. Efficient channel attention (ECA) [
14] uses one-dimensional convolution to learn attention weights. And the idea of the convolutional block attention module (CBAM) [
15] was to combine traditional channel attention and spatial attention.
The end-to-end spectral-spatial residual network (SSRN) proposed by Zhong et al. [
16] implemented the residual module to extract continuous spatial and spectral features of HSI, respectively, and achieved classification with fused spatial and spectral features. The flaw of SSRN was that only single-scale features were considered, and the extraction of global features was not sufficient. Song et al. [
17] proposed the deep feature fusion network (DFFN), which used residual learning to synthesize the outputs of different layers of the network but lost some spatial features and lacked contextual information. Based on the residual block, Mu et al. [
18] constructed 3D-2D alternating residual blocks for further extraction of multi-scale spatial and spectral features. Both the hybrid spectral convolutional neural network (HybridSN) [
19] and the spectral-spatial feature tokenization transformer (SSFTT) [
20] proposed recently adopted the residual block structure that combined 2DCNN and 3DCNN for joint features extraction, while SSFTT added a transformer module to represent high-level semantic features. However, SSFTT did not fully consider the details and edge information of HSIs. In 2021, the HSI spectral-spatial classification method based on deep adaptive feature fusion (SSDF) [
21] was proposed with a U-shaped, network-based module for feature fusion. Although it greatly combined multi-scale features for better classification results, the generalization of feature fusion operations still needed to be strengthened.
In recent years, many HSI classifiers have introduced Transformers [
22] and self-attention modules since Vision Transformer (ViT) [
23] was proposed in 2021. He et al. [
24] first applied the Transformer model to HSI classification and achieved good results. Hong et al. [
25] focused on improving spectral feature extraction in Transformer, proposing Spectralformer to learn spectral sequence information between neighbor channels. Combining Transformer with deep learning was also a good idea to make full use of two different models [
26]. Yang et al. [
27] changed the structure of Transformer and took the convolution process as embedding. Interactive spectral-spatial transformer (ISSFormer) [
28] combined the multi-head self-attention mechanism (MHSA) for spectral features with the local spatial perception mechanism (LSP) for spatial features and had a feature interaction mechanism. To fully explore the rich semantic information in HSIs, SemanticFormer [
29] was proposed to learn discriminative visual representations using semantic tokens.
So far, there are still some intractable problems with existing DL-based methods. The low spatial resolution of HSIs often leads to fewer details of ground objects, so it is generally difficult to obtain rich and detailed spatial features, which limits the final classification performance. How to improve the spatial resolution of HSIs and enhance spatial features in HSIs for better classification performance is still a problem to be solved. What’s more, HSIs have dozens to hundreds of continuous bands containing abundant spectral information. By extracting spectral features and capturing the subtle spectral differences between different bands, the accuracy of classification can be further improved. However, the strong correlation between spectral bands and the number of bands makes the extraction of spectral features prone to the influence of noise and information redundancy, and the dimensionality reduction of HSIs may ignore boundary information, which is not conducive to good classification results. In addition, there is always a small number of hyperspectral samples available for training, and labeling requires time and effort, but a good effect of DL demands massive training data. Due to this contradiction, it is necessary to find a model that can get good classification performances with fewer training samples.
To address the problems above, this paper proposes an improved feature enhancement and extraction model (IFEE) using spatial feature enhancement and attention-guided bidirectional sequential spectral feature extraction for hyperspectral image classification. Through adaptive guided filtering, the subtle features and edge information of the input image are preserved and enhanced in the filtered image, which helps to make the categories more distinguishable. With the introduction of a spatial feature enhancement module composed of 2DCNNs, IFEE can improve the resolution of the image after adaptive guidance filtering and provide a high-resolution image with key features emphasized for the subsequent feature extraction module, which will significantly enhance the effect of spatial feature extraction and effectively improve the classification performance. In addition, we refer to the multi-scale and multi-level feature extraction (MMFE) module proposed in SSDF for spatial features extraction to combine deep and shallow convolution features, which adds 2D convolution with different kernels after each pooling layer to get multi-level spatial features. Considering the effect of noise and information redundancy on spectral features extraction, the spectral attention mechanism combined with bidirectional sequential spectral features extraction is designed to emphasize significant and representative spectral features and suppress noise interference, which reduces the interference and increases the classification accuracy. We also discuss the performance of the proposed IFEE with fewer training samples. The main contributions of this paper can be concluded as follows:
- (1)
As the effect of traditional guided filtering is limited with fixed regularization coefficients, we introduce an adaptive guided filtering module to preserve and enhance the subtle features and edge information of the input HSIs, which may be lost in dimensionality reduction operations.
- (2)
The low spatial resolution of HSIs makes it hard to obtain sufficient spatial features and details of ground objects, thus limiting the classification performance. As a response, we propose a lightweight image enhancement module composed of 2DCNNs to improve the spatial resolution of the feature map after adaptive guided filtering processing and enhance spatial features, preparing more detailed spatial information for feature extraction.
- (3)
To reduce the noise and mitigate the impact of data redundancy in spectral dimension, we design a spectral attention mechanism creatively taking spatial patches and sequences randomly selected from original HSIs as two different inputs, which can emphasize representative spectral channels as well as suppressing noise. The spectral attention mechanism is combined with bidirectional sequential spectral features extraction, which further improves the classification accuracy.
- (4)
In consideration of the situation that labeled samples are limited in reality, we have discussed the classification performance of the proposed IFEE with a small number of training samples. Our method can get the best classification results on three datasets with only five labeled samples per class for training.