1. Introduction
Hyperspectral Image (HSI) is a three-dimensional data cube composed of hundreds of continuous spectral bands, which contains rich spectral–spatial information and is very helpful for ground object recognition. Therefore, HSI classification has been widely applied in environmental monitoring [
1,
2], mineral exploration [
3], precision agriculture [
4,
5] and other fields.
In the early stages of HSI classification research, most methods mainly focused on the utilization of spectral features, such as kernel-based support vector machines [
6], polynomial logistic regression [
7,
8] and random subspaces [
9,
10]. However, these methods only consider spectral information and ignore spatial features, so it is difficult to obtain good classification performance.
As deep-learning-based methods became widely applied and achieved excellent results in image classification [
11,
12], semantic segmentation [
13] and natural language processing [
14], researchers began to introduce them into HSI classification [
15,
16,
17], and proposed many classification methods based on Convolutional Neural Networks (CNNs) [
18,
19,
20]. Hu et al. [
21] proposed Deep Convolutional Neural Networks (DCNNs), which used multiple 1D-CNNs to extract spectral features and improve the classification performance. Li et al. [
22] adopted 3D-CNNs to effectively extract spectral–spatial features, thereby improving the classification performance. Since then, more deep learning methods based on spectral–spatial feature extraction have been used for HSI classification. Zhong et al. [
23] designed an end-to-end Spectral–Spatial Residual Network (SSRN), which used continuous residual blocks to learn spectral and spatial features separately, so as to extract more discriminative features. Roy et al. [
24] proposed Hybrid Spectral CNN (HybridSN) by combining the characteristics of 3D-CNN and 2D-CNN, which reduced the model’s complexity and obtained satisfactory performance. Mu et al. [
25] designed a U-shaped deep network model with principal component features as the model input and edge features of space as the model label, which realized the adaptive fusion of these two features. The fusion features were combined with the spectral features extracted by the Long Short-Term Memory (LSTM) model for spectral–spatial feature classification. To fully exploit the spectral–spatial features of HSIs, Huang et al. [
26] proposed a Dual-Branch Attention-Assisted CNN (DBAA-CNN). This network could extract sufficient diverse information, achieving higher classification accuracy. Lu et al. [
27] proposed a new dual-branch network structure, where each branch learned pixel-level spectral features and patch-level spectral–spatial features, respectively. The features from the two branches were then combined to further enhance classification performance.
In order to obtain more abundant local spatial information, various classification methods based on multiscale feature extraction have been proposed. Yu et al. [
28] proposed a Dual-Channel Convolution Network (DCCN) to maximize the use of global and multiscale information from HSIs. Zhang et al. [
29] proposed a Multiscale Dense Network (MSDN), which made full use of different scales of information in the network to realize deep feature extraction and multiscale feature fusion. To utilize the correlation information between different levels, Song et al. [
30] proposed a Deep Feature Fusion Network (DFFN), which introduced residual learning to alleviate the overfitting problem and fused the features of different levels to improve the classification accuracy.
Recently, a large number of studies have shown [
31,
32,
33] that different spectral bands and spatial pixels have different contributions to HSI classification tasks, and highlighting bands and pixels rich in effective information through the attention mechanism can significantly improve HSI classification performance. Sun et al. [
34] proposed a Spectral–Spatial Attention Network (SSAN). Firstly, a simple Spectral–Spatial Network (SSN) was constructed to extract spectral–spatial features. Then, the attention module was embedded into the SSN to suppress the interfering pixels, which achieved good results on three classical datasets, but the low computational efficiency of the attention module made it time consuming to train the SSAN. Lei et al. [
35] proposed a Local Attention Network (LANet) to improve the semantic segmentation of HSIs by enhancing the scene-related representation in the encoding and decoding stages, which greatly improved the semantic representation of low-level features and further improved the segmentation performance. In addition, Transformers have also begun to be used in HSI classification due to their ability to model global features of images. Hong et al. [
36] used Transformers to rethink the HSI classification process from a sequence perspective and proposed a new backbone network, SpectralFormer, to achieve high performance for the HSI classification task. Sun et al. [
37] proposed a Spectral–Spatial Feature Tokenization Transformer (SSFTT) to capture spectral–spatial features and high-level semantic features. The encoder module of the Transformer was introduced into the network for feature representation and learning, which achieved good classification results and greatly improved the computational efficiency.
HSI classification is a kind of pixel-level classification, and the detail information of edges and shapes is crucial to improving the classification accuracy. However, the general HSI classification model based on deep learning usually only focuses on the use of deep semantic features for classification, and ignores the shallow features, which is not conducive to further improvement of classification performance. The Feature Pyramid Network (FPN) [
38] embedded high-level features rich in semantic information into shallow features rich in detail information through a top-down path, so that all levels of features had rich semantic information. It achieved good results in the application of object detection [
39,
40], instance segmentation [
41] and other computer vision fields. Based on FPN, Wang et al. [
42] proposed an FPN with dual-filter feature fusion for HSI classification. The enhanced multiscale features were obtained by embedding dual-filter feature fusion modules in each horizontal branch of an FPN, and then the final feature representation obtained by fusing features of each level from top to bottom was used for classification, which achieved good performance. Fang et al. [
43] used a convolutional attention module in bottom-up feature extraction to extract effective information, and then used a bidirectional pyramid for instance segmentation of HSI. Chen et al. [
44] introduced coordinate attention in each horizontal branch to obtain more HSI features, and then added and fused the features of each level of FPN to achieve effective HSI classification of small samples.
Inspired by the idea of the FPN, this article proposes a Feature Embedding Network with Multiscale Attention (MAFEN) to make full use of both deep and shallow features through bottom-up feature extraction and top-down feature embedding. Firstly, a Multiscale Attention Module (MAM) is designed to express rich information for different levels of features. MAM first uses convolutional kernels with different receptive field sizes to extract multiscale information, and then uses spectral–spatial attention to suppress redundant information at each scale, so as to highlight the bands and pixels rich in effective information. Secondly, the deep semantic information is embedded into the shallow features through the top-down channel to enhance the representation ability of the features at different levels. Finally, an Adaptive Spatial Feature Fusion (ASFF) [
45] strategy is introduced to automatically learn the fusion weight of each feature map through the network, so as to realize the adaptive fusion of features at different levels.
The main contributions of this article are as follows:
The MAM is designed to enhance the representation ability of features at different levels. Firstly, multiscale convolution is used to obtain rich information representation, and then the attention mechanism is used to highlight important information.
The ASFF strategy is introduced for feature fusion in HSIs to adaptively fuse features of different levels and improve classification performance.
The MAFEN is proposed, where the deep features are embedded into the shallow features through the top-down channel to enrich their semantic information, and the shallow features are adaptively fused with features at other levels.
The rest of this article is organized as follows: The MAFEN method is described in detail in
Section 2.
Section 3 presents the experiments and analysis.
Section 4 concludes the article.
2. The Proposed Method
In this section, our proposed MAFEN for HSI classification is described in detail, and its overall framework is shown in
Figure 1. Firstly, the MAFEN backbone network uses 3D-CNN and 2D-CNN to extract the features of different depths from the dimensionality-reduced hyperspectral images. Secondly, MAM was designed to enhance the representation ability of different levels of features through multiscale convolution, and the spectral–spatial attention mechanism was used to highlight important information and suppress redundant information. Then, the high-level semantic information was embedded into the low-level local spatial information through the top-down channel to make the features at different levels have rich semantics. Finally, ASFF was introduced to adaptively fuse the features of different levels to obtain the final feature representation for classification.
2.1. Multiscale Attention Module
CNNs are limited by fixed-size receptive fields, which may result in insufficient local spatial features. To obtain richer local information of features at different levels, a multiscale approach can be used to control the sizes of convolutional kernels, thus obtaining different receptive fields. Moreover, the feature maps may contain redundant information that could degrade the representation performance, thereby affecting the final classification results. Therefore, we utilized spectral–spatial attention to extract crucial information from the features obtained using multiscale convolutions to enhance classification performance. We designed an MAM that utilized multiscale convolutions and spectral–spatial attention to obtain more rich and effective feature representations.
Figure 2a,b illustrates the overall framework of the MAM and the structure of the spectral–spatial attention module, respectively, as described below.
As shown in
Figure 2, firstly, the MAM convolved the features
of different levels with three convolutional kernels of different sizes to obtain multi-scale information, where the sizes of the convolutional kernels were
,
and
, respectively, and
,
and
represent the extracted low-level, mid-level, and high-level features, respectively. Then, the spectral–spatial attention modules were employed to extract effective information from the features extracted by each convolutional kernel, where spectral attention and spatial attention were cascaded. Finally, the three features were fused by element-wise summation.
2.1.1. Spectral Attention
The main purpose of spectral attention is to generate band weights to recalibrate the importance of each spectral band. Considering that the patch block may contain pixels from other classes, using global average pooling may introduce interference to the pixels of the current class. Therefore, we only used the center vector to generate the weight .
The specific structure of the spectral attention module is shown in
Figure 3. Firstly, the center vector
was taken from the input cube
, where
was the spatial size of
and
was the number of bands. Then, the band weight
was obtained through the calculation of two convolutional layers with a kernel size of
, as shown in Equation (1).
where
and
represent the sigmoid and ReLU activation functions, respectively.
and
are the weight parameters of the two convolutional layers, and
represents the convolution operation. Finally, as shown in
Figure 2b, the band weight
was used to recalibrate the bands in the feature
to highlight the useful spectral information, using Equation (2).
where
represents element-wise multiplication.
2.1.2. Spatial Attention
Spatial attention aims to enhance the spatial information for pixels belonging to the same class as that of the central pixel, while suppressing pixels of other classes. Therefore, the spatial weight
should have the same width and height as those of the input feature
, with a specific structure as shown in
Figure 4. Firstly, global max pooling was applied to the input feature
along the channel direction, as shown in Equation (3).
where
represents the value at position
in the feature
,
represents taking the maximum value along the channel direction
and
is the feature map after global max pooling. Then, it is passed through two 2D convolutional layers to generate the spatial weight
, as shown in Equation (4).
where
and
are the weight parameters of the two convolutional layers,
and
represent the sigmoid and ReLU activation functions, respectively, and
denotes the convolution operation. Finally, as shown in
Figure 2b, the spatial weight
is used to recalibrate the spatial information in the feature
and highlight the useful spatial information, using Equation (5).
where
represents element-wise multiplication.
2.2. Feature Embedding Network
Deep neural networks learn the fine-grained features of local objects in HSIs in shallow layers, and high-level semantic features in deep layers. However, during the deep learning process, shallow features are often lost or even disappear, so they are generally not involved in the final HSI classification. In addition, different-depth features have different levels of information representation, and fully utilizing information at different levels is beneficial to improving the effectiveness of HSI classification. In this article, we propose a new Multiscale Attention Feature Embedding Network. The backbone of MAFEN consists of a spectral–spatial feature extraction channel and a deep feature embedding channel. The detailed description of the MAFEN is as follows.
Let represent the original HSI data, where and represent the width and height of the spatial dimension, respectively, and is the number of spectral bands. Each pixel in corresponds to a one-hot label vector , where is the number of land cover classes. HSIs have rich spectral information, which will lead to a large number of spectral dimensions and an increase in computational complexity. HSIs may also contain noise, causing interference with the classification. Using Principal Component Analysis (PCA) to perform dimensionality reduction can improve classification accuracy by removing noise and redundant information, and can also reduce computation time and resource consumption, thereby enhancing computational efficiency and making deep learning models more efficient. Therefore, PCA is commonly used to process HSI data. PCA reduces the number of spectral bands from b to l, while maintaining the spatial size of HSI. The resulting reduced-dimensional HSI data are represented as , where is the number of reduced spectral bands. To fully leverage the spectral and spatial information provided by the HSI, a set of cubes is extracted from , where represents the spatial size of the patch blocks in the HSI cube. The center pixel of each patch is denoted as , and the true label of each patch is determined by the label of the center pixel.
(1) Feature Extraction Channel: Given the
ith feature
,
i = 0, 1, 2, 3, where
represents the cube corresponding to the HSI input data;
,
and
represent low-level, mid-level and high-level features, respectively. The feature map
is obtained by applying two layers of convolutions (3D-CNN and 2D-CNN) and residual connections to each feature map
in a bottom-up manner, as shown in Equations (6) and (7):
where
represents a 3D convolution with a weight parameter
and kernel size of
, and
represents a 2D convolution with a weight parameter
and kernel size of
.
stands for batch normalization, and
represents the activation function, which is ReLU here.
denotes the max pooling function.
The 3D-2D convolution is used to extract spectral–spatial features from the HSI data, resulting in three features with different levels of information. High-level features contain rich semantic information, while low-level features capture fine-grained local spatial information.
(2) Deep Feature Embedding Channel: Multiscale attention was applied to different deep features
in three branches to extract effective spectral–spatial information, thereby enhancing the classification performance. Then, transpose convolution was applied to the deep features
to complete upsampling and obtain
, as shown in Equation (8).
where
represents the transpose convolution with a kernel size of
and weight parameter
. As a result,
has the same spatial resolution as
. Next,
and
were added together for fusion, and the fused features were convolved as shown in Equation (9).
where
represents the convolution operation with a weight parameter
, and
represents element-wise addition for fusion. Through the above process, high-level features can be embedded into low-level features, enhancing the feature representation capability of the model.
2.3. Adaptive Spatial Feature Fusion
In contrast to conventional feature fusion strategies, ASSF can learn the fusion weights for each feature map automatically through the network, achieving adaptive fusion. The specific structure is shown in
Figure 5.
Firstly, the three different-level features
were concatenated along the channel dimension to obtain the feature
. Then, a convolution operation was applied to change the channel length, as shown in Equation (10).
where
represents a 2D convolution with a kernel size of
and
is the ReLU activation function. The resulting
from the convolution operation has a size of
. To obtain the feature fusion weights
of size
, the Softmax function was applied to normalize the exponential function of the data along the channel direction of
at the same position, as shown in Equation (11).
where
represents the value of the
kth channel of the feature
at position
. Therefore, the network can learn the weights for each feature automatically, enhancing the fusion capability. Next, features
,
and
were multiplied in an element-wise way by weights
,
and
in each band, respectively, to obtain
,
and
, which were then summed to obtain the final feature representation
. Finally, the feature
was fed into a linear layer for classification.