1. Introduction
Hyperspectral images (HSIs) contain a wealth of spectral–spatial information within hundreds of continuous spectra, holding promise for distinguishing different land covers at a fine-grained level, particularly for those which have extremely similar spectral signatures in RGB space [
1]. Therefore, HSI classification has great potential to effectively achieve a series of high-level Earth observation tasks such as mineral analysis, land cover mapping, precision agriculture, mineral exploration and military monitoring, etc.
Early studies on HSI classification utilized various machine learning-based approaches as feature extractors. Classical methods include K-nearest neighbor [
2], Markov random field [
3], random forest [
4], support vector machine [
5] and Gaussian process [
6]. These methods can quickly obtain classification results but cannot ensure accuracy since the man-made feature extractor limits the data representation and fitting ability. Thanks to the great success of deep learning, many neural networks have been proposed to achieve HSI classification in an end-to-end way. Neural networks are capable of cultivating the potentially valuable information hiding in the pluralistic data of HSI, and thus have become a ‘hot topic’ in HSI classification [
7]. For neural networks, the key problem is how to design the architecture and feature extraction in a way that automatically cultivates high-level nonlinearity features. The Convolutional Neural Network (CNN) is a partularly valuable paradigm, which extracts features by stacking multiple convolutions that have a local receptive field. In an early work, 1-D CNN [
8] was proposed to classify HSI by using the spectral signature while ignoring the spatial relation. Subsequently, 2-D CNN was proposed to effectively extract the abundant neighbor information of HSI. Moreover, 3-D CNN [
9] can extract features by regarding HSI as various cubes, which takes the spatial information and spectral signature into account simultaneously. Since HSI itself is 3-D data, 3-D CNN outperforms 1-D CNN and 2-D CNN in most cases. Although CNNs have a powerful ability to extract spatially structural and contextual information from HSI, this only works in short-range spatial relation building, and otherwise tends to cause pepper noise [
10]. Some recent works [
11,
12,
13] have sought to solve the problem by designing a network architecture and attention mechanism. Although some progress has been made, encoding the local information by scalar features limits the representation ability and causes information loss during the propagation of layers. Some methods, like residual network [
14], dense network [
15] and frequency combined network [
16], can suppress the problem, but they cannot enable CNN-based models to achieve extremely accurate results and always result in a heavy computation burden [
17].
Some recent works have helped improve HSI classification performance by designing particular network architectures, including autoencoders (AEs) [
18,
19], recurrent neural networks (RNNs) [
20], graph convolutional networks (GCNs) [
21,
22,
23], Transformers (TFs) [
24,
25,
26], and Mamba [
27]. Chen et al. [
19] employed stacked AEs to semantically extract spatial–spectral features by considering the local contextual information of HSI. Since it is an unsupervised paradigm, the AE-based method cannot ensure high classification accuracy. Hang et al. [
20] designed a cascaded RNN for HSI classification by taking advantage of RNNs to effectively model the sequential relations of neighboring spectral bands. However, RNN can only capture the sequences in short- and middle-dependencies. More recently, more attention has been paid to GCN and TF for HSI classification. GCN regards the pixels as vertices and models the relations of vertices as a graph structure. Therefore, in contrast to GNN, GCN has natural merit for capturing long-range spatial relations [
21]. By embedding superpixel segmentation [
28], this advantage is further enhanced, avoiding local pepper noise [
29]. As a strong substitute for RNN, TF [
30] has a powerful ability to build long-term dependencies using a self-attention technique, which has encouraged the development of a large number of TF-based HSI classification models.
However, GCN, TF and Mamba have inevitable limitations due to their intrinsic mechanisms. Although GCN extracts long-range spatial information that ensures the accuracy of non-adjacent patches, the propagation of graph data is still a challenging problem, which restricts the number of GCN layers used and limits deep semantic feature building [
31,
32]. For TF, it has the merit of building long-range dependency of tokens, and thus has shown huge success in dealing with large-scale datasets in natural language processing [
33]. Although TF can be directly used for HSI to build a spectral sequence, its performance is limited by the length of tokens due to relatively fewer “inductive biases” [
34]. In other words, in contrast to CNN, TF only builds 2-D information in the MLP layer of Encoder, whose effectiveness is mainly achieved by the self-attention mechanism used [
35]. If the token sequence is small, TF will achieve relatively poor performance [
36]. Therefore, the performance of TF-based HSI classification methods is limited since the HSI used is generally small- or medium-scale, in which the spectral signature only provides hundreds of tokens, far fewer than TF-based NLP models. On the other hand, although Mamba-based work [
27] alleviated the memory burden by using a State-Space Model (SSM) to replace the self-attention mechanism of TF, the classification performance is still limited by the length of the spectral signature.
Many recent works have used composite models to solve the limitations of the current backbones. By “composite” is meant that at least two backbones are combined to compensate mutually for the limitations , thus providing a more exact representation to achieve high-quality classification. For example, CEGCN [
37] incorporates the advantages of both CNN and GCN, to jointly cultivate pixel-level and superpixel-level spatial-spectral features to enhance feature representation. WFCG [
38] involves a weighted feature fusion model with an attention mechanism, which combines CNN with a graph attention network for HSI classification. Moreover, AMGCFN [
39] utilizes a novel attention-based fusion network, which aggregates rich information by merging multihop GCN and multiscale CNN. Moreover, GTFN [
40] explores the combination of GCN and TF. By using GCN to provide long-range spatial information, GTFN achieves SOTA performance after using TF for classification. A further approach, DSNet [
41], enhances classification performance by introducing a deep AE unmixing architecture to the CNN. The above composite backbones have demonstrated performance improvement resulting from the synergistic effect of the two backbones. However, simply combining two kinds of backbones with a designed fusion strategy only solves the problem of insufficient representation and propagation to a limited extent but results in a huge computational burden.
In this paper, we resort to a more abstract feature (activity vector) and rethink HSI classification with a capsule network (CapsNet) [
42]. As shown in
Figure 1, the pipeline of CapsNet can be divided into three steps termed ConvUnit, PrimaryCap and ClassCap, For standard CapsNet, the ConvUnit contains one convolution, which extracts preliminary features from the data input and sends them to PrimaryCap. PrimaryCap consists of multiple convolutional capsule layers, which are used to yield primary activity vectors. These primary activity vectors are adaptively activated in the ClassCap using a dynamic routing mechanism, and thus obtain the class activity vectors for classification. Some recent works have explored HSI classification based on CapsNet. CNet [
43] was the first CapsNet-based model for HSI classification. Subsequently, 3DCNet [
44] was proposed, which splits HSI into diverse 3-D cubes as the input to CapsNet to solve the classification problem with limited training samples. More recently, some attention mechanisms have been used to improve the capsule network. For example, SA-CNet [
45] encountered a limitation of the ConvUnit step, i.e., a single convolution cannot effectively address the spatial information. Therefore, a correlation matrix with trainable cosine distance was used to assign varying weights for the different patches of HSI. Moreover, ATT-CNet [
46] addressed the HSI data using a light-weight channel-wise attention mechanism. It incorporates self-weighted dynamic routing by introducing the self-attention mechanism of Transformer into the activation of the activity vectors. By contrast, our proposed CAN seeks to achieve SOTA classification performance using low-level calculations, and has three obvious differences from recent attention-based capsule networks. First, we drop the spectral signatures by PCA rather than weight them, thus effectively reducing the feature dimension of HSI and improving efficiency. Second, to ensure effectiveness, we design a light-weight module (termed AFE) without distance estimation to adaptively weight each pixel. Moreover, we observed that treating different capsule convolutions equally limits the representation of primary activity vectors. Therefore, we innovatively propose to produce more representative primary activity vectors by adaptively weighting the capsule convolutions during the PrimaryCap step.
The main contributions of this research are as follows:
We propose CAN for HSI classification with two attention components, termed AFE and SWM, which improves the representation of primary activity vectors by, respectively, weighting the pixel-wise features and capsule convolutions in an adaptive way. In contrast to standard CapsNet, our CAN only adds three more convolutions due to the used AFE and SWM, causing insignificant parameter increment. However, benefiting from these attention mechanisms, it can run on an extremely low spectral dimension (with low calculations) of HSI data while achieving much higher classification results.
We propose AFE block to simultaneously mine spectral–spatial features with an efficient adaptive pixel weighting mechanism, which ensures the ability to gather useful information from HSI data.
We propose SWM to distinguish the importance of different capsule convolutions, which effectively enhances the representation of primary activity vectors and thus benefits classification performance.
Experiments on four well-known HSI datasets show our CAN surpasses SOTA methods. Moreover, it is shown that our CAN performs much more efficiently than other methods with an extremely lower computational burden.
The rest of this paper is organized as follows:
Section 2 analyzes the limitations of CNN and introduces recent CapsNet based HSI classification works.
Section 3 introduces CAN.
Section 4 shows the experiments to verify the effectiveness.
Section 5 concludes the paper and presents some outlooks for the future.
3. CAN
The pipeline of CAN can be divided into four steps termed Preprocessing, ConvUnit, PrimaryCap and ClassCap. As can be seen in
Figure 3, the Preprocessing step extracts spectral-spatial information from HSI. Assuming the dimension of HSI is
, where
H,
W and
B denote the values of height, width and band, respectively. First, we randomly extract a spectral-spatial cube with a patch size of 7. Afterwards, inspired by [
19,
24], we drop the spectral band from
B to
C using a classic dimensionality reduction method, PCA. By projecting into orthogonal space, PCA adaptively finds valuable spectral features (i.e., principal components). In our experiments, we set
, which reduces redundant information to improve efficiency while maintaining effectiveness. Moreover, the remaining three steps are the standard pipeline of CapsNet. However, we propose AFE for the ConvUnit step to weight the different pixels in an adaptive way. Moreover, a novel attention mechanism is designed in the PrimaryCap step to automatically distinguish the importance of capsule convolutions. We next detail the three steps.
3.1. AFE
Standard CapsNet utilizes a single convolution to process the input data, which uses all information equally and fails to gather useful features. Therefore, we propose AFE, a simple but effective module to quickly obtain high-quality features. Attention mechanisms are a widely studied topic in recent works [
12,
24,
45,
46]. The famous Squeeze-and-Excitation Network [
49] (SENet) is in essence a channel-wise attention, which has been verified to be effective in improving the classification task. Moreover, CBAM [
50] (Convolutional Block Attention Module) and DBDA [
12] (Double-Branch Dual-Attention) were proposed to benefit from channel- and spatial-wise attention, which fuse the weighted features in a cascaded and parallel way, respectively. The proposed AFE is different from the above works since it only uses spatial-wise attention, and no channel-wise attention is used. This design is because PCA discards most bands, which means the remaining spectral bands are all important and adding channel-wise attention only achieves a quite limited gain. Concretely, the output of prepossessing is first sent to a convolution (kernel size 1, stride 1) with the ReLU function to obtain the nonlinearity features. After that, another convolution (kernel size 1, stride 1) with the sigmoid function is used to yield a spatial-wise attention map by dropping the channel dimension to 1. This process can be mathematically defined as
where
denotes the output of preprocessing.
and
denote the first and second convolution, respectively.
denotes the adaptively learned pixel-wise attention map. As can be seen from the angle of spatial attention, the proposed AFE is also different from CBAM and DBDA. First, in contrast to the spatial attention in CBAM, the proposed AFE obtains the spatial attention map by two convolutions with a ReLU function rather than by one convolution. The addition of one convolution with the ReLU function enhances the ability to build nonlinear features, making the attention mechanism more efficient with negligible computational increment. Moreover, in contrast to the spatial attention in DBDA, the proposed AFE only learns the attention map in one branch rather than in three branches, which ensures efficiency. In particular, the convolution’s kernel size of AFE is pixel-wise (1 × 1) rather than having a large kernel size (7 × 7 and 9 × 9) as used in CBAM and DBDA, respectively, which fits in pixel classification. After learning the attention map, we obtain the weighted features as
where ⊗ denotes element-wise multiplication.
denotes the weighed features by pixel-wise attention. Hereto, we have obtained more representative features by distinguishing the importance of each pixel in an automatic way.
After spatial-wise attention, it is necessary to conduct nonlinearity mapping again to extract the high-level features and adjust the scale. To be specific, a convolution (kernel size 5, stride 1) with ReLU function is used to change the patch size from to . To suppress the loss of spatial information, we correspondingly increase the channel dimension to 128. Finally, these features are sent to another convolution (kernel size 1, stride 1) to accomplish the feature preparation for the PrimaryCap step.
3.2. PrimaryCap
The PrimaryCap step yields primary activity vectors, which are the lowest level of multi-dimensional entities. Here, the concept of entity can be understood as land covers (i.e., classification objects) with multi-dimensional parts of interest that are expressed as the instantiation parameters [
42]. These instantiation parameters are meaningful; for HSI, they represent the associated properties of the corresponding land-cover type. Therefore, in this sense, the generation of the capsule can be interpreted as being opposite to the rendering process in computer graphics, where the pattern is obtained by applying rendering to an object with corresponding instantiation parameters.
The PrimaryCap is an
M-dimensionality convolutional capsule layer with
channels, i.e., it contains
M capsule convolutions with the kernel size being 3 and the stride being 2. Therefore, in total, PrimaryCap obtains
M capsules, where each output is a
dimensionality vector. This process can be mathematically defined as
where
denote the
M capsule convolutions.
are the extracted features. Considering the outputs of all the
M capsules, we stack them in dimension 0 to obtain
. Afterwards,
is reshaped as
, where
M denotes the number of activity vectors and
denotes the vector dimension. From another angle,
M is also the number of parallel convolutions and
denotes the corresponding number of output channels. In our experiments, we set
and
.
After obtaining the vectors, a squash operation should be conducted, which serves as the nonlinearity mapping operation of the vector version. The mathematical definition is
where
denotes the obtained primary activity vectors after the squash operation. From another angle, Equation (
7) can also be regarded as a weight operation of the vector. First, it is easy to see
denotes the normalized vector of
. Based on this, the larger the norm of
, the more the norm of
gets closer to 1. While the less the norm of
, the more the norm of
gets closer to 0. This shows Equation (
7) is meaningful, which makes these vectors well recognized.
Equations (4)–(7) show the pipelines of the standard CapsNet. It is easy to see that the learned primary activity vectors are closely related to the
M capsules. A natural idea is to automatically give the importance of these capsules and thus improve the representation of the primary activity vectors. Therefore, a simple but effective method, termed SWM, is designed to adaptively yield the weights. Concretely, we concatenate all the
in the channel dimension and then send them into SWM to generate the weights for all the capsules by a gated convolution. This process can be mathematically denoted as
where
denotes the concatenation operation in channel dimensionality.
is a convolution with kernel size of 3, with stride of 1, and thus it only changes the size of the channels. The input channel is
and the output channel is
M, which can be regarded as the adaptive weight maps of
M capsules. Formally, the weighted features can be defined as
where
are the weighted features. Afterwards, Equations (5) and (6) are solved sequentially to obtain high-quality activity vectors. These activity vectors are further squashed by Equation (
7) to obtain primary activity vectors for the ClassCap step.
It is easy to see the proposed SWM is meaningful. First, since the input of is the concatenated features of all the capsules, it is theoretically explainable that the output M maps can give reasonable weights for the results of different capsules. Moreover, the SWM has two obvious advantages. On the one hand, the weights are data driven, free from manmade priors. On the other hand, in contrast to the standard capsule network, it only introduces one more convolution, and thus the added calculation quantity can be completely ignored.
3.3. ClassCap
The ClassCap step describes an information propagation rule between vectors. Following [
42], we use dynamic routing to build such relations. As shown in
Figure 4, the primary activity vectors are first transformed by a projection matrix
as
where
denotes the projected vectors. These vectors are weighted by different weights and fused by element-wise addition. The process can be mathematically defined as
where
denotes the
n-th weighed vector, where
n denotes the number of class activity vectors (also the number of land cover types). Afterwards, the squashing operation (Equation (
7)) is conducted on
to obtain the class activity vector
. Assuming the total iteration number of dynamic routing is
T, the weight is adaptively updated by
In Equation (
13),
denotes the weight at the
epoch, where
denotes the index of iteration. According to [
42], we set
in the experiments. · denotes the inner-product, which is used to estimate the similarity of the two vectors. Since
denotes the fused vector, the weighting mechanism is meaningful since the vector input with a more positive contribution will be incrementally assigned a larger weight. The whole process is summarized in Algorithm 1.
Algorithm 1: Dynamic routing. |
Input: ; Calculate projected vectors ; Initialize ;
|
3.4. Loss Function
The margin loss and reconstruction loss are used to train the proposed CAN.
where
L denotes the overall loss, and
and
denote the margin loss and reconstruction loss, respectively.
is a trade-off for the loss, which is suggested as 0.0005 [
42].
(1) Margin Loss: For a CapsNet, it is expected that the class capsule for entity
c will have a long instantiation vector if, and only if, that entity is present in the current input. Therefore, the margin loss is used as
where
if the current input is assigned to class
c with margin coefficients
and
.
stops the initial learning from shrinking the lengths of the activity vectors of all the class capsules, which is suggested as 0.5 [
42].
(2) Reconstruction Loss: The reconstruction loss is used as a regularization to ensure exact pattern representation for each entity. If the current entity belongs to class capsule
c, we first mask the outputs of all other class capsules and then only use the result of capsule
c to reconstruct the input. The reconstruction network is composed by three fully connected layers as follows:
where
denotes the fully connected layers.
and
are followed by a ReLU function; while
is followed by a sigmoid function. The reconstruction loss is defined as
where
denotes
loss.