This section first describes the common HSI datasets. Second, the relevant experimental setup is introduced. Then, we discuss some important parameters. Next, S2CABT is compared with 3 non-DL-based models and 13 CNN-based networks. Moreover, the ablation experiments are implemented. Finally, we analyze the complexity of S2CABT.
4.3. Parameters Discussion
Batch Size: The batch size is the number of training samples input to the classification network in each iteration. A smaller batch size means that the updated gradient in each iteration is calculated based on fewer samples, which increases the randomness of the training process and may speed up the convergence. However, this may also cause the training process to be more volatile, thus affecting the classification performance. A larger batch size reduces this randomness, thereby making the gradient updates smoother and the network convergence process more stable. However, it may fall into a wide valley region due to the excessive smoothness, thereby causing the network to converge to a local minimum, and affecting the classification performance. Therefore, in order to determine an appropriate batch size, this paper validates the impact of the batch size on the classification performance. We select different batch sizes from the set
with step 16 to carry out the experiments with 2% training samples. The average OA, AA,
and their standard deviations are shown in
Figure 9a–c. As can be seen from
Figure 9, the batch size has basically no impact on the classification performances of the Indian Pines, Houston, and PaviaU datasets. In contrast, the impact of the batch size on the classification performance of the KSC dataset is more obvious. When the batch size is 32 or 48, its classification performance is better than when the batch size is set to other values. Considering that the batch size is directly related to the memory occupied by the network parameters in each iteration, this paper sets the batch size of the PaviaU datasets to 32. In order to maintain the consistency, the batch sizes of other three datasets are also set to 32.
Learning rate: The learning rate is the extent to which the network parameters are updated along the gradient direction during the optimization process. A smaller learning rate will make the network parameters gradually approach the optimal solution, but the convergence speed will be very slow and it is easy to fall into a wide local minimum region, thereby making it difficult to obtain the optimal value and affecting the classification performance. A large learning rate will cause the network parameters to decrease too much in each iteration, thereby making it easy to ignore the optimal solution. Meanwhile, a larger learning rate will cause the network parameters to stop updating quickly, thereby making the network unable to mine valuable information and affecting the classification performance. Therefore, in order to determine an appropriate learning rate, this paper validates the impact of the learning rate on the classification performance. We select different learning rates from the set
to carry out the experiments with 2% training samples. The average OA, AA,
and their standard deviations are shown in
Figure 10a–c. As the learning rate increases, the classification performances of the Indian Pines and KSC datasets gradually increase and then decrease, and reach the optimal classification performance when the learning rate is 0.0003. The classification performance of the KSC dataset fluctuates up and down as the learning rate increases, and it has an optimal value when the learning rate is 0.003. Meanwhile, it can be seen that when the learning rate is 0.0001 or 0.0003, the classification performance of the KSC dataset is very close to the optimal value. In addition, the learning rate has basically no impact on the classification performance of the PaviaU dataset. As the learning rate increases, its classification performance remains basically unchanged. In order to unify the learning rate in the proposed S2CABT corresponding to different datasets as much as possible, this paper sets the learning rate of four datasets to 0.0003.
Reduced spectral dimension: S2FEM uses the MNF to reduce the spectral dimension to remove the redundant spectral bands and decrease the memory requirements. Since the spectral bands contain abundant spectral information, a smaller reduced spectral dimension will not only remove the redundant spectral bands, but also easily remove some bands containing the discriminative spectral information, thereby resulting in the loss of the spectral features and affecting the identification for land covers. A larger reduced spectral dimension easily introduces some redundant spectral information, which makes it more difficult to extract the useful spectral features and increases the computational complexity. Therefore, in order to determine an appropriate reduced spectral dimension, this paper validates the impact of the reduced spectral dimension on the classification performance. We select different reduced spectral dimensions from the set
with step 5 to carry out the experiments with 2% training samples. The average OA, AA,
and their standard deviation are shown in
Figure 11a–c. It can be seen from
Figure 11 that as the reduced spectral dimension increases, the classification performances of the four datasets gradually increase. Moreover, when the reduced spectral dimension is equal to 15 or greater, their classification performances remain basically unchanged. In other words, reducing the spectral dimension does lose the spectral features to a certain extent, thereby affecting the classification performance, but this is only limited to the smaller reduced spectral dimension. When the reduced spectral dimension is appropriately increased, the impact of the spectral reduction will become smaller.
Patch size: The patch size relates to the spatial features that can be learned by the classification network. A smaller patch size will lose some important spatial information and the pixels with high correlation to the current pixel, which will result in the unsmooth predicted class label map. A larger patch size will introduce some useless or erroneous spatial information, and increase the computational cost. Therefore, in order to determine an appropriate patch size, this paper validates the impact of the patch size on the classification performance. We select different patch sizes from the set
with step 2 to carry out the experiments with 2% training samples. The average OA, AA,
and their standard deviations are shown in
Figure 12a–c. It can be seen from
Figure 12 that as the patch size increases, the classification performances of the Indian Pines and Houston datasets gradually increase and then decrease. The Indian Pines dataset has the optimal classification performance when the patch size is 5. The Houston dataset has essentially the same classification performance when the patch sizes are 7 and 9. Considering the computational complexity, its patch size is set to 7. As the patch size increases, the classification performances of the KSC and PaviaU datasets fluctuate up and down. They reach the optimal classification performances when their patch sizes are 13 and 7, respectively.
Convolutional kernels: the number of convolutional kernels is related to the high-level semantic features that can be learned by the classification network. The fewer convolutional kernels cannot learn enough high-level semantic features, which affects the representational ability of the classification network and leads to underfitting. The more convolutional kernels tend to learn redundant or erroneous high-level semantic features, which leads to overfitting and increases the computational complexity. Therefore, in order to determine an appropriate number of the convolutional kernels, this paper validates the impact of the number of convolutional kernels on the classification performance. We select different numbers of convolutional kernels from the set
with step 4 to carry out the experiments with 2% training samples. The average OA, AA,
and their standard deviations are shown in
Figure 13a–c. It can be seen from
Figure 13 the classification performances of the Indian Pines, KSC, and PaviaU datasets gradually increase with the convolutional kernels’ increases, and then fluctuate up and down. The Indian Pines and KSC datasets have the optimal classification performances when the number of convolutional kernels is 20. The PaviaU dataset reaches the optimal classification performance when the number of convolutional kernels is 28. As the convolutional kernels increase, the classification performance of the Houston dataset gradually increases and then decreases. It has an optimal classification performance when the number of convolutional kernels is 28.
4.4. Classification Results
In order to demonstrate the classification performance, our network is compared with 3 non-DL models: SVM [
11], RF [
12], MLR [
13] and 13 CNN networks: ResNet-34 [
40], SSRN [
47], DPRN [
48], A2S2K [
52], RSSAN [
51], FADCNN [
45], SPRN [
49], FGSSCA [
44], SSFTT [
54], BS2T [
58], S2FTNet [
55], S3ARN [
53], HF [
56].
Table 4,
Table 5,
Table 6 and
Table 7 show the statistical classification results of different methods on the Indian Pines, KSC, Houston, and PaviaU datasets with 2% training samples, respectively. The best results are in bold. We report the average ± standard deviation on the classification accuracy in each class (CA), OA, AA and
. It can be seen that our network outperforms the other methods in most CA, all OA, all AA, and all
on the four datasets. The non-DL-based models perform better than CNN-based networks on classes with fewer training samples. This is because fewer training samples cannot accurately estimate the network parameters, thereby resulting in underfitting. In the classes with more training samples, CNN-based networks perform better than the non-DL-based models. This is because CNN-based networks have strong feature learning capabilities and can predict land-cover classes more accurately using sufficient training samples.
Specifically, for the Indian Pines dataset, the proposed network has the highest CA in 10 classes (16 classes in total). Its evaluation metrics on the Indian Pines dataset are the best, which are 94.80%, 94.20%, and 0.9407, respectively. SPRN has the highest CA in 5 classes, among which it reaches 100% on class 8. Apart from S2CABT, it is the best performing network on CA. SPRN partitions the spectral bands into sub-bands and learns spectral-spatial features on each sub-band. Each sub-band can be regarded as a new dataset, so SPRN increases the available training samples by spectral partitioning. However, since it only uses the convolutional layer to learn the local features and ignores the cross-channel interactions as well as the global features, the features learned by SPRN make it difficult to accurately describe all classes. Therefore, its performance in other classes is not outstanding or even poor. Moreover, its OA, AA, and are 4.56%, 4.29%, and 0.0519 different from our network. The closest classification performance to our network is BS2T; its evaluation metrics are 94.01%, 91.46%, and 0.9316, respectively. The differences between S2CABT and BS2T on evaluation metrics are 0.79%, 2.74%, and 0.0091, respectively. BS2T learns the local and global features using CNN and Transformer, respectively, but it ignores the enhancement of semantic features by cross-channel interactions and spectral-spatial properties. Moreover, it does not consider the impact of limited labeled samples. Therefore, its CA reaches 100% only on class 8, and is worse than S2CABT on other classes.
For the KSC dataset, the proposed network has the highest CA in 10 classes (13 classes in total). Its evaluation metrics are the best, which are 99.11%, 98.42%, and 0.9901, respectively. It can be seen that the KSC dataset is more separable than the Indian Pines dataset, especially on classes 10 and 13, where 7 and 10 classification methods reach 100% CA, respectively. The main reason is that the number of training samples of each class in the KSC dataset is more balanced, which avoids overfitting to a certain extent. On this basis, the proposed network extracts more representative multi-scale spectral-spatial features to further enhance the KSC’s separability. The closest classification performance to our network is FGSSCA; its evaluation metrics are 98.36%, 97.43%, and 0.9818, respectively. The differences between S2CABT and FGSSCA on evaluation metrics are 0.75%, 0.99%, and 0.0083, respectively. Similar to SPRN, FGSSCA groups the shallow features and constructs the serial dual-branch structure to learn spectral and spatial features sequentially, thereby increasing the availability of training samples. However, in the serial dual-branch structure, the spatial features will be affected by the spectral features, which makes it difficult to learn highly discriminative semantic features.
For the Houston dataset, S2CABT has the highest CA in 2 classes (15 classes in total). Moreover, the difference in CA between our network and the method with optimal CA is small. For example, the difference between the CA of our network and the optimal CA is only 0.45%, 0.61%, 0.08%, 0.03%, and 0.17% on the classes 2, 3, 5, 7, and 11. The evaluation metrics of the proposed network are the best, which are 94.24%, 94.35%, and 0.9377, respectively. The closest classification performance to our network is FGSSCA; its evaluation metrics are 94.04%, 93.96%, and 0.9355, respectively. The differences between S2CABT and FGSSCA on the evaluation metrics are 0.2%, 0.39%, and 0.0022, respectively. For the PaviaU dataset, S2CABT has the highest CA in 2 classes (9 classes in total). The difference in CA between our network and the method with optimal CA is small. For example, the difference between the CA of our network and the optimal CA is only 0.26%, 0.03%, 1.14%, 0.4%, 0.09%, and 0.54% on classes 1, 2, 4, 5, 6, and 8. The evaluation metrics of the proposed network are the best, which are 99.35%, 98.75%, and 0.9914, respectively. The closest classification performance to our network is SPRN; its evaluation metrics are 99.27%, 98.61%, and 0.9903, respectively. The differences between S2CABT and SPRN on evaluation metrics are 0.08%, 0.14%, and 0.0011, respectively.
The visual classification performance of different methods on the Indian Pines dataset with 2% training samples is shown in
Figure 14. Overall, the CNN-based networks perform better than the non-DL-based models. The main reason is that CNN-based networks usually take the 3D patch as the input feature, and comprehensively consider the spectral and spatial information. They combine the feature extraction and the label prediction in the unified framework, and can dynamically learn more discriminative high-level semantic features based on the loss function. The non-DL-based models take 1D spectral vectors as input features, which only consider the spectral information and ignore the spatial information, thereby resulting in many misclassified pixels in the class label map. They generally take the input feature directly as the classification basis, which makes it difficult to estimate the optimal model parameters. In contrast, S2CABT can obtain a smoother class label map and predict the labels for the most labeled samples more accurately.
Specifically, the land covers are closer to each other in the Indian Pines dataset, so the feature extraction and the label prediction are easily affected by the adjacent land covers. The Indian Pines dataset has a large spatial resolution (20 m), which leads to the higher spectral variability. Since the land covers in the Indian Pines dataset have size differences, the number of training samples of different classes are not balanced. Both the spectral variability and the unbalanced training samples will create difficulties in estimating the optimal network parameters. As shown in
Figure 14, most methods are unable to accurately classify the land cover and recognize the image edges. The non-DL-based models and a few CNN-based networks (e.g., ResNet and DPRN) produce many discrete predicted labels due to the fact that the non-DL-based models ignore the spatial information, while ResNet and DPRN are underfitting because the training samples are too few. The remaining CNN-based networks are prone to over-smoothness, i.e., the land covers with different classes are recognized as the same class. The main reason is that these networks use original HSI data as the input feature, which makes it difficult for them to learn spectral-spatial features with high discriminative power from the original HSI data with low separability. In contrast, S2CABT enhances the data separability by integrating the highly relevant spectral information and the complementary spatial information at different scales. Meanwhile, it optimizes the designs for the channel attention and the self-attention to learn more discriminative semantic features, which can enhance the S2CABT’s ability to interpret the HSI data with the complex spectral-spatial structure and capture the contextual information, thereby better describing the essential properties of land covers.
In order to compare the classification performances of different methods on four datasets more comprehensively, we use different training sizes 2%, 4%, 6%, 8%, 10%, 15%, and 20% to complete the classification task, and the experimental results are shown in
Figure 15. It can be seen that the classification performances of non-DL-based models outperform some CNN-based networks when the training size is small. For example, the non-DL-based models are better than DPRN, RSSAN, FADCNN in the Indian Pines dataset, better than RSSAN, SSFTT, S2FTNet in the KSC dataset, better than DPRN, RSSAN, S3ARN in the Houston dataset, and better than FADCNN in the PaviaU dataset. As the training size increases, the CNN-based networks gradually perform better than the non-DL-based models. This is mainly because more training samples allow the CNN-based networks to learn highly discriminative semantic features, thereby improving the classification performance. Different from the comparison methods, S2CABT extracts MSCM with stronger separability and builds the more efficient feature learning module, so it can still achieve exciting classification performance with smaller training size, and it is better than other methods. As the training size increases, the difference between S2CABT and other comparison methods gradually decreases, but our network is still optimal.
The classification’s robustness is easily affected by many factors such as the sample quality and the feature representation capacity. In order to improve the S2CABT’s robustness, this paper projects HSI into a feature space with higher descriptive power by integrating the highly relevant spectral information and the complementary spatial information at different scales, thereby increasing the data separability and providing high-quality training samples for subsequent feature learning. In addition, this paper designs FCL and CAMHSA to improve the feature representation learning capacity. Among them, FCL can model the cross-channel interactions and optimize attention allocation on this basis. CAMHSA introduces the spectral-spatial property similarity measure to allow the central pixel to learn discriminative information from the neighboring pixels.
It can be seen from
Figure 15 that S2CABT still maintains better classification performance even with fewer training samples. Meanwhile, as the training size increases, the S2CAB’s classification performance gradually becomes better. Moreover, compared with other classification methods, S2CAB can converge faster. These phenomena all show that our network has better robustness and stability.
4.5. Ablation Study
In order to validate the impact of the important contributions in S2CABT on the classification performance, we designed an ablation study. In this experiment, the impact of the important contributions will be analyzed from three aspects: the ablation study on the input feature, the ablation study on the framework, and the ablation study on the attention.
Ablation study on the input feature: In order to reduce the impact caused by the high spectral variability and enhance the HSI’S separability, MSCM is extracted to reconstruct the input feature space. MSCM integrates the highly relevant spectral information and the complementary spatial information at different scales using the Spectral Correlation Selection (SCS) and the spatial multi-scale analysis, respectively. In this experiment, we compare the classification performance when the original HSI, single-scale covariance matrix, MSCM without SCS, and MSCM are used as input feature. Since we use three scales to extract MSCM, the window sizes at different scales are 3, 15, and 27, respectively. Therefore, different window sizes, i.e.,
,
, and
need to be considered when using the single-scale covariance matrix as the input feature. In order to analyze the differences between input features more intuitively, T-SNE is used to project the input features into a two-dimensional space and visualize them. As shown in
Figure 16, in each subgraph, the different colors represent the samples with different classes. Moreover, the experimental results on four HSI datasets are shown in
Table 8, and the best results are in bold. We report the average ± standard deviation on OA, AA, and
.
As can be seen from
Figure 16, affected by the high spectral variability, the samples with different class labels overlap with each other in the original HSI data. The distribution of samples with the same class label is relatively divergent, thereby making it difficult for the classification network to use the training samples obtained from the original HSI data to learn the optimal parameters. Meanwhile, it can be seen from
Table 8 that when the original HSI data is used as the input features, its classification performance is poor. In contrast, there are clear decision boundaries between the samples with different class labels in the MSCM, and the distribution of samples with the same class label is also more concentrated. Moreover, the classification performance when using the MSCM as the input feature is significantly better than that when using the original HSI data as the input feature. According to the above analysis, it can be seen that the separability of input features will directly affect the classification performance. In order to validate whether the spectral feature is lost when S2FEM uses the spectral correlation selection to remove the neighboring pixels with low spectral correlation, this section visualizes the MSCM without SCS. It can be seen from the visualization results that using all pixels in the neighborhood window to calculate the covariance matrix for the central pixel will lead to larger intra-class distances and smaller inter-class distances. Therefore, the introduction of the spectral correlation selection block not only does not lose the spectral features, but also can obtain the highly discriminative spectral information, thereby improving classification performance. Meanwhile, the multi-scale analysis enriches the feature representation to a certain extent. The small-scale covariance matrix can capture the fine features, while the large-scale covariance matrix can focus more on the overall representation. It can be seen from the visualization results of three single-scale covariance matrixes that it is difficult to obtain valuable information from the Indian Pines dataset on the small-scale region. However, in the large-scale region, a feature space with high descriptive power can be constructed for the Indian Pines dataset. The KSC, Houston, and PaviaU datasets can obtain effective feature representations at different scales, especially in the small-scale region, where they can extract spectral-spatial features with stronger separability.
Ablation study on the framework: In order to comprehensively characterize the spectral-spatial properties, the residual convolution module and the center-aware bottleneck Transformer are designed. In addition, the parallel dual-branch structure is built to learn robust spectral and spatial features. The feature learning module can be divided into four sub-modules: ERCB, ARCB, ECABT, and ACABT. ERCB and ARCB are constructed based on CNN to learn the local spectral and spatial features, respectively. ECABT and ACABT are designed based on the bottleneck Transformer to learn the global spectral and spatial features, respectively. In this experiment, we remove one or more sub-modules to validate the impact of different sub-modules on classification performance. The experimental results are shown in
Table 9 and the best results are in bold. It is worth noting that ✘ means the module is removed, while ✓ means the module is not removed. We report the average ± standard deviation on OA, AA, and
. It can be seen that each sub-module plays a vital role in improving classification performance. After removing one or more modules, the classification performance is degraded to varying degrees.
Specifically, this experiment can be divided into three groups: removing one sub-module, removing two sub-modules, and removing all sub-modules. On the KSC, Houston, and PaviaU datasets, the classification performance after removing one sub-module is better than other experiments. The classification performance after removing all sub-modules is worse than the other experiments. On the Indian Pines dataset, overall, the classification performance after removing one sub-module is better than other experiments. However, AA after removing two sub-modules is worse than AA after removing all modules. This phenomenon is manifested in the removal of the spectral or spatial branches, i.e., removing both ERCB and ECABT or removing both ARCB and ACABT. The main reason is that the spectral or spatial features are lost after removing a branch, which makes it difficult to accurately describe spectral-spatial properties. Especially for the land covers that only provide one training sample, removing one branch makes these classes have smaller CA, which has less effect on OA and but more effect on AA.
Ablation study on the attention: In order to optimize feature extraction and measure the different contributions of neighboring pixels to the central pixel, FCL and CAMHSA are designed. Among them, FCL uses the inner product between channels to model the cross-channel interactions, which suppresses the drawbacks that the existing channel attention easily loses some important information and only learns the independent importance for a single channel. CAMHSA introduces the feature similarity measure based on MHSA to focus more on the neighboring pixels that have relatively consistent spectral-spatial properties with the central pixel. In this experiment, in order to validate the advantages of FCL and CAMHSA, we replace FCL and CAMHSA with the existing channel attention (CBAM, SK-Net, and ECA-Net, etc.) and MHSA, respectively. The experimental results are shown in
Table 10; the best results are in bold. We report the average ± standard deviation on OA, AA, and
. It can be seen from
Table 10 that both FCL and CAMHSA improve the classification performance. Compared with the existing channel attention and MHSA, they demonstrate stronger feature optimization and learning capabilities.