1. Introduction
Hyperspectral image (HSI), as an image-spectrum merging technology, combinates subdivisional spectroscopy with imaging technology, which contains abundant spatial distribution information of surface targets and hundreds or even thousands of contiguous narrow spectral bands [
1,
2]. In terms of recognition and classification, which benefit from luxuriant spatial and spectral features, HSI not only has an inherent preponderance over natural images but also can efficaciously distinguish different land-cover categories and objects. Therefore, HSI plays a crucial role in multifarious fields, such as military defense [
3], atmospheric science [
4], urban planning [
5], vegetation ecology [
6,
7] and environmental monitoring [
8,
9]. Among the hyperspectral community, one of the most vibrant research applications is HSI classification. However, HSI classification also faces many formidable challenges, such as extensive redundant spectral information interference, available labeled samples deficiency and high intra-class variability.
Initially, traditional HSI classification methods, such as kernel-based and machine-learning approaches [
10,
11], are composed of two primary parts: feature extraction and classifier optimization. Representative algorithms are band selection (BS) [
12], sparse representation classifier (SRC) [
13], multinomial logistic regression (MLR) [
14], principal components analysis (PCA) [
15] and support vector machine (SVM) [
16], which utilize rich spectral features to implement HSI classification. However, the above classification approaches only exploit spectral information and do not take full advantage of spatial information. In order to improve the classification performance, many spectral–spatial-based methods have been developed, which incorporate spatial context information into classifiers. For example, 3D morphological profiles [
17] and 3D Gabor filters [
18] were designed to obtain spectral–spatial features. Li et al. presented multiple kernel learning (MKL) to excavate spectral and spatial information of HSI [
19]. Fang et al. constructed a novel local covariance matrix (CM) representation approach, which can capture the relation of spatial information and different spectral bands of HSI [
20]. These conventional classification approaches depend on handcrafted features, which leads to the discriminative features’ insufficient extraction and poor robust ability.
In recent years, with the breakthrough of deep learning, HSI classification methods based on deep learning have demonstrated superior performance [
21,
22,
23,
24,
25]. Chen et al. first applied a convolution neural network (CNN) to HSI classification [
26]. Hu et al. raised a framework based on CNN, which comprised five convolutional layers to perform classification tasks [
27]. Zhao et al. utilized a 2D CNN to obtain spatial features and then integrated them with spectral information [
28]. In order to capture joint spectral–spatial information, Zou et al. constructed a 3D fully convolutional network, which can further obtain high-level semantic features [
29]. In order to obtain promising information, Zhang et al. devised a diverse region-based CNN to extract semantic context-aware information [
30]. Ge et al. presented a lower triangular network to fuse spectral and spatial features and thus achieved high-dimension semantic information [
31]. Nie et al. proposed a multiscale spectral–spatial deformable network, which employed a spectral–spatial joint network to obtain low-level features composed of spatial and spectral information [
32]. An effective and efficient CNN-based spectral partitioning residual network was built, which utilized cascaded parallel improved residual blocks to achieve spatial and spectral information [
33]. A dual-path siamese CNN was designed by Huang et al., which integrated extended morphological profiles and siamese network with spectral–spatial feature fusion [
34]. In order to obtain more prominent spectral–spatial information, Gao et al. designed a multiscale feature extraction module [
35]. Shi et al. devised densely connected 3D convolutional layers to capture preliminary spectral–spectral features [
36]. Chan et al. utilized spatial and spectral information to train a novel framework for classification [
37].
The attention mechanism plays a crucial part in the HSI classification task and focuses on significant information related to the classification task [
38,
39,
40,
41,
42]. Zhu et al. proposed a spatial attention block to adaptively choose a useful spatial context and a spectral attention block to emphasize necessary spectral bands [
43]. In order to optimize and refine the obtained feature maps, Li et al. built a spatial attention block and a channel attention block [
44]. Gao et al. proposed a channel–spectral–spatial attention block to enhance the important information and lessen unnecessary ones [
45]. Xiong et al. utilized the dynamic routing between attention initiation modules to learn the proposed architecture adaptively [
46]. Xi et al. constructed a hybrid residual attention module to enhance vital spatial-spectral information and suppress unimportant ones [
47].
Inspired by the above successful classification methods, this article proposes a multibranch crossover feature attention network (MCFANet) for HSI classification. The MCFANet is composed of two primary submodules: a crossover feature extraction module (CFEM) and rearranged attention module (RAM). CFEM is designed to capture spectral–spatial features at different convolutional layers, scales and branches, which can boost the discriminative representations of HSI. Specifically speaking, CFEM consists of three parallel branches with multiple available, receptive fields, which can increase the diversity of spectral–spatial features. Each branch utilizes three additive link units (ALUs) to extract spectral and spatial information while introducing the cross transmission into ALU to take full advantage of spectral–spatial feature flows between different branches. Moreover, each branch also employs dense connections to combinate shallow and deep information and achieve strong related and complementary features for classification. RAM, including a spatial attention branch and a spectral attention branch, is constructed to not only adaptively pay attention to recalibrating spatial-wise and spectral-wise feature responses but also exploit the shifted cascade operation to rearrange the obtained attention-enhanced features to dispel redundant information and noise and, thus, enhance the classification accuracy. The main contributions of this article can be summarized as follows:
- (1)
In order to decrease training parameters and accelerate model convergence, we designed an additive link unit (ALU) to replace the conventional 3D convolutional layer. For one, ALU utilizes the spectral feature extraction factor and spatial feature extraction factor to capture joint spectral–spatial features; for another, it also introduces the cross transmission to take full advantage of spectral–spatial feature flows between different branches;
- (2)
In order to tackle the fixed-scale convolutional kernels that are difficult to sufficiently extract spectral–spatial features, a crossover feature extraction module (CFEM) was constructed, which can obtain spectral–spatial features at different convolutional scales and branches. CFEM not only utilizes three parallel branches with multiple available, receptive fields to increase the diversity of spectral–spatial features but also applies the dense connection to each branch to incorporate shallow and deep features and, thus, realize robust complementary features for classification;
- (3)
In order to dispel the interference of redundant information and noise, we devised a rearranged attention module (RAM) to adaptively concentrate on recalibrating spatial-wise and spectral-wise feature response while exploiting the shifted cascade operation to realign the obtained attention-enhanced features, which are beneficial to boost the classification performance.
The remaining part of this article can be summarized as follows:
Section 2 describes in detail our developed MCFANet,
Section 3 provides experimental results and discussion and
Section 4 gives the conclusion part of this article.
3. Experimental Results and Discussion
This section introduces in detail the benchmark datasets used, the experimental setup, a series of parameter analyses, and the discussion of experimental results.
3.1. Datasets Description
The Pavia University (UP) dataset was gathered by the Reflective Optics Spectrographic Imaging System (ROSIS-03) sensor in 2003 during a flight campaign in Pavia, Northern Italy. The wavelength range is 0.43–0.86 μm. This dataset contained nine ground truth categories and has 610 × 340 pixels with a spatial resolution of 1.3 m. After excluding 12 bands due to noise, the remaining 103 bands are generally utilized for the experiment.
The Indian Pines (IP) dataset was collected by the airborne visible/infrared imaging spectrometer (AVIRIS) in 1992 in northwestern Indiana, USA. The wavelength range is 0.4–2.5 μm. This dataset contains 16 ground truth categories and has 145 × 145 pixels with a spatial resolution of 20 m. Since 20 bands cannot be reflected by water, the remaining 200 bands are generally utilized for the experiment.
The Salinas Valley (SA) dataset was captured by an AVIRIS sensor in 2011 for the Salinas Valley in California, USA. The wavelength range is 0.36–2.5 μm. This dataset contains 16 ground truth categories and has 512 × 217 pixels with a spatial resolution of 3.7 m. After eliminating bands that cannot be reflected by water, the remaining 204 bands are generally utilized for the experiment.
For the UP and SA datasets, we chose at random 10% labeled samples of each category for training and the remaining 90% labeled samples for testing. For the IP dataset, we select at random 20% labeled samples of each category as the training set, and the remaining 80% labeled samples as the testing set.
Table 1,
Table 2 and
Table 3 list the land-over category details, sample numbers of three datasets and corresponding colors of each category.
3.2. Experimental Setup
All experiments were conducted on a system with an NVIDIA GeForce RTX 2060 SUPER GPU and 6 GB of RAM. The software environment of the system is TensorFlow 2.3.0, Keras 2.4.3, and Python 3.6.
The batch size was set to 16, and the number of training epochs was set to 200. Moreover, we adopted the RMSprop as an optimizer to update the parameters during the training process, and the learning rate was set to 0.0005. Three evaluation indices were used to evaluate the classification performance, i.e., Kappa coefficient (Kappa), average accuracy (AA) and overall accuracy (OA). The Kappa measures the consistency between the ground truth and the classification results. The AA represents the ratio between the total sample numbers of each category and the correctly classified sample numbers. The OA is the proportion of correctly classified samples in the total samples. In theory, the closer these evaluation indices utilized in this article are to 1, the better the classification performance will be.
3.3. Parameter Analysis
In this section, we mainly discussed five vital parameters that impact the classification results of our proposed MCFANet, i.e., the spatial sizes, training sample ratios, principal component numbers, the convolution kernel numbers in additive link unit and the number of additive link units. All experiments used the control variable method to analyze the influence of the aforementioned five important parameters.
3.3.1. Effect of the Spatial Sizes
Different HSI datasets have different feature distributions, and different spatial sizes may generate different classification results. Small size results in insufficient receptive fields, whereas large size results in more noise, which is to the disadvantage of HSI classification. Therefore, we fixed other parameters and set the spatial size to
,
,
,
and
to analyze their effects on the classification results of our proposed MCFANet for three datasets. The experimental results are provided in
Figure 6. According to
Figure 6, for the UP and IP datasets, the optimal spatial size is
. For the SA dataset, as the spatial size is
and
, three evaluation indices under the two conditions are the same. By considering the number of training parameters and time, the optimal spatial size was set to
for the SA dataset.
3.3.2. Effect of the Training Sample Ratios
The training sample ratios have a great effect on the HSI classification performance. In order to evaluate the robustness and generalization of the proposed MCFANet, we randomly choose 1%, 3%, 5%, 7%, 10%, 20% and 30% labeled samples for training and the remaining labeled samples for testing. The experimental results are provided in
Figure 7. According to
Figure 7, for three experimental datasets, it can be seen that as the training sample ratio is 1%, three evaluation indices are the lowest. With the increase in training sample ratios, three evaluation indices gradually improve. For the UP and SA datasets, as the training sample ratio is 10%, three evaluation indices reach a relatively stable level. For the IP dataset, as the training sample ratio is 20%, three evaluation indices reach a relatively stable level. This is because the UP and SA datasets have sufficient labeled samples, so even though the training sample ratio is small, our proposed method can still obtain high classification accuracies. In contrast, the IP dataset contains relatively small labeled samples, which means that the training ratio needs to be large for the proposed method to achieve good classification results. Therefore, the best training sample ratio for the UP and SA datasets is 10%, and the best training sample ratio for the IP dataset is 20%.
3.3.3. Effect of the Principal Component Numbers
HSI contains hundreds of spectral bands, and they are highly correlated with each other, leading to the existence of redundant information that is not conducive to the classification task. We performed PCA on the raw HSI dataset to reduce the training parameters of the proposed method by reducing the number of spectral bands of the original HSI dataset. We set the number of principal components to 5, 10, 20, 30 and 40 to evaluate their effects on the classification results of our proposed MCFANet for three datasets. The experimental results are provided in
Figure 8. According to
Figure 8, for the IP and SA datasets, as the number of the principal components is 20, three evaluation indices are obviously superior to other conditions. For the UP dataset, as the principal components numbers are 20, three evaluation indices are 99.82%, 99.62% and 99.80%. As the number of principal components is 40, three evaluation indices are 99.87%, 99.63% and 99.83%. Although the three evaluation indices of the former are 0.05%, 0.01% and 0.03% lower than those of the latter, in contrast, the former needs fewer training parameters and training time. Therefore, the optimal number of principal components for three datasets is 20.
3.3.4. Effect of the Number of Convolutional Kernels in Additive Link Units
The ALU is composed of a spatial feature extraction factor and a spectral feature extraction factor, which are similar to traditional 3D convolutional operations. Therefore, the number of output feature maps of ALU directly impacts the complexity and classification performance of the proposed method. We set the number of convolutional kernels in ALU to 2, 4, 8, 16, 32 and 64 to evaluate their effects on the classification results for three datasets. The experimental results are provided in
Figure 9. According to
Figure 9, for three datasets, it is clear that as the convolutional kernel number is 16, three evaluation indices are the most advantageous, and the proposed method has the best classification performance. Hence, the optimal number of convolutional kernels for three datasets is 16.
3.3.5. Effect of the Number of Additive Link Units
Our proposed MCFANet is composed of three parallel branches, and each branch includes multiple ALUS. A small amount of ALU leads to insufficient feature extraction, whereas too many ALUs lead to cause problems such as overfitting, a more complex structure of the model and gradient explosion, which are not conducive to HSI classification. Therefore, we set the number of ALUs to 2, 3, 4 and 5 to evaluate their effects on the classification results of our proposed MCFANet for three datasets. The experimental results are provided in
Figure 10. According to
Figure 10, the optimal number of ALUs for UP and SA datasets is 3. For the IP dataset, as the number of ALUs is 3, three evaluation indices are 99.61%, 99.72% and 99.55%. As the number of ALUs is 5, three evaluation indices are 99.68%, 99.69% and 99.64%. Although the OA and Kappa of the former are 0.07% and 0.09% lower than those of the latter, in contrast, the former needs fewer training parameters and training time. All things considered, the optimal number of ALUs for the IP dataset is 3.
3.4. Ablation Study
Our presented MCFANet involves two submodules: CFEM and RAM. In order to more comprehensively prove the validity of each module, ablation studies were implemented on three benchmark datasets, i.e., not using CFEM (named Netwok1), not using RAM (named Network 2) and using the combination CFEM and RAM (named Network 3). The experimental results are provided in
Figure 11. From
Figure 11, it is clear that Network 1 has the worst classification accuracies. For example, for the IP dataset, three evaluation indices are 2.37%, 3.13% and 3.05% lower than those of Network 3. In contrast, the classification performance of Network 2 has remarkable improvement. For example, for the UP dataset, three evaluation indices of Network 2 are 2.02%, 3.94% and 2.69% higher than those of Network 1. By comparison, Network 3 has superior classification results, which indicates that our devised CFEM has a greater effect on the classification performance of the presented model, and RAM can further enhance the classification results.
3.5. Comparison Methods Discussion
In order to verify the performance of our presented MCFANet, we utilized eleven classical methods for comparison experiments, which can be divided into two categories. One is based on traditional machine learning: SVM, RF, KNN and GuassianNB, which take all spectral bands as input. Another is based on deep learning: spectral–spatial residual network (SSRN) [
53] is a 3D CNN, which uses spatial and spectral residual blocks to capture spectral–spatial features; fast, dense spectral–spatial convolution network (FDSSC) [
54] utilizes different 3D convolutional kernel sizes based on dense connection to extract spatial and spectral features separately; HybridSN [
55] combines one 2D convolutional layer and three 3D convolutional layers; Hybrid 3D/2D CNN (3D_2D_CNN) [
56] is similar to HybridSN, which splices together 2D CNN components with 3D CNN components; multibranch 3D-dense attention network (MBDA) [
57] exploits 3D CNNs to obtain spectral–spatial information and designs spatial attention mechanisms to enhance the spatial feature representations; multiscale residual network (MSRN) [
50] constructs a multiscale residual block with mixed depthwise convolution to achieve multiscale feature learning.
Table 4,
Table 5 and
Table 6 provide the classification results of different methods on three experimental datasets.
First, as shown in
Table 4,
Table 5 and
Table 6, it can be seen that our developed MCFANet achieves excellent classification performance and has the highest classification accuracies in most categories. For the UP dataset, compared with other methods, the OA, AA and Kappa of the proposed MCFANet have an increase of approximately 0.34–33.98%, 0.47–33.15% and 0.32–43.17%, respectively. For the IP dataset, the OA, AA and Kappa of the proposed MCFANet have an increase of approximately 1.4–51.77%, 3.15–49.19% and 1.5~58.33%, respectively. For the SA dataset, the OA, AA and Kappa of the proposed MCFANet have an increase of approximately 0.14–23.20%, 0.09–13.7% and 0.15–25.51%, respectively. Overall, compared with seven deep learning-based classification approaches, SVM, RF, KNN and GuassianNB have lower classification accuracies, of which GuassianNB performs the worst. This is because they only capture features in the spectral domain and neglect ample spatial features. In addition, they need to rely on the prior information of experienced experts, leading to inferior robustness and generalization ability. Due to the hierarchical structure, seven methods based on deep learning can extract low-, middle- and high-level spectral–spatial features automatically and obtain decent classification accuracies.
Second, MBDA, MSRN and our proposed MCFANet adopted the multiscale feature extraction strategy. MBDA uses three parallel branches with a convolutional kernel with different sizes to capture multiscale spectral–spatial features. MSRN utilizes a multiscale residual block to perform multiscale feature extraction, which is composed of depthwise separable convolution with mixed depthwise convolution. It is clear from
Table 4,
Table 5 and
Table 6 of the three methods that our proposed MCFANet obtains superior classification performance. For the UP dataset, three evaluation indices of the proposed MCFANet are 0.34%, 0.47% and 0.32% higher than those of MBDA and are 12.63%, 10.76% and 16.42% higher than those of MSRN. For the IP dataset, three evaluation indices of the proposed MCFANet are 1.4%, 3.15% and 1.59% higher than those of MBDA and are 2.34%, 6.90% and 2.67% higher than those of MSRN. For the SA dataset, three evaluation indices of the proposed MCFANet are 0.34%, 0.33% and 0.39% higher than those of MBDA and are 1.29%, 0.92% and 1.44% higher than those of MSRN. This is because compared with MBDA and MSRN, our proposed MCFANet can introduce not only multiple available, receptive fields to capture multiscale spectral–spatial features but also utilize dense connection and cross transmission to aggregate spectral–spatial features from different layers and branches.
Third, according to
Table 4,
Table 5 and
Table 6, we also can obviously see that the evaluation indices of MBDA occupy the second place. This is because MBDA builds a spatial attention module to focus on spatial features that are related to HSI classification. Our proposed MCFANet also constructs a RAM to enhance spectral–spatial features and achieve the greatest classification accuracies. These indicate that the attention mechanism can boost classification performance to a certain degree. For the UP dataset, three evaluation indices of the proposed MCFANet are 0.34%, 0.47% and 0.32% higher than those of MBDA. For the IP dataset, three evaluation indices of the proposed MCFANet are 1.4%, 3.15% and 1.59% higher than those of MBDA. For the SA dataset, three evaluation indices of the proposed MCFANet are 0.34%, 0.33% and 0.39% higher than those of MBDA. This is because our designed RAM can not only focus on adaptively reweighting the significance of spectral-wise and spatial-wise features but also introduce the shifted cascade operation to replume the obtained attention-enhanced features to achieve more discriminative spectral–spatial features while dispelling redundant information and noise and thus, improving the classification performance.
Furthermore, as shown in
Table 4,
Table 5 and
Table 6, it is clear that compared with six DL-based classification approaches, the FLOPs and test time of our proposed MCFANet are not the least, which indicates that our presented method still has some shortcomings. This could be because our designed CFEM includes different parallel branches, and the spectral–spatial information between branches is shared with each other; the spectral–spatial features at different scales and branches are effectively integrated, but the structure of our developed model and the test time is relatively large and not superior. Therefore, how to shorten the test time and reduce the complexity of the proposed MCFANet is still a problem worth studying.
Moreover,
Figure 12,
Figure 13 and
Figure 14 provide the classification visual result maps of eleven comparison methods on the three experimental datasets. In contrast, SVM, RF, KNN and GuassianNB have coarse classification maps and contain vast noise and high misclassification rates. The classification maps of SSRN, FDSSC, 3D_2D_CNN, HybridSN, MBAN, MSRN and MCFANet have significantly improved and become clear. By comparison, the classification maps of our developed MCFANet are smoother and more accurate and are highly consistent with the ground-truth map on the three public datasets.