1. Introduction
Hyperspectral images (HSI) consist of hundreds of narrow spectral bands captured by hyperspectral remote sensors, containing rich spectral–spatial information. Compared with RGB and multispectral images, HSI provides more advantages in classifying land cover types. Therefore, hyperspectral image classification (HSIC) offers crucial technical support for extensive application in domains such as urban planning [
1], agriculture [
2], mineral exploration [
3], atmospheric sciences [
4], environmental monitoring [
5], and object tracking [
6].
A multitude of HSI classification methods primarily focus on traditional machine learning (ML) models [
7] and deep learning (DL) models [
8,
9,
10]. Compared with traditional ML methods that depend on handcrafted feature engineering [
11], DL approaches have shown significantly more potential in dealing with various fields, such as HSI classification, due to their ability to automatically learn features in an end-to-end manner. Typical DL approaches include stacked autoencoders (SAEs) [
12], recurrent neural networks (RNNs) [
13], convolutional neural networks (CNNs) [
14,
15], capsule networks (CapsNets) [
16], graph convolutional networks (GCNs) [
17,
18], Transformers [
10,
19], and Mamba [
20]. Among these models, CNN-, GCN-, Transformer-, and Mamba-based models have gained more interest. CNN-based models [
14,
21] utilize shape-fixed small kernel convolutions to extract local contextual information from fixed-size image patches. Subsequently, researchers explore multi-scale CNN architectures [
22,
23] and attention-based CNN models [
8,
24,
25] to enhance the ability of capturing local spatial–spectral features, thereby improving HSI classification performance. However, owing to the limited receptive field of their small kernels, they encounter challenges in identifying the relationships between land covers over medium and long distances.
Compared to CNNs with shape-fixed kernels, graph convolutional networks (GCNs) [
26] and their variant methods can perform flexible convolutions across irregular land cover regions. Consequently, many works introduce superpixel-based GCNs to classify HSI data [
9,
27,
28,
29]. These superpixel GCNs are capable of establishing long-range spatial dependencies and capture global information by leveraging superpixels as graph nodes. While the aforementioned superpixel-based GCN models enhance the classification capabilities of HSI, they suffer from two limitations: (1) the construction of their adjacency matrices necessitates significant computational resources, thereby diminishing classification efficiency; and (2) these adjacency matrices solely aim to model spatial relationships between pixels, overlooking the crucial spectral correlations.
Recently, driven by the outstanding achievements of vision Transformers (ViTs) [
30] in natural image processing, Transformer-based models [
10,
19,
31,
32,
33] have been proposed for identifying land cover types. These models have demonstrated remarkable classification outcomes, attributed to their robust capability in capturing and modeling remote dependencies among pixels. Nevertheless, they suffer from computational inefficiency due to the quadratic computational complexity driven by the self-attention mechanism in the Transformer. This complexity poses challenges when dealing with large HSI datasets containing numerous labeled pixels. To address these limitations, several studies [
20,
34] are devoted to developing Mamba [
35] frameworks for HSI classification. Although these Mamba-based models show strong long-range modeling ability and achieve linear computational complexity, their local feature extraction capabilities need to be enhanced.
In recent years, large kernel CNNs (LKCNNs) [
36,
37,
38] have garnered considerable attention. Unlike traditional CNNs, which stack a series of small-kernel layers to enlarge the receptive field, LKCNNs employ a few large spatial convolutions to increase the size of the receptive field, demonstrating a promising capability in natural visual tasks. This capability inspires the limited number of studies [
39,
40,
41] that leverage LKCNNs for HSI classification. These studies, such as SSLKA [
41], typically employ the classical large kernel attention (LKA) [
37] (decomposing
large kernel convolution into a
depthwise convolution (DWC) [
42], a
depthwise dilation convolution (DDC) with a dilation factor of
d, and a
convolution) to capture global features. However, they face three issues: (1) The LKA primarily focuses on modeling long-range dependencies while overlooking the extraction of local features. (2) Their number of parameters and computational complexity significantly increase when the value of
k is large, thus increasing the risk of overfitting. (3) Their capability to learn global features needs to be enhanced when the value of
k is not large.
To tackle these limitations of CNN-, GCN-, Transformer-, Mamba-, and LKCNN-based models, we propose a multi-scale large kernel asymmetric CNN (MSLKACNN) for HSI classification. This architecture scales up the large kernel sizes to
and
as illustrated in
Figure 1. Specifically, we first develop a spectral feature extraction module (SFEM) to eliminate noise, reduce spectral bands, and extract spectral features. Subsequently, to capture the spatial features of different scales, we construct a novel multi-scale large kernel asymmetric convolution (MSLKAC) comprising two parallel multi-scale asymmetric convolution components: a multi-scale large kernel asymmetric depthwise convolution (MLKADC) and a multi-scale asymmetric dilated depthwise convolution (MADDC). MLKADC consists of parallel DWCs with kernels ranging from
and
to
and
, which is designed to learn short-range (small local), medium-range (larger local), and long-range (global) spatial features. Since these depthwise kernels are non-square and the
m is set to a large value of 17, we refer to our MLKADC as large kernel asymmetric depthwise convolution (ADC). MADDC captures spatial relationships among pixels at varying distances through an integration of multi-scale learning, dilated convolutions [
43], DWCs, and asymmetric convolutions. Lastly, an average fusion pooling (AFP) is introduced to fuse these spatial features extracted by various components. The main contributions of this article are summarized as follows.
(1) We introduce a novel MLKADC to extract local-to-global spatial features. The MLKADC utilizes a series of asymmetric DWCs with small to large kernels, addressing the limitations of existing DL models. Notably, it extends the non-square kernel sizes to and , thus enhancing the global feature extraction capabilities while reducing the number of parameters compared to SSLKA, which relies on standard square kernels.
(2) We propose a new MADDC to model the spatial relationships between land covers at different distances by combining ADC with dilated convolution.
(3) By combining the proposed MLKADC and MADDC in parallel, we develop a novel MSLKAC for improving the ability to extract spatial features across small to large ranges. Based on our MSLKAC, we introduce an architecture termed MSLKACNN to jointly learn both spectral and spatial features through the SFEM and MSLKAC.
The rest of the paper is organized as follows. In
Section 2, we present the related works of HSI classification models. The proposed MSLKACNN is introduced in
Section 3. In
Section 4, we evaluate and discuss the performance of MSLKACNN. We summarize the paper in
Section 5.
2. Related Work
HSI classification methods are typically categorized into traditional ML-based approaches and DL-based approaches. DL-based approaches mainly comprise five types of models: CNN-based models, GCN-based models, Transformer-based models, Mamba-based models, and LKCNN-based models.
(1) ML-based models: In early studies, traditional ML methods, such as Markov random fields [
7] and morphological profiles [
44], tend to be applied in HSI classification. These models heavily rely on manual design and exhibit limited learning ability in extracting deep semantic features from HSI [
31].
(2) CNN-based models: Many CNN-based models, ranging from 1D-CNN [
45] to 2D-CNN [
46] and 3D-CNN [
14], have been applied to capture features from HSI in an end-to-end manner. Additionally, channel-based CNN frameworks, including single-channel CNN [
46], dual-channel CNN [
47], and multi-channel CNN [
48], have been employed to learn spatial–spectral features. Zhong et al. [
49] and Wang et al. [
21] respectively introduce residual connections [
50] and dense connections [
51] into their CNNs to significantly deepen their models, addressing the degradation [
50] of deep CNNs. Nevertheless, the majority of these models are limited to extracting features at a single scale from fixed-size image patches, resulting in suboptimal and unstable results under complex HSI with limited training samples [
52,
53]. To tackle these issues, numerous works establish multi-scale CNN architectures to capture multi-scale features. For instance, Gong et al. [
54] design a novel multi-scale CNN to more effectively learn features compared with single-scale CNN methods. MMFN [
22] combines the complementary and related features at different scales to achieve optimal results. To boost the learning capability for extracting spatial–spectral features, attention-based CNN models have been explored. Li et al. [
8] introduce a new double-branch dual-attention approach termed DBDA, which designs a channel attention block and a spatial attention block to enhance classification performance. Roy et al. [
24] introduce efficient feature recalibration (EFR) into improved 3D residual blocks to adjust the size of the receptive field and enhance cross-channel relationships. Furthermore, Wang et al. [
23] develop a novel attention-based multi-scale CNN architecture to better capture pixel-level discriminative features. These CNN-based models typically contain multiple small convolutional kernels, thus excelling at extracting local spatial–spectral features. However, they have difficulty in modeling the long-range dependencies between land covers because of the inherent locality of their small kernel convolutions.
(3) GCN-based models: Leveraging the capability of GCNs to capture spatial relationships ranging from short range to long range, they have been extensively utilized in HSI classification tasks. Qin et al. [
17] present a spectral–spatial GCN by establishing each pixel of HSI as a graph node. Subsequently, Bai et al. [
18] propose attention GCN frameworks to capture the non-Euclidean features of HSI. The aforementioned GCN models treat each pixel in HSI as a graph node, leading to tremendous computations. To overcome the limitation, many works explore superpixel-based GCNs that use superpixels instead of pixels as graph nodes [
55,
56]. For example, Wang et al. [
56] significantly reduce the computational complexity by constructing a superpixel graph, which facilitates GCNs to deal with large HSI data. Nevertheless, for these superpixel-based networks, the features of pixels are shared among each superpixel, thereby inevitably overlooking the individual characteristics of these pixels. This neglects the results with limited classification accuracy. To address these limitations, CNN and GCN fusion-based frameworks have been introduced [
28,
29,
57]. These fusion models fully leverage the advantages of their CNNs and GCNs, jointly mining features at both the pixel level and superpixel level to achieve complementary feature extraction. However, these models struggle with classifying large-scale HSI data, due to the complex computations and memory costs incurred by graph construction.
(4) Transformer-based models: Researchers have successfully applied Transformer to HSI classification. For instance, He et al. [
58] and Hong et al. [
19] propose Transformer-based approaches, employing the multi-head self-attention (MHSA) [
59] mechanism to model long-range relationships across different land cover types. In addition to these single Transformer structures, several fusion architectures have been developed. SSFTT [
10] captures low-level and high-level features through two convolutional layers and a Gaussian weighted feature tokenizer, respectively, before establishing global information with a Transformer. Subsequently, Zhao et al. [
32] present a lightweight groupwise separable convolutional vision Transformer network (GSCViT), utilizing groupwise separable convolutions and groupwise separable MHSA blocks to extract both local and global features. Furthermore, several works, such as MVAHN [
60] and GTFN [
31], explore the fusion of Transformers with GCNs to leverage the strengths of both models, thereby enhancing classification performance. Although these models effectively model long-term dependencies, they suffer from challenges in terms of speed and memory usage in large-scale HSI, owing to the quadratic computational complexity of Transformer.
(5) Mamba-based models: Recently, several Mamba models [
20,
34,
61] have been explored for HSI classification. SpectralMamba [
34] adopts Mamba to capture global information and model long-range relationships, achieving linear computational complexity. Zhou et al. [
61] present Mamba-in-Mamba architecture to extract global features, which is more efficient in computation than Transformer-based models. Furthermore, Li et al. [
20] design spatial and spectral Mamba blocks to extract spatial and spectral features. Although these models excel in long-range modeling and maintain linear computational complexity, they necessitate enhanced local feature extraction capabilities.
(6) LKCNN-based models: Motivated by the success of LKCNN models in natural visual tasks, several works utilize LKCNNs to extract the long-range features of HSI [
39,
40,
41]. LiteCCLKNet [
39] employs the criss-cross large kernel module to learn global information. LKSSAN [
40] and SSLKA [
41] apply large kernel attention (LKA) [
37], which consists of depthwise convolution (DWC), depthwise dilation convolution (DDC), and pointwise convolution (PWC), to extract long-range features. To enhance the both local and global feature extraction capabilities of these large kernel networks while significantly reducing their parameter size and computational complexity, we design a new multi-scale large kernel asymmetric CNN (MSLKACNN). Unlike LKSSAN and SSLKA that use a sequence of DWC, DDC, and PWC to capture global features, our MSLKACNN captures both local and global spatial features while improving computational efficiency and reducing the number of parameters by employing two parallel multi-scale asymmetric convolution components: (1) the proposed MLKADC that constructs asymmetric DWCs ranging from a sequence of two DWCs with
and
kernels to a sequence of two DWCs with
and
kernels, for extracting small local, larger local, and global spatial features; and (2) the proposed MADDC, which extracts spatial information among pixels at varying distances via multi-scale learning and asymmetric dilated DWCs.
4. Experiment
In this section, we first describe four publicly available benchmark HSI datasets. Then, we introduce the evaluation metrics, compared methods, and implementation details. Next, we qualitatively and quantitatively assess the performance of the proposed MSLKACNN and state-of-the-art methods. Subsequently, we compare different training samples and fusion schemes, as well as training and testing times across various methods. Finally, we conduct several ablation studies to analyze the impacts of key components and hyperparameters.
4.1. Dataset Description
In our experiments, the four HSI datasets are Indian Pines, Botswana, Houston 2013, and WHU-Hi-LongKou (LongKou), respectively. We summarize the details of these datasets in
Table 1 and
Table 2.
(1) Indian Pines: The Indian Pines dataset was acquired by the Airborne Visible Infrared Imaging Spectrometer sensor in 1992. It contains 10,249 labeled pixels with 16 ground-truth classes, consisting of pixels in the wavelength range from 0.4 to 2.5 μm. After removing these noisy and water absorption bands of 104–108, 150–163, and 220, 200 spectral bands are retained.
(2) Botswana: The Botswana dataset was captured by using the NASA EO-1 satellite over the Okavango Delta region in Botswana. The whole image comprises pixels with 242 spectral bands, 14 land cover categories, and wavelengths ranging from 0.4 to 2.5 μm. We retain 145 spectral bands by removing 97 noise bands.
(3) Houston 2013: The Houston 2013 dataset was provided by the National Center for Airborne Laser Mapping (NCALM) over the University of Houston in 2013 [
64]. The dataset contains 15,029 labeled pixels with 16 land cover categories, comprising
pixels with 144 spectral bands ranging from 0.38 to 1.05 μm.
(4) WHU-Hi-LongKou (LongKou): The LongKou dataset was gathered by using an 8 mm focal length Headwall Nano-Hyperspec imaging sensor over the town of LongKou, Hubei Province, China in 2018 [
65]. The HSI consists of
pixels with 9 land cover classes and 240 spectral bands in the wavelength range from 0.4 to 1.0 μm.
4.2. Experimental Setup
(1) Evaluation Metrics: To quantitatively analyze the effectiveness of the proposed MSLKACNN, four evaluation metrics are introduced: per-class accuracy, overall accuracy (OA), average accuracy (AA), and Kappa coefficient (KAPPA). Furthermore, the classification maps produced by various models are visualized to enable a qualitative assessment.
(2) Comparison Methods: To demonstrate the strengths of the proposed MSLKACNN, ten comparison methods are selected and evaluated. These comparison methods are divided into four categories, including (a) CNN-based methods: the double-branch dual-attention network (DBDA) [
8], and the attention-based adaptive spectral–spatial kernel ResNet (
) [
24]; (b) GCN-based methods: the CNN-enhanced GCN (CEGCN) [
9], the fast dynamic graph convolutional network and CNN parallel network (FDGC) [
27], and the GCN and transformer fusion network (GTFN) [
31]; (c) Transformer-based methods: the spectral–spatial feature tokenization transformer (SSFTT) [
10], the groupwise separable convolutional vision Transformer (GSC-ViT) [
32], and the double branch convolution-transformer network (DBCTNet) [
33]; (d) Mamba-based method: the spatial–spectral Mamba (MambaHSI) [
20]; and (e) LKCNN-based method: the spectral–spatial large kernel attention network (SSLKA) [
41].
(3) Implementation Details: All experiments are implemented on a Silver 4210 CPU, Python 3.10, and a GTX-3090 GPU. We adopt the Adam optimizer with a learning rate of 0.001 on the Pytorch platform. In the proposed MSLKACNN, the number of filters for all convolutional layers is set to 64. For our MSLKAC, we set the large kernel size m in MLKADC to 17 while setting the kernel size k in MADDC to 5. We train our model for 200 epochs on the Botswana, for 120 epochs on the Houston 2013, and for 150 epochs on the other datasets. All experiments of our MSLKACNN and the comparison methods are repeated twenty times with various random initializations, and the average results are reported across each evaluation metric.
4.3. Comparison with State-of-the-Art Methods
In this section, we conduct a quantitative and qualitative evaluation between the proposed MSLKACNN and existing state-of-the-art baselines on the Indian Pines, Botswana, Houston 2013, and LongKou datasets. These baselines are implemented using the optimal parameters as described in their respective references.
(1) Results on Indian Pines: Table 3 shows the quantitative comparison of all methods on the Indian Pines dataset. From the table, we observe that our MSLKACNN outperforms almost all baselines (except for MambaHSI in KAPPA) in terms of OA, AA, and KAPPA, as well as seven out of sixteen land cover categories. Specifically, MSLKACNN improves over CNN approaches by at least 7.92%, improves over GCN approaches by at least 24.06%, improves over Transformer approaches by at least 21.15%, improves over the Mamba approach by 24.56%, and improves over the LKCNN approach by 15.84% in terms of OA, respectively. These improvements highlight the superiority of the proposed MSLKACNN.
Figure 4 illustrates a qualitative evaluation through the visualization of classification maps obtained by various methods on the Indian Pines dataset. These maps clearly show that the proposed MSLKACNN exhibits fewer misclassifications in many classes, such as “Corn-notill” and “Soybean-notill”, in comparison to other methods.
(2) Results on Botswana: The comparative results of various approaches on the Botswana dataset are summarized in
Table 4. The results reveal two key findings: (a) Among all methods, the GSC-ViT, MambaHSI, SSLKA, and CEGCN models achieve the third-best, fourth-best, fifth-best, and sixth-best performance in terms of OA and AA, respectively. This is mainly due to the fact that these models can effectively establish long-range dependencies within the HSI data by utilizing Transformer, Mamba, LKCNN, and GCN, respectively. (b) Our MSLKACNN, which employs multi-scale asymmetric convolutions with kernels ranging from small to large, excels in capturing global features that are neglected by traditional CNNs, performing better than baseline methods in evaluation metrics, including OA, AA, and KAPPA. Specifically, in terms of OA, AA, and KAPPA, MSLKACNN outperforms GTFN by 15.79%, 15.13%, and 17.11%, respectively; outperforms DBCTNet by 6.99%, 6.39%, and 7.57%, respectively; outperforms MambaHSI by 4.41%, 4.15%, and 6.76%, respectively; and outperforms SSLKA by 4.75%, 5.66%, and 5.16%, respectively. These findings further validate the effectiveness of MSLKACNN.
The classification maps of various methods on the Botswana dataset are displayed in
Figure 5. Given the significant uneven distribution of various land covers within the highly sparse dataset, we zoom in on the two red boxed areas in the classification maps to facilitate a more accurate qualitative assessment. According to these enlarged maps, we observe that the proposed MSLKACNN achieves a superior classification map compared to the comparison methods.
(3) Results on Houston 2013: Table 5 presents the quantitative results achieved by different methods on the Houston 2013 dataset. From these results, it is evident that DBCTNet and GSC-ViT, which integrate convolution and Transformer, rank third and fourth, respectively, among the eleven methods. This indicates their strengths in capturing local features through the convolution and establishing long-range dependencies among pixels via the Transformer. Additionally, MSLKACNN outperforms other methods by a substantial margin in terms of OA, AA, and KAPPA, which demonstrates the superiority of our model in learning local-to-global information through asymmetric convolutions with small-to-large kernels.
The qualitative classification maps of diverse methods are depicted in
Figure 6. To aid a visual evaluation, we zoom in on the two red boxed areas in the classification maps. From these enlarged maps, we see that MSLKACNN exhibits a superior classification map in the classes of “Residential” and “Road” compared to comparison baselines.
(4) Results on LongKou: Table 6 displays the numerical outcomes obtained by diverse algorithms on the LongKou dataset. Consistent with the findings from other datasets, our proposed MSLKACNN demonstrates a notable enhancement across all benchmark methods, exceeding the second place (CEGCN) by 2.20%, 6.20%, and 2.77% in terms of OA, AA, and KAPPA, respectively. This enhancement again shows the strength of our MSLKACNN.
As illustrated in
Figure 7, a visual examination indicates that the classification map of MSLKACNN is closer to the ground truth compared to other methods, especially in distinguishing the category of “Broad-Leaf Soybean”.
4.4. Analysis of All Methods Under Various Numbers of Training Samples
In this section, we conduct a comparative analysis of the OA achieved by diverse methods using different numbers of training samples per class. Specifically, we utilize 2, 4, 6, 8, and 10 training samples for each dataset. A uniform number of five validation samples is maintained for all methods across all datasets. As shown in
Figure 8, the OA results of most methods demonstrate an upward trend as the number of training samples increases. However, in a minority of cases, we observe that the OA results of a few competitive methods, such as GSC-ViT, decrease unexpectedly with more training samples. These anomalous results may potentially stem from the additional noise introduced by the increased training data. Conversely, the OA results of CEGCN and our proposed MSLKACNN exhibit a notable improvement with the increase in training samples. This enhancement can be credited to the noise suppression modules in their architectures. Furthermore, in most cases, our MSLKACNN consistently surpasses the comparison methods across various datasets, especially under small training sample sizes, thereby further reinforcing its robustness and superiority for HSI classification tasks.
4.5. Analysis of Diverse Fusion Schemes
As described in
Section 3.2, we introduce two widely used fusion schemes: column concatenation fusion (concatenate) and sum fusion (sum). In Equation (
4), the number of feature maps obtained by the proposed MLKADC and MADDC is substantial. The applications of concatenate and sum for combining these feature maps have individually resulted in an increase in the number of parameters and the generation of large feature values, respectively, which may potentially lead to overfitting and gradient explosion issues. To address these challenges, we investigate the AFP fusion scheme. To evaluate our AFP, we compare the OA results achieved by AFP and the two fusion schemes.
Figure 9 displays the results. From the figure, it is evident that our AFP significantly outperforms other fusion schemes. This validates the superiority of our AFP in fusing multiple feature maps.
4.6. Analysis of Computational Complexity
Table 7 provides an extensive evaluation in terms of the training time, testing time, parameters, and FLOPS across all methods. The analysis yields the following insights: (1) SSFTT demonstrates faster training speeds compared to other baseline methods, which can be attributed to its use of a limited number of convolutional layers. (2) CEGCN and MambaHSI operate on the whole HSI as input instead of using small HSI cubes, leading to quicker prediction speeds than most other methods. (3) Like CEGCN and MambaHSI, the proposed MSLKACNN also processes the entire HSI as input, achieving the fastest prediction time across all datasets. (4) LKVHAN outperforms most methods in terms of parameters, due to its replacement of square kernels with vertical and horizontal kernels. (5) Since CEGCN, MambaHSI, and MSLKACNN take the entire HSI as input, they require significantly more FLOPS compared to other approaches that use small HSI cubes. Additionally, MSLKACNN significantly outperforms other methods in terms of the classification results. These findings highlight the benefits of incorporating small-to-large kernel asymmetric convolutions in MSLKACNN for industrial applications.
4.7. Ablation Study
The proposed MSLKACNN comprises three primary components, the SFEM, the MLKADC, and the MADDC, as well as two critical hyperparameters, the large kernel size in MLKADC and the large kernel size in MADDC. In this section, we perform ablation studies to assess the individual contributions and impact of the three components and the two hyperparameters.
(1) Contributions of Each Component: To assess the individual contributions of these components, we perform a quantitative analysis by selectively removing one of the three components. The results are summarized in
Table 8. To ensure consistency between the number of bands in the original HSI and the number of filters in the convolutional layers, we retain one of the
convolution blocks from the SFEM component after its removal. From the table, we observe that the MSLKACNN model without the MLKADC component exhibits suboptimal performance in comparison to other models across most datasets. This indicates that the component plays a more significant role compared to the other components. Moreover, our MSLKACNN consistently surpasses the performance of its modified versions on all datasets. These findings reinforce the effectiveness of the integrated components.
(2) Analysis of Various Large Kernel Sizes in MLKADC: To verify the effect of different large kernel sizes in MLKADC, we conduct a comparative analysis of OA using varying large kernel sizes across four benchmark datasets: Indian Pines, Botswana, Houston 2013, and LongKou. The results are visually depicted in
Figure 10. The figure illustrates a significant trend: the OA tends to increase with the enlargement of kernel sizes in most cases, reaching its peak at kernel sizes of
and
. Nevertheless, a further increase in kernel sizes leads to a decline in OA. This discovery is vital for determining the optimal large kernel sizes for MLKADC.
(3) Analysis of Different Kernel Sizes in MADDC: To evaluate the influence of various kernel sizes in MADDC, we compare the OA results achieved by diverse kernel sizes on the Indian Pines, Botswana, Houston 2013, and LongKou datasets. These results are illustrated in
Figure 11. We observe that the MSLKACNN model equipped with the kernel sizes of
and
exhibits superior performance compared to its variant models utilizing alternative kernel sizes, thereby determining the optimal kernel sizes for MADDC.
6. Further Discussion
As shown in
Table 3,
Table 4,
Table 5 and
Table 6, the proposed MSLKACNN demonstrates superior performance compared to those of five major categories of deep learning approaches: (1) CNNs, (2) GCNs, (3) Transformers, (4) Mamba, and (5) LKCNN. From these results, we observe that the LKCNN method SSLKA exhibits significantly lower classification performance than most benchmark methods on the high-density dataset (LongKou), while outperforming most comparative methods on the remaining datasets (Indian Pines, Botswana, and Houston 2013). This implies that SSLKA may be unsuitable for processing dense HSI data. Notably, compared to the most related method, SSLKA, the proposed MSLKACNN shows significant performance improvement across all datasets, with OA improvements of 9.18%, 4.75%, 0.67%, and 9.14% on the Indian Pines, Botswana, Houston 2013, and LongKou datasets, respectively. These performance gains can be attributed to the enhanced capabilities of MSLKACNN to extract and integrate both local and global features through asymmetric convolutions with small-to-large kernels. In addition, as shown in
Table 7, our MSLKACNN outperforms SSLKA by a large margin in terms of parameters and testing time, demonstrating the advantages of replacing square kernels with vertical and horizontal kernels in MSLKACNN.
Although the proposed MSLKACNN demonstrates significant advantages in classification performance, inference speed, and parameters, the use of entire HSI rather than its small cubes as input leads to higher computational complexity, posing challenges when dealing with extremely large-scale datasets. Furthermore, while our parallel asymmetric convolutions with small-to-large kernels effectively capture local-to-global features, the absence of an attention mechanism may limit the model’s ability to focus on critical spatial features, which could affect discriminative feature learning in complex scenarios.