1. Introduction
Hyperspectral imaging captures both spatial and rich narrowband spectral information, playing a critical role in analyzing surface distribution, object detection, and natural resources across various fields [
1,
2,
3]. Hyperspectral images (HSIs) are capable of gathering spectral data for every pixel within a given spatial region, with each pixel distinctly categorizing a land cover type [
4]. Hyperspectral imagery possesses a higher spectral resolution, enabling it to offer more detailed features of terrestrial objects compared to panchromatic and multispectral images [
5]. HSI classification represents a critical component within the field of hyperspectral image analysis. However, hyperspectral image classification remains challenging due to the redundancy in its spectral bands and the uneven distribution of samples.
Numerous methods have been proposed to enhance the accuracy of hyperspectral image classification. In the early phase, traditional techniques, such as support vector machines (SVM) [
6], sparse representation [
7], the k-nearest neighbor [
8], and random forests (RF) [
9], were predominantly used. Early efforts in hyperspectral classification were mainly directed towards extracting spectral information. Nevertheless, the neighboring pixels in hyperspectral imagery exhibit a high degree of correlation, suggesting that these pixels often represent the same category. Traditional approaches concentrated exclusively on spectral data, neglecting the spatial correlations present within the imagery. This oversight resulted in an incomplete exploitation of the available comprehensive hyperspectral features.
As deep learning technologies have swiftly advanced in domains like computer vision (CV) and natural language processing (NLP), their extensive application has equally proliferated into the area of hyperspectral classification [
10]. Unlike traditional classification methods that rely on manually designed features, deep learning excels at identifying high-level semantic features, providing a superior capability for feature representation end to end. Due to their specialized localized receptive fields and robustness against distortion, convolutional neural networks have proven to be of considerable practical importance, especially in their capacity for local feature extraction. Characterized by a three-dimensional data structure, hyperspectral images derive significant advantages from the widespread use of 1D [
11], 2D [
12], and 3D [
13] convolutional neural networks. 1D CNNs specialize in extracting spectral features and analyzing variations in data across one dimension. 2D CNNs are aimed at extracting spatial features and interpreting patterns and textures in two-dimensional data. 3D CNNs enable simultaneous spatial and spectral information processing at the cost of increased computational demand. Ahmad et al. [
14] designed a compact 3D convolutional network that employs three-dimensional kernels across multiple adjacent spectral bands to reduce computational demand. Utilizing 3D convolution with variable kernel sizes and residual connections, Zhong et al. [
15] efficiently extracted spectral and spatial features from HSI volumes. The combination of 3D CNNs and 2D CNNs in a sequential arrangement adeptly leverages the complementary strengths of both architectures [
16,
17]. Additionally, the integration of attention mechanisms with multi-branch convolutional network architectures efficiently extracts discriminative spatial and spectral features [
18,
19]. Nevertheless, the inherent limitations of localized receptive fields and invariant convolution kernel dimensions in CNN-based architectures hinder their ability to apprehend global features spanning long distances across both spatial dimensions and broad spectral bands.
Recently, the Vision Transformer (ViT) [
20], along with its derivative models [
21,
22,
23], has achieved remarkable advancements in the realm of image processing. By employing multi-head self-attention mechanisms, transformer-based models, such as ViT, have demonstrated proficiency in identifying long-range feature correlations. Furthermore, ViT has swiftly expanded its application scope, notably in hyperspectral classification [
24,
25,
26], indicating its significant utility and adaptability in diverse imaging contexts. Hong et al. [
27] successfully integrated the transformer encoder into hyperspectral classification without convolutional operations. However, pure transformer-based methods face challenges in effectively capturing local detail features within both spectral and spatial dimensions. Therefore, the naive idea is to combine the advantages of CNN structures with those of transformer structures. Sun et al. [
28] observed a significant enhancement in classification performance through the pre-implementation of 3D and 2D CNN structures ahead of the transformer encoder. Qi et al. [
4] combined 3D CNN and transformer structures to simultaneously extract global and local features of hyperspectral images. Through the incorporation of convolution operations in multi-head attention mechanisms, Zhang et al. [
29] achieved a profound integration of CNN and transformer architectures, resulting in impressive classification performance. Yang et al. [
30] developed a novel parallel multi-level feature fusion structure for integrating global and local features. Merging the benefits of CNN and transformer architectures results in improved classification performance, yet their inherent static receptive fields—from the local domain of convolution kernels to the global scope of self-attention mechanisms—pose challenges. Such fixed fields may constrain enhancements in performance, particularly with irregular and fragmented object contours, indicating opportunities for further advancements.
Through the analysis of the aforementioned work, we designed a novel dual-branch network structure that combines convolution and transformer architectures with adaptive receptive fields, namely, the dual-branch adaptive convolutional transformer (DBACT). The DBACT network is delineated into three primary components: the three-branch parallel hybrid stem module (TBPH), the local residual convolutional module (LRC), and the global adaptive transformer encoder module (GATE). Employing convolution operations with various kernel sizes and spectral pooling, the TBPH module is designed for the extraction of shallow features and the reduction of spectral dimensionality. The LRC module with skip connection focuses on bolstering the expression of local features. The GATE module, equipped with adaptive receptive fields, serves the purpose of adaptive global context encoding. A compact module known as the cross-attention interaction module (CAI) is implemented to enhance the interaction between global features and local features within the GATE and LRC modules. The main achievements are delineated as follows:
Our study introduces a novel dual-branch adaptive convolutional transformer network that merges the localized feature extraction capabilities of CNNs with the global modeling advantages of transformers. This parallel dual-branch adeptly captures and fuses discriminative features across spatial and spectral dimensions, preserving the inherent structural properties of HSIs.
The GATE module, distinctively equipped with adaptive multi-head self-attention mechanism (AMSA), is capable of extracting global context features. It also incorporates depth wise convolution to enhance its local feature capabilities and offers implicit inference of positional information. The AMSA mechanism, armed with adaptive receptive fields, adeptly adjusts to irregular object geometries and accurately captures nuanced edge details.
The LRC module, tasked with capturing local information, functions in conjunction with the CAI module to augment the global information assimilated by the GATE module. The CAI module utilizes a cross-attention mechanism effectively integrating global and local features.
Assessments on the Salinas Valley, Pavia University, and Indian Pines datasets have established that the DBACT model exhibits superior performance relative to existing state-of-the-art models, whether CNN-based or transformer-based.
The subsequent sections of this paper are structured as follows:
Section 2 describes the related works concerning the DBACT model.
Section 3 presents the details of our proposed model along with its essential components. In
Section 4, both qualitative and quantitative analyses were performed.
Section 5 delves into the influence of hyperparameters on model performance and conducts ablation experiments to dissect the role of pivotal modules within the model. Finally,
Section 6 furnishes a thorough summary of the entire paper.
4. Results
In this section, to validate the performance of the proposed model DBACT, three well-known hyperspectral datasets were applied in related experiments. Comprehensive details about the experimental setup and the datasets utilized were furnished, which set the stage for both qualitative and quantitative comparative analyses with leading-edge models.
4.1. Description of Datasets
Indian Pines (IP): The scene was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in northwestern Indiana in 1992, consisting of 145 × 145 pixels and 224 spectral reflectance bands. The Indian Pines landscape includes two-thirds agriculture and one-third forest or other natural perennial vegetation, featuring two major dual-lane highways, a railway line, as well as some low-density housing, other buildings, and more minor roads. The existing ground truth is divided into 16 categories, which are not all mutually exclusive. After excluding spectral bands affected by noise and water absorption, the study utilized the remaining 200 spectral bands at 10 nm intervals, spanning from 400 to 2500 nm.
Pavia University (PU): The University of Pavia dataset, widely recognized and utilized within the hyperspectral image classification domain, was captured by the satellite-based ROSIS-03 sensor over Pavia, Italy’s agricultural zones in 2003. This dataset is notable for its high spatial resolution and broad spectral range, providing essential data for diverse land cover and land use analyses. After abandoning 12 spectral bands compromised by noise and water absorption, the dataset comprises 103 spectral channels with spatial dimensions of 610 × 340 pixels. These channels cover a wavelength range from 430 to 860 nm and offer a spatial resolution (SR) of up to 1.3 m. The dataset is organized into 9 distinct classes, each representing a different category of land cover for classification.
Salinas Valley (SV): The Salinas dataset, derived from the Salinas Valley in California, features high-resolution hyperspectral images obtained via AVIRIS sensor in 1998. This dataset offers comprehensive spectral and spatial data pertinent to the region’s agriculture. It comprises images with dimensions of 512 × 217 pixels and includes 224 spectral bands that span wavelengths from 360 to 2500 nm, with a spatial resolution (SR) of 3.7 m. The dataset categorizes the land cover into 16 distinct classes for classification purposes.
The distribution of false color maps, ground truth maps, and dataset categories for the three datasets we used, IP, PU, and SV, are illustrated in
Figure 5,
Figure 6 and
Figure 7, respectively. Ground truth maps are images annotated with accurate classification labels, which represent the actual categorization of the land cover. In the IP and SV datasets, the distribution of features is comparatively regular, whereas in the PU dataset, the distribution of features tends to be more scattered.
These three datasets are randomly divided into training and testing sets with the distribution of categories detailed in
Table 1. We allocate about 8%, 5%, and 3% of the IP, PU, and SV dataset samples for training, respectively, with the remainder designated as the test set. Classification performance across all models is evaluated using three prevalent quantitative metrics: Overall Accuracy (OA), Average Accuracy (AA), and the Kappa Coefficient (K).
4.2. Experimental Setup
Experimental setups were executed on a hardware configuration consisting of an Intel(R) Core(TM) i7-9700 CPU at 3.00 GHz, 48 GB RAM, and a GeForce RTX 3080 GPU (10 GB VRAM) server. The software framework included Pytorch 1.8, Python 3.8, and CUDA 11.1, running on a Windows 10 operating system. Min-max scaling was applied to normalize the original HSI datasets, adjusting values to fall within the [0, 1] range. Models were trained with a batch size of 64 and a learning rate of 1 × 10−3. The DBACT model was optimized using the AdamW optimizer, which incorporated a weight decay factor of 0.03. CosineLRScheduler is applied to dynamically adjust the learning rate. For a fair comparison, the compared models were configured according to the hyperparameters recommended in their respective publications. The epoch number is set to 100 in all experiments. All experiments were repeated 10 times under the same conditions except for random seeds, and the average value was taken as the final value.
4.3. Classification Results of Different Models
Through extensive experiments with other state-of-the-art networks, we validate the effectiveness of the proposed model. To thoroughly assess the DBACT model against leading-edge counterparts, we employ quantitative analysis and visual evaluation. Our comparative analysis involved two types of architectures to highlight the superior performance of the proposed method. Specifically, the CNN based models, including SSRN [
15], DBDA [
19], SPRN [
18], and DCRN [
44] and Transformer architecture based networks, including SSTN [
32], SpectralFormer [
27], SSFTT [
28], and CTMixer [
29], are considered.
4.3.1. Quantitative Analysis
The classification performance of these models on the three public HSI datasets is shown in
Table 2,
Table 3 and
Table 4. These tables include a comprehensive summary of the OA, AA, and Kappa coefficients, along with their standard deviations and per-class accuracy metrics. The highest classification accuracy achieved in each category is distinguished by bold formatting in the corresponding rows. A comparison of the Overall Accuracy (OA) of different models across three distinct datasets is shown in
Figure 8. CNN-based techniques exhibit a modest advantage compared to Transformer-based methods.
The results demonstrate that the DBACT model surpasses competing methods across all three datasets. For the IP dataset, characterized by its highly imbalanced and rare samples, the DBACT model achieves the highest accuracy in 7 out of 16 categories. For crops with dispersed locations, such as CornN, GrassP, GrassT, and SoybeanM, the model achieves superior classification accuracy. The DBACT model outperforms alternatives by at least 0.24% on AA, 0.21% on OA, and 0.27% on Kappa, highlighting its superior capability in extracting representative features under conditions of sample imbalance and scarcity. CNN-based models with powerful local feature-capturing abilities marginally outperform those based on Transformers. The DBDA model achieves highly competitive classification accuracy, which highlights the effectiveness of the fusion by a dual-branch structure and attention mechanisms. The pure transformer architecture, such as SpectralFormer, exhibits inadequate performance due to constrained local feature extraction capabilities. However, merging convolutional and transformer structures in a sequential combination alone significantly enhances classification performance, such as SSFTT. Embedding convolutional operations deeply within the self-attention calculations, exemplified by the CTMixer model, significantly contributes to enhancements in ground object classification. The DBACT model, which introduces a spatial-spectral dual-branch framework, integrating convolution with self-attention, excels in capturing discriminative features on both global and local scales.
Integrating adaptive multi-head self-attention with convolution in our proposed model aids in capturing diagnostic features and subtle discrepancies. As demonstrated in
Table 3, the fragmented distribution of ground objects, such as Trees and Bricks, can be effectively captured by the DBACT model. The DBACT model secures the highest classification accuracy in 3 out of 9 ground object classes and surpasses competing models by minimum advantages of 0.08% on AA, 0.1% on OA, and 0.1% on Kappa metrics. Similar to the IP dataset, models relying solely on attention architectures, such as SF and SSTN, exhibit relatively inferior performance. Integrating the attention mechanism with CNN’s local feature extraction capabilities significantly enhances the model’s ability to learn discriminative features. The DBACT model integrates an adaptive attention mechanism that facilitates minor shape adjustments based on local information, thereby enhancing feature extraction from specific areas and more effectively capturing subtle features.
The DBACT model features an interactive architecture that combines branches for global and local feature information extraction, allowing for the concurrent extraction of both types of features. Such a design proves effective even for datasets with extensive and comparatively regular spatial distributions. As shown
Table 4, the SV dataset is characterized by its relatively concentrated distribution of geographical features and their regular shapes. The DBACT model excels at capturing the boundary information between these regularly distributed features.
In the context of the SV dataset, where land features are notably concentrated and exhibit regular shapes, the DBACT model adeptly identifies boundary information between these systematically distributed features. As illustrated in
Table 4, the SV dataset is notable for its tightly clustered geographical features and consistent patterns. The DBACT model excels in delineating the boundaries among these consistently distributed features. The DBACT model demonstrates superior classification accuracy, leading in 5 out of 16 ground object categories and exceeding other models by margins of at least 0.09% on AA, 0.04% on OA, and 0.11% on Kappa.
4.3.2. Qualitative Analysis
Alongside quantitative analysis, qualitative assessments depicted in
Figure 9,
Figure 10 and
Figure 11 were conducted to ensure a comprehensive evaluation. Visual analysis reveals that the DBACT model precisely delineates the boundaries between different types of ground objects, generating significantly less intra-category noise compared to competing models. For instance, in the SV dataset, distinguishing Grapes untrained from Vinyard_untrained categories is rendered difficult by their similar attributes. However, the DBACT model addresses this challenge by integrating adaptive receptive fields with the global self-attention mechanism. This combination, which adeptly captures subtle features, facilitates an effective differentiation between the two categories.
Overall, the distribution of classification maps across the three datasets is consistent with the results of quantitative analysis. The DBACT model excels in a variety of challenging scenarios: from datasets with unevenly distributed ground objects and those with small, imbalanced sample sizes (e.g., the IP dataset) to ones characterized by a fragmented distribution of geographical features (e.g., the PU dataset). It also performs well in datasets where crops are distributed orderly with distinct boundary lines (e.g., the SV dataset). The DBACT model achieves the highest classification accuracy in all cases, proving its effectiveness in identifying diverse object types.
5. Discussion
In this section, we analyze model structure parameters and ablation experiments. The parameter analysis primarily encompasses batch size, patch size, and the number of heads in self-attention. The ablation experiments aim to analyze the roles played by key components within the model.
5.1. Parameters Analysis
In this section, we investigate the effects of hyperparameters on Overall Accuracy (OA), including batch size, patch size, and head number in adaptive self-attention mechanism. Our objective is to determine the optimal network structure parameters. Optimal batch and patch sizes facilitate neural network convergence and performance enhancement. As illustrated in
Figure 12, for the three datasets, patch size varies from 8 × 8 to 20 × 20, with an increment of 2, and batch size ranges from 32 to 256, doubling at each step. The relationship between patch size and classification accuracy demonstrates a non-linear pattern. OA improves with an increase in patch size (from 8 × 8 to 16 × 16) but begins to decrease (from 16 × 16 to 20 × 20) once the patch size crosses a certain threshold, with batch size remaining constant. Relative to patch size, batch size has a comparatively minor effect on OA. Nonetheless, setting the batch size to 64 yields the optimal classification performance. Consequently, the patch size and batch size are optimally set at 16 and 64, respectively.
In the global adaptive transformer encoder branch, two self-attention blocks are employed, with the number of heads in the first and second blocks designated as head1 and head2, respectively.
Figure 13 depicts the impact of head1 and head2 counts on OA, as the number of heads varies from 2 to 8. Compared to patch size, the effect of head number on classification effectiveness is relatively minor. Classification performance is enhanced with a rising number of heads, achieving optimal results when head1 and head2 numbers are adjusted to 8 on the three datasets. Multiple heads mean the ability to extract representative features from various subspaces of HSI. Therefore, the optimal settings for head1 and head2 numbers are both established at 8.
5.2. Ablation Study
To elucidate the contribution of each component within the model to its performance, we conducted ablation experiments on these three datasets (IP, PU, and SV). Key components of our model for ablation studies include LRC (Local Residual Convolution module), GATE (Global Adaptive Transformer Encoder module), AMSA (Adaptive Multi-head Self-Attention Mechanism), and CAI (Cross-Attention Interaction module).
In summary, each of the four critical components in the DBACT model plays a role in enhancing classification performance to some extent, with the GATE module making the most substantial contribution, as shown in
Table 5,
Table 6 and
Table 7. Removing the GATE module results in the most significant decrease in Overall Accuracy (OA) by 0.59%, demonstrating the critical role that GATE’s global feature extraction capability plays in classification tasks. The slight degradation in performance due to the removal of the AMSA module indicates that our dual-branch structure inherently possesses a high capability for feature learning, while the AMSA module serves to raise the upper limit of performance. The omission of the LRC and CAI modules led to a noticeable reduction in performance, highlighting the essential supportive function that local features serve in augmenting global features. The results of these ablation experiments confirm the effectiveness and robustness of the DBACT model.
In summary, the Global Adaptive Transformer Encoder module (GATE) branch, incorporating an adaptive multi-head self-attention mechanism, efficiently extracts features from arbitrarily shaped objects. Meanwhile, leveraging residual connections, the Local Residual Convolution (LRC) module adeptly captures local representations. Combining the GATE and the LRC with cross attention, the DBACT model possesses robust feature extraction capabilities, effectively merging spatial and spectral data to extract representative high-level semantic information from hyperspectral images.
6. Conclusions
This study presents a dual-branch adaptive convolutional transformer (DBACT) network that integrates both global and local features for HSI classification. In our global feature extraction branch, we incorporated an adaptive multi-head self-attention mechanism capable of obtaining a dynamic global receptive field, thus accommodating a variety of irregular objects. The local feature extraction branch, equipped with residual connections, supplements the global feature extraction branch. The cross-attention mechanism acts as a fusion bridge between global and local features, enhancing the model’s overall classification performance. By integrating the characteristics of the structures above, the DBACT model is adept at capturing global and local spatial and spectral features of hyperspectral imaging (HSI) across multiple scales. This comprehensive approach enables a nuanced understanding and representation of HSI data, facilitating superior classification performance even in complex scenarios involving diverse spatial and spectral variations. Across the three commonly used hyperspectral datasets—the IP, PU, and SV datasets—the DBACT model demonstrated superior performance in the Overall Accuracy, Average Accuracy, and Kappa Coefficient metrics compared to other state-of-the-art models, confirming its effectiveness.
In the future, we aim to investigate the implementation of dynamic receptive fields within the tri-dimensional architecture of data structures. Moreover, to tackle the challenge of limited labeled data availability, we propose developing generative models for creating synthetic hyperspectral data, thereby enriching the dataset diversity.