1. Introduction
Wetlands, as an integral part of global ecosystems, perform irreplaceable ecological functions such as flood regulation, water purification, carbon sequestration, and the maintenance of biodiversity [
1]. Numerous studies have shown that wetland resources worldwide have experienced significant degradation since the Industrial Revolution, with over 50% of natural wetlands being replaced by urban expansion and agricultural development [
2,
3,
4]. Under the dual pressures of climate change and human impact, wetlands are now facing serious challenges including degradation, fragmentation, and declining ecological functionality. As a result, developing high-resolution, sustainable, and long-term monitoring and classification systems has become a key requirement for effective ecological conservation and management [
5,
6].
Hyperspectral imagery (HSI), characterized by its dense and narrow spectral bands, enables the precise discrimination of surface materials and is particularly suitable for the classification of heterogeneous and complex wetland environments [
7,
8,
9]. In recent years, the widespread adoption of unmanned aerial vehicles (UAVs) has provided an efficient solution for acquiring hyperspectral data with high spatial and temporal resolution, demonstrating outstanding performance in tasks such as vegetation monitoring and water boundary delineation in wetlands [
10,
11,
12]. However, UAV-based HSI still faces considerable challenges in fine-grained classification due to the complex mixture of wetland land covers, high spectral similarity between classes, blurred object boundaries, and severe class imbalance [
13,
14,
15].
Traditional supervised classification approaches, such as Support Vector Machine (SVM) [
16], Random Forest (RF) [
17], and Multinomial Logistic Regression (MLR) [
18], rely primarily on pixel-level spectral modeling and lack effective spatial context awareness, often leading to misclassification around class boundaries. To compensate for this limitation, hand-crafted spatial enhancement techniques such as Gabor filters [
19] and morphological profiles [
20] have been introduced. However, these methods are highly dependent on parameter tuning and exhibit limited generalization capability, making them inadequate for the complex and variable terrain of wetlands [
21].
With the development of deep learning, convolutional neural networks (CNNs) have become widely used in hyperspectral classification tasks. For instance, Hu et al. proposed a 1D CNN model that efficiently extracts spectral features [
22], but it fails to capture spatial structure. HybridSN leverages 3D convolution to jointly model spatial and spectral features, enhancing local texture sensitivity [
23], but its limited receptive field and heavy computational burden restrict its ability to model long-range dependencies. SSRN introduces a residual dual-branch architecture that separately extracts spatial and spectral features and fuses them to enhance nonlinear representation capabilities [
24], but its static fusion mechanism limits flexibility in adapting to class variability. To address the underperformance on minority classes, Yuan et al. proposed the DBMA network with a multi-branch attention mechanism, which improves feature expressiveness, yet its inter-branch coordination remains constrained [
25]. Furthermore, deeper spatial–spectral joint networks have been explored by Makantasis et al. [
26] and Wang et al. [
27], while Paoletti et al. [
7] provided a comprehensive review summarizing the performance variations of different deep models across diverse land cover scenarios.
In recent years, Transformer-based architectures have made remarkable advances in hyperspectral image classification, primarily due to their strong capability in modeling global dependencies. Among them, the Swin Transformer introduces a hierarchical design and shifted window attention, enabling efficient capture of long-range spatial relationships [
28]. Models such as SpectralFormer and CTIN integrate token-level spectral–spatial interaction mechanisms, substantially enhancing the fusion of heterogeneous features [
29,
30]. Furthermore, enhanced designs such as the dual-branch Transformer [
31], MS2I2Former [
32], and HSD2Former [
33] significantly improve boundary delineation and maintain spatial consistency.
Despite these achievements, most existing methods still depend on static weighted fusion or straightforward concatenation strategies, limiting their ability to dynamically model heterogeneous multi-branch features. They also often struggle with challenges like class imbalance and ambiguous boundaries—particularly in complex wetland environments characterized by fragmented patches and underrepresented classes [
34,
35].
To overcome these limitations, multi-scale feature fusion has emerged as a central focus in recent research. Techniques such as atrous spatial pyramid pooling (ASPP) and attention-based pyramid fusion have proven effective in capturing semantic information across varying spatial scales, leading to more accurate classification of land cover types with large scale variation, including water bodies, vegetation, and bare soil [
36,
37]. Several Transformer-based architectures further extend multi-scale fusion to the token level. For instance, models like MultiFormer and MSCF-MAM demonstrate that token-level multi-scale modeling greatly enhances spatial–semantic consistency [
38,
39]. Likewise, the GSSFT model highlights the potential of multi-scale attention modules for accurate recognition of fragmented wetland patches [
40].
Collectively, these developments reflect a paradigm shift in hyperspectral classification—from static fusion techniques toward dynamic, structure-adaptive integration mechanisms tailored to complex spatial–spectral characteristics.
Nevertheless, significant limitations remain in applying such architectures to wetland classification. Specifically, (1) most multi-source fusion strategies rely on static weighted or concatenation approaches, lacking adaptive mechanisms tailored to inter-class differences; (2) the scale modeling structures are often single-path, failing to balance fine boundary detail and global contour recognition; (3) insufficient robustness to class imbalance and boundary ambiguity leads to poor recognition of minor classes and produces salt-and-pepper noise in classification maps.
To overcome these challenges, this paper proposes a novel Multi-Branch Channel-Gated Swin Transformer network (MBCG-SwinNet), tailored to UAV hyperspectral imagery from the Yellow River Delta (YRD) wetlands in Shandong, China. The model employs a Swin Transformer spectral branch to extract global spectral context and a CNN spatial branch to capture local texture details; these dual streams learn complementary spectral and spatial representations. A lightweight Global-guided Spatial Self-Attention (GSSA) module provides initial cross-branch interaction. More critically, a redesigned Residual Spatial Fusion (RSF) module serves as the core of the fusion mechanism, enabling enhanced inter-branch interaction and spatial residual compensation. Furthermore, the channel-wise gate adaptively re-weights contributions across channels, and the multi-scale feature fusion (MFF) module consolidates multi-level features, improving robustness to spatial heterogeneity typical of YRD wetland scenes. Finally, a DenseCRF-based post-processing step refines boundary predictions and suppresses salt-and-pepper noise, leading to improved spatial coherence and classification accuracy.
The main contributions of this paper are summarized as follows:
We proposed a dual-branch hyperspectral classification framework combining a Swin Transformer-based spectral branch and a CNN-based spatial branch to jointly exploit global spectral dependencies and local spatial context.
We designed a hierarchical feature fusion strategy integrating Residual Spatial Fusion (RSF), channel-wise gating, and multi-scale feature fusion (MFF), enabling deep inter-branch interaction, dynamic channel re-weighting, and scale-adaptive integration for complex, heterogeneous wetland features.
We incorporated a DenseCRF-based post-processing module into the end-to-end pipeline to refine object boundaries and enforce spatial consistency, improving classification accuracy and boundary delineation on the benchmark datasets NC12, NC13, and NC16.
2. Materials and Methods
2.1. Study Area Overview
This study focuses on the Yellow River Delta (YRD) in northeastern Shandong, China (approx. [37.5–38.4]°N, [118.1–119.3]°E), a low-relief alluvial–coastal plain bordering the Bohai Sea. The YRD with elevations mostly <10 m a.s.l. exhibits strong river–tide interactions that drive pronounced seasonal hydrological variability. The wetland mosaic includes extensive reed marshes (
Phragmites australis), salt-marsh communities (e.g.,
Suaeda salsa), tidal mudflats, shallow ponds/creeks, and saline–alkali bare lands, interspersed with aquaculture ponds and levee infrastructure. High soil salinity, mixed vegetation–substrate patches, and frequent land–water transitions lead to strong spectral similarity, class imbalance, and blurred boundaries—making the YRD a representative yet challenging testbed for UAV-HSI classification under real-world wetland conditions [
41].
2.2. Dataset Description
The hyperspectral data used in this study were obtained from the Yellow River Delta Hyperspectral Remote Sensing Database (YRD-HRS), a publicly available dataset first established and released by Xie et al. [
42]. This database provides a standardized platform for hyperspectral classification research in wetland ecosystems, featuring high spatial resolution, multi-temporal observations, and pixel-level land cover annotations across diverse wetland types.
Data acquisition was conducted using a DJI M600 (SZ DJI Technology Co., Ltd., Shenzhen, China) UAV platform equipped with a Headwall Nano-Hyperspec sensor (12 mm focal length). The sensor captured 270 spectral bands spanning the 400–1000 nm range, offering both high spatial and spectral fidelity. The YRD-HRS database includes three representative UAV-collected datasets—NC12, NC13, and NC16—each covering distinct ecological and environmental conditions. All three datasets are used in this study to comprehensively evaluate the effectiveness and generalizability of the proposed classification model. Illustrative examples of the datasets are shown in
Figure 1.
The NC12 data were collected on 23 September 2020, from 12:27 to 13:10 under clear weather conditions. The UAV flew at an altitude of 300 meters, resulting in a spatial resolution of approximately 0.182 meters. This area contains 12 land cover types, primarily including water bodies, bare land, and
Suaeda glauca, which are typical classes in coastal wetlands. The NC13 data were acquired on 24 September 2020, from 14:47 to 15:20 under overcast conditions, with the UAV flying at the same altitude of 300 m and the same spatial resolution of approximately 0.182 m. The surface composition in this area is more complex, with some wetland types exhibiting low reflectance and dramatic variation, making this dataset particularly suitable for evaluating the spectral modeling performance of classification models under low-illumination conditions. The NC16 data were collected on 23 September 2020, from 13:50 to 14:20 at a flight altitude of 400 meters, corresponding to a spatial resolution of approximately 0.266 m. They comprise 16 land cover categories. This dataset is characterized by high spectral redundancy, blurred class boundaries, and predominantly mixed land cover distributions, making it a valuable benchmark for developing and evaluating boundary-sensitive classification models. All three hyperspectral datasets exhibit significant class imbalance. In
Figure 2, which depicts the category distribution of the hyperspectral dataset, artificial reference targets (standard reflectance cloth in NC13/NC16; both white cloth and standard reflectance cloth in NC12) were excluded from the visualization as they do not represent natural land cover classes. The distribution of the remaining classes reveals a long-tail pattern where dominant classes (e.g., water bodies and soil types) comprise most samples, while ecologically important but spatially limited classes (e.g., specific vegetation types) have limited representation.
We adopt a land cover taxonomy across all experiments. Vegetation communities are specified at the community/habitat level and are introduced by the common English name followed by the Latin binomial in italics at first mention (e.g., Suaeda (Suaeda glauca)); thereafter, the English name is used consistently throughout the text, tables, and figures. Non-vegetation classes use standardized land cover terms such as water, mudflat, bare soil, moist soil, and sand/gravel (stone). The complete class lists for NC12, NC13, and NC16, together with train/test counts, are provided in this section. Items labeled “White cloth” (standard reflectance panels for radiometric calibration) and “Iron” (man-made metallic markers/surfaces) are not natural land cover classes. For figure rendering, the same semantic class uses the same color as consistently as possible across datasets, legends are refined to map class names to color swatches, and background/no data is shown as transparent.
2.3. MBCG-SwinNet Architecture
This section provides a brief overview of the proposed Multi-Branch Channel-Gated Swin network (MBCG-SwinNet) architecture. The model is composed of several key components, including dual-branch spectral–spatial feature extraction, Global Spectral–Spatial Attention (GSSA) fusion, Residual Spatial Fusion (RSF), channel-wise gating, multi-scale feature fusion (MFF), and DenseCRF-based post-processing for refinement. The complete processing pipeline is illustrated in
Figure 3.
Specifically, for each input hyperspectral image patch, the model extracts spatial and spectral features via two parallel branches: a spatial branch based on 3D convolutional neural networks (3D-CNNs) and a spectral branch utilizing the Swin Transformer. The extracted features are then fed into the GSSA module, which enables high-order interaction and mutual enhancement between spectral and spatial domains, yielding two sets of enhanced features.
These GSSA-enhanced features are subsequently input into the Residual Spatial Fusion (RSF) module, which facilitates deeper cross-branch interaction and residual compensation. The output from RSF is further processed by a channel-wise gating mechanism, which adaptively modulates feature contributions at a fine-grained channel level to achieve dynamic feature weighting and fusion.
The fused features from RSF and the gating module are then passed through the multi-scale feature fusion (MFF) module, which aggregates multi-scale representations and significantly improves the model’s ability to recognize complex land cover structures. During inference, a Dense Conditional Random Field (DenseCRF) is applied to refine classification boundaries, suppress salt-and-pepper noise, and enhance spatial consistency, resulting in high-precision classification maps.
2.4. Spatial and Spectral Branch Design
The proposed model adopts a dual-branch architecture, consisting of a spatial branch and a spectral branch, designed to separately exploit the spatial context and spectral sequence features inherent in hyperspectral imagery. In contrast to the MGSSAN method proposed by Xie et al. [
42], our model employs a Swin Transformer as the spectral backbone. While hybrid CNN–Transformer architectures have been explored in HSI classification, here, we tailor a Swin-based spectral branch to UAV wetland scenes and integrate it with the proposed RSF, channel-wise gating, and MFF modules for deep spatial–spectral fusion.
2.4.1. Spatial Branch
The spatial branch utilizes a 3D convolutional module to jointly convolve across the spatial neighborhood and spectral dimension, effectively capturing local spatial textures and spectral–spatial joint representations. The input to this branch is a hyperspectral image patch represented as a 3D tensor
:
The final output is the spatial features
The spatial branch is effective in capturing local neighborhood structures around each pixel, which helps suppress hyperspectral noise and enhances the modeling of spatial continuity and boundary features of land cover classes. These spatial representations complement the spectral branch and are subsequently fused with spectral features via the Global Spectral–Spatial Attention (GSSA) module for deep interaction and joint enhancement.
2.4.2. Spectral Branch
Unlike conventional spectral branches that rely solely on convolutional structures, our approach introduces the Swin Transformer as the backbone of the spectral branch, significantly enhancing the model’s capacity to capture global dependencies and long-range contextual relationships in hyperspectral data. This branch adopts a hierarchical Swin Transformer architecture designed to fully exploit spectral correlations and long-distance dependencies along the band dimension.
As illustrated in
Figure 3, the input is a hyperspectral patch of size
. The input is first mapped into a low-dimensional feature space via a Linear Embedding layer, forming an initial token sequence fed into multiple Swin Transformer Blocks.
Unlike standard Transformers, the Swin Transformer utilizes window-based multi-head self-attention (W-MSA) and Shifted Window Multi-head Self-Attention (SW-MSA) mechanisms. These localized attention operations maintain computational efficiency while effectively modeling spectral relationships across bands and spatial locality. Each hierarchical stage comprises several Swin Transformer Blocks, interleaved with Patch Merging operations that perform progressive downsampling and feature dimension expansion to aggregate increasingly abstract spectral representations.
The internal structure of a single Swin Transformer Block is shown in
Figure 4. Each block includes Layer Normalization, a window-based self-attention module, a two-layer multi-layer perceptron (MLP), and residual connections to improve training stability and representational capacity. Ultimately, the spectral branch outputs rich multi-scale spectral features, which serve as key inputs for subsequent fusion and classification stages.
Both of them have residual connection and normalization operation to ensure efficient flow of features and training stability.
The forward propagation of the swing transformer block can be described as follows:
where W-MSA denotes window-based multi-head self-attention, SW-MSA refers to Shifted Window Multi-head Self-Attention, MLP stands for multi-layer perceptron, and LN indicates Layer Normalization. The hierarchical features are progressively downsampled and expanded in channel dimension via Patch Merging layers, enabling multi-scale global representation learning. The final output is a deep spectral representation denoted as
which serves as a crucial input for subsequent fusion modules.
2.4.3. Global Spectral–Spatial Attention (GSSA)
To further enhance the flexibility and discriminability of feature fusion between the spatial and spectral branches, the model incorporates the Global Spectral–Spatial Attention (GSSA) mechanism, originally inspired by the design proposed by Xie et al. [
42]. GSSA employs lightweight channel and spatial attention modules to guide the mutual complementation of global information between spectral and spatial features. The GSSA process can be abstracted as follows:
Let denote the spectral and spatial features extracted from the respective branches. After being processed by the GSSA module, the enhanced features are denoted as and , respectively.
2.5. Improved Residual Spatial Feature Enhancement Module (RSF)
Inspired by recent high-resolution remote sensing models such as Swin-CFNet [
43], this study proposes an improved Residual Spatial Feature (RSF) enhancement module tailored for hyperspectral imagery. The module is designed to fully integrate multi-source information from the spectral and spatial branches, thereby enhancing feature representation across both spatial and channel dimensions. To effectively leverage the complementary characteristics of the CNN-based spatial branch and the Swin Transformer-based spectral branch in complex wetland environments, RSF adopts a specialized dual-path structure that couples spectral–recursive gated spatial enhancement with spatial–channel attention modeling. At the output, a channel-wise gating mechanism is introduced to adaptively fuse features from the two branches. The overall structure of the RSF module, along with the MFF (multi-scale feature fusion) mechanism, is illustrated in
Figure 5.
2.5.1. Module Input and Branch Pre-Processing
After passing through the GSSA module, the spatial branch and spectral branch output
and
, respectively. Since the number of channels in the two branches is not necessarily the same, channel adaptation is required for subsequent complementary modeling and feature alignment:
2.5.2. Channel Attention Modeling in the Spatial Branch
Based on the convolutional features
, the spatial branch incorporates a channel attention mechanism to enhance the model’s ability to differentiate and represent the significance of each spatial feature channel. This mechanism uses a combination of global pooling and a lightweight MLP to model the importance of different channels, formulated as follows:
where
denotes a fully connected layer, while AvgPool and MaxPool represent global average and max pooling, respectively, used to extract global spatial contextual information. The channel attention mechanism aggregates pooled features via the MLP and applies channel-wise re-weighting (
), followed by convolution to further refine spatial dependencies. This design allows the model to adaptively modulate the contribution of each spatial channel to the final fusion, highlighting structured spatial patterns such as wetland boundaries and linear features (e.g., roads), while suppressing redundant or noisy components. The fusion of global and local information through pooling balances the model’s sensitivity across varying spatial scales, which is crucial for handling large-scale landscape variation in wetland environments.
2.5.3. Spectral Branch Recursive Gated Spatial Enhancement
The spectral branch
employs a recursive gated enhancement mechanism to dynamically reinforce multi-level spatial contextual information. Inspired by separable convolution and dual-path architecture design, the spectral features are divided into a primary path
and an auxiliary path
. The auxiliary path is processed using depth-wise separable convolutions to extract hierarchical spatial features, which are recursively refined and modulated to enrich the spatial representation of the spectral branch. This design enables the spectral branch to not only retain long-range spectral dependencies but also integrate spatial awareness, thereby enhancing the network’s ability to distinguish spatially heterogeneous wetland land cover under complex scenarios. Wetland classes often exhibit high spectral similarity yet distinct spatial structures (e.g., shallow water vs. dark wet soil along thin watercourses). RSF therefore decouples the spectral stream into a primary path (
) that preserves high-confidence spectral evidence and an auxiliary path (
) that captures ambiguous, easily confused components and multi-level spatial cues. Through layer-wise recursive gating, the auxiliary path selectively modulates the primary cues: when local structure supports the primary hypothesis, the gate reinforces
; otherwise,
compensates for missing or distorted spatial context. This predict–residual–correction cycle reduces over-smoothing near boundaries and improves discrimination for spectrally similar but structurally different land covers. Formally, at recursion step t, gate
modulates the auxiliary evidence, and the updated
implements the predict–residual–correction cycle summarized in Equations (15)–(20).
retains high-confidence spectral evidence, whereas extracts multi-level spatial cues and ambiguous components that are recursively used to modulate the spectral stream.
First-order feature fusion is as follows:
High-order recursive fusion (layer N) is as follows:
The final aggregate output is as follows:
where
denotes a 1 × 11\times 11 × 1 convolution followed by batch normalization (BN) and ReLU activation, while DWConv refers to depth-wise separable convolution. By decoupling the main spectral stream from the multi-level auxiliary spatial features and applying layer-wise gated recursion, this mechanism significantly enhances the spectral branch’s capacity to perceive spatial context adaptively. This design effectively addresses the “spectrally similar but structurally different” confusion problem prevalent in wetland hyperspectral imagery, thereby promoting deep spatial–spectral feature fusion.
For example, when shallow water and wet soil share similar spectra, local edge/texture cues extracted in drive a large near boundaries, injecting spatial evidence into and reducing confusion; in homogeneous interiors, remains small and preserves the spectral consensus.
2.5.4. Channel-Wise Gated Residual Fusion
In previous methods, feature fusion between spatial and spectral branches was commonly performed by simple addition or concatenation, which tends to assign equal importance to all channels regardless of their individual discriminative power. Such uniform fusion strategies are often inadequate for hyperspectral wetland scenes, where certain channels or branches may carry more critical information than others, while redundant or noisy features can negatively impact classification performance.
To address this limitation, we introduce a channel-wise gated fusion mechanism following the RSF module. This module adaptively learns the relative importance of each channel from different branches by generating channel-specific gating weights via a lightweight attention network. By dynamically weighting and combining the spatial and spectral features at the channel level, the proposed fusion scheme selectively emphasizes the most informative components while suppressing less relevant or redundant signals. Furthermore, the inclusion of residual connections enhances feature reusability and stabilizes gradient flow, contributing to more robust and expressive feature representations.
where
is the adaptive weight of each channel, MLP is generally a two-layer full connection,
,
.
Overall, the channel-wise gated residual fusion module empowers our network with flexible, fine-grained control over multi-branch feature integration, resulting in improved classification accuracy and generalization, particularly for complex and heterogeneous wetland hyperspectral images.
2.6. Multi-Scale Feature Fusion Module (MFF)
Following the RSF module, we further introduce a multi-scale feature fusion (MFF) module to deeply integrate the multi-source features derived from the spatial branch, spectral branch, and RSF output. The primary objective of MFF is to harness the complementary strengths of each branch, thereby improving the model’s capability to distinguish fine-grained structures, complex heterogeneous land covers, and ambiguous object boundaries in wetland environments. The MFF module not only strengthens the joint representation of global spectral and local spatial features, but also effectively suppresses redundant and noisy components, enhancing classification robustness in complex scenarios.
Inspired by feature redistribution and soft selection mechanisms, the MFF module performs adaptive feature integration through a combination of 1 × 1 convolution, normalization, and nonlinear activation. Specifically, the fused representation
, the enhanced spatial branch output
, and the enhanced spectral branch output
are concatenated along the channel dimension to form a comprehensive feature tensor, enabling unified representation for subsequent classification.
The 1 × 1 convolution is used to reconstruct and compress the stitched features, and the 3C channel is mapped back to the C channel to achieve the linear redistribution of features:
Batch norm and ReLU activation are introduced to improve training stability and increase feature expression nonlinearity:
We add a dropout layer at the output to suppress the strong correlation between features and improve the generalization ability of the model:
The final fusion feature is used as the input of the classification header (full connection layer—MLP) to generate the final pixel-level prediction probability.
2.7. Post-Processing: DenseCRF Refinement
Although deep neural networks can achieve high overall accuracy in pixel-wise classification of hyperspectral wetland imagery, the resulting classification maps often suffer from boundary ambiguity and salt-and-pepper noise, particularly in regions with complex class distributions or those near object boundaries. To enhance the spatial coherence and boundary precision of the classification outputs, we introduce Dense Conditional Random Fields (DenseCRF) as a post-processing refinement step, and the effect is shown in
Figure 6.
DenseCRF takes the initial segmentation probability map as input and globally optimizes pixel labels by minimizing an energy function. The core idea is to incorporate both spatial proximity and spectral similarity between pixels to perform pixel-level refinement:
Unary term : this indicates the confidence that each pixel belongs to a certain category, which is derived from the softmax output or classification probability map of the neural network.
: This models the relationship between pixels
i and
j and encourages pixels with similar space and/or spectrum to be given the same label. The typical forms are Gaussian kernel and bilateral kernel (see the following formula):
where
is the indicator function (if
≠
, it is 1),
is the spatial position,
is the pixel color or principal component value, σ
α, σ
β, σ
γ control the spatial and color influence range, and
are the weights
The parameters of DenseCRF—including spatial kernel scale, color kernel scale, label compatibility, and confidence weight—are highly tunable, making the framework adaptable to hyperspectral imagery with varying resolutions and textural complexities. In this study, the parameter settings were carefully designed and analyzed for sensitivity (as detailed in the experimental section), considering the following aspects:
ITER: Number of iterations, balancing convergence speed and accuracy.
SXY_G, COMPAT_G: Parameters of the Gaussian kernel term representing spatial smoothness constraints.
SXY_BI, SRGB, COMPAT_BI: Parameters of the bilateral kernel that integrates both spatial proximity and spectral (or color) similarity.
gt_prob: Confidence weight for the original network prediction, determining the degree of trust placed in the initial output during CRF optimization.
DenseCRF not only enhances global optimization capability but also allows for independent control over spatial smoothing and spectral guidance. Our experiments demonstrate that this method consistently improves both boundary delineation and overall classification accuracy in most wetland scenarios.
2.8. Experimental Setup
(1) All experiments were conducted on a workstation with an NVIDIA GeForce RTX 3090 GPU running Windows 10. The implementation was based on Python 3.8 and PyTorch; development used PyCharm2022, and ENVI5.6/ArcGIS10.8 were employed for data preprocessing/visualization and geospatial checks; model complexity statistics are reported in
Section 3.5.
(2) For fair, like-for-like comparability with prior work on the YRD UAV-HSI datasets, we adopt the fixed train/test split provided by Xie et al. [
42]. The training proportions are 3.19% for NC12, 2.81% for NC16, and 1.63% for NC13; all remaining labeled pixels constitute the test set. When a validation set is needed for model selection, it is carved out only from the training portion, and the test split remains unchanged. For transparency, per-class training/test counts are reported in Tables 1–3; class distributions and imbalance are summarized in
Section 2.2 in
Figure 2 (histograms).
(3) The model training settings were configured as follows: The AdamW optimizer was used with an initial learning rate of , dynamically adjusted using a Cosine Annealing LR schedule. The weight decay was set to . The batch size was 64, with a maximum of 100 training epochs. Input patch size was set to 88. The loss function employed was label-smoothed cross-entropy loss with a smoothing factor of 0.05. Dropout probability was set to 0.5. Both training and inference were performed on the original full-band hyperspectral patches, and the final classification maps were reconstructed using a sliding window strategy for pixel-wise prediction.
(4) To comprehensively evaluate model performance, we adopted several metrics including overall accuracy (OA), Average Accuracy (AA), and Cohen’s Kappa coefficient. These metrics were used to assess the generalization ability and robustness of the proposed model across the three benchmark hyperspectral wetland datasets: NC12, NC13, and NC16.
3. Results
3.1. Comparative Experiments
To rigorously validate the effectiveness of the proposed method for hyperspectral remote sensing image classification, we conducted extensive comparisons using the NC12, NC13, and NC16 datasets. The method was benchmarked against several state-of-the-art classification networks, including 1DCNN, 3DCNN, HybridSN, SSRN, DBMA, DBDA, and the Swin Transformer-based model.
To ensure fairness, all models were trained and tested using the same data splits (see
Table 1,
Table 2 and
Table 3) and evaluated using standard metrics such as OA, AA, and the Kappa coefficient. Classification performance was assessed on the test sets, with the highest accuracy values highlighted in bold. Additionally, per-class accuracy was reported to analyze class-wise performance differences.
To aid intuitive understanding,
Figure 7,
Figure 8 and
Figure 9 provide visual comparisons of classification maps produced by all methods, allowing for direct assessment of each model’s performance in spatial delineation and noise suppression.
From the comparative results shown in
Table 1,
Table 2 and
Table 3, the proposed Multi-Branch Channel-Gated Swin network (MBCG-SwinNet) consistently outperformed all baseline methods across the three benchmark hyperspectral datasets—NC12, NC16, and NC13. It demonstrated especially strong generalization in tackling challenging scenarios such as class imbalance and fine-grained object discrimination.
For the NC12 dataset, MBCG-SwinNet achieved outstanding performance, reaching an OA of 97.62%, AA of 88.85%, and Kappa coefficient of 97.04%. Notably, the model significantly improved the classification accuracy for difficult and spectrally similar categories such as reed and mudflat, substantially reducing misclassification. While the Swin Transformer achieved solid performance on dominant classes, it showed weaknesses in handling small-sample classes and boundary refinement. Residual-structure-based networks such as SSRN and HybridSN, in contrast, often suffered from boundary blurring and excessive salt-and-pepper noise under high spectral similarity and imbalanced class distributions. MBCG-SwinNet effectively alleviated these issues via robust spatial–spectral dual-branch feature fusion, leading to better spatial consistency and land cover coherence in the classification maps, as illustrated in
Figure 7; detailed class-wise and summary metrics are reported in
Table 1.
For the NC16 dataset, which poses an even greater challenge due to severe spectral overlap among classes, MBCG-SwinNet still achieved an OA of 97.32%, AA of 87.11%, and Kappa of 96.34%. It particularly excelled in classifying Spartina alterniflora and mudflats, which exhibit high intra-class heterogeneity and limited training samples. Compared with other methods, HybridSN and SSRN showed limitations in deep feature integration and inter-class discrimination. The Swin Transformer performed reasonably on major categories but lagged behind in fine-grained segmentation and small-class preservation. Visual analysis of classification maps further confirmed that MBCG-SwinNet produced cleaner, more coherent results with smoother boundaries and less noise, indicating superior spatial generalization, with the classification effect shown in
Figure 8; detailed class-wise and summary metrics are reported in
Table 2.
The NC13 dataset presented the most complex scenario, characterized by highly interleaved land covers and severe spectral overlap. This made it notoriously difficult to exceed an 80% OA using conventional methods. MBCG-SwinNet was the first to push the OA to 82.37%, AA to 82.53%, and Kappa to 78.91%, significantly outperforming all previously reported methods in the literature, to the best of our knowledge. Class-wise, the model achieved 99.91% for “Mixed asphalt cement road”, 94.78% for “Water”, and 80.41% for “Dry soil”. It also reached 94.18% for “Mixed suaeda glauca reed”. Compared with the best performing baseline methods, accuracy improvements for key classes ranged from 2.7% to 10%. Particularly for complex and minority classes such as moist soil and dry soil, MBCG-SwinNet showed pronounced advantages. The multi-scale fusion and spatial–spectral synergy mechanisms enabled superior recognition of small-sample categories such as reed and car, achieving 73.51% and 85.13% accuracy, respectively—outperforming all other models and mitigating the typical collapse in classification accuracy for underrepresented classes, with the classification effect shown in
Figure 9; detailed class-wise and summary metrics are reported in
Table 3.
Subjective assessment of classification maps further illustrates that MBCG-SwinNet generated results that more closely align with the true spatial patterns of land cover, with sharper boundaries and fewer noise artifacts. This was especially evident in transitional zones and mixed-class regions, where the model produced more stable and detailed segmentation results. These improvements are reflected not only in quantitative metrics but also in the practical value for remote sensing land cover interpretation.
In summary, MBCG-SwinNet demonstrated consistently superior performance across all three hyperspectral datasets, confirming the advantages of its spatial–spectral complementary representation, high-order feature interaction, and multi-scale fusion architecture. These attributes offer a promising direction for hyperspectral remote sensing classification and contribute a powerful tool for complex wetland and mixed-object environment analysis.
3.2. Ablation Study
To quantitatively evaluate the contribution of each key module—Residual Spatial Fusion (RSF), channel-wise gating, multi-scale feature fusion (MFF), and DenseCRF post-processing—to the overall performance of the proposed model, we conducted a series of ablation experiments as summarized in
Table 4 and
Figure 10. Given the architectural dependencies within the model, both the channel-wise gating mechanism and the MFF module rely on the output of the RSF module. Therefore, experimental configurations without RSF but including either gating or MFF were not considered.
The ablation study includes the following configurations:
A0—dual-branch (no GSSA/RSF/Gate/MFF/CRF):
Pure two-branch backbone (Swin spectral + 3D-CNN spatial), without GSSA, RSF, channel-wise gating, MFF, or DenseCRF.
A1—A0 + GSSA:
A2—A1 + RSF:
A3—A2 + Channel-wise gating:
A4—A3 + MFF:
A5—A4 (complete backbone, no CRF):
A6—A5 + DenseCRF (full model):
This progressive design allows for a step-by-step analysis of the performance improvement brought by each innovation, highlighting their individual and combined contributions to final classification accuracy.
From the experimental results, the baseline dual-branch model (A0) achieves overall accuracy (OA) scores of 95.01%, 75.30%, and 95.02% on the NC12, NC13, and NC16 datasets, respectively. This configuration demonstrates fundamental spatial–spectral modeling capabilities but exhibits limitations in complex land cover segmentation and mixed-pixel discrimination. The subsequent introduction of Global Spectral–Spatial Attention (A1) yields only marginal improvements (+0.20% NC12, +0.22% NC13, +0.16% NC16), confirming that early cross-branch guidance alone provides limited performance enhancement.
Significant gains emerge with the integration of Residual Spatial Fusion (RSF) in A2, which boosts the OA by +0.82% (NC12), +2.34% (NC13), and +0.69% (NC16) over A1. This substantial improvement validates RSF’s effectiveness in enhancing spatial representation and multi-source feature fusion, particularly for datasets with high spatial heterogeneity like NC13. The subsequent addition of channel-wise gating (A3) further elevates performance (+0.71% NC12, +2.08% NC13, +0.35% NC16), demonstrating its critical role in adaptive feature recalibration and noise suppression.
Multi-scale feature fusion (A4) continues this positive trend, though its impact varies across datasets—improving performance on NC12 (+0.48%) and NC16 (+0.49%) while slightly decreasing it on NC13 (−0.43%). This module enhances characterization of complex semantic structures, particularly benefiting minority classes through multi-context aggregation. The complete backbone configuration (A5) achieves robust performance (97.00% NC12, 80.54% NC13, 96.71% NC16), showcasing the cumulative benefits of our core innovations.
Finally, DenseCRF post-processing (A6) delivers the most substantial improvements—particularly for challenging NC13 (+1.83%)—culminating in a peak performance of 97.62% (NC12), 82.37% (NC13), and 97.32% (NC16). This boundary refinement step demonstrates exceptional value in ambiguous regions, with maximum cumulative gains of 2.61% (NC12), 7.07% (NC13), and 2.30% (NC16) over the baseline. Notably, the most responsive dataset (NC13) shows the greatest sensitivity to RSF and channel gating—our most impactful innovations—which collectively address complex boundary delineation and spectral mixing challenges. All evaluation metrics exhibit monotonic enhancement throughout the ablation sequence, confirming the complementary nature and systematic design of each proposed component.
3.3. Effectiveness and Parameter Sensitivity Analysis of DenseCRF
3.3.1. Effectiveness (With vs. Without CRF)
As summarized in
Table 5, applying DenseCRF at inference consistently improves OA and Kappa across all datasets—by +0.62 pp (NC12), +1.83 pp (NC13), and +0.61 pp (NC16) for OA, with corresponding Kappa gains of +0.79, +1.79, and +0.89. The largest improvements occur on NC13, whose low illumination and heavy mixing exacerbate boundary ambiguity; CRF refinement reduces salt-and-pepper artifacts and strengthens transitions between adjacent classes. The AA increases on NC13 (+2.45 pp) and NC16 (+1.66 pp), while showing a small decrease on NC12 (−0.39 pp), likely due to mild over-smoothing on very small or fragmented classes. These results are consistent with the step-wise ablation (A5→A6) and support the use of CRF as an effective, lightweight boundary regularizer.
DenseCRF introduces only a lightweight inference time overhead—about 20–45 s per full scene, depending on image size—yet consistently delivers a tangible benefit. Even on datasets dominated by very small or rare patches, the default configuration works well; if finer detail must be preserved, the spatial–kernel radius can simply be reduced to avoid over-smoothing tiny structures.
Across all three UAV-HSI datasets, this low-overhead refinement translates into cleaner boundaries, fewer speckles, and higher agreement with ground truth—benefits that are most pronounced precisely where the classification task is most challenging—making DenseCRF a well-justified and practically useful addition to the pipeline.
3.3.2. Parameter Sensitivity Analysis
To refine the classification results and enhance edge detail preservation, this study integrates DenseCRF post-processing and conducts a single-parameter sensitivity analysis. Using the NC12 dataset as a case study, seven key hyperparameters—ITER, SXY_G, COMPAT_G, SRGB, COMPAT_BI, SXY_BI, and gt_prob—were independently adjusted while keeping other parameters fixed. The effects on the overall accuracy (OA), average accuracy (AA), and Kappa coefficient were analyzed to assess each parameter’s influence on model performance, as shown in
Figure 11.
The results indicate that the OA is relatively insensitive to variations in ITER, SXY_G, and COMPAT_G, suggesting that the Gaussian pair-wise potential contributes minimally to global performance shifts. In contrast, the parameters associated with the bilateral potential—particularly SRGB, SXY_BI, and COMPAT_BI—have a more pronounced impact. Excessively high values for these parameters lead to over-smoothing at object boundaries, thereby reducing classification accuracy.
Moreover, moderately increasing the pseudo-label confidence threshold (gt_prob) proves beneficial for enhancing model consistency. The optimal setting was found near gt_prob = 0.95. Under the final configuration (ITER = 8, gt_prob = 0.95), the model achieved its highest performance on the NC12 dataset, with the OA reaching 97.62%, validating the efficacy of DenseCRF in improving both boundary delineation and overall classification accuracy.
3.4. Analysis of Model Complexity
Our model has a higher parameter count (~19.75 M, measured consistently across all three datasets with variations under 0.001 M) because it couples a Swin Transformer spectral backbone with a CNN spatial branch and adds the RSF, channel-wise gating, and MFF fusion modules. These components increase capacity for long-range spectral context and multi-scale spatial cues, which is crucial in heterogeneous wetlands. Importantly, the extra parameters do not create a runtime bottleneck: across all three datasets, our method delivers the shortest training and inference times among Transformer-based/hybrid models. CNN-only baselines have fewer parameters but markedly slower inference in many cases, reflecting the heavier 3D computations. Overall, the dual-branch design raises representational power while keeping a competitive runtime, indicating good practicality for UAV-HSI deployments. The model complexity comparison is reported in
Table 6.
3.5. Analysis of Model Noise Robustness
To further examine the model’s stability under typical acquisition interferences, we evaluated robustness to additive Gaussian noise and salt-and-pepper (impulse) noise on NC12, NC13, and NC16. Inputs were band-wise normalized to [0,1]; unless otherwise noted, the network was trained on clean data and evaluated on corrupted test sets.
Figure 12 reports the OA (%) as the noise level increases.
In (a), the OA decreases smoothly with the standard deviation σ of Gaussian noise. Starting from 97.62%/82.37%/97.32% on clean data (NC12/NC13/NC16), performance at σ=0.40 remains 95.40% (NC12), 79.10% (NC13), and 95.00% (NC16), indicating a gradual loss of spectral contrast but a controlled impact on accuracy. In (b), the OA drops more rapidly with the impulse noise ratio r, as random 0/1 impulses break local spatial continuity. At r=0.09, the OA is 94.20% (NC12), 77.90% (NC13), and 93.70% (NC16). Overall, NC16 shows the smallest decline, whereas NC13—with low illumination and heavy mixing—exhibits the largest, which is consistent with our scene analysis.
These trends align with the model design: the Swin-based spectral branch preserves long-range, band-wise dependencies under reduced contrast; RSF (recursive gated spatial enhancement) selectively injects spatial cues to separate spectrally similar yet structurally different classes; channel-wise gating suppresses unstable bands; and DenseCRF sharpens boundaries and mitigates salt-and-pepper artifacts. The monotonic, modest degradation across noise levels supports the robustness of MBCG-SwinNet under common UAV-HSI interferences.