Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification

Liu, Ruopu; Zhao, Jie; Tian, Shufang; Li, Guohao; Chen, Jingshu

doi:10.3390/rs17162862

Open AccessArticle

Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification

by

Ruopu Liu

¹

,

Jie Zhao

^1,*

,

Shufang Tian

¹,

Guohao Li

¹ and

Jingshu Chen

²

¹

School of Earth Sciences and Resources, China University of Geosciences (Beijing), Beijing 100083, China

²

Department of Geography, The University of Hong Kong, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2862; https://doi.org/10.3390/rs17162862

Submission received: 10 July 2025 / Revised: 8 August 2025 / Accepted: 14 August 2025 / Published: 17 August 2025

Download

Browse Figures

Versions Notes

Abstract

Hyperspectral classification of wetland environments remains challenging due to high spectral similarity, class imbalance, and blurred boundaries. To address these issues, we propose a novel Multi-Branch Channel-Gated Swin Transformer network (MBCG-SwinNet). In contrast to previous CNN-based designs, our model introduces a Swin Transformer spectral branch to enhance global contextual modeling, enabling improved spectral discrimination. To effectively fuse spatial and spectral features, we design a residual feature interaction chain comprising a Residual Spatial Fusion (RSF) module, a channel-wise gating mechanism, and a multi-scale feature fusion (MFF) module, which together enhance spatial adaptivity and feature integration. Additionally, a DenseCRF-based post-processing step is employed to refine classification boundaries and suppress salt-and-pepper noise. Experimental results on three UAV-based hyperspectral wetland datasets from the Yellow River Delta (Shandong, China)—NC12, NC13, and NC16—demonstrate that MBCG-SwinNet achieves superior classification performance, with overall accuracies of 97.62%, 82.37%, and 97.32%, respectively—surpassing state-of-the-art methods. The proposed architecture offers a robust and scalable solution for hyperspectral image classification in complex ecological settings.

Keywords:

deep learning; hyperspectral remote sensing; wetland classification; multi-branch network; multi-scale feature fusion

1. Introduction

Wetlands, as an integral part of global ecosystems, perform irreplaceable ecological functions such as flood regulation, water purification, carbon sequestration, and the maintenance of biodiversity [1]. Numerous studies have shown that wetland resources worldwide have experienced significant degradation since the Industrial Revolution, with over 50% of natural wetlands being replaced by urban expansion and agricultural development [2,3,4]. Under the dual pressures of climate change and human impact, wetlands are now facing serious challenges including degradation, fragmentation, and declining ecological functionality. As a result, developing high-resolution, sustainable, and long-term monitoring and classification systems has become a key requirement for effective ecological conservation and management [5,6].

Hyperspectral imagery (HSI), characterized by its dense and narrow spectral bands, enables the precise discrimination of surface materials and is particularly suitable for the classification of heterogeneous and complex wetland environments [7,8,9]. In recent years, the widespread adoption of unmanned aerial vehicles (UAVs) has provided an efficient solution for acquiring hyperspectral data with high spatial and temporal resolution, demonstrating outstanding performance in tasks such as vegetation monitoring and water boundary delineation in wetlands [10,11,12]. However, UAV-based HSI still faces considerable challenges in fine-grained classification due to the complex mixture of wetland land covers, high spectral similarity between classes, blurred object boundaries, and severe class imbalance [13,14,15].

Traditional supervised classification approaches, such as Support Vector Machine (SVM) [16], Random Forest (RF) [17], and Multinomial Logistic Regression (MLR) [18], rely primarily on pixel-level spectral modeling and lack effective spatial context awareness, often leading to misclassification around class boundaries. To compensate for this limitation, hand-crafted spatial enhancement techniques such as Gabor filters [19] and morphological profiles [20] have been introduced. However, these methods are highly dependent on parameter tuning and exhibit limited generalization capability, making them inadequate for the complex and variable terrain of wetlands [21].

With the development of deep learning, convolutional neural networks (CNNs) have become widely used in hyperspectral classification tasks. For instance, Hu et al. proposed a 1D CNN model that efficiently extracts spectral features [22], but it fails to capture spatial structure. HybridSN leverages 3D convolution to jointly model spatial and spectral features, enhancing local texture sensitivity [23], but its limited receptive field and heavy computational burden restrict its ability to model long-range dependencies. SSRN introduces a residual dual-branch architecture that separately extracts spatial and spectral features and fuses them to enhance nonlinear representation capabilities [24], but its static fusion mechanism limits flexibility in adapting to class variability. To address the underperformance on minority classes, Yuan et al. proposed the DBMA network with a multi-branch attention mechanism, which improves feature expressiveness, yet its inter-branch coordination remains constrained [25]. Furthermore, deeper spatial–spectral joint networks have been explored by Makantasis et al. [26] and Wang et al. [27], while Paoletti et al. [7] provided a comprehensive review summarizing the performance variations of different deep models across diverse land cover scenarios.

In recent years, Transformer-based architectures have made remarkable advances in hyperspectral image classification, primarily due to their strong capability in modeling global dependencies. Among them, the Swin Transformer introduces a hierarchical design and shifted window attention, enabling efficient capture of long-range spatial relationships [28]. Models such as SpectralFormer and CTIN integrate token-level spectral–spatial interaction mechanisms, substantially enhancing the fusion of heterogeneous features [29,30]. Furthermore, enhanced designs such as the dual-branch Transformer [31], MS2I2Former [32], and HSD2Former [33] significantly improve boundary delineation and maintain spatial consistency.

Despite these achievements, most existing methods still depend on static weighted fusion or straightforward concatenation strategies, limiting their ability to dynamically model heterogeneous multi-branch features. They also often struggle with challenges like class imbalance and ambiguous boundaries—particularly in complex wetland environments characterized by fragmented patches and underrepresented classes [34,35].

To overcome these limitations, multi-scale feature fusion has emerged as a central focus in recent research. Techniques such as atrous spatial pyramid pooling (ASPP) and attention-based pyramid fusion have proven effective in capturing semantic information across varying spatial scales, leading to more accurate classification of land cover types with large scale variation, including water bodies, vegetation, and bare soil [36,37]. Several Transformer-based architectures further extend multi-scale fusion to the token level. For instance, models like MultiFormer and MSCF-MAM demonstrate that token-level multi-scale modeling greatly enhances spatial–semantic consistency [38,39]. Likewise, the GSSFT model highlights the potential of multi-scale attention modules for accurate recognition of fragmented wetland patches [40].

Collectively, these developments reflect a paradigm shift in hyperspectral classification—from static fusion techniques toward dynamic, structure-adaptive integration mechanisms tailored to complex spatial–spectral characteristics.

Nevertheless, significant limitations remain in applying such architectures to wetland classification. Specifically, (1) most multi-source fusion strategies rely on static weighted or concatenation approaches, lacking adaptive mechanisms tailored to inter-class differences; (2) the scale modeling structures are often single-path, failing to balance fine boundary detail and global contour recognition; (3) insufficient robustness to class imbalance and boundary ambiguity leads to poor recognition of minor classes and produces salt-and-pepper noise in classification maps.

To overcome these challenges, this paper proposes a novel Multi-Branch Channel-Gated Swin Transformer network (MBCG-SwinNet), tailored to UAV hyperspectral imagery from the Yellow River Delta (YRD) wetlands in Shandong, China. The model employs a Swin Transformer spectral branch to extract global spectral context and a CNN spatial branch to capture local texture details; these dual streams learn complementary spectral and spatial representations. A lightweight Global-guided Spatial Self-Attention (GSSA) module provides initial cross-branch interaction. More critically, a redesigned Residual Spatial Fusion (RSF) module serves as the core of the fusion mechanism, enabling enhanced inter-branch interaction and spatial residual compensation. Furthermore, the channel-wise gate adaptively re-weights contributions across channels, and the multi-scale feature fusion (MFF) module consolidates multi-level features, improving robustness to spatial heterogeneity typical of YRD wetland scenes. Finally, a DenseCRF-based post-processing step refines boundary predictions and suppresses salt-and-pepper noise, leading to improved spatial coherence and classification accuracy.

The main contributions of this paper are summarized as follows:

We proposed a dual-branch hyperspectral classification framework combining a Swin Transformer-based spectral branch and a CNN-based spatial branch to jointly exploit global spectral dependencies and local spatial context.
We designed a hierarchical feature fusion strategy integrating Residual Spatial Fusion (RSF), channel-wise gating, and multi-scale feature fusion (MFF), enabling deep inter-branch interaction, dynamic channel re-weighting, and scale-adaptive integration for complex, heterogeneous wetland features.
We incorporated a DenseCRF-based post-processing module into the end-to-end pipeline to refine object boundaries and enforce spatial consistency, improving classification accuracy and boundary delineation on the benchmark datasets NC12, NC13, and NC16.

2. Materials and Methods

2.1. Study Area Overview

This study focuses on the Yellow River Delta (YRD) in northeastern Shandong, China (approx. [37.5–38.4]°N, [118.1–119.3]°E), a low-relief alluvial–coastal plain bordering the Bohai Sea. The YRD with elevations mostly <10 m a.s.l. exhibits strong river–tide interactions that drive pronounced seasonal hydrological variability. The wetland mosaic includes extensive reed marshes (Phragmites australis), salt-marsh communities (e.g., Suaeda salsa), tidal mudflats, shallow ponds/creeks, and saline–alkali bare lands, interspersed with aquaculture ponds and levee infrastructure. High soil salinity, mixed vegetation–substrate patches, and frequent land–water transitions lead to strong spectral similarity, class imbalance, and blurred boundaries—making the YRD a representative yet challenging testbed for UAV-HSI classification under real-world wetland conditions [41].

2.2. Dataset Description

The hyperspectral data used in this study were obtained from the Yellow River Delta Hyperspectral Remote Sensing Database (YRD-HRS), a publicly available dataset first established and released by Xie et al. [42]. This database provides a standardized platform for hyperspectral classification research in wetland ecosystems, featuring high spatial resolution, multi-temporal observations, and pixel-level land cover annotations across diverse wetland types.

Data acquisition was conducted using a DJI M600 (SZ DJI Technology Co., Ltd., Shenzhen, China) UAV platform equipped with a Headwall Nano-Hyperspec sensor (12 mm focal length). The sensor captured 270 spectral bands spanning the 400–1000 nm range, offering both high spatial and spectral fidelity. The YRD-HRS database includes three representative UAV-collected datasets—NC12, NC13, and NC16—each covering distinct ecological and environmental conditions. All three datasets are used in this study to comprehensively evaluate the effectiveness and generalizability of the proposed classification model. Illustrative examples of the datasets are shown in Figure 1.

The NC12 data were collected on 23 September 2020, from 12:27 to 13:10 under clear weather conditions. The UAV flew at an altitude of 300 meters, resulting in a spatial resolution of approximately 0.182 meters. This area contains 12 land cover types, primarily including water bodies, bare land, and Suaeda glauca, which are typical classes in coastal wetlands. The NC13 data were acquired on 24 September 2020, from 14:47 to 15:20 under overcast conditions, with the UAV flying at the same altitude of 300 m and the same spatial resolution of approximately 0.182 m. The surface composition in this area is more complex, with some wetland types exhibiting low reflectance and dramatic variation, making this dataset particularly suitable for evaluating the spectral modeling performance of classification models under low-illumination conditions. The NC16 data were collected on 23 September 2020, from 13:50 to 14:20 at a flight altitude of 400 meters, corresponding to a spatial resolution of approximately 0.266 m. They comprise 16 land cover categories. This dataset is characterized by high spectral redundancy, blurred class boundaries, and predominantly mixed land cover distributions, making it a valuable benchmark for developing and evaluating boundary-sensitive classification models. All three hyperspectral datasets exhibit significant class imbalance. In Figure 2, which depicts the category distribution of the hyperspectral dataset, artificial reference targets (standard reflectance cloth in NC13/NC16; both white cloth and standard reflectance cloth in NC12) were excluded from the visualization as they do not represent natural land cover classes. The distribution of the remaining classes reveals a long-tail pattern where dominant classes (e.g., water bodies and soil types) comprise most samples, while ecologically important but spatially limited classes (e.g., specific vegetation types) have limited representation.

We adopt a land cover taxonomy across all experiments. Vegetation communities are specified at the community/habitat level and are introduced by the common English name followed by the Latin binomial in italics at first mention (e.g., Suaeda (Suaeda glauca)); thereafter, the English name is used consistently throughout the text, tables, and figures. Non-vegetation classes use standardized land cover terms such as water, mudflat, bare soil, moist soil, and sand/gravel (stone). The complete class lists for NC12, NC13, and NC16, together with train/test counts, are provided in this section. Items labeled “White cloth” (standard reflectance panels for radiometric calibration) and “Iron” (man-made metallic markers/surfaces) are not natural land cover classes. For figure rendering, the same semantic class uses the same color as consistently as possible across datasets, legends are refined to map class names to color swatches, and background/no data is shown as transparent.

2.3. MBCG-SwinNet Architecture

This section provides a brief overview of the proposed Multi-Branch Channel-Gated Swin network (MBCG-SwinNet) architecture. The model is composed of several key components, including dual-branch spectral–spatial feature extraction, Global Spectral–Spatial Attention (GSSA) fusion, Residual Spatial Fusion (RSF), channel-wise gating, multi-scale feature fusion (MFF), and DenseCRF-based post-processing for refinement. The complete processing pipeline is illustrated in Figure 3.

Specifically, for each input hyperspectral image patch, the model extracts spatial and spectral features via two parallel branches: a spatial branch based on 3D convolutional neural networks (3D-CNNs) and a spectral branch utilizing the Swin Transformer. The extracted features are then fed into the GSSA module, which enables high-order interaction and mutual enhancement between spectral and spatial domains, yielding two sets of enhanced features.

These GSSA-enhanced features are subsequently input into the Residual Spatial Fusion (RSF) module, which facilitates deeper cross-branch interaction and residual compensation. The output from RSF is further processed by a channel-wise gating mechanism, which adaptively modulates feature contributions at a fine-grained channel level to achieve dynamic feature weighting and fusion.

The fused features from RSF and the gating module are then passed through the multi-scale feature fusion (MFF) module, which aggregates multi-scale representations and significantly improves the model’s ability to recognize complex land cover structures. During inference, a Dense Conditional Random Field (DenseCRF) is applied to refine classification boundaries, suppress salt-and-pepper noise, and enhance spatial consistency, resulting in high-precision classification maps.

2.4. Spatial and Spectral Branch Design

The proposed model adopts a dual-branch architecture, consisting of a spatial branch and a spectral branch, designed to separately exploit the spatial context and spectral sequence features inherent in hyperspectral imagery. In contrast to the MGSSAN method proposed by Xie et al. [42], our model employs a Swin Transformer as the spectral backbone. While hybrid CNN–Transformer architectures have been explored in HSI classification, here, we tailor a Swin-based spectral branch to UAV wetland scenes and integrate it with the proposed RSF, channel-wise gating, and MFF modules for deep spatial–spectral fusion.

2.4.1. Spatial Branch

The spatial branch utilizes a 3D convolutional module to jointly convolve across the spatial neighborhood and spectral dimension, effectively capturing local spatial textures and spectral–spatial joint representations. The input to this branch is a hyperspectral image patch represented as a 3D tensor

X \in R^{C \times H \times W}

:

F^{(1)} = R e L U (B N (C o n v 3 D_{1} (X)))

(1)

F^{(2)} = R e L U (B N (C o n v 3 D_{2} (M a x P o o l 3 D (F^{(1)}))))

(2)

F^{(3)} = R e L U (B N (C o n v 3 D_{3} (M a x P o o l 3 D (F^{(2)}))))

(3)

The final output is the spatial features

F_{spa} :

F_{spa} = Reshape (F^{(3)}) \in R^{C^{'} \times H^{'} \times W^{'}}

(4)

The spatial branch is effective in capturing local neighborhood structures around each pixel, which helps suppress hyperspectral noise and enhances the modeling of spatial continuity and boundary features of land cover classes. These spatial representations complement the spectral branch and are subsequently fused with spectral features via the Global Spectral–Spatial Attention (GSSA) module for deep interaction and joint enhancement.

2.4.2. Spectral Branch

Unlike conventional spectral branches that rely solely on convolutional structures, our approach introduces the Swin Transformer as the backbone of the spectral branch, significantly enhancing the model’s capacity to capture global dependencies and long-range contextual relationships in hyperspectral data. This branch adopts a hierarchical Swin Transformer architecture designed to fully exploit spectral correlations and long-distance dependencies along the band dimension.

As illustrated in Figure 3, the input is a hyperspectral patch of size

H \times W \times C

. The input is first mapped into a low-dimensional feature space via a Linear Embedding layer, forming an initial token sequence fed into multiple Swin Transformer Blocks.

Unlike standard Transformers, the Swin Transformer utilizes window-based multi-head self-attention (W-MSA) and Shifted Window Multi-head Self-Attention (SW-MSA) mechanisms. These localized attention operations maintain computational efficiency while effectively modeling spectral relationships across bands and spatial locality. Each hierarchical stage comprises several Swin Transformer Blocks, interleaved with Patch Merging operations that perform progressive downsampling and feature dimension expansion to aggregate increasingly abstract spectral representations.

The internal structure of a single Swin Transformer Block is shown in Figure 4. Each block includes Layer Normalization, a window-based self-attention module, a two-layer multi-layer perceptron (MLP), and residual connections to improve training stability and representational capacity. Ultimately, the spectral branch outputs rich multi-scale spectral features, which serve as key inputs for subsequent fusion and classification stages.

Both of them have residual connection and normalization operation to ensure efficient flow of features and training stability.

The forward propagation of the swing transformer block can be described as follows:

{\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l - 1}

(5)

z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l}

(6)

{\hat{z}}^{l + 1} = S W - M S A (L N (z^{l})) + z^{l}

(7)

z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}

(8)

where W-MSA denotes window-based multi-head self-attention, SW-MSA refers to Shifted Window Multi-head Self-Attention, MLP stands for multi-layer perceptron, and LN indicates Layer Normalization. The hierarchical features are progressively downsampled and expanded in channel dimension via Patch Merging layers, enabling multi-scale global representation learning. The final output is a deep spectral representation denoted as

F s p e

which serves as a crucial input for subsequent fusion modules.

2.4.3. Global Spectral–Spatial Attention (GSSA)

To further enhance the flexibility and discriminability of feature fusion between the spatial and spectral branches, the model incorporates the Global Spectral–Spatial Attention (GSSA) mechanism, originally inspired by the design proposed by Xie et al. [42]. GSSA employs lightweight channel and spatial attention modules to guide the mutual complementation of global information between spectral and spatial features. The GSSA process can be abstracted as follows:

\tilde{F} s p e, \tilde{F} s p a = GSSA (F s p e, F s p a)

(9)

Let

F s p e, F s p a

denote the spectral and spatial features extracted from the respective branches. After being processed by the GSSA module, the enhanced features are denoted as

\tilde{F} s p e

and

\tilde{F} s p a

, respectively.

2.5. Improved Residual Spatial Feature Enhancement Module (RSF)

Inspired by recent high-resolution remote sensing models such as Swin-CFNet [43], this study proposes an improved Residual Spatial Feature (RSF) enhancement module tailored for hyperspectral imagery. The module is designed to fully integrate multi-source information from the spectral and spatial branches, thereby enhancing feature representation across both spatial and channel dimensions. To effectively leverage the complementary characteristics of the CNN-based spatial branch and the Swin Transformer-based spectral branch in complex wetland environments, RSF adopts a specialized dual-path structure that couples spectral–recursive gated spatial enhancement with spatial–channel attention modeling. At the output, a channel-wise gating mechanism is introduced to adaptively fuse features from the two branches. The overall structure of the RSF module, along with the MFF (multi-scale feature fusion) mechanism, is illustrated in Figure 5.

2.5.1. Module Input and Branch Pre-Processing

After passing through the GSSA module, the spatial branch and spectral branch output

F_{s p a}

and

F_{s p e}

, respectively. Since the number of channels in the two branches is not necessarily the same, channel adaptation is required for subsequent complementary modeling and feature alignment:

F_{s p a}^{'} = {Conv}_{1 \times 1} (F_{s p a})

(10)

F_{s p e}^{'} = {Conv}_{1 \times 1} (F_{s p e})

(11)

2.5.2. Channel Attention Modeling in the Spatial Branch

Based on the convolutional features

F_{s p a}^{'}

, the spatial branch incorporates a channel attention mechanism to enhance the model’s ability to differentiate and represent the significance of each spatial feature channel. This mechanism uses a combination of global pooling and a lightweight MLP to model the importance of different channels, formulated as follows:

F_{s p a, a} = f_{c} (ReLU (f_{c} (AvgPool (F_{s p a}^{'}))))

(12)

F_{s p a, m} = f_{c} (ReLU (f_{c} (MaxPool (F_{s p a}^{'}))))

(13)

F_{s p a}^{″} = Conv (F_{s p a}^{'} ⊙ ReLU (F_{s p a, a} + F_{s p a, m}))

(14)

where

f_{c}

denotes a fully connected layer, while AvgPool and MaxPool represent global average and max pooling, respectively, used to extract global spatial contextual information. The channel attention mechanism aggregates pooled features via the MLP and applies channel-wise re-weighting (

⊙

), followed by convolution to further refine spatial dependencies. This design allows the model to adaptively modulate the contribution of each spatial channel to the final fusion, highlighting structured spatial patterns such as wetland boundaries and linear features (e.g., roads), while suppressing redundant or noisy components. The fusion of global and local information through pooling balances the model’s sensitivity across varying spatial scales, which is crucial for handling large-scale landscape variation in wetland environments.

2.5.3. Spectral Branch Recursive Gated Spatial Enhancement

The spectral branch

F_{s p e}^{'}

employs a recursive gated enhancement mechanism to dynamically reinforce multi-level spatial contextual information. Inspired by separable convolution and dual-path architecture design, the spectral features are divided into a primary path

p_{w}

and an auxiliary path

p_{a}

. The auxiliary path is processed using depth-wise separable convolutions to extract hierarchical spatial features, which are recursively refined and modulated to enrich the spatial representation of the spectral branch. This design enables the spectral branch to not only retain long-range spectral dependencies but also integrate spatial awareness, thereby enhancing the network’s ability to distinguish spatially heterogeneous wetland land cover under complex scenarios. Wetland classes often exhibit high spectral similarity yet distinct spatial structures (e.g., shallow water vs. dark wet soil along thin watercourses). RSF therefore decouples the spectral stream into a primary path (

p_{w}

) that preserves high-confidence spectral evidence and an auxiliary path (

p_{a}

) that captures ambiguous, easily confused components and multi-level spatial cues. Through layer-wise recursive gating, the auxiliary path selectively modulates the primary cues: when local structure supports the primary hypothesis, the gate reinforces

p_{w}

; otherwise,

p_{a}

compensates for missing or distorted spatial context. This predict–residual–correction cycle reduces over-smoothing near boundaries and improves discrimination for spectrally similar but structurally different land covers. Formally, at recursion step t, gate

G_{t} = σ (f ([p_{w}^{(t)}, p_{a}^{(t)}]))

modulates the auxiliary evidence, and the updated

p_{w}^{(t + 1)} = p_{w}^{(t)} + G_{t} ⊙ h (p_{a}^{(t)})

implements the predict–residual–correction cycle summarized in Equations (15)–(20).

p_{w}, p_{a} = split (F_{s p e}^{'})

(15)

d a = DWConv (p_{a})

(16)

d w_{i} = split (d a), i = 0, 1, \dots, N

(17)

p_{w}

retains high-confidence spectral evidence, whereas

p_{a}

extracts multi-level spatial cues and ambiguous components that are recursively used to modulate the spectral stream.

First-order feature fusion is as follows:

x_{0} = p_{w} ⊙ d w_{0}

(18)

High-order recursive fusion (layer N) is as follows:

x_{i + 1} = φ (x_{i}) ⊙ d w_{i + 1}

(19)

The final aggregate output is as follows:

F_{s p e}^{″} = Conv (x_{N})

(20)

where

φ

denotes a 1 × 11\times 11 × 1 convolution followed by batch normalization (BN) and ReLU activation, while DWConv refers to depth-wise separable convolution. By decoupling the main spectral stream from the multi-level auxiliary spatial features and applying layer-wise gated recursion, this mechanism significantly enhances the spectral branch’s capacity to perceive spatial context adaptively. This design effectively addresses the “spectrally similar but structurally different” confusion problem prevalent in wetland hyperspectral imagery, thereby promoting deep spatial–spectral feature fusion.

For example, when shallow water and wet soil share similar spectra, local edge/texture cues extracted in

p_{a}

drive a large

G_{t}

near boundaries, injecting spatial evidence into

p_{w}

and reducing confusion; in homogeneous interiors,

G_{t}

remains small and preserves the spectral consensus.

2.5.4. Channel-Wise Gated Residual Fusion

In previous methods, feature fusion between spatial and spectral branches was commonly performed by simple addition or concatenation, which tends to assign equal importance to all channels regardless of their individual discriminative power. Such uniform fusion strategies are often inadequate for hyperspectral wetland scenes, where certain channels or branches may carry more critical information than others, while redundant or noisy features can negatively impact classification performance.

To address this limitation, we introduce a channel-wise gated fusion mechanism following the RSF module. This module adaptively learns the relative importance of each channel from different branches by generating channel-specific gating weights via a lightweight attention network. By dynamically weighting and combining the spatial and spectral features at the channel level, the proposed fusion scheme selectively emphasizes the most informative components while suppressing less relevant or redundant signals. Furthermore, the inclusion of residual connections enhances feature reusability and stabilizes gradient flow, contributing to more robust and expressive feature representations.

F_{c g f} = Concat (F_{s p a}^{″}, F_{s p e}^{″})

(21)

G = Softmax (MLP (GAP (F_{c g f})))

(22)

F_{R S F} = G_{1} ⊙ F_{1} + G_{2} ⊙ F_{2}

(23)

where

G \in R^{2 \times C}

is the adaptive weight of each channel, MLP is generally a two-layer full connection,

F_{1} = F_{s p a}^{″}

,

and F_{2} = F_{s p e}^{″}

.

Overall, the channel-wise gated residual fusion module empowers our network with flexible, fine-grained control over multi-branch feature integration, resulting in improved classification accuracy and generalization, particularly for complex and heterogeneous wetland hyperspectral images.

2.6. Multi-Scale Feature Fusion Module (MFF)

Following the RSF module, we further introduce a multi-scale feature fusion (MFF) module to deeply integrate the multi-source features derived from the spatial branch, spectral branch, and RSF output. The primary objective of MFF is to harness the complementary strengths of each branch, thereby improving the model’s capability to distinguish fine-grained structures, complex heterogeneous land covers, and ambiguous object boundaries in wetland environments. The MFF module not only strengthens the joint representation of global spectral and local spatial features, but also effectively suppresses redundant and noisy components, enhancing classification robustness in complex scenarios.

Inspired by feature redistribution and soft selection mechanisms, the MFF module performs adaptive feature integration through a combination of 1 × 1 convolution, normalization, and nonlinear activation. Specifically, the fused representation

F_{RSF}

, the enhanced spatial branch output

F_{s p a}^{″}

, and the enhanced spectral branch output

F_{s p e}^{″}

are concatenated along the channel dimension to form a comprehensive feature tensor, enabling unified representation for subsequent classification.

F_{cat} = Concat (F_{s p a}^{″}, | F_{s p e}^{″}, | F_{RSF})

(24)

The 1 × 1 convolution is used to reconstruct and compress the stitched features, and the 3C channel is mapped back to the C channel to achieve the linear redistribution of features:

F_{mff}^{(1)} = {Conv}_{1 \times 1} (F_{cat})

(25)

Batch norm and ReLU activation are introduced to improve training stability and increase feature expression nonlinearity:

F_{mff}^{(2)} = ReLU (BN (F_{mff}^{(1)}))

(26)

We add a dropout layer at the output to suppress the strong correlation between features and improve the generalization ability of the model:

F_{mff} = Dropout (F_{mff}^{(2)})

(27)

The final fusion feature

F_{mff}

is used as the input of the classification header (full connection layer—MLP) to generate the final pixel-level prediction probability.

2.7. Post-Processing: DenseCRF Refinement

Although deep neural networks can achieve high overall accuracy in pixel-wise classification of hyperspectral wetland imagery, the resulting classification maps often suffer from boundary ambiguity and salt-and-pepper noise, particularly in regions with complex class distributions or those near object boundaries. To enhance the spatial coherence and boundary precision of the classification outputs, we introduce Dense Conditional Random Fields (DenseCRF) as a post-processing refinement step, and the effect is shown in Figure 6.

DenseCRF takes the initial segmentation probability map as input and globally optimizes pixel labels by minimizing an energy function. The core idea is to incorporate both spatial proximity and spectral similarity between pixels to perform pixel-level refinement:

E (x) = \sum_{i} ψ_{u} (x_{i}) + \sum_{i < j} ψ_{p} (x_{i}, x_{j})

(28)

Unary term

ψ_{u} (x_{i})

: this indicates the confidence that each pixel belongs to a certain category, which is derived from the softmax output or classification probability map of the neural network.

P a i r w i s e t e r m ψ_{p} (x_{i}, x_{j})

: This models the relationship between pixels i and j and encourages pixels with similar space and/or spectrum to be given the same label. The typical forms are Gaussian kernel and bilateral kernel (see the following formula):

ψ_{p} (x_{i}, x_{j}) = μ (x_{i}, x_{j}) [w_{1} \exp (- \frac{| | p_{i} - p_{j} {| |}^{2}}{2 σ_{α}^{2}} - \frac{| | I_{i} - I_{j} {| |}^{2}}{2 σ_{β}^{2}}) + w_{2} \exp (- \frac{| | p_{i} - p_{j} {| |}^{2}}{2 σ_{γ}^{2}})]

(29)

where

μ (x_{i}, x_{j})

is the indicator function (if

x_{i}

≠

x_{j}

, it is 1),

p_{i}

is the spatial position,

I_{i}

is the pixel color or principal component value, σ_α, σ_β, σ_γ control the spatial and color influence range, and

w_{1}, w_{2}

are the weights

The parameters of DenseCRF—including spatial kernel scale, color kernel scale, label compatibility, and confidence weight—are highly tunable, making the framework adaptable to hyperspectral imagery with varying resolutions and textural complexities. In this study, the parameter settings were carefully designed and analyzed for sensitivity (as detailed in the experimental section), considering the following aspects:

ITER: Number of iterations, balancing convergence speed and accuracy.

SXY_G, COMPAT_G: Parameters of the Gaussian kernel term representing spatial smoothness constraints.

SXY_BI, SRGB, COMPAT_BI: Parameters of the bilateral kernel that integrates both spatial proximity and spectral (or color) similarity.

gt_prob: Confidence weight for the original network prediction, determining the degree of trust placed in the initial output during CRF optimization.

DenseCRF not only enhances global optimization capability but also allows for independent control over spatial smoothing and spectral guidance. Our experiments demonstrate that this method consistently improves both boundary delineation and overall classification accuracy in most wetland scenarios.

2.8. Experimental Setup

(1) All experiments were conducted on a workstation with an NVIDIA GeForce RTX 3090 GPU running Windows 10. The implementation was based on Python 3.8 and PyTorch; development used PyCharm2022, and ENVI5.6/ArcGIS10.8 were employed for data preprocessing/visualization and geospatial checks; model complexity statistics are reported in Section 3.5.

(2) For fair, like-for-like comparability with prior work on the YRD UAV-HSI datasets, we adopt the fixed train/test split provided by Xie et al. [42]. The training proportions are 3.19% for NC12, 2.81% for NC16, and 1.63% for NC13; all remaining labeled pixels constitute the test set. When a validation set is needed for model selection, it is carved out only from the training portion, and the test split remains unchanged. For transparency, per-class training/test counts are reported in Tables 1–3; class distributions and imbalance are summarized in Section 2.2 in Figure 2 (histograms).

(3) The model training settings were configured as follows: The AdamW optimizer was used with an initial learning rate of

5 \times 10^{- 4}

, dynamically adjusted using a Cosine Annealing LR schedule. The weight decay was set to

1 \times 10^{- 4}

. The batch size was 64, with a maximum of 100 training epochs. Input patch size was set to 8

\times

8. The loss function employed was label-smoothed cross-entropy loss with a smoothing factor of 0.05. Dropout probability was set to 0.5. Both training and inference were performed on the original full-band hyperspectral patches, and the final classification maps were reconstructed using a sliding window strategy for pixel-wise prediction.

(4) To comprehensively evaluate model performance, we adopted several metrics including overall accuracy (OA), Average Accuracy (AA), and Cohen’s Kappa coefficient. These metrics were used to assess the generalization ability and robustness of the proposed model across the three benchmark hyperspectral wetland datasets: NC12, NC13, and NC16.

3. Results

3.1. Comparative Experiments

To rigorously validate the effectiveness of the proposed method for hyperspectral remote sensing image classification, we conducted extensive comparisons using the NC12, NC13, and NC16 datasets. The method was benchmarked against several state-of-the-art classification networks, including 1DCNN, 3DCNN, HybridSN, SSRN, DBMA, DBDA, and the Swin Transformer-based model.

To ensure fairness, all models were trained and tested using the same data splits (see Table 1, Table 2 and Table 3) and evaluated using standard metrics such as OA, AA, and the Kappa coefficient. Classification performance was assessed on the test sets, with the highest accuracy values highlighted in bold. Additionally, per-class accuracy was reported to analyze class-wise performance differences.

To aid intuitive understanding, Figure 7, Figure 8 and Figure 9 provide visual comparisons of classification maps produced by all methods, allowing for direct assessment of each model’s performance in spatial delineation and noise suppression.

From the comparative results shown in Table 1, Table 2 and Table 3, the proposed Multi-Branch Channel-Gated Swin network (MBCG-SwinNet) consistently outperformed all baseline methods across the three benchmark hyperspectral datasets—NC12, NC16, and NC13. It demonstrated especially strong generalization in tackling challenging scenarios such as class imbalance and fine-grained object discrimination.

For the NC12 dataset, MBCG-SwinNet achieved outstanding performance, reaching an OA of 97.62%, AA of 88.85%, and Kappa coefficient of 97.04%. Notably, the model significantly improved the classification accuracy for difficult and spectrally similar categories such as reed and mudflat, substantially reducing misclassification. While the Swin Transformer achieved solid performance on dominant classes, it showed weaknesses in handling small-sample classes and boundary refinement. Residual-structure-based networks such as SSRN and HybridSN, in contrast, often suffered from boundary blurring and excessive salt-and-pepper noise under high spectral similarity and imbalanced class distributions. MBCG-SwinNet effectively alleviated these issues via robust spatial–spectral dual-branch feature fusion, leading to better spatial consistency and land cover coherence in the classification maps, as illustrated in Figure 7; detailed class-wise and summary metrics are reported in Table 1.

For the NC16 dataset, which poses an even greater challenge due to severe spectral overlap among classes, MBCG-SwinNet still achieved an OA of 97.32%, AA of 87.11%, and Kappa of 96.34%. It particularly excelled in classifying Spartina alterniflora and mudflats, which exhibit high intra-class heterogeneity and limited training samples. Compared with other methods, HybridSN and SSRN showed limitations in deep feature integration and inter-class discrimination. The Swin Transformer performed reasonably on major categories but lagged behind in fine-grained segmentation and small-class preservation. Visual analysis of classification maps further confirmed that MBCG-SwinNet produced cleaner, more coherent results with smoother boundaries and less noise, indicating superior spatial generalization, with the classification effect shown in Figure 8; detailed class-wise and summary metrics are reported in Table 2.

The NC13 dataset presented the most complex scenario, characterized by highly interleaved land covers and severe spectral overlap. This made it notoriously difficult to exceed an 80% OA using conventional methods. MBCG-SwinNet was the first to push the OA to 82.37%, AA to 82.53%, and Kappa to 78.91%, significantly outperforming all previously reported methods in the literature, to the best of our knowledge. Class-wise, the model achieved 99.91% for “Mixed asphalt cement road”, 94.78% for “Water”, and 80.41% for “Dry soil”. It also reached 94.18% for “Mixed suaeda glauca reed”. Compared with the best performing baseline methods, accuracy improvements for key classes ranged from 2.7% to 10%. Particularly for complex and minority classes such as moist soil and dry soil, MBCG-SwinNet showed pronounced advantages. The multi-scale fusion and spatial–spectral synergy mechanisms enabled superior recognition of small-sample categories such as reed and car, achieving 73.51% and 85.13% accuracy, respectively—outperforming all other models and mitigating the typical collapse in classification accuracy for underrepresented classes, with the classification effect shown in Figure 9; detailed class-wise and summary metrics are reported in Table 3.

Subjective assessment of classification maps further illustrates that MBCG-SwinNet generated results that more closely align with the true spatial patterns of land cover, with sharper boundaries and fewer noise artifacts. This was especially evident in transitional zones and mixed-class regions, where the model produced more stable and detailed segmentation results. These improvements are reflected not only in quantitative metrics but also in the practical value for remote sensing land cover interpretation.

In summary, MBCG-SwinNet demonstrated consistently superior performance across all three hyperspectral datasets, confirming the advantages of its spatial–spectral complementary representation, high-order feature interaction, and multi-scale fusion architecture. These attributes offer a promising direction for hyperspectral remote sensing classification and contribute a powerful tool for complex wetland and mixed-object environment analysis.

3.2. Ablation Study

To quantitatively evaluate the contribution of each key module—Residual Spatial Fusion (RSF), channel-wise gating, multi-scale feature fusion (MFF), and DenseCRF post-processing—to the overall performance of the proposed model, we conducted a series of ablation experiments as summarized in Table 4 and Figure 10. Given the architectural dependencies within the model, both the channel-wise gating mechanism and the MFF module rely on the output of the RSF module. Therefore, experimental configurations without RSF but including either gating or MFF were not considered.

The ablation study includes the following configurations:

A0—dual-branch (no GSSA/RSF/Gate/MFF/CRF):
- Pure two-branch backbone (Swin spectral + 3D-CNN spatial), without GSSA, RSF, channel-wise gating, MFF, or DenseCRF.
A1—A0 + GSSA:
- Adds Global Spectral–Spatial Attention (GSSA) for early cross-branch guidance.
A2—A1 + RSF:
- Adds Residual Spatial Fusion (RSF) to enhance recursive gated spectral-to-spatial interaction.
A3—A2 + Channel-wise gating:
- Adds channel-wise gating to down-weight unstable/noisy bands and emphasize informative channels.
A4—A3 + MFF:
- Adds multi-scale feature fusion (MFF) to aggregate multi-scale context.
A5—A4 (complete backbone, no CRF):
- The complete in-network backbone (GSSA+RSF+Gate+MFF), without post-processing.
A6—A5 + DenseCRF (full model):
- Adds DenseCRF post-processing for boundary refinement (full pipeline).

This progressive design allows for a step-by-step analysis of the performance improvement brought by each innovation, highlighting their individual and combined contributions to final classification accuracy.

From the experimental results, the baseline dual-branch model (A0) achieves overall accuracy (OA) scores of 95.01%, 75.30%, and 95.02% on the NC12, NC13, and NC16 datasets, respectively. This configuration demonstrates fundamental spatial–spectral modeling capabilities but exhibits limitations in complex land cover segmentation and mixed-pixel discrimination. The subsequent introduction of Global Spectral–Spatial Attention (A1) yields only marginal improvements (+0.20% NC12, +0.22% NC13, +0.16% NC16), confirming that early cross-branch guidance alone provides limited performance enhancement.

Significant gains emerge with the integration of Residual Spatial Fusion (RSF) in A2, which boosts the OA by +0.82% (NC12), +2.34% (NC13), and +0.69% (NC16) over A1. This substantial improvement validates RSF’s effectiveness in enhancing spatial representation and multi-source feature fusion, particularly for datasets with high spatial heterogeneity like NC13. The subsequent addition of channel-wise gating (A3) further elevates performance (+0.71% NC12, +2.08% NC13, +0.35% NC16), demonstrating its critical role in adaptive feature recalibration and noise suppression.

Multi-scale feature fusion (A4) continues this positive trend, though its impact varies across datasets—improving performance on NC12 (+0.48%) and NC16 (+0.49%) while slightly decreasing it on NC13 (−0.43%). This module enhances characterization of complex semantic structures, particularly benefiting minority classes through multi-context aggregation. The complete backbone configuration (A5) achieves robust performance (97.00% NC12, 80.54% NC13, 96.71% NC16), showcasing the cumulative benefits of our core innovations.

Finally, DenseCRF post-processing (A6) delivers the most substantial improvements—particularly for challenging NC13 (+1.83%)—culminating in a peak performance of 97.62% (NC12), 82.37% (NC13), and 97.32% (NC16). This boundary refinement step demonstrates exceptional value in ambiguous regions, with maximum cumulative gains of 2.61% (NC12), 7.07% (NC13), and 2.30% (NC16) over the baseline. Notably, the most responsive dataset (NC13) shows the greatest sensitivity to RSF and channel gating—our most impactful innovations—which collectively address complex boundary delineation and spectral mixing challenges. All evaluation metrics exhibit monotonic enhancement throughout the ablation sequence, confirming the complementary nature and systematic design of each proposed component.

3.3. Effectiveness and Parameter Sensitivity Analysis of DenseCRF

3.3.1. Effectiveness (With vs. Without CRF)

As summarized in Table 5, applying DenseCRF at inference consistently improves OA and Kappa across all datasets—by +0.62 pp (NC12), +1.83 pp (NC13), and +0.61 pp (NC16) for OA, with corresponding Kappa gains of +0.79, +1.79, and +0.89. The largest improvements occur on NC13, whose low illumination and heavy mixing exacerbate boundary ambiguity; CRF refinement reduces salt-and-pepper artifacts and strengthens transitions between adjacent classes. The AA increases on NC13 (+2.45 pp) and NC16 (+1.66 pp), while showing a small decrease on NC12 (−0.39 pp), likely due to mild over-smoothing on very small or fragmented classes. These results are consistent with the step-wise ablation (A5→A6) and support the use of CRF as an effective, lightweight boundary regularizer.

DenseCRF introduces only a lightweight inference time overhead—about 20–45 s per full scene, depending on image size—yet consistently delivers a tangible benefit. Even on datasets dominated by very small or rare patches, the default configuration works well; if finer detail must be preserved, the spatial–kernel radius can simply be reduced to avoid over-smoothing tiny structures.

Across all three UAV-HSI datasets, this low-overhead refinement translates into cleaner boundaries, fewer speckles, and higher agreement with ground truth—benefits that are most pronounced precisely where the classification task is most challenging—making DenseCRF a well-justified and practically useful addition to the pipeline.

3.3.2. Parameter Sensitivity Analysis

To refine the classification results and enhance edge detail preservation, this study integrates DenseCRF post-processing and conducts a single-parameter sensitivity analysis. Using the NC12 dataset as a case study, seven key hyperparameters—ITER, SXY_G, COMPAT_G, SRGB, COMPAT_BI, SXY_BI, and gt_prob—were independently adjusted while keeping other parameters fixed. The effects on the overall accuracy (OA), average accuracy (AA), and Kappa coefficient were analyzed to assess each parameter’s influence on model performance, as shown in Figure 11.

The results indicate that the OA is relatively insensitive to variations in ITER, SXY_G, and COMPAT_G, suggesting that the Gaussian pair-wise potential contributes minimally to global performance shifts. In contrast, the parameters associated with the bilateral potential—particularly SRGB, SXY_BI, and COMPAT_BI—have a more pronounced impact. Excessively high values for these parameters lead to over-smoothing at object boundaries, thereby reducing classification accuracy.

Moreover, moderately increasing the pseudo-label confidence threshold (gt_prob) proves beneficial for enhancing model consistency. The optimal setting was found near gt_prob = 0.95. Under the final configuration (ITER = 8, gt_prob = 0.95), the model achieved its highest performance on the NC12 dataset, with the OA reaching 97.62%, validating the efficacy of DenseCRF in improving both boundary delineation and overall classification accuracy.

3.4. Analysis of Model Complexity

Our model has a higher parameter count (~19.75 M, measured consistently across all three datasets with variations under 0.001 M) because it couples a Swin Transformer spectral backbone with a CNN spatial branch and adds the RSF, channel-wise gating, and MFF fusion modules. These components increase capacity for long-range spectral context and multi-scale spatial cues, which is crucial in heterogeneous wetlands. Importantly, the extra parameters do not create a runtime bottleneck: across all three datasets, our method delivers the shortest training and inference times among Transformer-based/hybrid models. CNN-only baselines have fewer parameters but markedly slower inference in many cases, reflecting the heavier 3D computations. Overall, the dual-branch design raises representational power while keeping a competitive runtime, indicating good practicality for UAV-HSI deployments. The model complexity comparison is reported in Table 6.

3.5. Analysis of Model Noise Robustness

To further examine the model’s stability under typical acquisition interferences, we evaluated robustness to additive Gaussian noise and salt-and-pepper (impulse) noise on NC12, NC13, and NC16. Inputs were band-wise normalized to [0,1]; unless otherwise noted, the network was trained on clean data and evaluated on corrupted test sets. Figure 12 reports the OA (%) as the noise level increases.

In (a), the OA decreases smoothly with the standard deviation σ of Gaussian noise. Starting from 97.62%/82.37%/97.32% on clean data (NC12/NC13/NC16), performance at σ=0.40 remains 95.40% (NC12), 79.10% (NC13), and 95.00% (NC16), indicating a gradual loss of spectral contrast but a controlled impact on accuracy. In (b), the OA drops more rapidly with the impulse noise ratio r, as random 0/1 impulses break local spatial continuity. At r=0.09, the OA is 94.20% (NC12), 77.90% (NC13), and 93.70% (NC16). Overall, NC16 shows the smallest decline, whereas NC13—with low illumination and heavy mixing—exhibits the largest, which is consistent with our scene analysis.

These trends align with the model design: the Swin-based spectral branch preserves long-range, band-wise dependencies under reduced contrast; RSF (recursive gated spatial enhancement) selectively injects spatial cues to separate spectrally similar yet structurally different classes; channel-wise gating suppresses unstable bands; and DenseCRF sharpens boundaries and mitigates salt-and-pepper artifacts. The monotonic, modest degradation across noise levels supports the robustness of MBCG-SwinNet under common UAV-HSI interferences.

4. Discussion

4.1. Key Findings and Mechanism Interpretation

Across the three Yellow River Delta (YRD) UAV-HSI scenes, MBCG-SwinNet delivers consistent improvements in boundary delineation and minority class recognition (Section 3; Table 1, Table 2 and Table 3; Figure 6, Figure 7 and Figure 8). The gains are most pronounced on NC13. The NC13 scene is characterized by highly mixed land cover patterns and more challenging illumination, which compresses inter-class spectral contrast and aggravates boundary ambiguity. On this dataset, the largest relative gains stem from three components working in concert: (i) the Swin-based spectral branch captures long-range, band-wise dependencies, stabilizing spectral discrimination under low-contrast conditions; (ii) the RSF module’s spectral branch recursive gated spatial enhancement injects spatial cues through the auxiliary path

p_{a}

and aggregates them back into the main spectral stream

p_{w}

, enabling separation of spectrally similar yet structurally different classes; and (iii) DenseCRF post-processing sharpens boundaries and suppresses label noise in transitional zones. This mechanism-level explanation is consistent with the ablation trends observed on NC13, where progressively adding RSF, gating, MFF, and CRF (A1–A5) yields monotonic improvements. We highlight NC13 as the most challenging case in our experiments and observe larger incremental benefits when RSF and CRF are enabled. The step-wise ablation (A0→A6) shows monotonic improvements, with the largest increments on NC13 when RSF and CRF are enabled, which is consistent with this mechanism-level interpretation.

4.2. Comparison with Previously Published Results

Under the same split protocol on NC12/NC13/NC16 (Section 3.2), strong published baselines spanning CNN-only, hybrid 3D-CNN, and Transformer-based designs are summarized in Table 1, Table 2 and Table 3. Relative to the best prior results therein, MBCG-SwinNet improves the OA by +1.9 pp (NC12), +5.0 pp (NC13), and +2.3 pp (NC16), with consistent gains in Kappa and competitive AA. These like-for-like comparisons indicate that structure–adaptive fusion (RSF + channel-wise gating + MFF) together with edge-aware post-refinement (CRF) is decisive for boundary-sensitive wetland classification.

4.3. Relation to UAV-HSI Elsewhere and to Satellite/Multispectral Data

Beyond the YRD benchmark, UAV hyperspectral imagery is widely used for wetland and vegetation mapping because its centimeter-level GSD preserves fine boundaries and small patches critical for habitat delineation. Spaceborne hyperspectral and multispectral sensors offer wide area coverage but typically at coarser spatial resolution and with different band settings, so direct numerical comparisons are not strictly like-for-like. In this context, the observed improvements over competitive baselines highlight the value of deep spectral–spatial interaction in boundary-sensitive wetlands. Transferring the same design to other regions or platforms is plausible, but should be evaluated under sensor-specific resolution, spectral configuration, and label granularity.

4.4. Practical Aspects and Robustness

DenseCRF acts as a low-overhead, edge-aware regularizer upon inference, improving the OA/Kappa most in the hardest scene (NC13) while remaining stable under practical parameter settings (Section 3.3). Noise stress tests with Gaussian and impulse corruptions (Section 3.5) exhibit monotonic yet modest degradation, consistent with the model’s design that combines global spectral context with gated spatial cues. These results support the use of MBCG-SwinNet in operational mapping where illumination and acquisition noise vary across sorties.

4.5. Limitations and Future Work

First, although Section 3.4 reports model complexity statistics, further latency/energy profiling on embedded devices would clarify deployment trade-offs. Second, evaluation is centered on YRD wetland UAV-HSI; follow-up work will (i) conduct cross-region/season evaluations within wetlands under a standardized pre-processing/metrics protocol, (ii) perform stress tests on non-wetland HSI (e.g., urban/forest) with explicit notes on domain shift, and (iii) explore domain adaptation, illumination-invariant augmentation, and self-supervised pretraining to enhance portability under distribution changes.

5. Conclusions

This work presented MBCG-SwinNet, a dual-branch hyperspectral classifier that couples a Swin-based spectral branch with a 3D-CNN spatial branch through a structure-adaptive fusion chain (GSSA → RSF → channel-wise gating → MFF) and an edge-aware DenseCRF refinement. On three Yellow River Delta UAV-HSI datasets, the model achieved state-of-the-art accuracy and produced cleaner, more coherent maps, with the largest gains on the most challenging dataset, NC13. Ablations and CRF analyses verified the individual and combined contributions of the components. Noise robustness tests indicated stable behavior under typical acquisition interferences and model complexity profiling further confirmed a favorable accuracy–efficiency trade-off, with competitive training and inference times across all datasets, supporting practical deploy ability. The approach is computationally tractable and reproducible. Future work will extend validation to cross-region/season wetlands and other HSI domains, and explore domain adaptation and self-supervised pretraining to enhance portability under distribution shifts.

Author Contributions

Conceptualization, R.L. and J.Z.; methodology, R.L.; software, R.L.; validation, R.L.; formal analysis, R.L.; investigation, R.L.; resources, R.L.; data curation, R.L.; writing—original draft preparation, R.L.; writing—review and editing, R.L., G.L., and J.C.; visualization, R.L., G.L., and J.C.; supervision, J.Z. and S.T.; project administration, J.Z. and S.T.; funding acquisition, J.Z. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was jointly supported by the Open Fund of Beijing Key Laboratory of Land and Resources Information Research and Development, Project No.: BJNRR2025-01.

Data Availability Statement

The hyperspectral wetland datasets used in this study—namely NC12, NC16, and NC13—were collected in the Yellow River Delta National Nature Reserve, China, and are publicly available courtesy of Xie and colleagues. These datasets were originally introduced in Xie et al. (2022), “Multilayer Global Spectral–Spatial Attention Network for Wetland Hyperspectral Image Classification” (IEEE Transactions on Geoscience and Remote Sensing, vol. 60, 2022, 5518913). The datasets can be accessed and downloaded at https://drive.google.com/drive/u/0/folders/1HMTIodCUcXgMIaijOco_wC8uKjw6cSnl (accessed on 15 June 2025). The authors gratefully acknowledge Xie and his team for generously sharing the NC12, NC16, and NC13 datasets and for their contributions to the open research community.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fluet-Chouinard, E.; Stocker, B.D.; Zhang, Z. Others Extensive Global Wetland Loss over the Past Three Centuries. Nature 2023, 614, 281–289. [Google Scholar] [CrossRef]
Davidson, N.C. How Much Wetland Has the World Lost? Long-Term and Recent Trends in Global Wetland Area. Mar. Freshwater Res. 2014, 65, 934–941. [Google Scholar] [CrossRef]
Amani, M.; Salehi, B.; Mahdavi, S.; Brisco, B. Spectral Analysis of Wetlands Using Multi-Source Optical Satellite Imagery. ISPRS J. Photogramm. Remote Sens. 2018, 144, 119–136. [Google Scholar] [CrossRef]
Barbier, E.B.; Hacker, S.D.; Kennedy, C.; Koch, E.W.; Stier, A.C.; Silliman, B.R. The Value of Estuarine and Coastal Ecosystem Services. Ecol. Monogr. 2011, 81, 169–193. [Google Scholar] [CrossRef]
Mitsch, W.J.; Bernal, B.; Nahlik, A.M. Wetlands, Carbon and Climate Change. Nat. Rev. Earth Environ. 2022, 3, 119–131. [Google Scholar] [CrossRef]
Günther, A.; Barthelmes, A.; Huth, V.; Joosten, H.; Jurasinski, G.; Koebsch, F.; Couwenberg, J. Prompt Rewetting of Drained Peatlands Reduces Climate Warming despite Methane Emissions. Nat. Commun. 2020, 11, 1644. [Google Scholar] [CrossRef] [PubMed]
Paoletti, M.E.; Haut, J.M.; Plaza, J.; Plaza, A. Deep Learning Classifiers for Hyperspectral Imaging: A Review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Ma, L.; Li, M.; Ma, X.; Cheng, L.; Du, P.; Liu, Y. A Review of Supervised Object-Based Land cover Image Classification. ISPRS J. Photogramm. Remote Sens. 2017, 130, 277–293. [Google Scholar] [CrossRef]
Smigaj, M.; Gaulton, R.; Barr, S.L.; Suárez, J.C. Uav-Borne Thermal Imaging for Forest Health Monitoring: Detection of Disease-Induced Canopy Temperature Increase. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2015, 40, 349–354. [Google Scholar] [CrossRef]
Yu, R.; Luo, Y.; Zhou, Q.; Zhang, X.; Wu, D.; Ren, L. A Machine Learning Algorithm to Detect Pine Wilt Disease Using UAV-Based Hyperspectral Imagery and LiDAR Data at the Tree Level. Int. J. Appl. Earth Obs. Geoinf. 2021, 101, 102363. [Google Scholar] [CrossRef]
Aasen, H.; Honkavaara, E.; Lucieer, A.; Zarco-Tejada, P.J. Quantitative Remote Sensing at Ultra-High Resolution with UAV Spectroscopy: A Review of Sensor Technology, Measurement Procedures, and Data Correction Workflows. Remote Sens. 2018, 10, 1091. [Google Scholar] [CrossRef]
Luo, J.; He, Z.; Lin, H.; Wu, H. Biscale Convolutional Self-Attention Network for Hyperspectral Coastal Wetlands Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Zhan, Y.; Wu, K.; Dong, Y. Enhanced Spectral–Spatial Residual Attention Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7171–7186. [Google Scholar] [CrossRef]
Cui, Y.; Li, W.; Chen, L.; Gao, S.; Wang, L. Double-Branch Local Context Feature Extraction Network for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Melgani, F.; Bruzzone, L. Classification of Hyperspectral Remote Sensing Images with Support Vector Machines. IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [Google Scholar] [CrossRef]
Belgiu, M.; Drăguţ, L. Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Krishnapuram, B.; Carin, L.; Figueiredo, M.A.T.; Hartemink, A.J. Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 957–968. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Du, Q. Gabor-Filtering-Based Nearest Regularized Subspace for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 1012–1022. [Google Scholar] [CrossRef]
Huang, X.; Guan, X.; Benediktsson, J.A.; Zhang, L.; Li, J.; Plaza, A.; Dalla Mura, M. Multiple Morphological Profiles from Multicomponent-Base Images for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4653–4669. [Google Scholar] [CrossRef]
Jia, S.; Jiang, S.; Lin, Z.; Li, N.; Xu, M.; Yu, S. A Survey: Deep Learning for Hyperspectral Image Classification with Few Labeled Samples. Neurocomputing 2021, 448, 179–204. [Google Scholar] [CrossRef]
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep Convolutional Neural Networks for Hyperspectral Image Classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef]
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3-D–2-D CNN Feature Hierarchy for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2020, 17, 277–281. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M.A. Spectral–Spatial Residual Network for Hyperspectral Image Classification: A 3-D Deep Learning Framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
Ma, W.; Yang, Q.; Wu, Y.; Zhao, W.; Zhang, X. Double-branch Multi-attention Mechanism Network for Hyperspectral Image Classification. Remote Sens. 2019, 13, 867. [Google Scholar] [CrossRef]
Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep Supervised Learning for Hyperspectral Data Classification through Convolutional Neural Networks. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; IEEE: New York, NY, USA, 2015; pp. 4959–4962. [Google Scholar]
Wang, S.; Liu, Z.; Chen, Y.; Hou, C.; Liu, A.; Zhang, Z. Expansion Spectral–Spatial Attention Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 6411–6427. [Google Scholar] [CrossRef]
He, Y.; Tu, B.; Liu, B.; Chen, Y.; Li, J.; Plaza, A. Hybrid Multi-Scale Spatial-Spectral Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5527918. [Google Scholar] [CrossRef]
Li, S.; Liang, L.; Zhang, S.; Zhang, Y.; Plaza, A.; Wang, X. End-to-End Convolutional Network and Spectral-Spatial Transformer Architecture for Hyperspectral Image Classification. Remote Sens. 2024, 16, 325. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Tang, D.; Zhou, Z. RDTN: Residual Densely Transformer Network for Hyperspectral Image Classification. Expert Syst. Appl. 2024, 250, 123939. [Google Scholar] [CrossRef]
Wang, C.; Huang, J.; Lv, M.; Wu, Y.; Qin, R. Dual-Branch Adaptive Convolutional Transformer for Hyperspectral Image Classification. Remote Sens. 2024, 16, 1615. [Google Scholar] [CrossRef]
Cheng, S.; Chan, R.; Du, A. MS2I2Former: Multiscale Spatial-Spectral Information Interactive Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5532919. [Google Scholar] [CrossRef]
Luo, B.; Li, M.; Wei, Y.; Zuo, H.; Zhang, J.; Liu, D. HSD2Former: Hybrid-Scale Dual-Domain Transformer with Crisscrossed Interaction for Hyperspectral Image Classification. Remote Sens. 2024, 16, 4411. [Google Scholar] [CrossRef]
Liu, W.; Prasad, S.; Crawford, M.M. Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3462374. [Google Scholar] [CrossRef]
Xie, E.; Chen, N.; Zhang, G.; Peng, J.; Sun, W. Two-Branch Global Spatial–Spectral Fusion Transformer Network for Hyperspectral Image Classification. Photogramm. Rec. 2024, 39, 392–411. [Google Scholar] [CrossRef]
Zhu, T.; Liu, Q.; Zhang, L. 3D Atrous Spatial Pyramid Pooling Based Multi-Scale Feature Fusion Network for Hyperspectral Image Classification. SPIE Proc. 2023, 12815, 225–231. [Google Scholar] [CrossRef]
Yang, J.; Li, A.; Qian, J.; Jia, Q.; Wang, L. A Cross-Attention-Based Multi-Information Fusion Transformer for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13358–13375. [Google Scholar] [CrossRef]
Gong, Z.; Zhou, X.; Yao, W. MultiScale Spectral–Spatial Convolutional Transformer for Hyperspectral Image Classification. IET Image Proc. 2024, 18, 4328–4340. [Google Scholar] [CrossRef]
Sun, Q.; Zhao, G.; Xia, X.; Xie, Y.; Fang, C.; Sun, L.; Wu, Z.; Pan, C. Hyperspectral Image Classification Based on Multi-Scale Convolutional Features and Multi-Attention Mechanisms. Remote Sens. 2024, 16, 2185. [Google Scholar] [CrossRef]
Zhang, M.; Gao, F.; Dong, J.; Qi, L. Multi-Scale Feature Fusion for Hyperspectral and Lidar Data Joint Classification. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [Google Scholar] [CrossRef]
Li, H.; Liu, Q.; Huang, C.; Zhang, X.; Wang, S.; Wu, W.; Shi, L. Variation in Vegetation Composition and Structure across Mudflat Areas in the Yellow River Delta, China. Remote Sens. 2024, 16, 3495. [Google Scholar] [CrossRef]
Xie, Z.; Hu, J.; Kang, X.; Duan, P.; Li, S. Multilayer Global Spectral–Spatial Attention Network for Wetland Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, M. Swin-CFNet: An Attempt at Fine-Grained Urban Green Space Classification Using Swin Transformer and Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]

Figure 1. The study area and hyperspectral datasets used in this study. Left: Locations of the NC12, NC13, and NC16 scenes (indicated by green, red, and blue arrows, respectively, pointing to the corresponding images in the middle column). Middle: False-color composites of each scene. Right: Ground truth maps, where each color represents a land-cover class corresponding to the legend in Figures 7–9.

Figure 2. Class distributions in hyperspectral datasets. (a) NC12, (b) NC13, and (c) NC16 datasets showing natural class distributions after excluding artificial reference targets. Blue: training samples; orange: test samples. Rare classes highlighted with red borders and enlarged in insets.

Figure 3. Overall architecture of MBCG-SwinNet model.

Figure 4. Architecture of spectral branch based on Swin Transformer.

Figure 5. Architecture of Residual Spatial Feature (RSF) module and multi-scale feature fusion (MFF) module.

Figure 6. A visual comparison of classification maps before and after DenseCRF post-processing on the NC12 dataset (random color). (a) Predicted map by MBCG-SwinNet (without CRF refinement); (b) classification map after DenseCRF optimization; (c) difference map highlighting pixels changed by CRF post-processing (number of changed pixels: 43,335).

Figure 7. Classification maps obtained using different methods on NC12 dataset.

Figure 8. Classification maps obtained using different methods on NC16 dataset.

Figure 9. Classification maps obtained using different methods on NC13 dataset.

Figure 10. The OAs of various combinations for the three datasets.

Figure 11. Sensitivity analysis of DenseCRF parameters on NC12 dataset. The red mark represents the optimal OA here.

Figure 12. Noise robustness of MBCG-SwinNet on NC12, NC13, and NC16. (a) OA vs. Gaussian noise. (b) OA vs. salt-and-pepper (impulse) noise.

Table 1. Classification results of different methods for NC12 dataset.

Model	Training	Test	Method
Model	Training	Test	1DCNN	3DCNN	HybridSN	SSRN	DBMA	DBDA	Swin Transformer	Our method
Suaeda glauca	4467	9650	99.12	94.57	99.24	99.78	93.02	96.22	99.36	99.26
Reed	531	15072	45.28	71.21	68.79	72.88	65.44	55.71	27.84	87.41
Water	575	35716	97.03	93.9	98.51	97.04	98.72	98.76	99.47	99.87
White cloth	25	403	81.14	33	84.12	35.24	61.79	66	93.05	100
Withered reed	1543	59741	84.35	96.34	90.53	97.04	97.43	97.25	97.94	97.51
Tamarix	2381	6876	90.47	88.93	93.09	93.98	94.69	93.09	61.24	73.51
Stone	121	1046	88.91	85.37	89.1	85.28	57.46	72.08	95.41	90.19
Moist soil	1222	67275	96.97	97.88	98.42	98.31	96.98	97.29	96.87	96.86
Mudflat	123	511	84.93	75.15	78.86	82	62.43	69.28	74.36	98.53
Spartina alterniflora	1373	122947	80.13	82.11	82.36	81.31	75.82	87.36	99.58	99.6
Dry soil	2032	118041	98.06	95.95	97.77	97.75	98.16	97.3	99.37	98.98
Standard reflectance cloth	9	66	90.91	57.58	100	80.3	51.52	51.52	100	32.89
OA			88.93	91.13	91.52	92.08	90.17	92.96	95.75	97.62
AA			86.44	81	90.07	85.08	79.45	81.82	87.04	88.85
Kappa			86.38	89.06	89.55	90.24	87.93	91.27	94.62	97.04

Table 2. Classification results of different methods for NC16 dataset.

Model	Training	Test	Method
Model	Training	Test	1DCNN	3DCNN	HybridSN	SSRN	DBMA	DBDA	Swin Transformer	Our Method
Suaeda glauca	1325	9209	98.31	94.36	99.27	98.88	97.78	93.42	100	97.58
Cement road	74	1461	78.58	86.11	85.69	68.65	71.8	85.9	89.73	95.49
Asphalt road	141	4300	72.7	87.77	72.53	78.33	87.19	87.44	85.33	99.64
Water	434	69,638	95.21	98.28	95.83	98.31	99.09	99.44	98.72	99.48
Stone	273	5244	86.58	86.42	88.52	93.63	73.97	76.7	92.56	84.06
Hay	58	2374	61.79	57.33	66.3	64.57	60.24	44.61	67.48	100
Iron	2	32	0	0	9.38	0	0	0	15.63	94.44
Tamarix	61	559	5.72	63.33	43.47	61.72	47.94	22.18	84.62	72.91
Withered reed	96	1810	29.56	61.55	54.48	72.71	58.12	39.06	57.68	91.98
Reed	57	1187	66.47	47.51	57.96	35.3	63.69	26.71	64.7	85.67
Spartina alterniflora	459	4839	92.21	88.74	95.52	92.37	89.05	95.99	96.34	100
Moss	29	704	16.34	35.94	38.92	38.35	3.13	0	48.72	100
Dry soil	705	18,775	89.01	65.01	91.4	75.27	83.97	83.63	90.44	99.95
Moist soil	554	28,822	96.98	94.04	95.3	90.68	91.19	94.45	96.47	96.77
Mudflat	71	1051	79.35	4.66	94.96	69.17	63.27	41.77	98.86	54.21
Standard reflectance cloth	4	32	90.63	18.75	93.75	34.38	53.13	40.63	96.88	100
OA			91.37	89.31	92.6	90.78	91.48	91.39	95.01	97.32
AA			66.21	61.86	73.95	67.02	65.22	58.25	80.26	87.11
Kappa			88.19	85.27	89.86	87.32	88.25	88.09	93.13	96.34

Table 3. Classification results of different methods for NC13 dataset.

Model	Training	Test	Method
Model	Training	Test	1DCNN	3DCNN	HybridSN	SSRN	DBMA	DBDA	Swin Transformer	Our Method
Suaeda glauca	400	14,225	60.39	78.92	93.18	95.31	62.95	65.56	87.24	87.74
Mixed asphalt cement road	108	13,184	86.1	77.55	99.83	86.1	89.75	88.9	97.44	99.91
Moist soil	98	4910	61.83	61.71	89.02	81.2	74.46	73.16	82.75	94.78
Water	230	63,007	89.51	94.18	99.98	99.65	87.13	78.76	99.68	99.71
Oil	309	1963	89.81	67.04	93.43	70.81	82.58	88.44	98.37	79.21
Reed	665	56,922	41.96	50.22	62.43	70.75	47.35	58.81	57.24	73.51
Tamarix	134	1954	9.83	45.34	55.22	55.83	32.86	35.31	42.99	19.91
Car	31	391	41.69	36.06	76.21	37.6	38.36	35.55	68.8	85.13
Dry soil	54	4427	61.87	76.21	86.02	83.71	87.55	81.95	82.65	80.41
Standard reflectance cloth	8	49	40.82	61.22	100	53.06	93.88	100	97.96	30.32
Mixed tamarix reed	1465	39,837	60.87	52.21	45.13	36.66	47.83	40.67	55.53	66.26
Mixed suaeda glauca reed	638	21,514	69.28	46.27	70.08	55.32	69.18	68.06	72.16	94.18
Mixed water reed	435	53,549	53.4	48.02	67.52	64.2	59.47	61.22	82.79	77.3
OA			63.74	63.28	74.51	72.19	64.66	64.37	77.35	82.37
AA			59.03	61.15	79.85	68.48	67.18	67.41	78.89	82.53
Kappa			57.32	56.76	69.7	66.89	58.47	58.17	72.95	78.91

Table 4. The combinations of different modules in the ablation study.

Name	Dual-Branch	GSSA	RSF	Channel-Wise Gate	MFF	DenseCRF	OA (%)
							NC12	NC13	NC16
A0	✓						95.01	75.30	95.04
A1	✓	✓					95.21	75.52	95.18
A2	✓	✓	✓				96.03	77.86	95.87
A3	✓	✓	✓	✓			96.74	79.94	96.22
A4	✓	✓	✓		✓		96.52	79.51	96.04
A5	✓	✓	✓	✓	✓		97.00	80.54	96.71
A6	✓	✓	✓	✓	✓	✓	97.62	82.37	97.32

Table 5. With/without CRF comparison on NC12/NC13/NC16.

Dataset	OA (w/o)	OA (w/)	ΔOA	AA (w/o)	AA (w/)	ΔAA	Kappa (w/o)	Kappa (w/)	ΔK
NC12	97	97.62	0.62	89.24	88.85	−0.39	96.25	97.04	0.79
NC13	80.54	82.37	1.83	80.08	82.53	2.45	77.12	78.91	1.79
NC16	96.71	97.32	0.61	85.45	87.11	1.66	95.45	96.34	0.89

Table 6. Model complexity on NC12/NC16/NC13.

Data	Time	1DCNN	3DCNN	SSRN	DBMA	DBDA	HybridSN	Swin Transformer	Our Method
NC12	Training(s)	101.24	1098.67	1970.56	2414.6	2216.36	1455.29	972.35	891.05
NC12	Test(s)	21.35	690.51	1059.63	1141.69	1237.01	985.41	712.2	693.5
NC16	Training(s)	32.59	312.54	613.87	728.71	718.93	402.85	376.28	300.43
NC16	Test(s)	13.86	602.1	696.82	807.23	886.42	713.2	567.58	479.21
NC13	Training(s)	29.81	319.67	682.9	769.35	733.81	424.5	388.02	305.09
NC13	Test(s)	13.57	579.37	673.54	712.08	745.01	635.08	594.37	572.98
Params (M)		0.133	0.28	0.093	0.159	0.165	2.407	18.262	19.749

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, R.; Zhao, J.; Tian, S.; Li, G.; Chen, J. Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification. Remote Sens. 2025, 17, 2862. https://doi.org/10.3390/rs17162862

AMA Style

Liu R, Zhao J, Tian S, Li G, Chen J. Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification. Remote Sensing. 2025; 17(16):2862. https://doi.org/10.3390/rs17162862

Chicago/Turabian Style

Liu, Ruopu, Jie Zhao, Shufang Tian, Guohao Li, and Jingshu Chen. 2025. "Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification" Remote Sensing 17, no. 16: 2862. https://doi.org/10.3390/rs17162862

APA Style

Liu, R., Zhao, J., Tian, S., Li, G., & Chen, J. (2025). Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification. Remote Sensing, 17(16), 2862. https://doi.org/10.3390/rs17162862

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Branch Channel-Gated Swin Network for Wetland Hyperspectral Image Classification

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area Overview

2.2. Dataset Description

2.3. MBCG-SwinNet Architecture

2.4. Spatial and Spectral Branch Design

2.4.1. Spatial Branch

2.4.2. Spectral Branch

2.4.3. Global Spectral–Spatial Attention (GSSA)

2.5. Improved Residual Spatial Feature Enhancement Module (RSF)

2.5.1. Module Input and Branch Pre-Processing

2.5.2. Channel Attention Modeling in the Spatial Branch

2.5.3. Spectral Branch Recursive Gated Spatial Enhancement

2.5.4. Channel-Wise Gated Residual Fusion

2.6. Multi-Scale Feature Fusion Module (MFF)

2.7. Post-Processing: DenseCRF Refinement

2.8. Experimental Setup

3. Results

3.1. Comparative Experiments

3.2. Ablation Study

3.3. Effectiveness and Parameter Sensitivity Analysis of DenseCRF

3.3.1. Effectiveness (With vs. Without CRF)

3.3.2. Parameter Sensitivity Analysis

3.4. Analysis of Model Complexity

3.5. Analysis of Model Noise Robustness

4. Discussion

4.1. Key Findings and Mechanism Interpretation

4.2. Comparison with Previously Published Results

4.3. Relation to UAV-HSI Elsewhere and to Satellite/Multispectral Data

4.4. Practical Aspects and Robustness

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI