1. Introduction
Wetlands, characterized by their unique intersection of freshwater and saltwater environments and identified by perpetually wet soils due to frequent flooding [
1], play a vital role in the environment through essential biogeochemical and hydrological processes [
2]. These ecosystems, dominated by emergent plants, shrubs, and woodland vegetation, offer a myriad of benefits, including flood control, erosion prevention, nutrient provision, recreational opportunities, and aesthetic value [
3]. Moreover, wetlands contribute to crucial environmental services such as protection from flood and storm damage, enhancement of water quality, reduction of greenhouse gases, shoreline stabilization, and support for diverse fish and wildlife species [
4]. Despite their significance, 20th-century wetlands have faced substantial damage from industrialization, climate change, and pollution [
5]. Given these challenges and the critical importance of wetlands, creating precise maps detailing their location and composition becomes paramount [
6].
Accurate wetland maps serve as vital resources for understanding the spatial distribution, ecosystem functions, and temporal changes within these invaluable environments [
7]. Traditional mapping methods, such as surveying, prove laborious and expensive, particularly given the remote, expansive, and seasonally dynamic characteristics of many wetland ecosystems [
4]. Fortunately, remote sensing emerged as a cost-effective and efficient alternative approach, providing ecological data crucial for describing and monitoring wetland ecosystems. Satellite-based Earth monitoring, utilizing optical and radar systems, has evolved into a valuable tool for wetland mapping. The European Space Agency’s (ESA) Copernicus program, specifically the Sentinel-1 (S1) and Sentinel-2 (S2) missions, presents a unique opportunity to improve wetland mapping accuracy. These missions provide free access to high-resolution S1 and S2 data, offering exceptional spatial and temporal resolutions. S1 satellites, equipped with C-band Synthetic Aperture Radar, operate at a 10 m resolution and prove to be effective in detecting water bodies hidden beneath vegetation canopies [
8,
9]. In parallel, S2 satellites, with a multispectral sensor offering optical data at various resolutions, complement the radar data [
10]. Both missions, involving two polar-orbiting satellites, provide relatively frequent revisit times of every five to six days at the equator [
11].
Despite significant progress in remote sensing technology, classifying complex and heterogeneous land cover types like wetlands remains a challenging task [
12]. The challenge arises from the fact that in highly fragmented landscapes like wetlands, there exist numerous small classes that lack distinct boundaries between them, resulting in heightened variability within the same class and reduced distinguishability between different classes [
5]. Moreover, certain classes among these may exhibit highly similar spectral properties thus, complicating their separation and requiring advanced methods for an accurate classification.
A crucial factor in achieving accurate wetland categorization through remote sensing data hinges on the selection of an appropriate classification algorithm [
13]. Over the past few years, numerous advanced machine learning (ML) algorithms have been used for the purpose of wetland classification [
4,
14,
15,
16,
17,
18]. Presently, deep learning (DL) algorithms have demonstrated a high level of accuracy in the context of wetland mapping [
3]. Deep learning (DL) algorithms, such as convolutional neural networks (CNNs), have shown significant advancements in handling complex datasets due to their ability to extract and learn hierarchical features from raw data automatically [
19,
20]. Unlike shallow models like Random Forest (RF), which rely on manual feature engineering and are limited by their relatively simplistic structure, DL models can discover complex patterns and relationships within large and heterogeneous datasets [
21]. For instance, CNNs excel at identifying spatial hierarchies and local features through convolutional layers. These capabilities are particularly beneficial for tasks such as wetland mapping, where diverse land cover patterns and subtle class differences necessitate sophisticated feature extraction and representation [
22]. Empirical studies have demonstrated that DL models often achieve higher accuracy and robustness compared to shallow models in wetland mapping [
4,
5,
12,
23].
CNNs, a class of DL models inspired by biological processes, are widely employed for the classification of remote sensing images, consistently achieving high accuracy in complex, high-dimensional scenarios [
24,
25,
26]. Following the emergence of cutting-edge CNN models like AlexNet [
27], VGG [
28], and ResNet [
29], which achieved remarkable performance in the ImageNet image classification challenge, CNN architectures have solidified their position as the dominant and most influential designs for computer vision tasks. Nonetheless, the advent of transformers in the field of natural language processing (NLP) and text processing directed attention toward considering the possible application of these innovative architectures in the vision tasks [
30]. The key factor behind the transformers’ achievement of high accuracy is the attention mechanism (AM), which enables the model to assign greater importance to various regions or patches within the image [
31]. Recently, attention-based models, such as vision transformers (ViT), swin transformers, etc., have been demonstrated to be effective in several tasks, including some instances of remote sensing image classification [
32,
33,
34]. Previous studies demonstrated the benefits of using both CNNs and transformers by changing the multi-head attention blocks with convolutional layers [
35], adding more parallel convolutional layers [
36], and using sequential convolutional layers [
37]. These approaches can result in capturing local dependencies. Despite these advances in other remote sensing applications, the potential of combining state-of-the-art CNN architectures with transformers in wetland mapping is underrepresented and requires further exploration.
This study introduces a novel deep learning model for wetland mapping using S1 and S2 data. The primary contributions of this study are: (1) using the potential of both CNNs and ViT to develop a novel DL model called CVTNet, which improves feature extraction through channel attention (CA) and spatial attention (SA); (2) evaluating the effects of attention mechanisms (AMs) in CVTNet model using the gradient-weighted class activation mapping (Grad-CAM) algorithm and occlusion sensitivity maps; (3) and assessing the model’s capability at distinguishing wetland class boundaries.
2. Materials and Methods
2.1. Study Area
The study area is in Eastern Newfoundland and Labrador, bordered by the Atlantic Ocean (see
Figure 1). It covers approximately 1996.631 square kilometers and has a humid continental climate. St. John’s, the largest city’s capital, has around 226,000 residents. The land cover includes diverse wetland types such as bogs, fens, forests, grasslands, marshes, pastures, shrubbery, urban areas, swamps, and water bodies. According to the Canadian Wetland Classification System (CWSC), the primary wetlands here are bogs, marshes, fens, and swamps, with bogs and fens (peatlands) predominant.
2.2. Data
This study used S1 and S2 satellite data from Google Earth Engine (GEE) with a 10 m spatial resolution. The maskS2clouds algorithm [
1] filtered S2 data for less than 10% cloud coverage from 15 May to 27 May 2022. S2 satellite data includes three different spatial resolutions: 10 m for visible and near-infrared bands, 20 m for red-edge and shortwave infrared bands, and 60 m for atmospheric correction bands. We selected the 10 and 20 m bands (B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12), resampling the 20 m bands to 10 m for consistency. S1 ground range detected (GRD) data included
,
,
, and
. The spatial resolution of S1 data in Interferometric Wide (IW) mode, as obtained from GEE, has a pixel spacing of approximately 10 m, though its spatial resolution is not precisely 10 m due to the nature of radar imaging. Preprocessing involved noise removal, terrain adjustment, and radiometric calibration using the ESA Sentinel toolbox. Both S1 and S2 data were acquired from 15 May to 27 May 2022.
Table 1 shows the details of S1 and S2 data used in this study.
Training and validation datasets, labeled by specialists using GPS and high-resolution imagery, were collected from 2017 to 2019. The training set includes 638 polygons, and the validation set has 249 polygons, each with diverse sizes and classes. These datasets were obtained from different geographic regions to avoid data leakage.
Figure 2 indicates the distribution of training and validation sets. Moreover, non-wetland classes such as water, urban, and grass were included to prevent over-classification of wetlands.
Table 2 shows the number of polygons within each class and their respective area measurements in both the training and validation datasets.
2.3. Data Preparation and Label Assigning
The satellite data employed in this study were represented in a three-dimensional format with dimensions
where
and
are the image’s height and width, respectively, and
is the number of channels. Given the large size of the satellite image, processing it in its entirety proved time-intensive. Consequently, the image was subdivided into multiple patches along both the
and
dimensions, resulting in a new shape of
(where
<
and
<
) [
4]. To address the challenge of labeling these patches, we adopted a patch-based method, assigning labels to the center of each patch [
15], as illustrated in
Figure 3.
Initially, S1 data (comprising 4 backscattering bands) and S2 data (comprising 10 spectral bands) for the study area were collected using GEE. These data were then combined along their respective channels, resulting in an image with 14 bands. The 14-band image was subsequently partitioned into multiple patches with a 50% overlay. To determine the optimal patch size, a trial-and-error methodology was employed, experimenting with patch sizes of 8, 12, 15, and 20. These patch sizes were proposed by previous studies for S1 and S2 data [
3]. The investigation showed that a patch size of 8 produced optimal results.
Next, the central coordinate of the patch (P) was extracted, and its intersection with a set of training and validation polygons was evaluated. If there was no intersection, the patch remained unlabeled. Conversely, if an intersection occurred between P and the polygon set, the intersection area (A) between the patch and the polygon was calculated.
Subsequently, A was compared with a predefined area value (A’). If A > A’, the label of the polygon was assigned to the patch; otherwise, no label was assigned. Notably, A’ varied based on patch sizes, with values of 400, 900, 1400, and 2500 m
2 corresponding to patch sizes of 8, 12, 15, and 20, respectively. A’ signified the area covered by a polygon exclusively belonging to a single wetland class. In
Table 3, the number of extracted patches for each class can be observed and each patch is sized at 8.
2.4. Convolutional Neural Network (CNN)
The CNN algorithm is a powerful DL approach that eliminates the necessity for prior feature extraction [
38]. A standard CNN comprises three primary layer types: the convolution layer, the pooling layer, and the fully connected layer. The convolution layer, the foundational element in extracting information from preceding layers, uses filters on input data to identify diverse features such as edges, textures, and shapes through learnable parameters [
39]. CNNs maintain connections between pixels, preserving spatial information [
23]. Equation (1) defines a simple convolution operation involving the input image
and the kernel
, where
and
denote the dimensions of
[
5]:
The second layer, known as the pooling layer, reduces the spatial dimensions of the feature maps produced by preceding convolutional layers. This reduction in dimensionality serves to decrease computational complexity and mitigate overfitting by reducing the number of parameters [
15]. Various pooling functions, including maximum and average, are widely used in CNN architectures. In our proposed network, we employed the max-pooling function, recognized for its stability and effectiveness in the field of DL [
23].
The final layer, the fully connected layer, acts as the decision-making component of the network, determining the final label for the input data. Neurons in this layer aggregate information from all neurons in preceding layers to reach a final decision. Unlike the preceding layers, its function does not involve preserving spatial structure but rather involves converting the input layer into a vector format [
40].
2.4.1. Dilated CNN (DCNN)
DCNNs, also referred to as atrous CNNs, involve the expansion of the convolutional kernel by introducing holes between its elements [
19]. These holes, or dilations, enable the convolutional layer to access a larger area (field of view) of the input without increasing the number of parameters [
41]. This mechanism allows for the capture of more contextual and multi-scale information. In DCNN, a crucial additional parameter, known as the dilation rate (DR), determines the spacing between the kernel’s sampled points. Equation (2) illustrates the operation of dilated convolution with input image
and kernel
, where the parameters
m and
n represent the dimensions of
, and
is the DR:
2.4.2. VGG16
The VGG network, developed by the Visual Geometry Group at the University of Oxford, represents a CNN architecture [
42]. In a competition where the group introduced six deep CNNs, VGG16 and VGG19 emerged as the most successful. VGG16 consists of 13 convolutional layers and three fully connected layers, while VGG19 includes 16 convolutional layers and three fully connected layers. Both networks employ a sequence of small 3 × 3 convolutional filters with a stride of 1, followed by multiple layers introducing non-linearity. For feature extraction in the encoder section of our proposed model, we used the initial five layers of VGG16, as depicted in
Figure 4.
2.5. Attention Mechanism (AM)
In recent years, AMs have achieved significant attention in the field of DL, particularly for computer vision tasks such as object detection [
28,
30,
43], image captioning [
36,
44,
45], and action recognition [
46,
47]. The fundamental concept behind these mechanisms revolves around identifying the most crucial elements within input data while minimizing redundancy in the context of computer vision applications [
30]. This process can be formalized as follows [
48]:
where
refers to generating attention, which involves focusing on the distinctive areas. The expression
denotes the action of handling input
while considering the attention defined by
. This process aligns with the idea of addressing crucial areas and extracting valuable information [
48]. Almost all existing AMs can be written into the above formulation [
48]. In this study, we employed CA, SA, and a ViT model that incorporates multi-head self-attention mechanism (MHSeA).
2.5.1. Channel Attention (CA)
In a typical CNN model, input data is processed by various convolution kernels, generating new channels with varying levels of information. CA introduces assigning weights to each channel to indicate the relevance between the channel and critical information. Higher weights signify greater relevance, directing the model’s attention to the corresponding channel [
49]. CA enables the model to determine which channels to focus on autonomously.
As shown in
Figure 5, the input feature map is subjected to a global pooling operation, which reduces the spatial dimensions of the feature map, transforming it into a vector output. This output vector is then processed through a sigmoid function, a key component that introduces non-linearity into the model. The resulting vector is channel-wise multiplied by the input feature map, producing a feature map where the importance of channels is explicitly defined.
2.5.2. Spatial Attention (SA)
SA plays a pivotal role in enhancing the discernment capabilities of neural networks by enabling them to autonomously identify and prioritize relevant regions within the input data (refer to
Figure 6A). This AM operates by transforming the spatial information inherent in the input data into a distinct space, thereby preserving essential information and significantly augmenting the network’s proficiency in object and pattern recognition [
50].
In the SA mechanism, the input feature map transforms a 1 × 1 convolutional layer equipped with the sigmoid activation function and a singular filter. This convolutional operation is fundamental in discerning the significance of different spatial locations within the feature map. The resulting output from this process is then used to scale the input feature spatially, dynamically emphasizing regions deemed vital for the task at hand (refer to
Figure 6B).
The intricate interplay of the 1 × 1 convolutional layer and sigmoid activation function allows the network to assign importance to spatial elements, promoting a refined understanding of the input’s spatial structure. This mechanism proves invaluable in capturing nuanced spatial relationships and is particularly effective in tasks where precise localization and discrimination of features are paramount.
2.5.3. Self-Attention (SeA)
SeA is a core component of transformers that enables modeling relationships between all tokens in an input sequence. In ViT, SeA captures long-range dependencies between image patches and encodes interactions across the entire image. The key concept is to update a token’s representation by aggregating relevant global context from all other tokens [
51]. For an input image
of dimensions
(
is the number of channels,
and
are the height and width of the image, respectively), SeA first splits the image into
flattened
patches
of shape
, where
is the patch size. These patches are projected into an embedding space of dimension
using a learned linear layer, giving embeddings
of shape
. To apply SeA,
is transformed into queries (
), keys (
), and values (
) using separate projection matrices
,
, and
. This allows modeling interactions between the
patch embeddings. The sequence
is projected to obtain
,
, and
. Equation (4) indicates how the attention matrix
of size
will be computed [
52].
2.5.4. Multi-Head Self-Attention (MHSeA)
MHSeA involves combining multiple SeA blocks that are concatenated in a channel-wise manner simultaneously [
52]. This approach aims to capture diverse and complex relationships between various sequences of embeddings. Each head within the MHSeA possesses its unique set of learnable weight matrices, denoted as
,
, and
, where
ranges from
to
, and
represents the number of heads in the MHSeA. This configuration enables the MHSeA to simultaneously attend to multiple aspects of the input data, using distinct sets of learned parameters for each head. Consequently, the model obtains the capacity to discern complex patterns and relationships, enhancing its ability to extract meaningful information distributed across various pixels.
2.5.5. Vision Transformer (ViT)
Recently, transformer-based models have shown remarkable performance in a wide range of applications, including computer vision and NLP. These transformers employ SeA mechanisms to effectively capture long-range dependencies [
53]. In response to these advancements, [
54] introduced ViT for image classification tasks. The ViT can capture extensive connections within input images [
54]. It interprets images as a series of patches and utilizes a traditional transformer encoder. The ViT model comprises three main elements, namely: (1) patch embedding, (2) transformer encoder (TE), and (3) the classification head, as illustrated in
Figure 7.
In the ViT architecture, the input image will be processed by first splitting it into a set number of non-overlapping image patches of equal dimensions. Then, these image patches will be flattened into 1D vectors that encode raw pixel values. Next, the linear projection will be applied to the patch vectors to derive lower-dimensional patch embedding representations. This projection step reduces the vector dimensionality while distilling the information contained in the original image patches into more compact feature vectors. The sequence of patch embeddings serves as the input to the TE.
The TE module is made up of several layers, each of which contains a MHSeA mechanism and a feedforward neural network. When encoding each patch, the SeA mechanism allows the model to pay attention to different parts of the image. The output of the SeA mechanism is transformed nonlinearly by the feedforward neural network. After passing through the TE, the sequence of embeddings is passed to an MLP, which predicts the class label of the input image.
2.6. The Proposed Model (CVTNet)
In this study, we introduce a novel model named CVTNet, a convolutional ViT hybrid designed specifically for mapping wetlands using data from S1 and S2 satellites. The CVTNet model comprises two main branches: a convolutional branch and a ViT branch, as depicted in
Figure 8.
The input image dimensions for the CVTNet model are set at 8 × 8 × 14. The initial stage employs a convolutional layer with three filters and a 1 × 1 kernel to align the number of channels with the three-channel input requirement of the VGG16 architecture. Subsequently, a resizing layer transforms the output dimensions to 64 × 64 × 3. The first five pre-trained layers of the VGG16 architecture encode the input image. This encoded output feeds into a DCNN module, featuring four convolutional layers with 64 filters and varying dilation rates (1, 3, 4, and 6). This module extracts multi-scale features from the encoded tensor, enhancing the field of view, and concatenates these features along the channels.
Given the varied weights and importance of multi-scale features, SA and CA are used to specify the importance of regions and channels, respectively. In both mechanisms, global average and global maximum operations are applied along channel and spatial dimensions, respectively. The CA mechanism employs 1D CNNs instead of traditional dense layers to extract more meaningful features for assessing feature importance. The features from the global average and maximum operations are combined through an added layer.
In the SA mechanism, global average and global maximum results are concatenated along their respective channels. This concatenated tensor undergoes a 2D convolutional layer with 64 filters. Subsequently, a convolutional layer with one filter and a 1 × 1 kernel, along with a sigmoid function, generates a heatmap. This heatmap scales the initial input feature map. The outputs from both CA and SA mechanisms are then integrated using an add layer.
Next, three convolutional layers with filter sizes of 32, 32, and 16 are employed to decode the extracted features. A batch normalization layer is used to normalize the features, ensuring that the mean output is close to 0 and the output standard deviation is close to 1, which enhances the stability of the optimization process. Finally, the extracted features are flattened using a flattened layer, and two dense layers with 2048 and 1024 neurons are used as the final layers in the first branch. It is worth noting that all convolution layers and dense layers in this branch are activated using the Rectified Linear Unit (ReLU) activation function.
The second branch of CVNet starts by dividing input images into nine non-overlapping patches, each with a size of 4 × 4 × 14. These patches are then flattened and encoded individually. The encoding process involves a positional embedding step. In this step, an embedding layer is employed to generate embeddings for each patch, mapping them to a lower-dimensional space (128).
CVNet’s second branch consists of eight layers of transformer blocks, each comprising several stages (see
Figure 7 and
Figure 8). First, a normalization layer is applied to the encoded patches. Then, MHSeA is used to capture relationships between the patches. The output of the MHSeA mechanism is combined with the encoded patches from the previous step, followed by another layer normalization (see
Figure 7). An MLP processes the output of the MHSeA, which includes dense layers with ReLU activation functions and dropout with a rate of 0.1 (see
Figure 7). Finally, the output of this process passes through dense layers with 2048 and 1024 neurons, utilizing ReLU activation functions and dropout with a 0.1 dropout rate (see
Figure 8).
The second branch of CVTNet starts by dividing input images into nine non-overlapping patches, each with a size of 4 × 4 × 14. These patches are then flattened and encoded individually. The encoding process involves a positional embedding step. In this step, an embedding layer is employed to generate embeddings for each patch, mapping them to a lower-dimensional space (128).
CVTNet’s second branch consists of eight layers of transformer blocks, each comprising several stages (see
Figure 7 and
Figure 8). First, a normalization layer is applied to the encoded patches. Then, MHSeA is used to capture relationships between the patches. The output of the MHSeA mechanism is combined with the encoded patches from the previous step, followed by another layer normalization (see
Figure 7). An MLP processes the output of the MHSeA, which includes dense layers with ReLU activation functions and dropout with a rate of 0.1. Finally, the output of this process passes through dense layers with 2048 and 1024 neurons (see
Figure 8).
The outputs of the convolutional branch and the ViT branch are integrated using an add layer. Finally, there is a dense layer with 11 neurons and a softmax activation function used to classify the label of the input patch.
2.7. Validation Process
The validation of the CVTNet model employed a six-fold cross-validation (CV) strategy. The dataset D, comprising satellite image patches, was partitioned into six exclusive folds
. The model was trained six times, each time using the entire dataset except for one-fold reserved for validation. This resulted in 83% of the data (12,718 patches) being used for training and 17% (2604 patches) for validation in each iteration. The number of folds was determined through trial and error to balance robust validation and computational efficiency.
Figure 9 illustrates the six-fold CV process.
To evaluate the performance of the CVTNet model, we used several metrics: F1-Score (F1), Kappa Coefficient (KC), Overall Accuracy (OA), Precision, and Recall. OA measures the proportion of correctly classified regions within the entire image, calculated by dividing the number of correctly classified pixels by the total number of pixels. The KC indicates the level of agreement between the reference data and the classified map, providing a normalized assessment of classification accuracy. The F1 score, crucial for imbalanced data, balances Precision and Recall. Precision (positive predictive value) reflects the accuracy of detected pixels for each class, avoiding false positives. Recall (sensitivity) shows how many actual pixels in each class are correctly classified, emphasizing the model’s ability to capture all relevant instances.
2.8. Occlusion Sensitivity
Occlusion sensitivity maps [
55] were employed to identify key regions within the input image that significantly influence the classification decisions made by CVTNet. Occlusion sensitivity obscures sections of the input image using an occluding mask, such as a gray square and measures the change in the probability score for a particular class as the mask’s position varies. These maps highlight areas that substantially impact the classification score and those that contribute less or not at all.
Figure 10 illustrates the procedure for generating heatmaps using the CVTNet model.
2.9. Experimental Settings
The CVTNet model was trained and tested using the Tensorflow [
56] and Keras [
57] packages on a machine with an Intel i7-10750H 2.6 GHz processor, 16 GB of RAM, and an NVIDIA GTX 1650 Ti graphics card. For the training phase, a batch size of 4 was determined through a trial-and-error approach. Furthermore, we used the Adam optimizer [
58] with a learning rate of 0.0003 and binary cross-entropy for training the CVTNet model.
To prevent overfitting of the CVTNet model to the training data within the 200 iterations, we implemented the early stopping technique [
59]. In each iteration using the training dataset, an evaluation was carried out on the validation dataset. If the model’s accuracy during validation was higher than any previously recorded maximum loss, the model’s weights were adjusted accordingly. Ultimately, at the conclusion of the training process, the most optimal model was saved.
3. Results
3.1. Quantitative Results
The performance of the proposed CVTNet model in mapping wetland areas using S1 and S2 satellite data was evaluated on both the training and validation datasets (see
Table 4). The model achieved a remarkable OA of 0.961 on the training set and 0.925 on the validation set. These high OA values highlight the model’s ability to accurately classify regions within the study area. Moreover, the model attained a KC of 0.936 on the training set and 0.899 on the validation set, demonstrating consistent and reliable performance in classifying wetland regions.
In dealing with the inherent challenges of imbalanced training data, CVTNet exhibits a balanced trade-off between Precision and Recall. The F1-Score, which harmonizes these two metrics, was 0.942 on the training dataset and 0.911 on the validation dataset. This balance is crucial for achieving accurate wetland mapping, where both the detection of actual wetland pixels (Recall) and the avoidance of false positives (Precision) are vital. Additionally, a detailed breakdown of Recall and Precision values provides knowledge about the model’s performance at the class level. The model’s high Recall (0.961) on the training set and substantial Recall (0.923) on the validation set underscore its high performance in correctly identifying true positive instances of wetland regions. Similarly, Precision values of 0.924 on the training set and 0.898 on the validation set indicate the model’s ability to minimize false positives.
The results in
Table 5 showcase the classification performance of the CVTNet model across different wetland classes, with a focus on key metrics including Recall, Precision, and F1-Score. In the training dataset, notable achievements are observed, particularly in classes like pasture, shrubland, urban, and water, where the model achieves perfect Recall (1.0), Precision (1.0), and F1-Score (1.0) values. The model also demonstrates strong performance in correctly identifying classes such as bog (Recall: 0.97, Precision: 0.95, F1: 0.96), fen (Recall: 0.99, Precision: 0.95, F1: 0.97), and forest (Recall: 0.94, Precision: 0.98, F1: 0.96). However, it exhibits some challenges in classes like exposed (Recall: 0.87, Precision: 0.83, F1: 0.85) and grassland (Recall: 0.95, Precision: 0.83, F1: 0.89), where Precision is comparatively lower.
For the validation dataset, the CVTNet continues to exhibit commendable performance, maintaining high Recall, Precision, and F1-Score values across several wetland classes. Classes like pasture, shrubland, urban, and water once again stand out with excellent performance metrics (Recall, Precision, and F1-Score all equal to 1.0). The model shows robustness in correctly classifying diverse wetland types, including bog (Recall: 0.96, Precision: 0.94, F1: 0.95), fen (Recall: 0.93, Precision: 0.93, F1: 0.93), forest (Recall: 0.94, Precision: 0.94, F1: 0.94), marsh (Recall: 0.89, Precision: 0.89, F1: 0.89), and swamp (Recall: 0.91, Precision: 0.97, F1: 0.94). Nevertheless, there are slight variations in Precision for certain classes, such as exposed (Recall: 0.78, Precision: 0.78, F1: 0.78) and grassland (Recall: 0.81, Precision: 0.91, F1: 0.86), indicating potential areas for improvement.
3.2. Models Results Comparison
To evaluate the efficacy of our proposed CVTNet model, we conducted a comparative analysis with contemporary algorithms in the realms of remote sensing and computer vision, including the ViT [
54], MLP-mixer [
60], hybrid spectral net (HybridSN) [
61], and Random Forest (RF) [
7] classifiers using the validation dataset.
Table 6 presents key performance metrics, including OA, KC, F1, Recall, and Precision, for these models. Notably, the proposed CVTNet model outperformed state-of-the-art algorithms, achieving the highest OA of 0.921, along with a KC of 0.899, F1 of 0.911, Recall of 0.923, and Precision of 0.898. These detailed metrics highlight the superior performance of CVTNet in accurately classifying wetland areas compared to other evaluated models.
These comparative results demonstrated that the CVTNet model outperforms the other models in accuracy. This performance can be attributed to several key factors. CVTNet uses CNNs and the ViT algorithm. While CNNs are highly useful in extracting local spatial features, ViT excels at modeling long-range dependencies. The hybrid architecture allows CVTNet to use local and global features, leading to more accurate wetland mapping. The model’s ability to extract multi-scale features from S1 and S2 data improves its robustness. By combining data from S1 with S2, CVTNet benefits from complementary information, resulting in improved classification performance. Additionally, the AM in ViT enables CVTNet to focus on the most relevant parts of the input data, improving its ability to distinguish between wetlands and other land cover types. This targeted focus reduces misclassifications and improves overall mapping accuracy.
Furthermore,
Figure 11 indicates the classification map obtained using these models. The visual analysis of these wetland maps shows that the CVTNet model successfully differentiated between non-wetlands and wetlands, demonstrating superior accuracy, particularly in classifying smaller objects.
5. Conclusions
In this study, we introduced and evaluated the CVTNet model for wetland mapping using S1 and S2 satellite data. The proposed model, integrating CNN and ViT algorithms, demonstrated exceptional performance in accurately classifying diverse wetland areas. The incorporation of AMs, including CA and SA, proved instrumental in enhancing the model’s ability to discern intricate features within image patches, leading to superior classification accuracy in the complex wetland environment.
The quantitative results showcased remarkable OA values of 0.961 on the training set and 0.925 on the validation set, highlighting the model’s efficacy in classifying wetland regions. Moreover, the model exhibited a balanced trade-off between Precision and Recall, crucial for accurate wetland mapping. The comparison with contemporary algorithms, including ViT, MLP-mixer, HybridSN, and Random Forest, reaffirmed the superior performance of CVTNet across various metrics.
Detailed analyses, such as sensitivity maps and examination of model performance at class boundaries, provided valuable insights. AMs were found to significantly impact the model’s focus, especially in discerning the unique characteristics of wetland classes. Challenges at class boundaries were identified, emphasizing the need for further improvement in handling transitional areas between different wetland classes. Future research could focus on refining the model’s performance at class boundaries and exploring additional AMs for further improvement in accuracy.