Next Article in Journal
Operational Amplifiers Defect Detection and Localization Using Digital Injectors and Observer Circuits
Previous Article in Journal
Study on Thermal Characteristics of Winding under Inter-Turn Fault of Oil-Immersed Transformer
Previous Article in Special Issue
Localization of Coordinated Cyber-Physical Attacks in Power Grids Using Moving Target Defense and Machine Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SwinDefNet: A Novel Surface Water Mapping Model in Mountain and Cloudy Regions Based on Sentinel-2 Imagery

1
College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
2
National Earthquake Response Support Service, Beijing 100085, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(14), 2870; https://doi.org/10.3390/electronics13142870 (registering DOI)
Submission received: 1 June 2024 / Revised: 16 July 2024 / Accepted: 19 July 2024 / Published: 21 July 2024
(This article belongs to the Special Issue Applications of Deep Neural Network for Smart City)

Abstract

:
Surface water plays a pivotal role in the context of climate change, human activities, and ecosystems, underscoring the significance of precise monitoring and observation of surface water bodies. However, the intricate and diverse nature of surface water distribution poses substantial challenges to accurate mapping. The extraction of water bodies from medium-resolution satellite remote sensing images using CNN methods is constrained by limitations in receptive fields and inadequate context modeling capabilities, resulting in the loss of boundary details of water bodies and suboptimal fusion of multi-scale features. The existing research on this issue is limited, necessitating the exploration of novel deep-learning network combinations to overcome these challenges. This study introduces a novel deep learning network combination, SwinDefNet, which integrates deformable convolution and Swin Transformer for the first time. By enhancing the effective receptive field and integrating global semantic information, the model can effectively capture the diverse features of water bodies at various scales, thereby enhancing the accuracy and completeness of water extraction. The model was evaluated on Sentinel-2 satellite images, achieving an overall accuracy of 97.89%, an F1 score of 92.33%, and, notably, an accuracy of 98.03% in mountainous regions. These findings highlight the promising potential of this combined approach for precise water extraction tasks.

1. Introduction

Surface water is an essential component of the Earth’s ecosystem and a key factor influencing factor in climate change, ecological protection, and human activities [1]. It serves critical functions in environmental monitoring and management [2], ecological conservation and restoration, disaster prevention and management, emergency response, agricultural practices, land utilization, and ensuring water safety for communities [3]. Through the acquisition and examination of spatial distribution and area data of water bodies, it is possible to provide guidance and assistance for improved human lifestyles [4]. Therefore, accurate mapping of surface water is of great significance for both environmental monitoring and societal progress.
Due to its extensive coverage, relatively high spatial and temporal resolutions, and the significant advantage of continuous monitoring of the Earth’s surface, Remote Sensing (RS) technology has become a widely used tool in the extraction of surface water extraction [5]. The techniques employed for water extraction from satellite remote sensing images can be classified into three main categories: (1) threshold-based methods, (2) machine learning methods, and (3) hybrid methods. In the process of water extraction, threshold-based methods primarily rely on the spectral reflectance characteristics of water bodies in specific bands, including both single-band threshold approaches. The former utilizes data from a single band [6] for identifying water bodies, while the latter employs a set of multi-band data to detect water bodies through mathematical and logical operations, such as Normalized Difference Water Index (NDWI) [7], Modified Normalized Difference Water Index (MNDWI) [8], Multi-band Water Index (MBWI) [9], Background Difference Water Index (BDWI) [10], Normalized Difference Water Fraction Index (NDWFI) [11], Composite Normalized Difference Water Index (CNDWI) [12], among others. Machine learning methods like Fuzzy C-means, K-means clustering, support vector machine, decision tree, and random forest are also utilized for water body identification and extraction. Hybrid methods combine water feature analysis with machine learning classifiers or multi-classifier ensembles to achieve precise water body extraction. The image processing involved in hybrid methods is intricate, involving multiple influencing factors and high levels of uncertainty. Traditional methods typically rely heavily on the expertise of domain specialists and may have limited capabilities in expressing features, making it challenging to fully comprehend intricate semantic details and spatial relationships between pixels.
Compared to traditional conventional water extraction techniques, deep learning methods offer the ability to learn and investigate intricate features, facilitating the identification of more intricate and nonlinear water characteristics [13]. These methods eliminate the necessity for manual adjustment of optimal thresholds, can adapt to large-scale learning, and demonstrate increased flexibility and generality, leading to their extensive utilization in water extraction studies. Isikdogan [14] introduced a distinctive CNN architecture named DeepWaterMap, which utilizes a fully convolutional network design to reduce the number of parameters requiring training and enable comprehensive, large-scale analysis. This network integrates the shape, texture, and spectral characteristics of water bodies to filter out interfering elements like snow, ice, clouds, and terrain shadows. Chen [15] introduced an innovative approach for detecting open identify water in urbanized regions, utilizing unequal and physical size constraints to recognize water bodies in urban settings. This approach tackles the significant confusion errors of traditional water resource indices in high spatial resolution images. The potential utility of this approach in extensive water detection tasks is validated through experimental confirmation using spectral libraries and authentic high spatial resolution RS imagery. Kang et al. [16] introduced a multi-scale context extraction network, MSCENet, designed for accurate and efficient extraction of water bodies from high-resolution optical RS images. This network incorporates multi-scale feature encoders, feature decoders, and context feature extraction modules. Specifically, the feature encoder employs Res2Net to capture detailed multi-scale information about water bodies, effectively managing variations in their shape and size. The context extraction module, consisting of an expanded convolutional unit and a sophisticated multi-kernel pooling unit, further refines multi-scale contextual information to generate enhanced high-level feature maps. Luo et al. [17] proposed an automated technique for surface water mapping and developed a novel surface water mapping model named WatNet. This model addresses the challenge of reduced mapping accuracy resulting from the similarity between non-water features and water features, employing a customized design for mapping surface water to achieve precise identification of smaller water bodies. The study also established the Earth Surface Water Knowledge Base (ESWKB), a publicly accessible dataset based on Sentinel-2 images. Li et al. [18] introduced a water index-driven deep fully convolutional neural network (WIDFCN) approach that achieves accurate water delineation without the need for manually collected samples. WIDFCN effectively manages scale and spectral variations of surface water and demonstrates robustness in experiments involving various types of shadows, such as those from buildings, mountains, and clouds. The key aspect of this approach is the extraction of high-precision but incomplete water boundaries from water spectral indices, which are subsequently expanded to enhance completeness. This study presents an effective approach to realize an efficient strategy for automatically generating training samples without the need for manual labeling, leading to a significant reduction in economic costs. Zhang et al. [19] introduced the MRSE-Net, an end-to-end CNN water segmentation network that incorporates multi-scale residual and squeeze-excitation (SE) attention mechanisms. The network leverages the SE-attention module to improve prediction accuracy by mitigating water boundary ambiguity and employs multi-scale residual modules to precisely extract water pixels, thereby addressing the issue of unclear boundaries in small river water bodies. Yu et al. [20] proposed the WaterHRNet, a network comprising a multi-branch high-resolution feature module (HRNet), feature attention module and segmentation header module. This hierarchical focus high-resolution network focuses on providing high-quality semantic feature representation for accurate water body segmentation across diverse scenarios. Xin Lyu et al. [21] introduced the MSNANet, a multi-scale normalized attention network designed for precise water body extraction in complex environments. The network integrates the Multi-Scale Normalized Attention (MSNA) module to fuse multi-scale water body features, emphasizing feature representations. Additionally, it incorporates an optimized spatial pyramid pooling (OASPP) module to enhance feature representations through contextual information, thereby improving segmentation performance. Kang et al. [22] proposed the WaterFormer, a hybrid model combining transformer and convolutional neural network architectures for precise water detection tasks. The network features dual-stream CNNs, Cross-Level Visual Transformers (CL-ViT), lightweight attention modules (LWA), and sub-pixel upsampling modules (SUS). The dual-stream network abstracts water features from various perspectives and levels, integrates cross-level visual transformers to capture long-range dependencies between spatial information and semantic features, enhances feature abstraction and generates high-quality class-specific representations using lightweight attention modules and sub-pixel upsampling modules.
The above analysis of current mainstream water extraction techniques reveals a predominant reliance on Convolutional Neural Networks (CNN). While CNNs exhibit remarkable feature extraction capabilities, their limited receptive field hinders the comprehensive capture of global context information in images. Moreover, the convolution and pooling processes often lead to the loss of image details. When CNNs are employed on medium-resolution satellite remote sensing images, the varied spatial distribution of water and the complexity of the intricate environmental background can result in the loss of boundary details, thereby affecting the accuracy of water extraction. Consequently, CNNs present certain constraints in water body extraction. In recent years, Transformers have garnered attention for their exceptional semantic representation capabilities and proficiency in modeling global information relationships. Notably, the Swin Transformer [23] has showcased robust feature extraction, contextual modeling, and multi-scale feature fusion capabilities, offering a promising approach for precise water body extraction from remote sensing images. Nevertheless, research in this domain remains limited. Therefore, this study aims to explore a novel integration of deep learning networks that maximizes multi-scale information utilization to enhance the detection of water body features. For the first time, we combine deformable convolutions [24] with the Swin Transformer to expand effective receptive fields and better incorporate global semantic information. We utilized Sentinel-2 image datasets to validate the effectiveness of our approach in identifying water bodies of various sizes and across diverse environmental conditions, such as mountainous regions and cloudy areas. The outcomes demonstrate the model’s high accuracy.
This network adeptly integrates the robust local feature extraction abilities of Convolutional Neural Networks (CNNs) with the comprehensive global feature extraction capabilities of Swin Transformers, facilitating precise extraction of water bodies. The primary contributions of our research are delineated as follows:
(1) A novel combination of deep learning networks was developed, integrating CNNs and Swin Transformers for the first time. The enhanced model focuses on extracting water body features, with a specific emphasis on accurately delineating water body boundaries. To accomplish this objective, the model leverages CNNs to capture image details and edge information while utilizing Swin Transformers to model global contextual information for a more comprehensive understanding of image semantic information. This hybrid model effectively incorporates both detailed image information and global contextual cues, leading to enhanced accuracy and performance in semantic segmentation.
(2) Given the intricate morphology and size discrepancies of water body boundaries, deformable convolutions are employed for the accurate extraction of water body boundary features. Deformable convolutions, through the incorporation of offsets, enable the adaptive modification of receptive fields. This adjustment allows convolutional kernels to flexibly deform on input feature maps based on target shapes, thereby enhancing the precision in capturing water body features.
The distribution of content in the subsequent sections of this paper is outlined as follows: Section 2 delineates the specifics of the data and methodologies employed in this study. Section 3 scrutinizes the experimental findings and furnishes the experimental setup. Section 4 deliberates on the ablation experiments. Ultimately, our conclusions are expounded upon in Section 5.

2. Materials and Methods

2.1. Materials

We employed the Earth Surface Water Knowledge Base (ESWKB) dataset, which is publicly available [25]. This dataset extensively utilizes Sentinel-2 satellite imagery resources and carefully selects 95 different scenes to encompass various types of water bodies under diverse environmental conditions. The labeling of surface water was carried out using six medium-resolution spectral bands: blue, green, red, near-infrared (NIR), mid-infrared 1, and mid-infrared 2. The utilization of this combination significantly aids in the extraction of water bodies by leveraging spectral characteristics [26].
The Sentinel-2 mission has been developed, built, and overseen by the European Space Agency’s Copernicus Programme since 2015 [27]. The primary goal of this mission is to monitor the Earth’s surface, providing essential services such as forest monitoring, identification of changes in land use, and efficient handling of natural calamities. The Sentinel-2 satellite is furnished with a Multi-Spectral Instrument (MSI) that can acquire images in 13 bands, with resolutions varying from 10 m to 60 m.
To evaluate the efficacy of SwinDefNet, a total of 38 images from the ESWKB dataset were chosen for the test dataset, while the remaining 57 images were allocated to the training set. Six images from the test set were specifically selected to demonstrate the model’s performance, encompassing mountainous regions and cloud-covered areas (refer to Figure 1 and Table 1).

2.2. Methods

2.2.1. Architecture of Encoder-Decoder

The encoder-decoder architecture is a pivotal component in various applications such as feature extraction, data compression, sequence mapping, and multi-level feature fusion. It facilitates the handling of intricate data transformation and generation tasks and finds extensive applications in natural language processing (NLP), digital image processing, time series prediction, and speech recognition, among others. Contemporary deep learning models like CNNs, RNNs, and Transformers trace their roots back to this architecture. Numerous semantic segmentation models like U-net [28], DeeplLabV3 [29], and Pspnet [30] are constructed based on the encoder-decoder architecture. This architecture comprises two main components:
(1) Encoder: Maps high-dimensional input images to low-dimensional representations. It extracts features from input images and condenses the spatial dimensions of the feature maps;
(2) Decoder: Reconstructs the low-dimensional representations mapped by the encoder to the original input. It restores the spatial scale and target details of the original image.
The objective of semantic segmentation is to assign a semantic category to every pixel in an image. This research aims to accurately delineate surface water bodies and enhance their extraction from remote sensing imagery. Both objectives require a detailed comprehension of individual pixels in the image and the assignment of a semantic classification to each pixel. To achieve this, we employ the encoder-decoder architecture, a prevalent approach in semantic segmentation, to develop the network for delineating surface water bodies. The schematic representation of the overall framework is depicted in Figure 2.
In consideration of the scale of feature extraction and model computational efficiency, the Swin Transformer and DeepLabV3+ [31] have been selected as the encoder and decoder components of the model, respectively. The Swin Transformer utilizes a hierarchical structure that incorporates varying numbers of transformer layers at different stages to capture relationships among different regions using the self-attention mechanism. This approach effectively manages image information at diverse scales. Additionally, the model reduces computational costs by employing the windowed self-attention mechanism. On the other hand, DeepLabV3+ is specifically designed for semantic segmentation and relies on dilated convolutions as its fundamental element. These convolutions enhance segmentation accuracy by integrating features at multiple scales and efficiently preserving detailed object boundaries.

2.2.2. Water Extraction Network Based on Swin Transformer

The Transformer, initially proposed for natural language processing (NLP), distinguishes itself from CNNs in feature extraction through the utilization of self-attention [32]. Additionally, the Transformer exhibits robust capabilities in modeling global information relationships. Nonetheless, it faces challenges related to high computational complexity when processing long sequence inputs. In 2021, the Swin Transformer was introduced to mitigate the issues associated with large-scale visual entities and computational complexity. This technology has demonstrated significant potential in addressing various visual tasks, including image classification, object detection, and semantic segmentation. In this investigation, the Swin Transformer is utilized to enhance the representation of water features at different scales and to establish comprehensive global contextual information for precise delineation of water boundaries. The architecture of the network for surface water mapping, which employs the Swin Transformer for feature extraction, is depicted in Figure 3, utilizing an encoder–decoder framework.
The encoder is established based on the Swin Transformer, which directly takes the original image size as input. The images are segmented into patches and reshaped into feature vectors through patch embedding. The Swin Transformer functions as a hierarchical feature extraction network, creating hierarchical feature mappings with linear computational complexity relative to the image size. It is structured into four stages, each consisting of patch merging and Swin Transformer blocks. Patch merging conducts downsampling operations at the onset of each stage to decrease resolution and adjust channel numbers, thereby reducing computational costs. Each Swin Transformer block is comprised of windowed multi-head self-attention (W-MSA) and shifted window multi-head self-attention (SW-MSA) layers for local feature extraction. Notably, the Swin Transformer conducts both intra-window and inter-window attention computations to extract global information. Through the shifted window operation, feature maps are shifted with a predefined mask, addressing the issue of the number of windows post-shifting and achieving equivalent computational outcomes. This approach enables multi-scale feature extraction while improving model computational efficiency through the integration of cross-window connections.
In the encoder section of our model, each stage produces a feature layer, yielding four feature maps of varying sizes, corresponding to 1/16, 1/32, 1/64, and 1/128 of the input size. The feature layers generated from the first, second, and fourth stages are labeled as basic, intermediate, and advanced feature layers, respectively, constituting a set of feature maps that capture different scales of water features. These feature maps are subsequently fed into the decoder section.
In this research, Deeplabv3+ is utilized to interpret the features extracted by the encoder. Deeplabv3+ is a semantic segmentation model that relies on dilated convolutions. By adjusting dilation factors, it can create receptive fields of various scales to effectively capture multi-scale information. This model demonstrates strong feature extraction capabilities, which assist the network in discerning distinctions between water bodies and their surroundings. Notably, it is proficient in recovering intricate boundary details of targets. Furthermore, deformable convolutions are integrated into the ASPP module to enhance water feature extraction in the advanced feature layer. The internal configuration of the ASPP is depicted in Figure 4.
The patch size was set at 4, and the number of embedding dimensions was defined as 96, considering the hardware limitations of the computer. This configuration aims to reduce computational complexity effectively and capture features of the water body comprehensively. Swin Transformer blocks were employed in each stage, with a total of 2, 2, 6, and 2 for the different stages. These blocks integrated 3, 6, 12, and 24 multi-head self-attention heads, thereby enhancing the model’s complexity. A window size of 7 was chosen, resulting in each window covering 49 patches. The size of the remote sensing image significantly impacts processing procedures, leading to increased computational complexity. Hence, the image was cropped to 256 × 256 pixels during model calculation. These parameters were carefully selected to ensure compatibility with the processor and to achieve high accuracy within a relatively short timeframe.

2.2.3. Deformable Convolution

Remote sensing images encompass a wealth of surface information. Nevertheless, the diverse resolutions and shapes of water bodies can lead to substantial disparities in the information conveyed by remote-sensing images. For example, a lake spanning 100 km² might be condensed into a single pixel in the image, posing challenges for conventional convolution operations in accurately discerning its characteristics. Conventional convolutional kernels are constrained by fixed receptive fields and dimensions, rendering them inadequate in accommodating geometric variations. They encounter difficulties in extracting pertinent features when confronted with targets that are excessively large or small, although they exhibit proficiency in handling regularly shaped water bodies influenced by human activities. Some research endeavors have employed dilated convolutions to surmount this challenge, with the objective of enlarging the effective receptive field and capturing information across multiple scales. Nonetheless, this strategy may lead to the omission of intricate image details. To enhance the extraction of features from the input remote sensing images, we propose the utilization of deformable convolutions.
In comparison to traditional convolution and dilated convolution, deformable convolution involves the prediction of offsets for feature sampling points. This approach allows for adaptive changes in sampling positions, facilitating the alignment of target points with targets to a greater extent. Deformable convolution is designed to accommodate irregular situations and effectively manage geometric deformations for feature extraction. By incorporating a direction vector into each convolution kernel, deformable convolution enables adaptive shape modifications, automatic scale adjustments, and variations in the receptive field. This method aims to closely align with the shapes and sizes of objects by introducing an offset parameter derived from traditional convolution. Initially, deformable convolution generates X and Y direction offset layers (2N) based on the input feature map. Subsequently, these offset layers are combined to produce the output feature map through deformable convolution. The process of implementing deformable convolution is depicted in Figure 5.
First, we define a standard convolution kernel R, denoted as the weighted sum of sampling values, obtaining Equation (1). Then, a standard convolution feature y 1 ( p 0 ) matrix can be obtained as shown in Equation (2). Finally, an offset Δ p n is introduced into R to realize the offset of feature points. The value of Δ p n is shown in Equation (3), and the eigenmatrix of the deformable convolution kernel is obtained as shown in Equation (4).
R = { ( 1 , 1 ) , ( 1 , 0 ) , , ( 0 , 1 ) , ( 1 , 1 ) }
y 1 ( p 0 ) = p n R w ( x n ) · x ( p 0 + p n )
{ Δ p n | n = 1 , , N } ,   N = | R |
y 2 ( p 0 ) = p n R w ( x n ) · x ( p 0 + p n + Δ p n )
where R denotes a 3 × 3 convolution kernel with a standard step size of 1. y 1 ( p 0 ) represents the eigenmatrix associated with the standard convolution, Δ p n signifies the offsets of the eigenpoints, and y 2 ( p 0 ) represents the eigenmatrix of the deformable convolution kernel.
When the feature map from the preceding layer is subjected to a 3 × 3 convolution, a separate 3 × 3 convolutional layer (referred to as the offset field) is initially defined with identical dimensions as the input feature map and a channel count of 2N, denoting the offsets in the X and Y directions. Utilizing this offset field for interpolation, standard convolution operations are subsequently conducted. To tackle the issue of potentially non-integer offsets, bilinear interpolation is utilized, which involves calculating the weighted mean of the four adjacent pixel values surrounding each sample point to approximate the pixel value at the new location. For every point in the feature map, it is necessary to account for the four neighboring pixels that could correspond to post-offset. Consequently, within the 3 × 3 convolution kernel scope, each sample point could potentially relate to as many as 36 distinct pixel values. Owing to the influence of the offset field, these positions fluctuate with alterations in the offset, thereby enabling the extraction of features for multi-scale and irregular shapes. This approach effectively expands the receptive field of the convolution operation, enabling the network to capture intricate structures and subtle variations in remote sensing images with greater precision and accuracy.

3. Results

3.1. Experimental Environment and Parameter Settings

The experiments were carried out using an Intel(R) Core(TM) i7-12700H 2.30 GHz processor with 16.0 GB of RAM, an NVIDIA GeForce RTX 3060 Laptop GPU, and CUDA 11.2. The input image size was configured to 256 × 256, batch size to 2, epochs to 200, and the learning rate was set to 0.002.
The model was trained and validated using the ESWKB dataset. The dataset was partitioned into training, testing, and validation sets in a ratio of 6:2:2. Images were randomly cropped to a specified pixel size of 256 × 256. Furthermore, basic data augmentation methods were implemented on the training set, such as image flipping and random rotations in 90° increments.

3.2. Evaluation Metrics

To evaluate the performance of SwinDefNet, we employed four commonly acknowledged metrics in the domain of semantic segmentation: accuracy, precision, recall, and the F1 score.
The metric known as accuracy represents the fraction of correctly predicted pixels out of all pixels. Precision, on the other hand, measures the fraction of accurately predicted water body pixels among all pixels classified as water bodies. Recall, defined as the ratio of accurately predicted water body pixels to all actual water body pixels, evaluates the model’s ability to capture all relevant instances. The F1 score, a metric that combines accuracy and recall, provides a harmonic mean to assess the overall performance of the model.
In the context of evaluation metrics, accuracy denotes the ratio of correctly predicted pixels to the total number of pixels. Precision measures the ratio of pixels labeled as bodies of water that are indeed bodies of water. Conversely, recall assesses the ratio of pixels accurately identified within the actual water body. The F1 score, as the harmonic mean of accuracy and recall, serves as a comprehensive indicator of the model’s overall performance.
The calculation methods for accuracy, precision, recall, and F1 score are as follows:
A c c u r a c y = T P + T N T P + T N + F N + F P
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
f 1 s c o r e = 2 p r e c i s i o n r e c a l l p r s c i s i o n r e c a l l
In the formulas:
TP (True Positives): Correctly predicted water body pixels;
TN (True Negatives): Correctly predicted non-water body pixels;
FP (False Positives): Incorrectly predicted water body pixels;
FN (False Negatives): Incorrectly predicted non-water body pixels.

3.3. Method Comparison

To assess the efficacy of the suggested model, a comparative analysis was conducted against four established methodologies: U-Net, ResNet [33], DeepLabv3+, and DeepWaterMapv2 [34]. We illustrate the variations in loss functions of these methods in Figure 6. The following provides comprehensive explanations of these methodologies:
U-Net: The U-Net model is a fully convolutional neural network architecture recognized for its encoder-decoder structure, commonly applied in tasks involving image segmentation. It is particularly adept at capturing contextual information and achieving precise localization. The encoder component is responsible for extracting features through convolutional layers, whereas the decoder component works to gradually restore spatial resolution. The integration of skip connections serves to combine features from both the encoder and decoder, thereby improving the accuracy of segmentation. Due to its effectiveness and dependability, U-Net is a preferred choice in remote sensing applications.
ResNet: ResNet (Residual Network) addresses the issues of gradient vanishing and representation bottlenecks in deep networks by introducing Residual Blocks. This innovation enables the network to learn residual representations, facilitating the training of deeper models with lower error rates. In this study, we employ ResNet-50 for comparative analysis.
DeepLabv3+: DeepLabv3+ is a sophisticated semantic segmentation model that incorporates multi-scale features through the utilization of dilated convolutions and ASPP (Atrous Spatial Pyramid Pooling) modules. By employing these techniques, the model enhances the receptive field and adeptly captures contextual information. The decoder module efficiently combines high and low-resolution features to enhance segmentation outcomes, with a specific emphasis on delineating object boundaries.
DeepWaterMapv2: DeepWaterMapv2 emphasizes surface water mapping and utilizes the U-Net architecture. Through the iterative application of convolutional and pooling operations, it efficiently extracts essential features from images to accurately identify and extract water body regions.

3.4. Analysis of Experimental Results

Following the completion of model training, 20% of the dataset was allocated for validation purposes. Within the validation set, six images were specifically chosen to showcase the prediction outcomes, as illustrated in Figure 7. These images encompassed diverse terrains, including mountainous regions with significant topographical variations and areas characterized by high cloud coverage. To ensure a precise assessment of SwinDefNet’s performance, a comparative analysis was conducted against other state-of-the-art water body detection techniques. Table 2 displays the average metrics across all images in the test set. Notably, our model outperformed existing methods, achieving accuracy, precision, recall, and F1 score values exceeding 90%, with specific metrics of 97.89%, 94.98%, 90.05%, and 92.33%, respectively.
Through a comparative analysis of the four metrics, it was determined that our model exhibited the highest level of accuracy when compared to all other methods. This suggests that our model possesses superior predictive capabilities for delineating surface water and is more effective in extracting water bodies from remote sensing imagery. Furthermore, in comparison to alternative methods, our model also demonstrated the highest recall rate, signifying its ability to identify a greater number of water pixels during surface water mapping while minimizing the omission of water pixels.
Upon analyzing Table 2, it was observed that ResNet attained an F1 score of 92.68% in the test, surpassing our proposed model by 0.35%. This result suggests that ResNet exhibits superior overall performance. However, its recall rate of 88.96% implies that a significant number of water pixels may be overlooked during extraction, revealing a notable deficiency in accurately delineating water bodies. In contrast, DeepWaterMapv2 demonstrated a precision level of 99.07%, albeit with a lower recall rate of 91.89%. This trade-off indicates that the DeepWaterMapv2 approach prioritizes precision over recall in model predictions, highlighting the potential for enhancing the precise extraction of water bodies.
The comparison of predicted images generated by various methods is illustrated in Figure 7, while the true labels of the predicted images are depicted in Figure 8. Figure 9 and Figure 10 exhibit partial outcomes of surface water mapping in mountainous and cloudy regions, with additional result comparisons available in the Appendix A and Appendix B at the conclusion of this document. Upon examination of Figure 7 and Figure 9 and the annotated data in Figure 8, it is evident that the SwinDefNet model excels in noise suppression in the background (refer to Figure 9), surpassing the performance of ResNet and DeepLabv3+ methodologies. A detailed analysis of Table 2 indicates that, in contrast to SwinDefNet, ResNet achieves a precision of 97.26% and a recall of 88.96%, suggesting that ResNet tends to be conservative in predicting positive samples, potentially leading to the omission of significant features by classifying numerous samples as negative. Furthermore, SwinDefNet exhibits superior extraction capabilities in complex scenes, including small, curved rivers (as shown in Figure 10). In comparison to alternative methods, the delineations are more distinct, and in regions with high cloud cover, our model demonstrates the lowest likelihood of misclassifying the background as water, in contrast to U-Net and ResNet approaches.
In an effort to enhance the reliability and precision of our methodology, we carried out experiments across different regions to assess the performance of U-Net, ResNet, DeepLabv3+, and DeepWaterMapv2 in regions characterized by cloud cover and mountainous terrain. The outcomes pertaining to accuracy and F1 score are depicted in Figure 11 and Figure 12, respectively. Detailed numerical data on the four evaluation metrics for the five methods across the three regions and the complete validation dataset are presented in Table 3. Figure 13 shows the confusion matrix for SwinDefNet.
Upon examination of Table 3, it is evident that our proposed model demonstrates superior performance in mountainous terrains, exhibiting accuracy, recall, and F1 score metrics of 98.03%, 96.61%, and 93.52%, respectively. The analysis of water body extraction in mountainous regions reveals that, apart from U-Net, all alternative methodologies yield commendable outcomes, with accuracy rates exceeding 97%. Despite U-Net’s suboptimal performance in mountainous settings, it excels in effectively identifying elongated and small water bodies within cloudy and urban environments.
The model attained an accuracy and F1 score of 98.30% and 93.46%, correspondingly, in overcast regions. In comparison to the top-performing ResNet model, the discrepancies are merely 0.19% and 1%, respectively. In contrast to alternative approaches, our recall stands at the highest rate of 92.22%, suggesting that in the prediction of overcast regions, we compromised some accuracy to minimize the exclusion of water pixels, facilitating precise delineation of water bodies. This is further supported by our predicted images. Furthermore, upon juxtaposing the annotated image with the predicted one, we noted that our model demonstrates greater consistency in delineating the boundaries of small water bodies. This implies that our study offers specific benefits in enhancing the processing of small water bodies.
Finally, a comparison was conducted on the average epoch times for training various models. As illustrated in Table 4 the Resnet and Deepwatermapv2 approaches demonstrated the shortest times at 6.83 s and 6.40 s, respectively. This efficiency can be attributed to the utilization of residual blocks in constructing a deep network architecture, which ensures high computational efficiency. Conversely, Deepwatermapv2 achieves its effectiveness by employing dilated convolution to reduce computational overhead. In contrast, Deeplabv3+ necessitated 11.10 s for training, while SwinDefNet and Unet required longer durations of 12.03 s and 13.99 s, respectively. The increased time taken by Unet can be linked to its additional processing requirements when handling remote-sensing image data. On the other hand, SwinDefNet’s integration of an efficient attention mechanism enhances performance but also introduces heightened computational complexity and longer training times—a trade-off deemed acceptable within this context.

4. Discussion

In this section, ablation tests were conducted to assess the importance of different components and to examine the influence of the Swin Transformer and Deformable Convolution on the network’s performance. The ablation experiments utilized identical training, testing, and validation sets for the model evaluation.
The study aimed to assess the efficacy of integrating the Swin Transformer in land-water mapping tasks. The research initially compared the performance of the combined Swin Transformer and DeepLabV3+ model (referred to as Model 1 in Table 5) with the DeepLabV3+ model utilizing the original Xception backbone (referred to as Model 3 in Table 5). Evaluation metrics such as Accuracy, Precision, Recall, and F1_score were employed. The findings revealed that the utilization of the Swin Transformer led to enhancements in Accuracy and F1_score by 1.20% and 3.93%, respectively. Moreover, Precision and Recall exhibited improvements of 6.82% and 0.28%, respectively. The analysis demonstrated that the integration of the Swin Transformer not only increased the accuracy of model predictions but also enhanced its reliability. Furthermore, the observed 6.82% increase in Precision suggests that the incorporation of the Swin Transformer improved the model’s ability to accurately predict water pixels, particularly in scenarios necessitating classification tasks, while simultaneously reducing classification errors stemming from misjudgments of water pixels.
To assess the efficacy of Deformable Convolutional Networks, a comparison was conducted between a DeepLabV3+ model with Swin Transformer as its backbone network utilizing standard 3 × 3 convolutions and another model incorporating Deformable Convolutional (referred to as Model 2 in Table 5). The findings revealed that the integration of Deformable Convolutional Networks into the model’s architecture resulted in enhancements in Accuracy (+0.22%), Precision (+0.03%), Recall (+1.01%), and F1_score (+0.6%). These outcomes signify an improvement in both accuracy and reliability following the incorporation of Deformable Convolutional into the model’s architecture, thereby enhancing its capability to identify water pixels with a recall rate increase of +1.01%.
Through experimental analysis, it has been verified that both the Swin Transformer and Deformable Convolutional are advantageous for water extraction tasks. Furthermore, the synergistic utilization of these two components has not only improved the accuracy of land water mapping but has also facilitated the meticulous extraction of water bodies.

5. Conclusions

In this study, a novel approach was investigated to achieve precise extraction of water bodies in remote sensing images by utilizing deep learning networks with an encoder-decoder architecture. The methodology involved a two-step process: (1) integration of a state-of-the-art image classification and segmentation model into the encoder-decoder network and (2) implementation of deformable convolution to flexibly modify the receptive field for accurate extraction of water body features. Specifically, the model utilized the Swin Transformer as the encoder to capture multi-scale information and incorporated deformable convolution to dynamically adjust the receptive field for the accurate extraction of water body characteristics.
Furthermore, in order to evaluate the effectiveness of our proposed approach, we conducted training on the innovative integrated model using the ESWKB dataset. Subsequently, we applied this model to map surface water across different geographical areas. Comparative analysis with several established methods revealed that our novel network fusion method surpassed them, particularly in the precise delineation of water bodies. This outcome not only enhances the precision of water extraction but also presents a viable technique for detailed water body segmentation in remote sensing imagery.
The limitation of the research is that the training time of the proposed model is a little longer. Future investigations will prioritize model optimization to diminish training duration. The suggested approach holds potential for application in diverse real-world water-related contexts such as water mapping, flood monitoring, and land management. These applications could enhance environmental monitoring and regional planning efforts.

Author Contributions

Conceptualization, H.P. and X.C.; methodology, X.C. and H.P.; validation, X.C., H.P. and J.L.; formal analysis, X.C.; investigation, H.P.; data curation, J.L.; writing—original draft preparation, X.C.; writing—review and editing, H.P.; visualization, X.C.; supervision, H.P. and J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Programs, grant number 2022YFC3004405; National Natural Science Foundation of China, grant number 42061073; Natural Science Foundation of Guizhou Province, grant number [2020]1Z056.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We would like to thank the reviewers for their valuable comments and suggestions. In addition, the authors would like to thank X Luo for providing Earth’s surface water knowledge base.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Cloud Area

Electronics 13 02870 g0a1

Appendix B. Mountainous Area

Electronics 13 02870 g0a2

References

  1. Parajuli, J.; Fernandez-Beltran, R.; Kang, J.; Pla, F. Attentional Dense Convolutional Neural Network for Water Body Extraction From Sentinel-2 Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6804–6816. [Google Scholar] [CrossRef]
  2. Luo, X.; Hu, Z.; Liu, L. Investigating the seasonal dynamics of surface water over the Qinghai–Tibet Plateau using Sentinel-1 imagery and a novel gated multiscale ConvNet. Int. J. Digit. Earth 2023, 16, 1372–1394. [Google Scholar] [CrossRef]
  3. Li, H.; Zech, J.; Ludwig, C.; Fendrich, S.; Shapiro, A.; Schultz, M.; Zipf, A. Automatic mapping of national surface water with OpenStreetMap and Sentinel-2 MSI data using deep learning. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 102571. [Google Scholar] [CrossRef]
  4. Jiang, C.; Zhang, H.; Wang, C.; Ge, J.; Wu, F. Water Surface Mapping from Sentinel-1 Imagery Based on Attention-UNet3+: A Case Study of Poyang Lake Region. Remote Sens. 2022, 14, 4708. [Google Scholar] [CrossRef]
  5. Zhao, B.; Sui, H.; Liu, J. Siam-DWENet: Flood inundation detection for SAR imagery using a cross-task transfer siamese network. Int. J. Appl. Earth Obs. Geoinf. 2023, 116, 103132. [Google Scholar] [CrossRef]
  6. Tong, X.; Luo, X.; Liu, S. An approach for flood monitoring by the combined use of Landsat 8 optical imagery and COSMO-SkyMed radar imagery. ISPRS J. Photogramm. Remote Sens. 2018, 136, 144–153. [Google Scholar] [CrossRef]
  7. McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  8. Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
  9. Wang, X.; Xie, S.; Zhang, X.; Chen, C.; Guo, H.; Du, J.; Duan, Z. A robust Multi-Band Water Index (MBWI) for automated extraction of surface water from Landsat 8 OLI imagery. Int. J. Appl. Earth Obs. Geoinf. 2018, 68, 73–91. [Google Scholar] [CrossRef]
  10. Li, L.; Su, H.; Du, Q.; Wu, T. A novel surface water index using local background information for long term and large-scale Landsat images. ISPRS J. Photogramm. Remote Sens. 2021, 172, 59–78. [Google Scholar] [CrossRef]
  11. Cai, Y.; Shi, Q.; Liu, X. Spatiotemporal Mapping of Surface Water Using Landsat Images and Spectral Mixture Analysis on Google Earth Engine. J. Remote Sens. 2024, 4, 117. [Google Scholar] [CrossRef]
  12. Sun, Q.; Li, J. A method for extracting small water bodies based on DEM and remote sensing images. Sci. Rep. 2024, 14, 760. [Google Scholar] [CrossRef] [PubMed]
  13. Yan, X.; Song, J.; Liu, Y.; Lu, S.; Xu, Y.; Ma, C.; Zhu, Y. A Transformer-based method to reduce cloud shadow interference in automatic lake water surface extraction from Sentinel-2 imagery. J. Hydrol. 2023, 620, 129561. [Google Scholar] [CrossRef]
  14. Isikdogan, F.; Bovik, A.C.; Passalacqua, P. Surface Water Mapping by Deep Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4909–4918. [Google Scholar] [CrossRef]
  15. Chen, F.; Chen, X.; Van de Voorde, T.; Roberts, D.; Jiang, H.; Xu, W. Open water detection in urban environments using high spatial resolution remote sensing imagery. Remote Sens. Environ. 2020, 242, 111706. [Google Scholar] [CrossRef]
  16. Kang, J.; Guan, H.; Peng, D.; Chen, Z. Multi-scale context extractor network for water-body extraction from high-resolution optical remotely sensed images. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102499. [Google Scholar] [CrossRef]
  17. Luo, X.; Tong, X.; Hu, Z. An applicable and automatic method for earth surface water mapping based on multispectral images. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102472. [Google Scholar] [CrossRef]
  18. Li, Z.; Zhang, X.; Xiao, P. Spectral index-driven FCN model training for water extraction from multispectral imagery. ISPRS J. Photogramm. Remote Sens. 2022, 192, 344–360. [Google Scholar] [CrossRef]
  19. Zhang, X.; Li, J.; Hua, Z. MRSE-Net: Multiscale Residuals and SE-Attention Network for Water Body Segmentation From Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5049–5064. [Google Scholar] [CrossRef]
  20. Yu, Y.; Huang, L.; Lu, W.; Guan, H.; Ma, L.; Jin, S.; Yu, C.; Zhang, Y.; Tang, P.; Liu, Z.; et al. WaterHRNet: A multibranch hierarchical attentive network for water body extraction with remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103103. [Google Scholar] [CrossRef]
  21. Lyu, X.; Fang, Y.; Tong, B.; Li, X.; Zeng, T. Multiscale Normalization Attention Network for Water Body Extraction from Remote Sensing Imagery. Remote Sens. 2020, 14, 4983. [Google Scholar] [CrossRef]
  22. Kang, J.; Guan, H.; Ma, L.; Wang, L.; Xu, Z.; Li, J. WaterFormer: A coupled transformer and CNN network for waterbody detection in optical remotely-sensed imagery. ISPRS J. Photogramm. Remote Sens. 2023, 206, 222–241. [Google Scholar] [CrossRef]
  23. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
  24. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  25. Luo, X. Earth Surface Water Dataset. Zenodo. 2021. Available online: https://zenodo.org/records/5205674 (accessed on 20 October 2023).
  26. Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
  27. Seale, C.; Redfern, T.; Chatfield, P.; Luo, C.; Dempsey, K. Coastline detection in satellite imagery: A deep learning approach on new benchmark data. Remote Sens. Environ. 2022, 278, 113044. [Google Scholar] [CrossRef]
  28. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
  29. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  30. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
  31. Pan, H.; Chen, H.; Hong, Z.; Liu, X.; Wang, R.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; et al. A Novel Boundary Enhancement Network for Surface Water Mapping Based on Sentinel-2 MSI Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9207–9222. [Google Scholar] [CrossRef]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv arXiv:1706.03762, 2017.
  33. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
  34. Isikdogan, L.F.; Bovik, A.; Passalacqua, P. Seeing Through the Clouds with DeepWaterMap. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1662–1666. [Google Scholar] [CrossRef]
Figure 1. Representation of regions in images (the first line represents mountains, and the second line represents cloud regions.).
Figure 1. Representation of regions in images (the first line represents mountains, and the second line represents cloud regions.).
Electronics 13 02870 g001
Figure 2. Use of encoder–decoder architecture in semantic segmentation. (encoder: the spatial dimensions of features gradually decrease while the depth (number of channels) increases. Decoder: the spatial dimensions of feature maps gradually increase while the depth decreases).
Figure 2. Use of encoder–decoder architecture in semantic segmentation. (encoder: the spatial dimensions of features gradually decrease while the depth (number of channels) increases. Decoder: the spatial dimensions of feature maps gradually increase while the depth decreases).
Electronics 13 02870 g002
Figure 3. SwinDefNet’s network structure diagram. Normalization and ReLU activation layers follow each convolution operation in the network.
Figure 3. SwinDefNet’s network structure diagram. Normalization and ReLU activation layers follow each convolution operation in the network.
Electronics 13 02870 g003
Figure 4. Internal details of the ASPP module.
Figure 4. Internal details of the ASPP module.
Electronics 13 02870 g004
Figure 5. Using 3 × 3 convolution as an example, this demonstrates the offset process of deformable convolution and shows the corresponding effective receptive field.
Figure 5. Using 3 × 3 convolution as an example, this demonstrates the offset process of deformable convolution and shows the corresponding effective receptive field.
Electronics 13 02870 g005
Figure 6. Visualization of the loss for U-Net, ResNet, DeepLabv3+, DeepWaterMapv2, and our proposed model after 200 epochs of training.
Figure 6. Visualization of the loss for U-Net, ResNet, DeepLabv3+, DeepWaterMapv2, and our proposed model after 200 epochs of training.
Electronics 13 02870 g006
Figure 7. Comparison of predicted results using different methods on test images, with red circles indicating areas of comparison between the predicted images generated by each method.
Figure 7. Comparison of predicted results using different methods on test images, with red circles indicating areas of comparison between the predicted images generated by each method.
Electronics 13 02870 g007
Figure 8. Label data to illustrate the prediction results.
Figure 8. Label data to illustrate the prediction results.
Electronics 13 02870 g008
Figure 9. Results of water extraction in hilly regions are compared with existing approaches and our suggested model in mountainous regions.
Figure 9. Results of water extraction in hilly regions are compared with existing approaches and our suggested model in mountainous regions.
Electronics 13 02870 g009
Figure 10. Comparison of our suggested model’s and other approaches’ water extraction outcomes in cloudy regions.
Figure 10. Comparison of our suggested model’s and other approaches’ water extraction outcomes in cloudy regions.
Electronics 13 02870 g010
Figure 11. Comparison of the accuracy of several methods in various geographical areas.
Figure 11. Comparison of the accuracy of several methods in various geographical areas.
Electronics 13 02870 g011
Figure 12. Comparison of the F1 score of several methods in various geographical areas.
Figure 12. Comparison of the F1 score of several methods in various geographical areas.
Electronics 13 02870 g012
Figure 13. Confusion matrix results for SwinDefNet.
Figure 13. Confusion matrix results for SwinDefNet.
Electronics 13 02870 g013
Table 1. Characteristics and Challenges of Different Regions.
Table 1. Characteristics and Challenges of Different Regions.
RegionCharacteristicsChallenges
Mountainous AreaLarge undulating terrain, complex landforms, scattered water bodies, and high-altitude areas that may be covered by ice and snow.Water bodies are often obstructed by mountains, resulting in incomplete extraction information; their scattered distribution leads to a small extraction scale range; there is also potential interference from ice and snow.
Cloudy AreaUnder conditions of frequent clouds, rain, and fog, cloud cover areas are large and last for long durations. Clouds exhibit spectral characteristics similar to water bodies in some bands.Cloud cover affects the transmission and reflection characteristics of remote sensing images, increasing the difficulty of water body extraction. Due to their spectral similarities, confusion between clouds and water bodies is prone to occur.
Table 2. Evaluation metrics of different methods. Average values for validation set images.
Table 2. Evaluation metrics of different methods. Average values for validation set images.
MethodAccuracy (%)Precision (%)Recall (%)F1_Score (%)
Ours97.8994.9890.0592.33
Unet90.7995.2472.1777.03
Resnet97.6597.2688.9692.68
Deeplabv3_plus97.2793.8086.5389.22
Deepwatermapv297.4199.0781.8988.69
Table 3. Accuracy, Precision, Recall, and F1 score results of water mapping in different regions for the five methods.
Table 3. Accuracy, Precision, Recall, and F1 score results of water mapping in different regions for the five methods.
MethodOursUnetResnetDeeplabv3+Deepwatermapv2
Mountainous Area
Accuracy98.03%86.77%97.14%97.88%97.47%
Precision95.99%89.24%97.97%95.42%99.34%
Recall91.61%49.18%87.91%89.18%85.65%
F1_score93.52%55.56%92.32%91.08%91.46%
Cloud Area
Accuracy98.30%90.93%98.49%97.64%97.14%
Precision94.97%93.66%97.12%93.81%98.61%
Recall92.22%63.05%92.12%89.27%84.85%
F1_score93.46%67.29%94.46%91.33%88.19%
Table 4. Perform 200 epochs using the same training set, averaging time per epoch.
Table 4. Perform 200 epochs using the same training set, averaging time per epoch.
MethodOursUnetResnetDeeplabv3+Deepwatermapv2
Time12.03 s13.99 s6.83 s11.10 s6.40 s
Table 5. The results of the ablation experiments.
Table 5. The results of the ablation experiments.
ModelSwin TransformDeformable ConvolutionalAccuracy (%)Precision (%)Recall (%)F1_Score(%)
197.8994.9890.0592.33
2 97.6794.9689.0491.73
3 96.7088.1789.7788.40
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, X.; Pan, H.; Liu, J. SwinDefNet: A Novel Surface Water Mapping Model in Mountain and Cloudy Regions Based on Sentinel-2 Imagery. Electronics 2024, 13, 2870. https://doi.org/10.3390/electronics13142870

AMA Style

Chen X, Pan H, Liu J. SwinDefNet: A Novel Surface Water Mapping Model in Mountain and Cloudy Regions Based on Sentinel-2 Imagery. Electronics. 2024; 13(14):2870. https://doi.org/10.3390/electronics13142870

Chicago/Turabian Style

Chen, Xinyue, Haiyan Pan, and Jun Liu. 2024. "SwinDefNet: A Novel Surface Water Mapping Model in Mountain and Cloudy Regions Based on Sentinel-2 Imagery" Electronics 13, no. 14: 2870. https://doi.org/10.3390/electronics13142870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop