CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++

Li, Nan; Xu, Xiaohua; Huang, Shifeng; Sun, Yayong; Ma, Jianwei; Zhu, He; Hu, Mengcheng

doi:10.3390/rs16183391

Open AccessArticle

CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++

by

Nan Li

^1,2,

Xiaohua Xu

^3,4,

Shifeng Huang

^1,2,*,

Yayong Sun

^1,2

,

Jianwei Ma

^1,2,

He Zhu

^1,2 and

Mengcheng Hu

^1,2

¹

State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research, Beijing 100038, China

²

Research Center on Flood & Drought Disaster Prevention and Reduction of the Ministry of Water Resources, Beijing 100038, China

³

Jiangxi Provincial Institute of Water Science, Nanchang 330000, China

⁴

Jiangxi Provincial Technology Innovation Center for Ecological Water Engineering in Poyang Lake Basin, Nanchang 330000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(18), 3391; https://doi.org/10.3390/rs16183391

Submission received: 26 May 2024 / Revised: 12 August 2024 / Accepted: 29 August 2024 / Published: 12 September 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately mapping the surface water bodies through remote sensing technology is of great significance for water resources management, flood monitoring, and drought monitoring. At present, many scholars at home and abroad carry out research on deep learning image recognition algorithms based on convolutional neural networks, and a variety of variant-based convolutional neural networks are proposed to be applied to extract water bodies from remote sensing images. However, due to the low depth of convolutional layers employed and underutilization of water spectral feature information, most of the water body extraction methods based on convolutional neural networks (CNNs) for remote sensing images are limited in accuracy. In this study, we propose a novel surface water automatic extraction method based on the convolutional neural network (CRAUnet++) for Sentinel-2 images. The proposed method includes three parts: (1) substituting the feature extractor of the original Unet++ with ResNet34 to enhance the network’s complexity by increasing its depth; (2) Embedding the Spatial and Channel ‘Squeeze and Excitation’ (SCSE) module into the up-sampling stage of the network to suppress background features and amplify water body features; (3) adding the vegetation red edge-based water index (RWI) into the input data to maximize the utilization of water body spectral information of Sentinel-2 images without increasing the data processing time. To verify the performance and accuracy of the proposed algorithm, the ablation experiment under four different strategies and comparison experiment with different algorithms of RWI, FCN, SegNet, Unet, and DeepLab v3+ were conducted on Sentinel-2 images of the Poyang Lake. The experimental result shows that the precision, recall, F1, and IoU of CRAUnet++ are 95.99%, 96.41%, 96.19%, and 92.67%, respectively. CRAUnet++ has a good performance in extracting various types of water bodies and suppressing noises because it introduces SCSE attention mechanisms and combines surface water spectral features from RWI, exceeding that of the other five algorithms. The result demonstrates that CRAUnet++ has high validity and reliability in extracting surface water bodies based on Sentinel-2 images.

Keywords:

deep learning; CRAUnet++; water index; land surface water extraction; Sentinel-2 imagery; attention mechanism

1. Introduction

Surface water plays an essential role in the water cycle, land cover changes, environmental changes, and climate changes in many parts of the world [1]. Accurate extraction of surface water bodies is of great significance for water resource management, drought and flood monitoring, and ecological environmental protection [2,3,4,5]. Satellite remote sensing technology has the advantages of wide coverage, low cost, and a short data acquisition period [6,7]. Therefore, recently, using satellite remote sensing images to extract the information of land surface water, such as water area, position, shape, and river width, has become an effective way to obtain land surface water information rapidly [8].

The key to accurately extracting water bodies from remote sensing images and creating water area maps lies in effectively highlighting water bodies and inhibiting other ground objects [9]. Various algorithms have been proposed to extract surface bodies from remote sensing imagery. In early studies, limited by remote sensing technology and image quality, the threshold algorithm is widely used in many studies, such as the normalized difference water index (NDWI) [10], the modified normalized difference water index (MNDWI) [11], the automated water extraction index (AWEI) [12], and the linear discriminant analysis water index (LDAWI) [13]. These algorithms make use of the intrinsic nature that water bodies have different reflectance under different wavelengths [14]. Specifically, multiple bands are selected from the spectral imaging data and combined to compute thresholds that effectively distinguish water bodies from other background objects. However, water bodies under different conditions have different spectral characteristics, and it is difficult to obtain an ideal single threshold. Thereby, various dynamic or variational threshold methods, such as the Otsu algorithm [15], have been proposed to calculate the optimal values of these thresholds [16]. However, achieving dynamic adjustment of thresholds and producing optimal segmentation results is complex and time-consuming [17]. Moreover, shadows have spectral characteristics similar to water bodies, and threshold methods are difficult to distinguish between buildings, mountains, and cloud shadows. Overall, the threshold algorithms that utilize only spectral information, although simple and easy to implement, are more suitable for identifying water bodies with flat terrain and relatively wide water areas.

With the rapid development of aviation and aerospace technology, the spatial resolution of the acquired remote sensing images is getting higher and higher, more spatial details become visible, and the traditional machine learning algorithms show strong robustness in water body extraction [18,19]. Many researchers have been exploring the application of machine learning on the task of automated water body detection [20,21,22]. Li et al. [23] conducted a comparative study on the effectiveness of six machine learning algorithms, including Decision Tree [24], Logistic Regression [25], Random Forest [26], Neural Networks [27], Support Vector Machines (SVM) [28] and Xgboost [29] for water body extraction using Landsat-8 images as a data source. The results show that the neural network has good performance and is a stable model, followed by support vector machine and logistic regression algorithms. In addition, the integrated algorithms such as Random Forest and Xgboost are affected by the sample distribution, and the decision tree model returns the worst performance [30]. Overall, machine learning algorithms use manually designed features that have limited ability to characterize water bodies, are less adaptable to different datasets, and require some prior knowledge.

In recent decades, convolutional neural networks (CNNs) deep learning technology has made breakthrough progress in the field of image semantic segmentation due to its strong ability to extract variable features [31]. In the beginning, CNNs were used for natural image semantic segmentation tasks, which employ many organized convolutional layers to capture features and assign classification labels to each pixel based on the feature information [32]. Due to its outstanding performance, various CNNs have been proposed, such as FCN [33], UNet [34], SegNet [35], DeepLab series [36,37,38], and PSPNet [39]. Among these different CNNs above, improved CNNs based on UNet have received more attention from researchers for their extensibility and accuracy in semantic segmentation tasks [40]. UNet consists of two parts: encoder and decoder. The encoder down-samples the feature map and extracts the deeper features of the image—semantic information through convolutional and pooling layers [41]; the decoder performs up-sampling to recover the image spatial information. The skip connection between the encoder and decoder enables the network to combine the shallow spatial information with the deep semantic information to obtain accurate segmentation results [42]. However, UNet was initially designed for semantic segmentation of medical images. Due to inherent complex spectral and spatial information, semantic segmentation methods for natural images provide a limited contribution to the segmentation accuracy of remote sensing images [43]. To overcome this shortcoming, many new algorithms have been proposed to achieve better segmentation results in remote sensing images. Zhang et al. [44] replaced the feature extractor of UNet with MobileNetV3 (large) network to improve the model complexity and introduced the convolutional attention mechanism in the skip connection part. It achieved a segmentation accuracy of 93.78% on the GF-6 PMS, which was able to accurately differentiate the water body from other features. Wu et al. [45] addressed the problem that UNet could not effectively extract the river contours from remote sensing images, carried out the improvement of the attention mechanism, and introduced the spatial pyramid pooling module, which optimized the network learning ability and retained the detailed information of the feature maps. Liu et al. [46] proposed a DUP network based on the UNet to extract water bodies rapidly and accurately from remote sensing images. The network uses dense blocks to extract more spatial features and adds the multi-scale spatial pyramid pooling module (MSPP) to the skip connection part to combine shallow and deep features for small water bodies. Guo et al. [47] addressed the problem of low accuracy of river extraction by utilizing VGG16 and ResNet50 models for feature extraction. They introduced scale and semantic feature output layers to construct a two-branch fusion model and proposed a new method based on edge feature fusion with an accuracy of 97.7%. Zhang et al. [48] proposed MU-Net, which embeds MixFormer into UNet, focusing on capturing the local and global contexts of images and mining deeper semantic features of water bodies. CNN is a data-driven method with the advantages of high automation and good generalization performance, but it is poorly interpreted and needs to be driven by a large amount of training data. The water body index method uses reflectance spectral features of the water body to differentiate the water body from the background features, which is mechanistically interpretable, but it is poorly generalized, and the two methods have their own advantages and disadvantages. Therefore, we consider combining the two methods to explore their effectiveness and reliability.

Although the above methods have generally shown good performance in large-scale water body extraction in specific scenarios, high-precision automatic extraction of surface water bodies with complex surrounding environments is still a challenge. Most of the deep learning surface water body extraction methods proposed in the current research are mostly targeted at synthetic images in the red, green, and blue bands of remote sensing data, failing to make full use of the multi-spectral feature information of the features in the remote sensing data. Moreover, the neural network encoder used in these methods has fewer layers, which is poorly adapted to the extraction of small and microsurface water bodies in complex environments. Therefore, this paper proposes an automatic surface water extraction method for Sentinel-2 images based on a convolutional neural network (CRAUnet++) by combining the vegetation red edge based water index (RWI), the integrated model of Unet++ and ResNet34, and the spatial and channel ‘Squeeze and Excitation’ (SCSE) attention mechanism to automatically extract water bodies from Sentinel-2 images. To verify the accuracy, the proposed method in this paper is applied to surface water extraction of Poyang Lake.

To sum up, our objectives of this paper are as follows:

(1): In the encoder section of Unet++, substitute the feature extractor of the original Unet++ with ResNet34 to enhance the network’s complexity by increasing its depth.
(2): Embedding the Spatial and Channel ‘Squeeze and Excitation’ (SCSE) module into the sampling stage of the network to suppress background features and amplify water body features.
(3): Adding the vegetation red edge based water index (RWI) into the input data to maximize the utilization of water body spectral information of Sentinel-2 images without increasing the data processing time.

2. Materials and Methods

2.1. Materials

2.1.1. Study Area and Data

To validate our approach, we conducted experiments on the Sentinel-2 MSI water body dataset created in this study, which uses 4 Sentinel-2 MSI images covering the area of Poyang Lake, Jiangxi Province, China. The reason we chose Poyang Lake as the study area is because it has numerous water bodies of different shapes and sizes, including natural lakes, small streams, paddy fields, and artificial ponds [49]. The geographic location of the study area and the selected 4 Sentinel-2 MSI images are shown in Figure 1.

The Sentinel-2 mission comprises two polar-orbiting satellites, Sentinel-2A and Sentinel-2B, capable of capturing high-resolution optical imagery. The revisit period for each satellite is ten days, and two satellites operating simultaneously in orbit can revisit the same location for approximately 2–5 days, ensuring frequent coverage of the same area [50]. Both satellites are equipped with a Multi-Spectral Instrument (MSI) sensor, which carries 13 spectral bands and has spatial resolution ranging from 10 m to 60 m. The radiation resolution of each band is 16 bits. Sentinel-2 data at the L2A level were used in this study and were downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus, accessed on 16 December 2022). The Level-2A product underwent radiometric, geometric, orthorectified, and atmospheric corrections. It can provide per-pixel radiometric measurements of surface reflectance [51]. Seven bands were employed in this study, four of which (Bands 2, 3, 4, 8) have a spatial resolution of 10 m and three (Bands 5, 8A, and 12) have a spatial resolution of 20 m. Bands 2, 3, and 4 are B, G, and R, respectively, and Bands 3, 5, 8, 8A, and 12 are used to calculate the RWI water body index. When extracting a river or body of water from remote sensing imagery, uncertainties such as optical conditions can significantly affect the results. These uncertainties include, but are not limited to, cloud cover, atmospheric conditions, surface reflection characteristics, imaging time (e.g., diurnal, seasonal variations), and the performance of the sensor itself. In order to avoid these influencing factors, images with good imaging quality are selected for the study in this paper. The information about the study images is shown in Table 1.

2.1.2. Data Generation Pipeline

The training dataset generation process includes the production of remote sensing images and corresponding samples. Figure 2 shows the sample generation of one remote sensing image. The remote sensing data preprocessing was conducted by SNAP 9.0.0 software. The Band Select module was applied to extract Bands 2, 3, 4, 5, 8, 8A, and 12 from 13 bands of origin Sentinel-2 images. The Resample module was used to upsample Bands 5, 8A, and 12 by a factor of two in each dimension via nearest interpolation. After the preprocessing, images with Bands 2, 3, 4, 5, 8, 8A, and 12 and a spatial resolution of 10 m were generated. Then we opted for a combination of bands 4, 3, and 2 to form a true-color image to obtain the color, shape, and texture features of the water area. We used Python Numpy and the Gdal module to calculate the RWI water body index to enhance the differentiation between water body and background information. Finally, the resulting remote sensing image, synthesized from 432 bands and RWI, was generated to extract water area using a deep learning model.

For sample production, we first selected typical surface water scenes with different spectral features, texture features, and geographical environments to test the generalization ability of CNNs. The OTSU threshold segmentation algorithm was used to obtain the coarse extraction results of the water body; the water area was assigned a label of 1, while the non-water background was labeled as 0. Then, we manually modified the RWI water binary segmentation map in Photoshop 2020 software to refer to a true-color image. Finally, the selected images and the corresponding labels were randomly clipped to patches of 256 × 256 pixels. Moreover, in order to enhance the robustness of the model, we use the most common methods of flipping (up and down, left and right), rotating (180°), and color enhancement for all the training samples to increase the number of samples available for training. After the above steps, the sample set contained 2000 samples. Figure 3 shows a sample image and mask from this dataset.

2.2. Methods

The principle of extracting surface water bodies from remote sensing images is mainly based on the differences in spectral characteristics between water bodies and other features such as vegetation and soil. As shown in Figure 4, the spectral properties of water bodies in the visible and near-infrared bands are mainly characterized by low reflectance, especially low absorption in the blue-green light band, and high absorption in other bands, especially the infrared band. This is due to the interaction of chlorophyll, sediment, water depth, and its thermal characteristics in the water column with the light transmitted into the water. Vegetation, on the other hand, has a strong reflectance in the visible and near-infrared bands, especially in the red and near-infrared bands, which is due to the absorption and reflection properties of light by chlorophyll in the leaves of vegetation. Based on the above differences in spectral characteristics, researchers have proposed methods such as the threshold method, the water body index method, and the classifier.

With the continuous progress of remote sensing technology and computer algorithms, the introduction of advanced computer technologies, such as deep learning, has further improved the intelligence and automation of water body extraction. Deep learning models can automatically learn the complex features of water bodies in remote sensing images, such as shape features, texture features, etc., and adapt to different environments and lighting conditions so as to achieve more accurate water body extraction. CNN is a data-driven method, which has the advantages of a high degree of automation and good generalization performance, but it has poor interpretability and needs to be driven by a large amount of training data. The water body index method uses the reflectance spectral features of the water body to distinguish the water body from the background features, which is mechanistically explainable, but its generalization is poor. The two methods have their own advantages and disadvantages, so we consider combining the two methods to explore their effectiveness and reliability.

2.2.1. Main Network Structure

In this paper, we explore an improved surface water extraction method by combining water index with Unet++ networks from Sentinel-2 imagery. The overall network structure is shown in Figure 5. The Unet++ network is an improved structure based on Unet, which can be regarded as consisting of three different depths of Unet nested, enabling the network to automatically learn and use features of different depths [53]. It extends Unet by introducing additional convolutional layers between the skip connections to bridge the semantic gap between the encoder and decoder. Since the original Unet only upsampled the end of the encoder, in order to optimize the problem of insufficient upsampling of the encoder, in Unet++ each convolutional layer of the encoder was upsampled, and dense connections were introduced for each convolutional layer to learn the semantic features of different depths, and they were integrated by means of feature overlapping, which accomplished the deep interaction of different feature levels. In this paper, we made three improvements based on the Unet++ network structure:

(1): Using ResNet34 to replace the feature extractor of the original Unet++, increasing the depth of the network to improve the complexity of the network, and using ImageNet pre-training parameters to initialize the model parameters to speed up the training process.
(2): Adding the Spatial and Channel ‘Squeeze and Excitation’ (SCSE) module is added to the up-sampling stage of the network to suppress the background features and enhance the water body features.
(3): Fusing RWI water index to the training data to fully utilize the spectral information in Sentinel-2 images without increasing the data processing time.

Figure 5. The structure of CRAUnet++. RGB+RWI indicates a combination of red, green, blue, and vegetation red edge based water index (RWI) images.

The results of the research show that as the depth increases, the performance of the network does not improve, most likely due to the disappearance of the gradient during the backpropagation between layers [54]. The gradient is so small that it cannot update the layer parameters. This problem can be solved by a residual learning module of ResNet. The residual learning module bypasses the inputs and outputs of the convolutional layer so that those small gradients can be bypassed to compensate for the vanishing gradients. Currently, ResNet networks are widely used in semantic segmentation, target detection, and other computer vision tasks [55]. Four networks with different numbers of layers were proposed in the original paper: ResNet18, ResNet34, ResNet101, and ResNet152. A comparison experiment between ResNet18, ResNet34, and Plain18 was conducted in the ResNet paper, and it was found that ResNet34 solves the degradation problem better than the other two and the simplicity and superior robustness of the ResNet34 network structure make it one of the classic models in the field of deep learning [56]. Therefore, the network proposed in this paper uses ResNet34 after removing the last fully connected layer as the backbone network for feature extraction of the water. The ResNet34 feature extractor is shown in Figure 6. It consists of a 7 × 7 convolutional layer, a max pooling layer, and 16 Basic Block residual blocks. The 7 × 7 convolution and max pooling layers correspond to the first layer of the Unet++feature extractor. The Basic Block is composed of two 3 × 3 convolutional layers, with 16 Basic Block residual blocks stacked and combined according to 3-4-6-3, corresponding to the 1st, 2nd, 3rd, and 4th layers of the Unet++feature extractor. The first 3 × 3 convolutional layer in the first Basic Block of Unet++feature extractor’s layers 2, 3, and 4 has a stride of 2, making it capable of down-sampling to reduce feature maps and expand receptive fields. Compared to the feature extractor of the original Unet++ network, ResNet34 has deeper network layers. The complexity of the feature background of surface water bodies requires a deeper network with more layers to extract complex features, so in this paper, ResNet34 is used as the backbone network to replace the Unet++ feature extractor after removing the last fully connected layer.

The structure of the Spatial and Channel ‘Squeeze and Excitation’ (SCSE) module [57] is shown in Figure 7, which consists of two parts, the Spatial Squeeze and Channel Excitation (cSE) block and the Channel Squeeze and Spatial Excitation (sSE) block. The function of the SCSE block is to encourage the CNN to learn more significant features that are relevant both spatially and channel wise. For an input feature map of size W × H and number of channels C, the sSE block (upper half of Figure 7) squeezes the input feature map along the channel domain using a convolution kernel of size 1 × 1 with the number of output channels 1 to obtain a feature map of size W × H with the number of channels 1, which is weighted along the spatial domain using a Sigmoid activation function. The cSE block (lower half of Figure 7) employs the average pooling layer to squeeze the input feature map along the spatial domain to obtain a feature map with size 1 × 1 and channel number C and weigh it along the channel domain using two convolution kernels with size 1 × 1 and channel numbers C/2 and C, respectively. The two weighting mechanisms of SCSE can compute the weights for the input feature map in both spatial and channel directions, suppressing the weak features while boosting the meaningful ones, and improving the accuracy with almost no increase in the complexity of the model. The SCSE block is helpful for the CNN to pay more attention to the region of interest to acquire more subtle land surface water information. In addition, the SCSE block is adaptable and can be seamlessly integrated into most CNNs.

At present, there are many algorithms for extracting large and medium-sized water bodies, and the extraction objects are mostly large water bodies, including large and medium-sized lakes, reservoirs, and rivers, lacking algorithms for extracting small water bodies. To overcome this shortcoming, Wu et al. [58] proposed a new water index—vegetation red edge based water index (RWI), which is calculated from the green light band (B3), red edge band (B5), near-infrared band (B8, B8A), and short-wave infrared band (B12) of Sentinel-2 images. The comparative experimental results of Wu et al. [58] indicate that RWI is better than other indices in extracting the boundary of the water body and can eliminate the influence of the mixed pixels to a certain extent and efficiently extract the small water bodies in the image. The formula of RWI is as follows [58]:

R W I = \frac{(B 3 + B 5) - (B 8 + B 8 A + B 12)}{(B 3 + B 5) + (B 8 + B 8 A + B 12)}

(1)

Figure 4 shows Sentinel-2 images under different band combinations. In true-color composite images, as shown in Figure 8a, water bodies are easily confused with paddy fields and vegetation. False color combination images, as shown in Figure 8b, can distinguish vegetation from water bodies, but small water bodies with shallow water depth and small areas are easily recognized as backgrounds. The water area in RWI is positive while the background is negative, making it easy to distinguish from the background, as shown in Figure 8c. In order to fully utilize the rich spectral information in the Sentinel-2 data and, at the same time, reduce the computation of the deep learning model, this paper introduces RWI into the training data of the deep learning model.

2.2.2. Main Network Loss Function

A weighted loss function comprising cross-entropy (CE) loss [59] and dice loss [60] is employed. Cross-entropy loss is a commonly used loss function for multiclassification tasks. Cross-entropy loss examines each pixel and compares the difference between the model predictions and the true labels. The cross-entropy loss function for binary classification can be defined as Equation (2) [59]:

L_{c e} = - \frac{1}{n} [\sum_{i = 1}^{n} p_{i} \log q_{i} + (1 - p_{i}) \log (1 - q_{i})]

(2)

where

n

is the total number of samples and

p_{i}

denotes the sample label, 1 for the positive class, and 0 for the negative class.

q_{i}

denotes the probability that the sample is predicted to be a positive class after training.

In order to give more attention to hard-to-classify samples during the training process. In the field of computer vision, the Dice coefficient is a frequently used statistic for calculating picture similarity. Dice loss is suitable for the scene of positive and negative sample imbalance. It can optimize the problem that the river occupies a small proportion in images, and the training focuses more on the mining of the river region. In contrast, the cross-entropy loss will deal with the positive and negative samples equitably, and when there is a small proportion of positive samples, it will be swamped by more negative samples. The dice loss function can be defined as Equation (3) [60]:

L_{d i c e} = 1 - \frac{2 \sum_{i = 1}^{n} p_{i} q_{i}}{\sum_{i = 1}^{n} p_{i} + \sum_{i = 1}^{n} q_{i}}

(3)

where

n

represents the total number of samples;

p_{i}

is the label of the samples, with 1 for the positive class and 0 for the negative class and

q_{i}

is the predicted value.

We use a weighted combination of CE loss and dice loss, which can effectively segment the boundary of the water at the pixel level.

The final weighted loss is defined as follows [61]:

L = A L_{c e} + B L_{d i c e}

(4)

where

A

and

B

are two hyperparameters that give different weighting factors to the two losses. In our experiments, we set

A

= 1.0 and

B

= 0.6.

2.2.3. Implementation Details and Evaluation Indexes

The experiments were conducted in a 64-bit Windows 11 environment; the GPU model was an NVIDIA RTX 3060 graphics card with 12 GB of video memory; the programming language was Python 3.10; the network was built based on the pytorch 2.0.0 module; and CUDA11.8 was used for GPU acceleration. The batch size of the proposed model is 4, and the number of epochs for training is 200. The model optimizer is Adam, and the initial learning rate is fixed at 0.0001. In this paper, a weighted combination of the cross-entropy loss function and dice loss function is used.

The evaluation metrics used in this paper include recall, precision, intersection over union (IoU), and F1, all ranging between 0 and 1 [61]. Precision represents the ratio of the number of pixels correctly classified for water bodies to the number of pixels labeled as water bodies in the predicted image. Recall represents the ratio of the number of pixels correctly classified as water bodies to the total number of pixels labeled as water bodies in the labeled image. The F1 value is the harmonic mean of recall and precision. The IOU represents the average of the intersection and union ratios of water bodies and background, reflecting the degree of overlap between predicted and true values, which can most intuitively and succinctly represent the effect of binary water segmentation in this paper. Equations of these evaluation metrics are shown in Table 2.

The confusion matrix between the water body masks and ground truths is calculated, consisting of true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). The positive and negative represent water and background, respectively. FP indicates that the number in pixels of the river is incorrectly predicted (number of false positive samples). FN denotes the number of incorrectly predicted background pixels (number of false negative samples). TP denotes the number of correctly predicted river pixels (number of true positive samples). TN denotes the number of correctly predicted background pixels (number of true negative samples).

3. Results

3.1. Training Process

The accuracy and loss curves of training and validation are displayed in Figure 9. The horizontal axis epoch in the figure represents the number of training iterations; one epoch represents the process of the model performing one forward propagation and one backward propagation on all training samples and updating model parameters; and the vertical axis represents the accuracy value and loss value, respectively. It can be concluded from the upper figure that the accuracy of the model on the training set has been steadily increasing, while the accuracy value of the validation set is also increasing with the increase in epoch, but it has always been lower than the accuracy of the training set, and there are some fluctuations. The accuracy values of the training set and validation set gradually stabilize with the increase in training times, indicating that the model is gradually reaching a convergence state. The accuracy value of the training set is always higher than that of the validation set, which is a normal phenomenon in deep learning model training because the model updates parameters based on the training set. For the validation set that did not participate in the training process, its fitting effect may have some deviation from the training set, but it can still achieve a good effect. The deep learning model used in this article has a good fitting effect during the training process, without overfitting or underfitting.

3.2. Comparison Experiment of Adding Modules

To evaluate the impact of RWI, ResNet34, and SCSE attentional mechanism modules on the Unet++ water extraction performance, we conducted ablation experiments by testing different water extraction results before and after adding these methods to our model and used F1 and IoU as the main metrics to evaluate the model. The ablation experimental results are shown in the table below.

Baseline: The baseline was based on the origin Unet++ model and using RGB three channel Sentinel-2 images as input data.

RWI: We add RWI as auxiliary data to the model input of the baseline (indicated as baseline + RWI). The RWI is calculated by 5 bands of Sentinel-2 images and can efficiently distinguish the small water bodies in the image.

ResNet34: We use ResNet34 as a feature extractor of the baseline + RWI (indicated as baseline + RWI + ResNet34). The deep semantic features of the image are extracted through the ResNet34 feature extractor.

SCSE: The SCSE attention mechanism module is added after the up-sampling operators of the baseline + RWI + ResNet34 (indicated as baseline + RWI + ResNet34 + SCSE). The features produced by the encoder were refined through SCSE.

We can see from the results of the ablation experiment in Table 3 that U-Net++ input data containing RWI improves the F1 and IOU by 2.49% and 4.47%, respectively, compared to the original Unet++, indicating that integrating RWI in the training data can effectively improve the accuracy of water extraction. The F1 and IOU were improved by 0.38% and 0.7%, respectively, after replacing the backbone network with ResNet34 and by 0.12% and 0.23%, respectively, after adding the SCSE module, demonstrating that using the ResNet34 network as the feature extractor of the Unet++ network and adding the SCSE attention mechanism module in the upsampling part can improve the accuracy of water extraction to a certain extent.

The results of the experimental visualization are shown in Figure 10. After the RWI as auxiliary data was introduced, the result of water body extraction was more complete. This indicates that RWI improves the ability of the network to identify small water bodies by modeling spectral information. Before replacing the original feature extractor with ResNet34, the segmentation results of the model appeared hollow and lost water body information at the boundary. This is because only simple convolution operations are used for local feature extraction, which cannot learn global contextual information in the image, resulting in missed and inaccurate boundary segmentation. After adding the SCSE block, the result of water body extraction is more accurate, which can suppress the noises, such as small shadows and bridges. This indicates that SCSE can suppress useless information and highlight useful information in the space and channel of the image. Using SCSE blocks can make the network pay more attention to water areas and extract more detailed information that is difficult to mine. The results of the ablation experiments verify the effectiveness of each component of CRAUnet++.

3.3. Quantitative Comparison Experiment with Classic Network

To quantitatively compare the results of CRAUnet++, RWI [58], and the other four CNNs (FCN [33], SegNet [35], Unet [34], and DeepLab v3+ [38]) in land surface water extraction, the four accuracy evaluation indicators listed in Section 2.2.3 were used to quantitatively evaluate the water extraction results of the above algorithms. Comparative experiments were performed in the same data environment for all methods except the RWI method. The quantitative evaluation results are given in Table 4. It shows that the segmentation accuracy of CRAUnet++ obtains 95.99%, 96.41%, 96.19%, and 92.67% for precision, recall, F1, and IOU, respectively, which is higher than the other five algorithms, indicating that this algorithm can obtain water extraction results with a balance between recall and precision, and has the highest consistency with ground truth values. RWI performs the worst in precision with 90.88%, indicating that this algorithm has a large number of cases of recognizing the background as water. This may be due to water bodies and easily confused objects such as shadows and buildings having similar spectral characteristics, which cannot be accurately distinguished by spectral information alone. The recall, F1, and IoU of FCN are the lowest with 88.47%, 90.61%, and 82.84%, respectively, indicating that this algorithm cannot accurately extract land surface water information. This may be related to the structure of FCN. There is no skip connection structure between the shallow and deep feature maps of FCN, which can accurately extract water information, but this structure can cause edge blurring of water. The comparison of accuracy evaluation index results shows that the extraction results of CRAUnet++ are the best among six algorithms; RWI performs the worst in the water dataset of this paper, followed by SegNet, DeepLab v3+, and Unet. Among the comparison methods, the RWI water index method distinguishes water bodies from other features only by spectral features, and the deep learning algorithm acquires complex features of the image including spatial and spectral features by convolution, so the deep learning method is superior to the RWI water body index method. For our method and the other four deep learning methods, the possible reasons for the difference in their accuracy are analyzed in Section 3.4 in conjunction with the visual effect of water extraction.

3.4. Qualitative Comparison Experiment of Adding Modules

In order to compare the water extraction results of our method with the current classic deep learning models and verify the effectiveness of our method, we selected several easily confused scenes in the water extraction process, including urban areas, mountainous areas, and water bodies with different shapes and colors. We compared the details of our method in extracting water bodies with RWI, FCN, SegNet, Unet, and DeepLab v3+, respectively. Figure 11 shows the water extraction results of different methods. In Figure 11, column a is the true color synthetic remote sensing image, column b is the water label data, which is the ground truth value, and columns c, d, e, f, g, and h are the water extraction results of RWI, FCN, SegNet, Unet, DeepLab v3+, and CRAUnet++, respectively. In the figure, different colors are used to identify water bodies and non-water bodies. Water bodies are displayed in white, while non-water bodies are displayed in black. From the graph, it can be seen that all methods can extract water information, but there are some differences. For the area in the first row of the figure that contains building shadows and small water bodies, five CNN-based methods can distinguish between building shadows and water bodies, while the RWI method mistakenly identifies buildings and their shadows as water bodies. CRAUnet++ and DeepLab v3+ can recognize bridges on the river channel, while FCN and SegNet cannot recognize bridges, and the boundaries of the river are extracted ambiguously, with misclassification and omission. The second row in the figure contains artificial channels with finer shapes. The recognition results of our method are the most complete, followed by SegNet and DeepLab v3+. SegNet has some noise around the artificial channels, and the small artificial channels extracted by FCN have a large number of discontinuous areas that are missed. In the mountainous area in the third row of the figure, this paper, along with the other five methods, can distinguish between water bodies and mountain shadows. The water body boundaries extracted by this method are closer to the labeled data, and FCN has omissions and blurred boundaries. For the large area of water and pond in the fourth row of the figure, our method can accurately extract the water land boundary of the large area of water and to some extent distinguish the boundary of the pond. SegNet and DeepLab v3+ can recognize the outer boundary of the pond but cannot distinguish the inner part of the pond. FCN cannot accurately extract the range of the pond. For the recognition of natural small rivers with irregular shapes, it can be seen from the fifth row of the figure that the water body boundaries extracted by our method and SegNet method are closer to the labeled data. FCN and DeepLab v3+ all have varying degrees of omission. In the distribution area of farmland and small water bodies in the sixth row of the figure, our method extracts the most complete information, followed by SegNet and DeepLab v3+ methods. FCN methods cannot fully extract small water body information.

Overall, it can be seen from the extraction results of various types of water bodies in Figure 11 that our method performs better than the other five methods in different scenarios. The surface water body extraction results are closest to the real ground labels, with fewer misclassification and omission phenomena and clearer and more accurate boundaries. The range of water extraction results in FCN is smaller than that of ground truth labels, which may be due to the fusion of too much low-level feature information extracted from shallow convolutional layers in FCN, resulting in unsatisfactory extraction results. The extraction results of SegNet are complete, but there are some noise points, which may be because SegNet’s decoder uses the maximum pooling index in the encoder to achieve upsampling and restore image spatial resolution. This method cannot effectively achieve the fusion of encoding and decoding features and cannot enhance the target semantics. DeeplabV3+ has poor recognition ability for small targets. Due to the large receptive field of the model, it may not be able to effectively capture and recognize small-sized targets or details, resulting in inaccurate segmentation results for small targets. The method proposed in this article has good recognition performance for different types of water bodies and can clearly distinguish the boundaries at the junction of water and land, depicting small rivers and ponds. The addition of the SCSE attention mechanism module enables the model to weigh the original feature maps in both spatial and channel dimensions, adaptively allocating model parameter weights, and extracting more accurate local and global information, making it more suitable for extracting complete surface water information. The addition of dice loss pays more attention to small targets and few sample categories, avoiding excessive bias towards a larger number of categories and reducing the situation of model missed points. The addition of RWI enables the model to effectively utilize water spectral information and improve the accuracy and robustness of water extraction. The results of comparative experiments show that the method proposed in this paper can extract complete surface water information with high overall accuracy, which is superior to the other five algorithms.

4. Discussion

The effect of surface water extraction mainly relies on high-quality remote sensing image data and the algorithm that matches it. In terms of remote sensing image data, firstly, it should have sufficient spatial resolution to recognize fine water body features, such as fine rivers, ponds, or wetlands. Meanwhile, multispectral or hyperspectral data can provide richer surface information, which helps to distinguish water bodies from other feature types. Therefore, we choose Sentinel-2 MSI images with high spatial and temporal resolution as the data source for the study, and in order to avoid the influence of uncertainties such as optical conditions, images with good imaging quality are selected for the study in this paper. In terms of surface water extraction algorithms, CNN is a data-driven method with the advantages of high automation and good generalization performance, but it has poor interpretability and needs to be driven by a large amount of training data. The water body index method uses the reflectance spectral features of the water body to distinguish the water body from the background features, which is mechanistically interpretable, but its generalization is poor, and the two methods have their own advantages and disadvantages, so we consider combining the two methods to explore their effectiveness and reliability. We propose a surface water body extraction method that integrates deep learning and water body index for the Sentinel-2 image. From the perspective of accuracy evaluation indicators, the method proposed in this chapter has better extraction accuracy compared to the other four methods, obtaining water extraction results with balanced accuracy and recall. In particular, our method achieves 95.99%, 96.41%, 96.19%, and 92.67% for precision, recall, F1, and IoU, respectively. From the extraction results of the test images, it can be seen that the method proposed in this paper performs better than the other four methods in different scenarios. The water extraction result image is closest to the real ground labels, with fewer omissions and misclassifications. Moreover, the extracted water boundaries are clearer and more accurate, which can accurately identify water bodies of different shapes and regions. It also achieves good results in terms of consistency with ground truth images.

Despite some achievements in our research, there were some limitations. On the one hand, the remote sensing images used in this article have a short time span and a small coverage range and do not have the generalization ability of the analysis model. In the future, the research area will be expanded to fully utilize the high degree of automation of the method in this article to draw thematic maps of surface water bodies nationwide or even globally and analyze the changes in water body area over a long time series and their driving factors. On the other hand, in optical remote sensing, cloud cover causes sensors to be unable to effectively obtain ground information, resulting in distortion or even loss of ground information in images. In this paper, cloud mask products are used to screen available data to avoid interference from cloud pixels. In the future, we will consider synthesizing cloud free images from nearby temporal images to address the problem of cloud coverage on remote sensing images and achieve efficient data utilization.

5. Conclusions

To improve the limitations of current semantic segmentation algorithms of the surface water in remote sensing images, a new surface water extraction algorithm that combines deep learning and water index for Sentinel-2 images. The algorithm has used the Sentinel-2 remote sensing true-color image and RWI as the input of the model and achieved the fusion of semantic features and RWI spectral information through the improved Unet++ network to make full use of the spectral features of Sentinel-2 images and improve the separability between surface water and the background. We improved surface water extraction accuracy of Unet++ from two aspects: the feature extractor and the decoder component. Concretely, for the issue that a simple or less depth convolutional layer combination cannot handle the variety of sizes and shapes of water bodies, the ResNet34 was used to replace the original feature extractor of Unet++ to enhance the feature extraction capability of the network and avoid the risk of reducing the generalization capability of the model due to the excessive depth of the network layers at the same time. In order to make CNN pay more attention to the region of interest to acquire more subtle land surface water information, the SCSE attention module was embedded in the decoder component, enhancing the semantic and spatial information and boosting the decoder module’s image recovery capacity. In addition, we constructed a workflow to create the training dataset semi-automatically based on the water index threshold segmentation algorithm to generate datasets quickly and accurately that can be used in the model training stage. Firstly, the water index threshold segmentation algorithm was used for dataset annotation and then manual correction by using Photoshop 2020 software to improve the accuracy and efficiency of dataset production, avoiding the labor and time costs of manually plotting the entire dataset. We constructed a surface water dataset by fusing RWI and true-color images using Sentinel-2 images of Poyang Lake, and ablation experiments and comparable analysis with other algorithms (RWI, FCN, SegNet, Unet, and DeepLab v3+) were carried out, respectively. The experimental results show that CRAUnet++ has the highest precision, recall, F1, and IoU with values of 95.99%, 96.41%, 96.19%, and 92.67%, respectively, which indicate that the introduction of water index in deep learning is helpful for improving the accuracy of water body extraction from Sentinel-2 images. Moreover, according to the visual comparison in confusing scenes such as urban area, paddy fields area, and mountainous area, surface water extraction based on CRAUnet++ is better than RWI, FCN, SegNet, Unet, and DeepLab v3+. The result demonstrates that CRAUnet++ has high validity and reliability in extracting surface water bodies based on Sentinel-2 images. This study efficiently improves the situation of the inefficiency of water body sample generation, the difficulty of extracting small water bodies under complex background, the poor flexibility of extraction methods, and the lack of precision in Sentinel-2 surface water extraction. In summary, CRAUnet++ has good application potential in large-scale and automatic surface water extraction from Sentinel-2 images, contributing to surface water resources investigation, flood and drought monitoring, climate change, and environmental protection. In the future, we will produce more samples, expand the research area and time period, and analyze the changes in surface water area and their driving mechanisms.

Author Contributions

N.L., X.X. and Y.S. conceived of and designed the experiments. N.L. and X.X. performed the experiments. Y.S. and M.H. made the dataset. N.L., S.H. and Y.S. analyzed the results. N.L. wrote the whole paper, and all authors edited the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Program of Jiangxi Province under grant 20212BBG71008, and the Youth Innovation Talents Promotion Plan of the Research Center of Flood and Drought Disaster Reduction of the Ministry of Water Resources, IWHR.

Data Availability Statement

Data are available on request from the authors. The data are not publicly available due to the requirements of laboratory policies or restrictions such as confidentiality agreements.

Acknowledgments

The authors thank the anonymous reviewers and the editors for their valuable comments to improve our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kuhn, C.; de Matos Valerio, A.; Ward, N.; Loken, L.; Sawakuchi, H.O.; Kampel, M.; Richey, J.; Stadler, P.; Crawford, J.; Striegl, R.; et al. Performance of Landsat-8 and Sentinel-2 Surface Reflectance Products for River Remote Sensing Retrievals of Chlorophyll-a and Turbidity. Remote Sens. Environ. 2019, 224, 104–118. [Google Scholar] [CrossRef]
Wang, R.; Zhang, C.; Chen, C.; Hao, H.; Li, W.; Jiao, L. A Multi-Modality Fusion and Gated Multi-Filter U-Net for Water Area Segmentation in Remote Sensing. Remote Sens. 2024, 16, 419. [Google Scholar] [CrossRef]
Liu, J.; Wang, Y. Water Body Extraction in Remote Sensing Imagery Using Domain Adaptation-Based Network Embedding Selective Self-Attention and Multi-Scale Feature Fusion. Remote Sens. 2022, 14, 3538. [Google Scholar] [CrossRef]
Li, J.; Ma, R.; Cao, Z.; Xue, K.; Xiong, J.; Hu, M.; Feng, X. Satellite Detection of Surface Water Extent: A Review of Methodology. Water 2022, 14, 1148. [Google Scholar] [CrossRef]
Quang, D.N.; Linh, N.K.; Tam, H.S.; Viet, N.T. Remote sensing applications for reservoir water level monitoring, sustainable water surface management, and environmental risks in Quang Nam province, Vietnam. J. Water Clim. Chang. 2021, 12, 3045–3063. [Google Scholar] [CrossRef]
Laonamsai, J.; Ichiyanagi, K.; Patsinghasanee, S.; Kamdee, K. Controls on Stable Isotopic Characteristics of Water Vapor over Thailand. Hydrol. Process. 2021, 35, e14202. [Google Scholar] [CrossRef]
Huang, C.; Chen, Y.; Zhang, S.; Wu, J. Detecting, extracting, and monitoring surface water from space using optical sensors: A review. Rev. Geophys. 2018, 56, 333–360. [Google Scholar] [CrossRef]
Wang, C.; Zhang, J.; Li, Y.; Phoumilay. The construction and verification of a water index in the complex environment based on GF-2 images. Remote Sens. Nat. Resour. 2022, 34, 50–58. [Google Scholar]
Mondejar, J.P.; Tongco, A.F. Near infrared band of Landsat 8 as water index: A case study around Cordova and Lapu-Lapu City, Cebu, Philippines. Sustain. Environ. Res. 2019, 29, 16. [Google Scholar] [CrossRef]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Feyisa, G.L.; Meilby, H.; Fensholt, R.; Proud, S.R. Automated water extraction index: A new technique for surface water mapping using Landsat imagery. Remote Sens. Environ. 2014, 140, 23–35. [Google Scholar] [CrossRef]
Fisher, A.; Flood, N.; Danaher, T. Comparing Landsat water index methods for automated water classification in eastern Australia. Remote Sens. Environ. 2016, 175, 167–182. [Google Scholar] [CrossRef]
Cao, M.; Mao, K.; Shen, X.; Xu, T.; Yan, Y.; Yuan, Z. Monitoring the spatial and temporal variations in the water surface and floating algal bloom areas in Dongting Lake using a long-term MODIS image time series. Remote Sens. 2020, 12, 3622. [Google Scholar] [CrossRef]
Otsu, N. A threshold selection method from gray-histogram. IEEE Trans. Syst. ManCybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Zhang, F.; Li, J.; Zhang, B.; Shen, Q.; Ye, H.; Wang, S.; Lu, Z. A simple automated dynamic threshold extraction method for the classification of large water bodies from landsat-8 OLI water index images. Int. J. Remote Sens. 2018, 39, 3429–3451. [Google Scholar] [CrossRef]
Dan, L.I.; Wu, B.; Chen, B.W.; Xue, Y.; Zhang, Y. Review of water body information extraction based on satellite remote sensing. J. Tsinghua Univ. (Sci. Technol.) 2020, 60, 147–161. [Google Scholar]
Zhang, Y.; Liu, X.; Zhang, Y.; Ling, X.; Huang, X. Automatic and Unsupervised Water Body Extraction Based on Spectral-Spatial Features Using GF-1 Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2018, 16, 927–931. [Google Scholar] [CrossRef]
Chen, Y.; Fan, R.; Yang, X.; Wang, J.; Latif, A. Extraction of urban water bodies from high-resolution remote-sensing imagery using deep learning. Water 2018, 10, 585. [Google Scholar] [CrossRef]
Miao, Z.; Fu, K.; Sun, H.; Sun, X.; Yan, M. Automatic water-body segmentation from high-resolution satellite images via deep networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 602–606. [Google Scholar] [CrossRef]
Guo, H.; He, G.; Jiang, W.; Yin, R.; Yan, L.; Leng, W. A Multi-Scale Water Extraction Convolutional Neural Network (MWEN) Method for GaoFen-1 Remote Sensing Images. ISPRS Int. J. Geo-Inf. 2020, 9, 189. [Google Scholar] [CrossRef]
Luo, X.; Tong, X.; Hu, Z. An applicable and automatic method for earth surface water mapping based on multispectral images. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102472. [Google Scholar] [CrossRef]
Li, A.; Fan, M.; Qin, G.; Xu, Y.; Wang, H. Comparative Analysis of Machine Learning Algorithms in Automatic Identification and Extraction of Water Boundaries. Appl. Sci. 2021, 11, 10062. [Google Scholar] [CrossRef]
Friedl, M.; Brodley, C. Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 1997, 61, 399–409. [Google Scholar] [CrossRef]
Peng, J.; Lee, K.; Ingersoll, G. An Introduction to Logistic Regression Analysis and Reporting. J. Educ. Res. 2002, 96, 3–14. [Google Scholar] [CrossRef]
Breiman, L. Random forests, machine learning 45. J. Clin. Microbiol. 2001, 2, 199–228. [Google Scholar]
Feng, W.; Sui, H.; Huang, W.; Xu, C.; An, K. Water body extraction from very high-resolution remote sensing imagery using deep U-Net and a superpixel-based conditional random field model. IEEE Geosci. Remote Sens. 2019, 16, 618–622. [Google Scholar] [CrossRef]
Christianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Samat, A.; Li, E.Z.; Wang, W.; Liu, S.C.; Lin, C.; Abuduwaili, J. Meta-XGBoost for Hyperspectral Image Classification Using Extended MSER-Guided Morphological Profiles. Remote Sens. 2020, 12, 1973. [Google Scholar] [CrossRef]
Du, S.H.; Du, S.H.; Liu, B.; Zhang, X.Y. Mapping large-scale and fine-grained urban functional zones from VHR images using a multi-scale semantic segmentation network and object based approach. Remote Sens. Environ. 2021, 261, 112480. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Wang, Z.; Gao, X.; Zhang, Y.; Zhao, G. MSLWENet: A Novel Deep Learning Network for Lake Water Body Extraction of Google Remote Sensing Images. Remote Sens. 2020, 12, 4140. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Volume 9351, pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv 2016, arXiv:1606.00915. [Google Scholar] [CrossRef]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Chen, Q.; Zheng, L.; Li, X.; Xu, C.; WU, Y.; Xie, D.; Liu, L. Water Body Extraction from High-Resolution Satellite Remote Sensing Images Based on Deep Learning. Geogr. Geo-Inf. Sci. 2019, 35, 43–49. [Google Scholar]
Tang, Y.; Zhang, J.; Jiang, Z.; Lin, Y.; Hou, P. RAU-Net++: River Channel Extraction Methods for Remote Sensing Images of Cold and Arid Regions. Appl. Sci. 2024, 14, 251. [Google Scholar] [CrossRef]
Fan, Z.; Hou, J.; Zang, Q.; Chen, Y.; Yan, F. River Segmentation of Remote Sensing Images Based on Composite Attention Network. Complexity 2022, 2022, 7750281. [Google Scholar] [CrossRef]
Zhong, H.-F.; Sun, H.-M.; Han, D.-N.; Li, Z.-H.; Jia, R.-S. Lake Water Body Extraction of Optical Remote Sensing Images Based on Semantic Segmentation. Appl. Intell. 2022, 52, 17974–17989. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Yu, H.; Lu, X.; Li, G. A water extraction method for remote sensing with lightweight network model. Sci. Surv. Mapp. 2022, 47, 64–72. [Google Scholar]
Wu, J.; Sun, D.; Wang, J.; Qiu, H.; Wang, R.; Liang, F. Surface River Extraction from Remote Sensing Images based on Improved U-Net. In Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Hangzhou, China, 4–6 May 2022; pp. 1004–1009. [Google Scholar]
Liu, Z.; Chen, X.; Zhou, S.; Yu, H.; Guo, J.; Liu, Y. DUPnet: Water Body Segmentation with Dense Block and Multi-Scale Spatial Pyramid Pooling for Remote Sensing Images. Remote Sens. 2022, 14, 5567. [Google Scholar] [CrossRef]
Guo, B.; Zhang, J.; Li, X. River Extraction Method of Remote Sensing Image Based on Edge Feature Fusion. IEEE Access 2023, 11, 73340–73351. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T. MU-Net: Embedding MixFormer into Unet to Extract Water Bodies from Remote Sensing Images. Remote Sens. 2023, 15, 3559. [Google Scholar] [CrossRef]
Li, H.; Xu, Z.; Zhou, Y.; He, X.; He, M. Flood Monitoring Using Sentinel-1 SAR for Agricultural Disaster Assessment in Poyang Lake Region. Remote Sens. 2023, 15, 5247. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
ESA. Sentinel-2 User Handbook; ESA: Paris, France, 2015. [Google Scholar]
Classification Algorithms and Methods. Available online: https://seos-project.eu/classification/classification-c01-p05.html (accessed on 8 August 2024).
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
Yin, H.; Zhang, J.; Zhang, C.; Qian, Y.; Han, Y.; Ge, Y.; Shuai, L.; Liu, M. Water Extraction from Remote Sensing Images: Method Based on Convolutional Neural Networks. Trop. Geogr. 2022, 42, 854–866. [Google Scholar]
Jia, M.; Wang, Z.; Mao, D.; Ren, C.; Wang, C.; Wang, Y. Rapid, Robust, and Automated Mapping of Tidal Flats in China Using Time Series Sentinel-2 Images and Google Earth Engine. Remote Sens. Environ. 2021, 255, 112285. [Google Scholar] [CrossRef]
Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; Jian, S. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Roy, A.G.; Navab, N.; Wachinger, C. Concurrent Spatial and Channel Squeeze & Excitation in Fully Convolutional Networks. arXiv 2018, arXiv:1803.02579. [Google Scholar]
Wu, Q.; Wang, M.; Shen, Q.; Yao, Y.; Li, J.; Zhang, F.; Zhou, Y. Small water body extraction method based on Sentinel-2 satellite multi-spectral remote sensing image. Natl. Remote Sens. Bull. 2022, 26, 781794. [Google Scholar] [CrossRef]
Wei, H.; Xu, X.; Ou, N.; Zhang, X.; Dai, Y. DEANet: Dual Encoder with Attention Network for Semantic Segmentation of Remote Sensing Imagery. Remote Sens. 2021, 13, 3900. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. Deep Learn Med. Image Anal. Multimodal. Learn Clin. Decis. Support 2017, 2017, 240–248. [Google Scholar]
Peng, Y.; Zhang, Z.M.; He, G.J.; Wei, M.Y. An improved grabcut method based on a visual attention model for rare-earth ore mining area recognition with high-resolution remote sensing images. Remote Sens. 2019, 11, 987. [Google Scholar] [CrossRef]

Figure 1. The geographic location of the study area and the selected 4 Sentinel-2 Multi-Spectral Instrument (MSI) images.

Figure 2. Flowchart of Sentinel-2 MSI images preprocessing. The first row shows the process of generating training sample data used as input to the model, and the second row shows the process of generating water body labels used to calculate model losses.

Figure 3. Examples of data augmentation. (a) is the original image and corresponding label; (b) is flipping the original image up and down; (c) is flipping the original image left and right; (d) is rotating the original image at 180°.

Figure 4. Reflectance of water, soil, and vegetation at different wavelengths [52].

Figure 6. The structure of ResNet34 feature extractor using in CRAUnet++. It consists of two Basic Blocks, which are represented by different colors in the figure. The first convolutional layer of the green BasicBlock has a stride of 1 and its input and output feature maps are of the same size. The first convolutional layer of the yellow BasicBlock has a stride of 2 and can downsample the input feature map twice.

Figure 7. The structure of Spatial and Channel ‘Squeeze and Excitation’ (SCSE) module. The first row of this figure weights the input data from the spatial dimension, and the second row of this figure weights the input data from the channel dimension.

Figure 8. Sentinel-2 images under different band combinations (a) true color composite image, (b) false color composite image, and (c) Vegetation red edge based water index (RWI) image.

Figure 9. The trend of accuracy and loss values in the training and testing sets (the upper graph shows the trend of accuracy value change, the lower graph shows the trend of loss value change, the red line represents the testing set, and the blue line represents the training set).

Figure 10. Visualization of water extraction results for ablation studies: (a) images; (b) labels; (c) Baseline; (d) Baseline + RWI; (e) Baseline + RWI + ResNet34; (f) Baseline + RWI + ResNet34 + SCSE. Black denotes the background, and white denotes water bodies.

Figure 11. Visualization results of CRAUnet++, RWI, and CNN-based semantic segmentation networks on Sentinel-2 dataset: (a) images; (b) labels; (c) RWI; (d) FCN; (e) SegNet; (f) Unet; (g) DeepLab v3+; (h) CRAUnet++. Black denotes the background, and white denotes water bodies.

Table 1. Detailed information of the selected 4 Sentinel-2 MSI images.

Image MGRS	Imaging Time (M-D-Y)	Central Longitude (°E)	Central Latitude (°N)
50RMS	10 July 2022	116.5	28.4
50RMT	10 July 2022	116.5	29.4
50RLS	13 July 2022	115.5	28.4
50RLT	13 July 2022	115.5	29.4

Table 2. Equations of evaluation metrics.

Evaluation Index	Calculation Formula
precision	$\frac{T P}{T P + F P}$
recall	$\frac{T P}{T P + F N}$
$F 1$	$2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}$
IoU	$\frac{T P}{T P + F P + F N}$

Table 3. Results of ablation experiments after successive addition of Unet++ to RWI, ResNet34, and SCSE attentional mechanism modules (%). The bolded section indicates optimal parameters.

Method	F1	IoU
Baseline	93.20	87.27
Baseline + RWI	95.69	91.74
Baseline + RWI + ResNet34	96.07	92.44
Baseline + RWI + ResNet34 + SCSE	96.19	92.67

Table 4. Comparison results of CRAUnet++, RWI, and four CNN-based semantic segmentation networks on the Sentinel-2 dataset. The bolded section indicates optimal parameters.

CNNs	Precision/(%)	Recall/(%)	$F_{1 - S c o r e}$	IoU/(%)
RWI	90.88	95.77	93.26	87.37
Unet	95.33	96.11	95.72	91.79
FCN	92.86	88.47	90.61	82.84
SegNet	91.54	95.93	95.08	90.62
DeepLab v3+	93.36	95.06	94.21	89.05
CRAUnet++	95.99	96.41	96.19	92.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, N.; Xu, X.; Huang, S.; Sun, Y.; Ma, J.; Zhu, H.; Hu, M. CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++. Remote Sens. 2024, 16, 3391. https://doi.org/10.3390/rs16183391

AMA Style

Li N, Xu X, Huang S, Sun Y, Ma J, Zhu H, Hu M. CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++. Remote Sensing. 2024; 16(18):3391. https://doi.org/10.3390/rs16183391

Chicago/Turabian Style

Li, Nan, Xiaohua Xu, Shifeng Huang, Yayong Sun, Jianwei Ma, He Zhu, and Mengcheng Hu. 2024. "CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++" Remote Sensing 16, no. 18: 3391. https://doi.org/10.3390/rs16183391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CRAUnet++: A New Convolutional Neural Network for Land Surface Water Extraction from Sentinel-2 Imagery by Combining RWI with Improved Unet++

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Study Area and Data

2.1.2. Data Generation Pipeline

2.2. Methods

2.2.1. Main Network Structure

2.2.2. Main Network Loss Function

2.2.3. Implementation Details and Evaluation Indexes

3. Results

3.1. Training Process

3.2. Comparison Experiment of Adding Modules

3.3. Quantitative Comparison Experiment with Classic Network

3.4. Qualitative Comparison Experiment of Adding Modules

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI