*Article* **Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images**

**Lili Zhang 1, Yu Fan 1, Ruijie Yan 1, Yehong Shao 2, Gaoxu Wang 3,\* and Jisen Wu <sup>1</sup>**


**Abstract:** The tidal flat is long and narrow area along rivers and coasts with high sediment content, so there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. The existing waterbody extraction methods are mostly used for the extraction of large water bodies like rivers and lakes, whereas less attention has been paid to tidal flat waterbody extraction. Extracting tidal flat waterbody accurately from high-resolution remote sensing imagery is a great challenge. In order to solve the low accuracy problem of tidal flat waterbody extraction, we propose a fine-grained tidal flat waterbody extraction method, named FYOLOv3, which can extract tidal flat water with high accuracy. The FYOLOv3 mainly includes three parts: an improved object detection network based on YOLOv3 (Seattle, WA, USA), a fully convolutional network (FCN) without pooling layers, and a similarity algorithm for water extraction. The improved object detection network uses 13 convolutional layers instead of Darknet-53 as the model backbone network, which guarantees the water detection accuracy while reducing the time cost and alleviating the overfitting phenomenon; secondly, the FCN without pooling layers is proposed to obtain the accurate pixel value of the tidal flat waterbody by learning the semantic information; finally, a similarity algorithm for water extraction is proposed to distinguish the waterbody from non-water pixel by pixel to improve the extraction accuracy of tidal flat water bodies. Compared to the other convolutional neural network (CNN) models, the experiments show that our method has higher accuracy on the waterbody extraction of tidal flats from remote sensing images, and the *IoU* of our method is 2.43% higher than YOLOv3 and 3.7% higher than U-Net (Freiburg, Germany).

**Keywords:** tidal flat water; YOLOv3; similarity algorithm for water extraction

#### **1. Introduction**

Water resources are closely related to human survival and development, and many researchers focus on how to obtain water resource information quickly and accurately. The extraction and detection of the water bodies from remote sensing images is one of the main ways to obtain water resource information. It can be widely applied in ecosystem protection and restoration, river supervision, pollution control, and infrastructure construction [1,2]. In recent years, with the rapid development of remote sensing satellite technology, obtaining water resource information from remote sensing images [3] has gradually replaced manual measurement, and the images are widely applied in water resource surveys and flood predictions.

At present, scholars have proposed a variety of water extraction methods for different satellite imagery, which can be summarized into three categories: visual interpretation methods [4], extraction methods based on spectral bands [5–9], and machine learning methods [10–12]. However, these methods are mainly applied to extract large water bodies

**Citation:** Zhang, L.; Fan, Y.; Yan, R.; Shao, Y.; Wang, G.; Wu, J. Fine-Grained Tidal Flat Waterbody Extraction Method (FYOLOv3) for High-Resolution Remote Sensing Images. *Remote Sens.* **2021**, *13*, 2594. https://doi.org/10.3390/rs13132594

Academic Editors: Anwaar Ulhaq and Douglas Pinto Sampaio Gomes

Received: 21 May 2021 Accepted: 25 June 2021 Published: 2 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

like rivers and lakes, and there are few waterbody extraction methods for tidal flats. The tidal flat area [13] refers to the tidal invasion area between the high tide level and the low tide level along rivers and coasts, etc. The water bodies in this kind of area are relatively long and narrow, with high sediment content. Due to the influence of tides, there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. Meanwhile, the mixture of water and sand makes the spectral band characteristics of the water in the tidal flat area different from the water in the other areas. Therefore, the methods based on spectral bands are not suitable for tidal flat waterbody extraction. The machine learning method used for water extraction is usually based on supervised learning, so the training dataset is necessary. However, there is not public training dataset for tidal flat waterbody extraction. Hence the machine learning methods usually have poor ability and do not learn effectively due to the limited training dataset and have an accuracy bottleneck in the water extraction as a result.

The boundary of the waterbody is blurry in tidal flat area, in order to solve the low accuracy problem of its waterbody extraction caused by little feature difference between the waterbody and the background, this paper proposes a fine-grained tidal flat waterbody extraction method, named FYOLOv3. The FYOLOv3 mainly includes three parts: an improved object detection network based on YOLOv3 (Seattle, WA, USA), a fully convolutional network (FCN) without pooling layers, and a similarity algorithm for water extraction.

In this paper, our contributions are as follows:


The rest of this paper is organized as follows. In Section 2, we introduce some classical methods and analyze the YOLO models. In Section 3, a fine-grained tidal flat waterbody extraction method FYOLOv3 is described in detail. The experiments and analysis are presented in Section 4. Finally, the conclusion of this paper with some discussions and future work are given in Section 5.

#### **2. Related Work**

#### *2.1. Water Extraction Methods*

Spectral band analysis methods are the earlier methods for waterbody extraction from remote sensing images [5–9] by analyzing the differences of absorption and reflection of different ground objects for each band spectrum, then obtaining the water region in the remote sensing images. There are three methods based on spectral analysis: single band threshold method [14], multi-band spectral relationship method [15], and water index method [8]. Xu et al. [8] proposed an improved normalized difference water index (MNDWI) based on the band combination of the normalized water index. The experiments show that the method is efficient for the extraction of urban water bodies, and effectively solves the influence of urban building shadow. Guo et al. [7] proposed a weighted normalized difference water index (WNDWI) to solve the influence of turbid water, small water bodies and shadow areas on water extraction. The method was tested on Landsat images and achieved good results. Methods based on spectral analysis usually only use the spectral information of remote sensing images, which does not effectively use the texture, space, surrounding background, and other information, so its extraction ability has certain

limitations. These methods have specific requirements for the band of remote sensing images and have low applicability as a result.

Some machine learning methods, such as support vector machine (SVM) and maximum likelihood classification [10–12] try to balance the learning effectiveness and the interpretability of the models and provide a solution framework for the classification problem of limited samples. This kind of method improves the accuracy of target extraction in a certain range by learning the distribution characteristics of the training data. However, they have poor ability and do not learn effectively due to the limited training dataset and have an accuracy bottleneck in the water extraction.

With the concept of deep learning proposed by Hinton et al. [16] in 2006 and the outstanding achievements of deep convolution neural network proposed by Alex [17] in natural images recognition in 2012, deep learning ushered in a new research phase. Many experts and scholars began to apply deep learning technology to obtain object extraction from remote sensing images. Zhong et al. [18] used convolution neural network model to extract waterbody from remote sensing images, and the experiments showed that convolution neural network is more efficient to extract waterbody from remote sensing images than normalized water index. Liang et al. [19] introduced dense connection structure in the full convolution network to reduce the shallow feature loss, get more detailed information from the remote sensing images, and achieve better water extraction. Song et al. [20] used the self-learning ability of deep learning to construct a modified Mask R-CNN method which integrates bottom-up and top-down processes for water recognition. Yu et al. [21] presented a novel deep learning framework for waterbody extraction from Landsat images considering both its spectral and spatial information, which is a hybrid of CNN and logistic regression classifier. Li et al. [22] adopted a fully convolutional network (FCN) to extract water bodies in the case of limited training data, which consists of an encoder for extracting multiscale features and a decoder for recovering spatial contexts. Wang et al. [23] proposed an end-to-end trainable model named the multi-scale lake water extraction network (MSLWENet) to extract lake water from Google remote sensing images. Yu et al. [24] developed a novel self-attention capsule feature pyramid network (SA-CapsFPN) to extract water bodies from remote sensing images. Li et al. [25] built a deep learning model for water extraction based on the EfficientNet-B5 (Perdriel, Argentina).

#### *2.2. YOLO Models*

The excellent performance of deep convolution neural network [17] has been demonstrated in computer vision. Recently, YOLO models such as YOLOv1 (Seattle, WA, USA) [26], YOLOv2 [27], and YOLOv3 [28], were proposed one after another. The YOLOv1 model is based on GoogLeNet (Mountain View, CA, USA) [29], which is mainly composed of convolutional layers and fully connected layers to achieve the object detection fast. The model transforms the object detection problem into coordinate regression problem and carries out the classification and regression of target objects. Because the two prediction frames generated in YOLOv1 (Seattle, WA, USA) for each lattice in the images can only predict one target object, the detection accuracy of adjacent objects whose center point falls in the same lattice is reduced as a result. In view of the above shortcomings, YOLOv2 (Seattle, WA, USA) proposes a variety of strategies to improve the network framework, which significantly improves the speed and accuracy of object detection. In order to further optimize the YOLO models, the DarkNet-53 network is used as the object feature extractor in YOLOv3 (Seattle, WA, USA) model, and the output module uses the feature pyramid structure to achieve three-way outputs to complete the accurate detection of the targets with different sizes [28].

#### **3. Methodology**

To solve the low accuracy problem of water extraction for tidal flats, this paper proposes a fine-grained tidal flat waterbody extraction method for high-resolution remote sensing images, named FYOLOv3. The key parts of our method are as follows, firstly, the

improved object detection network based on YOLOv3 (Seattle, WA, USA) is proposed and used to locate the tidal flat waterbody, and the frame coordinates of the corresponding waterbody are obtained; secondly, four images with size of 32 × 32 are clipped from the obtained border region, which are used as the input of the FCN without pooling layers to get the initial waterbody extraction; finally, the similarity algorithm for water extraction is used to judge all pixels in the obtained initial waterbody region to optimize and improve the initial waterbody extraction. We list the steps of our method as follows:


The architecture of our method is shown in Figure 1, where the 13-layer CNN is constructed for water feature extraction and mainly composed of convolutional layers, pooling layers, and batch standardization layers. The multi-scale feature pyramid network uses the different feature maps to get the narrow and long waterbodies and small waterbodies in the tidal flat area, respectively. Hence, our object detection network can guarantee the water detection accuracy while reducing the time cost and alleviating the overfitting phenomenon.

**Figure 1.** Fine-grained tidal flat waterbody extraction method.

#### *3.1. Construction of Training Dataset*

#### 3.1.1. Preprocessing

The GF-2 remote sensing satellite is the first civil high-resolution satellite in China, and it was successfully launched in 2014. GF-2 satellite has high spatial resolution, accurate positioning, and strong maneuverability. The remote sensing images used in this paper are the Level-1 product data. Therefore, it is necessary to preprocess the remote sensing images first. The preprocessing of GF-2 remote sensing images used in this paper mainly includes radiometric calibration [30], atmospheric correction [31], orthorectification [32], and image fusion [33].

#### 1. Radiometric correction and orthorectification of multispectral images

Radiometric correction includes two parts: radiometric calibration and atmospheric correction. Radiometric calibration refers to convert the brightness value of pixels into absolute radiance value, which helps researchers to compare remote sensing images acquired from different types of sensors at different times. Atmospheric correction refers to the process of eliminating the radiation error caused by atmospheric influence and obtaining the true reflectance of surface objects. Orthorectification is to correct the geometric distortion of remote sensing images and the plane orthophotos are obtained at last. The preprocessing example of the multispectral images is shown in Figure 2.

**Figure 2.** Comparison of multispectral images before and after preprocessing. (**a**) Multispectral image before preprocessing; (**b**) multispectral image after preprocessing.

#### 2. Orthorectification of panchromatic images

Different from multispectral images, the band range of panchromatic images in GF-2 is 0.45–0.90 μm, which includes multiple wavelength ranges. The attenuation of the atmosphere is selective for the lights with different wavelengths, and each wavelength is affected by the atmosphere differently. Therefore, it is usually impossible to carry out an atmospheric correction for panchromatic images. The number and distribution of controlled points in a remote sensing image influence the error of the orthorectification and the mean square error is used to evaluate the accuracy of orthorectification. Fan et al. made the accuracy analysis of GF-2 satellite image according to the mentioned evaluation indexes [34], and the RPC orthorectification was proved better to correct the geometric distortion in panchromatic images. Hence, we use the RPC orthorectification to deal with the panchromatic images in this paper, and the experiment is shown in Figure 3.

**Figure 3.** Comparison of panchromatic images before and after orthorectification. (**a**) Panchromatic image before orthorectification; (**b**) panchromatic image after orthorectification.

#### 3. Image fusion

Image fusion is often used to enrich the image information. It fuses the images of the same area from different channels and finally obtains the fused images with more information and higher quality.

In this paper, the NNDiffuse Pan Sharpening [35] method is used to fuse the multispectral images and panchromatic images. The multispectral images and the panchromatic images are obtained synchronously by different sensors installed in the GF-2, and the former has higher resolution but less spectral information, and the latter has more spectral information and lower resolution. If we fuse them, we could get the fused image with high resolution and more spectral information, as shown in Figure 4.

**Figure 4.** Comparison of multispectral images before and after fusion. (**a**) Original multispectral image; (**b**) image after fusion.

4. Band combination selection

GF-2 multispectral images contain redundant data because of the close correlation between different bands. In order to make full use of the features of GF-2 multispectral images, reduce data redundancy and maintain the original characteristics of the images, we need to make the optimal band combination for GF-2 multispectral images.

There are three principles to choose the optimal band combination: the information in a single band should be as much as possible; the information intersection between two bands should be less; the spectral differences of different types of ground objects after the band combination should be getting clearer [18]. Because the spectral bands of GF-2 multispectral images are the same as the GF-1 multispectral images, according to the above three principles, it is appropriate to use the standard deviation and Optimum Index Factor (OIF) [36] to study the best band combination of GF-2 images. Finally, we get band 2, band 3, and band 4 as the combined bands to generate the original image in our study. The remote sensing image after band combination is shown in Figure 5.

**Figure 5.** Image after band combination.

3.1.2. Data Labeling and Augmentation

1. Data labeling

The data labeling mainly includes two parts: one is to label a region in which the waterbody is located to get the training dataset for our proposed object detection network model and the other is to label the waterbody to get the training dataset for our FCN without pooling layers.

(a) Labeling a region: We use the LabelImg (Barcelona, Spain) to label the region in which the waterbody is located. The water area is labeled by a rectangular frame, and a xml file is generated finally. As shown in Figure 6, the label in the file records the name, path, water area category and coordinates of the frame.

**Figure 6.** Example of labeling a region.

(b) Waterbody labeling: The Labelme is used to label the waterbody. The labeled image is shown in Figure 7. In the labeled image, the labeled information of the waterbody is saved in the index dataset. Because the extraction of waterbody is essentially binary classification, the black area in the labeled image is the background area and represented by 0. The red area is the waterbody and is represented by 1.

**Figure 7.** Remote sensing image and regional waterbody labeled image. (**a**) Remote sensing image; (**b**) regional waterbody labeled image.

#### 2. Data augmentation

Compared with public images dataset like ImageNet, there is no remote sensing image training dataset, and it is difficult to get much more data by ourselves, so we enlarge the dataset by data augmentation [37,38] to expand the training data and avoid the overfitting phenomenon. The remote sensing images are clipped, and the size of the images is 256 × 256, which is feasible to complete the construction of the training dataset. The geometric transformation operations used in this paper include rotation operations of 90◦, 180◦, and 270◦ of the original images, horizontal flip operation, and vertical flip operation. We use the OpenCV (Intel, Santa Clara, CA, USA) based on python for data augmentation. The operation examples are shown in Figure 8.

We labeled the data at first, and then achieve the data augmentation operations. To meet the training requirement of FCN without pooling layers, we clip the size of waterbody labeled images into 32 × 32. Now we have 6000 waterbody labeling data with size of 32 × 32, and 6000 waterbody region labeled data with size of 256 × 256. We choose 70% of them as the training set to train the improved water detection network and the FCN without pooling layers, respectively, and 30% are used as test data.

#### *3.2. Improved Water Detection Network Based on YOLOv3*

As shown in Figure 9, the improved water detection network based on YOLOv3 mainly includes two parts. The first part is the feature extraction module, in which we use 13 convolutional layers to obtain the water features. The second part is the feature pyramid network structure for multi-scale waterbody detection, which uses feature fusion for multi-scale waterbody detection.

**Figure 8.** Example of remote sensing images by data augmentation. (**a**) 90◦ rotation; (**b**) 180◦ rotation; (**c**) 270◦ rotation; (**d**) horizontal flip; (**e**) vertical flip.

**Figure 9.** The network structure of the improved water detection network.

#### 3.2.1. Improved Feature Extraction Module

The Darknet-53 network structure used in the feature extraction module of YOLOv3 (Seattle, WA, USA) easily leads to the overfitting phenomenon in the case of limited training data. In order to solve this problem, a 13-layer CNN is constructed for water feature extraction in the feature extraction module. The module is mainly composed of convolutional layers, pooling layers, and batch standardization layers. The parameters are shown in Table 1.


**Table 1.** Network structure and parameters of improved feature extraction module.

In the improved feature extraction module, the convolutional layers with a convolution kernel of 3 × 3 is used to extract the water features of a tidal flat area, and the convolutional layers with a convolution kernel of 1 × 1 is used to realize cross-channel information fusion. In order to ensure the generalization ability of our waterbody detection model in a tidal flat area, this paper uses pooling layers to keep the main characteristic data of water. To solve the slow convergence and gradient explosion, the improved feature extraction module used in this paper adds batch standardization layers. This operation normalizes the data before it passes through the activation function to reduce the change data amplitude and make it follow the Gaussian distribution and speed up the convergence of the network model as a result.

#### 3.2.2. Feature Pyramid Network Structure for Multi-Scale Water Detection

Inspired by the design of the feature pyramid, three branches are used in YOLOv3 (Seattle, WA, USA) to obtain feature maps with sizes of 13 × 13, 26 × 26 and 52 × 52 respectively. The feature maps of different sizes correspond to different receptive fields. The larger the size of the feature map is, the smaller the corresponding receptive field is. The correspondence between feature graph size and prior box is shown in Table 2. Based on the size and characteristic of the tidal flat water, we design two branches in our model. One of the branches, used for the detection of narrow and long waterbodies in the tidal flat area, is to get a 13 × 13 feature map through three convolutional layers after the improved feature extraction module; the other branch, used for the detection of small waterbodies in tidal flat areas, is to up-sample the output of the 14th convolutional layer in the network, and then fuse it with the features obtained by the 13th convolutional layer, and finally get the feature map with size of 26 × 26 through two convolutional layers.

**Table 2.** Corresponding between feature graph size and prior box.


#### *3.3. FCN without Pooling Layers*

The improved object detection network based on YOLOv3 (Seattle, WA, USA) is a water object detection model, so it cannot extract the water edge. In order to solve this problem, we design the FCN without pooling layers to complete the initial waterbody extraction and obtain the feature information of waterbody in a tidal flat area. The network structure of the FCN without pooling layers is shown in Figure 10.

**Figure 10.** Fully convolutional network without pooling layers.

In general, the pooling layers of CNN have two main functions: one is to compress the extracted features to reduce the computational time of the model. The second is to enlarge the receptive field of the model so that each point in the feature map corresponds to a larger area in the original image. Because the receptive field represents the receptive range of different neurons in the network to image, the enlargement of receptive field means the enlargement of receptive range of different neurons in the network. So, each point in the feature map corresponds to a larger area in the original image when we enlarge the receptive field of the model. The FCN without pooling layers we proposed in this paper aims at the initial extraction of waterbodies from 32 × 32 remote sensing images, so the receptive field is not required and the water extractions based on the network still work in our method.

The FCN without pooling layers uses six convolutional layers to extract waterbodies from 32 × 32 remote sensing images, and the convolution kernel sizes of convolutional layers are 3 × 3 and 1 × 1, respectively. All the parameters in the network can be seen in Table 3. Compared with the convolution kernels with sizes 7 × 7 and 5 × 5, we use the convolution kernel with size 3 × 3 in the model to improve the network depth and the nonlinear expression ability of the model with the same receptive field. The convolution kernel with size 1 × 1 realizes cross-channel information fusion.



#### *3.4. Similarity Algorithm for Water Extraction*

To reduce the false extraction caused by the high similarity between the waterbody and background in a tidal flat area, a similarity algorithm for water extraction is proposed. The steps of the algorithm are as follows:


$$
\sigma, \gspace{0.5, b} = \sum\_{i}^{n} L\_{r, \mathfrak{F}, b} / n \tag{1}
$$

where *r*, *g*, *b* represents the average water pixel value, *Lr*,*g*,*<sup>b</sup>* represents the pixel value of the water extraction results, and *n* is the number of waterbody pixels.

3. Thirdly, we traverse every pixel in the detection information, and calculate the similarity between the water pixels and the standard water pixels. The formula is:

$$\mathcal{Y} = \sqrt{(L\_{\mathcal{I}} - r)^2 + (L\_{\mathcal{J}} - \mathcal{g})^2 + (L\_{\mathcal{b}} - b)^2} \tag{2}$$

where *Lr*, *Lg* and *Lb* represent the pixel values of the detection results in the red, green and blue channels, respectively.

4. Finally, we set a similarity threshold and finish the water extraction. We set 34 based on the experiments. If the similarity between a water pixel in the water detection results and a standard water pixel is greater than the threshold, the pixel point is considered as water, otherwise it is not water.

The similarity algorithm for water extraction proposed in this paper effectively solves the accuracy problem of waterbody extraction caused by the blurry boundary between the waterbody and background.

The similarity algorithm for water extraction is as Algorithm 1:

**Algorithm 1.** Similarity Algorithm for Water Extraction.

```
Input: Output results of improved water detection network based on YOLOv3 and FCN
without pooling layers
Output: Pixel is waterbody or non-waterbody
1. Procedure Similarity-Water-Extraction (n: integer);
2. begin
3. for i: = 1 to n do
4. begin
5. sumLr = sumLr + Lri;
6. sumLg = sumLg + Lri;
7. sumLb = sumLb + Lbi;
8. end;
9. r: = sumLr/n;
10. g: = sumLg/n;
11. b: = sumLb/n;
12. while (pixel is the result of waterbody target detection) do
13. begin
14. Y = sqrt ((Lr−r)
                   2+ (Lg −g)
                              2 + (Lb −b)
                                          2);
15. if (Y > 34) then
16. pixel is waterbody;
17. else then
18. pixel is non-waterbody;
19. end;
20. end;
```
#### **4. Experiment and Analysis**

*4.1. Experimental Configuration*

All experiments are implemented on a system with NVIDIA GeForce GTX1070 (Santa Clara, CA, USA) and Intel(R) Core (TM) i7 (Santa Clara, CA, USA), and the operating system is Windows 10 (Redmond, WA, USA). The software environment of the system is ENVI 5.3 (Boulder, CO, USA), Python 3.6 (Wilmington, DE, USA), TensorFlow 1.12.0 (Mountain View, CA, USA) and Keras 2.2.4 (Cobham, UK).

#### *4.2. Evaluation Criterion*

To accurately analyze the experiments, this paper selects three indicators to quantitatively evaluate the model: Intersection over Union (*IoU*), pixel accuracy, and *Kappa* coefficient. The overlap ratio describes the overlap degree between the extracted object and the ground truth; the pixel accuracy is used to measure the proportion coefficient of the correct part of the detection result; the *Kappa* coefficient is used to measure the pixel classification accuracy. The calculation formulas of the three indicators are as follows:

$$IoL = \frac{Area(P) \cap Area(T)}{Area(P) \cup Area(T)} \tag{3}$$

where *Area*(*P*) represents the prediction result and *Area*(*T*) represents the ground truth.

$$P = \frac{TP}{TP + FP} \tag{4}$$

where *P* represents the pixel accuracy, *TP* represents the number of samples that are positive and identified as positive by the network model, and *FP* represents the number of samples that are incorrectly classified as positive.

$$k = \frac{p\_0 - p\_c}{1 - p\_c} \tag{5}$$

where *k* represents the value of *Kappa* coefficient, *p*<sup>0</sup> represents the proportion of the correct cells, and *pe* represents the proportion of misinterpretations caused by chance.

$$pc = TP \ast \frac{FN}{n \ast n} \tag{6}$$

where *TP* represents the number of samples that are positive and identified as positive by the network model, *n* represents the number of ground feature types, and *FN* represents the number of samples that are incorrectly classified as negative.

#### *4.3. Parameter Setting*

In this paper, the waterbody detection network plays an important role for the final water extraction. To study the influence of learning rate parameters on the accuracy of water detection, we compare and analyze the decline curve of the loss function under different learning rates and take the optimal learning rate as the model parameter at last. The values of learning rate are set as 0.0001, 0.005, 0.001 and 0.01 respectively. The convergence curve of the loss function is shown in Figure 11.

**Figure 11.** Loss function curve of the water detection model with different learning rates.

As shown in Figure 11, when the learning rates are 0.0001 and 0.005, the network model converges slowly, and the loss function value of the final convergence result is higher. When the learning rates are 0.001 and 0.01, the network performs better, and its convergence speed and final convergence result are significantly improved compared with other learning rates. Based on the above analysis of learning rate, as well as many experiments and model debugging, the training parameters of the water detection model are obtained. In this paper, we set the learning rate to be 0.001, the batch training sample size 64, the impulse 0.9, the weight attenuation 0.0005, and the epoch 500 for the improved water detection network based on YOLOv3 (Seattle, WA, USA). The network also uses two *IoU* thresholds during training. If a prediction overlaps the ground truth by 0.7 it is as a positive example, by 0.5–0.7 it is ignored, less than 0.5 for all ground truth objects it is a negative example. We set the learning rate to be 0.01, the batch training sample size 32, the impulse 0.9, the weight attenuation 0.0001 and the epoch 150 for the FCN without pooling layers in our experiments.

#### *4.4. Performance Analysis*

#### 4.4.1. Influence of Threshold of Similarity Algorithm for Water Extraction

We set 31, 32, 33, 34, 35, 36, 37, 38 and 39 as thresholds, respectively, and use the extraction accuracy to study the influence of thresholds. The experiments are shown in Figure 12.

**Figure 12.** Comparison of the pixel accuracy with different thresholds.

We calculated the pixel accuracy of water extraction with different thresholds, and the experiments show that the pixel accuracies increasing at first and decreasing after 34, as shown in Figure 12. When the threshold is getting 34, the pixel accuracy is the highest in our experiments. When the threshold is lower than 34, the phenomenon of missing extraction begins to appear in the water extraction, which makes the accuracy of the water extraction continue to decrease. When the threshold is higher than 34, water extraction begins to appear the false extraction, and the accuracy of water extraction decreases with the increase of the threshold as well. This is likely caused by the definition of the standard water pixel value. To sum up, this paper selects 34 as the threshold of similarity algorithm for water extraction in the tidal flat area.

#### 4.4.2. Qualitative Analysis

To verify the effectiveness of this method, we compare the following methods: NDWI, support vector machine (SVM), maximum likelihood classification, U-Net (Freiburg, Germany) [39], YOLOv3 (Seattle, WA, USA) and FYOLOv3. The tidal flat remote sensing images from the GF-2 satellite are selected as the sample, and the experiments are shown in Figure 13.

**Figure 13.** *Cont.*

**Figure 13.** Comparison of different methods for water extraction in a tidal flat area. (**a**) Small water bodies; (**b**) water bodies with blurry boundaries; (**c**,**d**) NDWI; (**e**,**f**) SVM; (**g**,**h**) maximum likelihood classification; (**i**,**j**) U-Net; (**k**,**l**) YOLOv3; (**m**,**n**) FYOLOv3.

The experiments of NDWI, SVM, maximum likelihood classification, U-Net (Freiburg, Germany), YOLOv3 (Seattle, WA, USA) and FYOLOv3 for small waterbodies and waterbodies with blurry boundaries are shown in Figure 13, respectively.

As shown in Figure 13, the NDWI method effectively extracts the waterbody in the remote sensing images, but there are a lot of noises in the extraction results. The water extractions by SVM and maximum likelihood classification method are relatively good, but they cannot effectively solve the problem of high similarity between water and background, and there are a lot of false extractions in the experiments. From the experiments of U-Net, we can see that the water extraction is not good for small waterbodies, and there are lots of false extraction and missing extraction. Compared to NDWI, SVM, maximum likelihood classification, and U-Net (Freiburg, Germany), the experiments of YOLOv3 (Seattle, WA, USA) and FYOLOv3 have better extraction. However, in the experiments of YOLOv3 (Seattle, WA, USA), there are some missing extractions in the densely small water areas

due to the prior frames, and FYOLOv3 is able to check each pixel in the detection area based on the similarity algorithm for water extraction, which solves the false and missing extraction, so it is superior to YOLOv3 (Seattle, WA, USA).

#### 4.4.3. Accuracy Analysis

We take *IoU*, pixel accuracy (*P*) and *Kappa* (*k*) coefficient as the evaluation indexes to compare six methods: NDWI, SVM, maximum likelihood classification, U-Net (Freiburg, Germany), YOLOv3 (Seattle, WA, USA), and FYOLOv3. We set the image size to be 256 × 256, the learning rate 0.001, the decay 0.0005, the momentum 0.9 and we use the optimizer Adam for YOLOv3. We set the learning rate to be 0.001, the decay 0.0001 the momentum 0.9 and the optimizer Adam for U-Net (Freiburg, Germany). The threshold of NDWI is 0.19 and the parameter of maximum likelihood is 2.1. The experiment results of six methods are shown in Table 4.

**Table 4.** Accuracy comparison of six methods for water extraction in tidal flat area.


As shown in Table 4, *IoU*, *P* and *k* of the FYOLOv3 method for water extraction in a tidal flat area on remote sensing images are the highest, followed by the YOLOv3 network, NDWI, U-Net, maximum likelihood classification, and SVM. The method proposed in this paper has higher extraction accuracy than other methods and has a better effect for water extraction in tidal flat with fuzzy boundaries and small waterbodies in a tidal flat area. This proves that this method has more advantages for small waterbody extraction in a tidal flat area.

Table 5 shows the model training time and water extraction time of the three convolutional neural network methods. Although the FYOLOv3 method is divided into three parts, its speed of water extraction is the highest. The method proposed in this paper not only improves the accuracy of water extraction, but also reduces the model training time and water extraction time due to the improvement of YOLOv3 (Seattle, WA, USA).

**Table 5.** Comparison of the model training time and water extraction time.


#### **5. Conclusions**

The tidal flat is long and narrow with high sediment content, so there is little feature difference between the waterbody and the background, and the boundary of the waterbody is blurry. Extracting tidal flat waterbody accurately from high-resolution remote sensing imagery is a great challenge. In order to solve the low accuracy problem of tidal flat waterbody extraction, in this paper, a FYOLOv3 is proposed to solve the above problems and extract waterbody in tidal flat with high accuracy. The FYOLOv3 mainly includes three parts: Firstly, according to the characteristics of tidal flat water extraction, an improved object detection network based on YOLOv3 (Seattle, WA, USA) is proposed to ensure the accuracy of water detection, reduce the computational time of the model and alleviate the overfitting phenomenon; secondly, a FCN without pooling layers follows the improved object detection network to obtain the initial water extraction; at last, a similarity algorithm

for water extraction is proposed, which distinguishes the waterbody and non-water pixel by pixel in order to improve the extraction accuracy of tidal flat waterbody. Compared to the other models, the experiments show that our method has higher accuracy on the waterbody extraction of tidal flats or small areas, and the *IoU* of our method is 2.43% higher than YOLOv3 (Seattle, WA, USA) and 3.7% higher than U-Net (Freiburg, Germany). However, this method also has some limitations, which needs to manually select the similarity threshold, and different thresholds need to be set for different data, which affects the robustness of the method. Therefore, our future research will consider how to determine the threshold intelligently in order to improve the robustness of the method.

**Author Contributions:** Conceptualization, L.Z. and Y.F.; Methodology, L.Z. and Y.F.; Validation, Y.S. and R.Y.; Resources, R.Y. and G.W.; Data Curation, J.W. and G.W.; Writing—Original Draft Preparation, L.Z. and Y.F.; Writing—Review and Editing, L.Z. and Y.F.; Supervision, L.Z.; Funding Acquisition, L.Z. and G.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Key Research and Development Program of China (No.2016YFA0601703, 2016YFC0401005) and National Natural Science Foundation of China (91847301, 42075191, 52009080).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The study did not report any data.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

