**1. Introduction**

Water is an indispensable resource for a sustainable ecosystem on earth. It contributes significantly to the balance of ecosystems, the maintenance of climate change and the carbon cycle [1]. The formation, expansion, shrinkage and disappearance of surface water are important factors influencing the environment and regional climate changes. Water is also an important factor in socioeconomic development, because it a ffects many agricultural, environmental and ecological issues over time [2,3]. Hence the rapid and accurate extraction of water resource information can provide necessary data, which is of grea<sup>t</sup> significance for water resource investigation [4–6], flood monitoring [7,8], wetland protection [9,10] and disaster prevention and reduction [11,12].

In recent years, a lot of research has been done on image foreground extraction and segmentation [13]. This study proposed an Alternating Direction of Method of Multipliers (ADMM) approach to separate the foreground information from the background, and it has a grea<sup>t</sup> e ffect upon the separation of text, moving objects and so on. There are also many algorithms for extracting water from remote sensing images, including spectral classification [14], the threshold segmentation method [7,15] and machine learning [16–18].

However, the accurate identification of water is always a di fficult problem because of the complicated terrain, classification methods and remote sensing data itself. Because of its simplicity and convenience, the water index is the most commonly used water identification method. Among them, the Normalized Di fference Water Index (NDWI) [19], Modified NDWI (MNDWI) [20] and the Automated Water Extraction Index (AWEI) [21], are most representative methods. The NDWI normalized green and near-infrared bands to enhance the water information to separate the water better, but it had a large error in urban areas [20]. MNDWI ameliorates this problem by using mid-infrared bands [20]. What these water indices have in common, is that they all use di fferences in the reflectivity of water at di fferent wavebands to enhance water information. The water is then classified by setting a threshold.

There are two problems with the water index approaches, and one of them is that every water index has its drawbacks. For example, the NDWI was poor at distinguishing between water and buildings, and the MNDWI was poor at distinguishing water from snow and mountain shadows. More sophisticated methods for high-precision water maps require auxiliary data, such as digital elevation models and complex rule sets to overcome these problems [22–24]. Another problem is that the optimal threshold to extract water is not only highly subjective, but also varies with region and time. By adopting the method of the Automatic Water Extraction Index [21], the extraction result was improved, but the threshold still changes with the change of time and area.

Statistical models are also used for identifying the water bodies, which can be divided into unsupervised and supervised classifications. It is generally more accurate than other methods, because it does not require an empirical threshold. No prior knowledge is applied in the unsupervised classification, while supervised classification makes classifications by learning from given samples. There are many popular supervised methods, like maximum likelihood [14] and the decision tree [25,26]. Most methods require additional inputs for more accurate results, such as slope, and mountain shadow [25,26], in the original band, and so on. All of these increase the data volume and calculation di fficulty.

In recent years, the recognition algorithm based on artificial intelligence has been developing rapidly. Di fferent from the traditional methods, deep learning can adapt learning from a large number of samples with flexibility and universality [27]. The convolutional neural network is one of the commonly used models of deep learning, which greatly reduces the number of parameters, enhances the generalization ability, and realizes the qualitative leap of image recognition by its features of local connection and weight sharing [17]. As part of the study of neural networks, the recent popularity of neural networks has revitalized the research field. As the number of network layers increases, the di fferences between di fferent structures are also enlarging, which has stimulated the exploration of di fferent network structures [28–32]. Many di fferent network structures have been proposed to realize the semantic segmentation of images. One is the encoder–decoder structure, such as Unet [33], SegNet [34] and RefineNet [35]. The encoder is used to extract image features and reduce image dimensions. The decoder is used to restore the size and the detail of the image. The other is to use the dilated convolutions, such as DeepLab v1 [36], v2 [37], v3 [38], v3+ [39] and PSPNet [40]. They can increase the input field without pooling, so that each convolution contains a larger range of information in the output. In addition, networks that have been proven to be e ffective in object detection applications were also applied to the instance segmentation field and showed good e fficiency. For instance, the regional convolutional network (R-CNN) [41], Fast R-CNN [42], Faster R-CNN [43], Mask R-CNN [44], etc. A new framework has also been proposed called the Hybrid Task Cascade (HTC), which combined cascade architecture with R-CNN for better results [45]. Attention mechanisms have also been applied to segmentation networks by many researchers. Chen et al. [46] showed that the attention mechanism outperforms average and max pooling. More recently, a Dual Attention Network (DANet) [47] has been proposed which appended two types of attention modules on top of dilated FCN, and achieved some new state-of-the-art results on multiple popular benchmarks.

Besides those networks mentioned above, there are many other types of depth model applied to image segmentation, like applying active contour models to convolutional neural networks (CNNs) [48], and so on. Shervin et al. [49] have made a thorough network summary for image segmentation.

The corresponding features from the image of target detection and classification can be extracted by the deep convolutional neural network. It is reported to perform well in image classification and target detection, and there are already some models developed, such as LeNet [50] in 1998, AlexNet [28] in 2012, GoogLeNet [29] and VGG [30] in 2014 and ResNet [31] in 2015. With the technical development, the complexity of these models is increasing. The VGG network uses only a 3 × 3 convolution kernel and 2 × 2 pooling kernel [30]. The use of a smaller convolution kernel can increase the linear transformation and improve the classification accuracy. It also shows that the increase of network depth has a grea<sup>t</sup> effect on the improvement of the final classification results of the network. However, simply increasing the network depth will lead to gradient vanishing or gradient explosion. ResNet solves this problem by introducing a residual block [31]. It passes information direct to output to protect the integrity of the information. The whole network just needs to know the difference between the input and output, simplifying the learning process. Recent research on ResNet shows that many of its middle layers contribute little to the actual training process, and can be randomly deleted, which makes ResNet similar to the recurrent neural networks [32]; but, since ResNet has its own weight every layer, it has a larger number of parameters. The multidimensional densely connected convolutional neural network (DenseNet) [51] proposed in 2016 does not have the above problems. It gives full play to the idea of a residual block in ResNet, and each layer of its network is directly connected to its previous layer to achieve the reuse of features. This enables the network to be easy to train by improving the flow of information and gradient throughout the network. At the same time, it has a regularization effect, and can prevent the overfitting effect for small data sets. Besides, each layer of the network is very narrow, leading to reduced redundancy. Crucially, unlike ResNet, the DenseNet combines features, not by summing them before passing them to the next layer, but through concatenation instead. Compared to ResNet, the number of its parameters is greatly reduced. The experimental result has shown that the DenseNet has fewer parameters, faster convergence speed and shorter training time under the premise of ensuring the training accuracy [51].

So far, Landsat is one of the most commonly used data satellites in water extraction research, the spatial resolution is 30 meters, and the temporal resolution is 16 days [52]. The GF-1 satellite was launched in April 2013 by China, which was equipped with two full-color cameras with a resolution of 2 m, and a multi-spectral camera with a resolution of 16 m. Since the revisit period of the GF-1 satellite is about four days, it has apparent advantages regarding its spatial and temporal resolutions. However, there are still rare cases using GF-1 satellite images for water body extraction, especially with the deep learning algorithms.

In this paper, we use the convolutional neural network (CNN) to extract water bodies from GF-1 images. We borrowed the idea of DenseNet and added the up-sampling process to form a fully convolutional neural network. At the same time, the skip layer connection was added in the up-sampling and down-sampling processes to improve the efficiency of feature utilization. This paper compares this model with the two segmentation networks of SegNet and DeepLab v3+, two feature extraction networks of ResNet and VGG, and also the traditional water index method to understand their efficiencies in water body identification.

#### **2. Materials and Methods**

#### *2.1. Study Area*

The Poyang Lake (28◦22–29◦45N, 115◦47–116◦45E), is located in the north of the Jiangxi province. It is the largest freshwater lake in China. In the rainy summer season, the area of lake can exceed 4000 km2; in the relatively dry autumn and winter, the lake area will typically shrink by more

than 1000 km2. The lake is mainly fed by precipitation, and sometimes the Yangtze River flux. Rainy season in the Jiangxi province usually begins in April, and lasts for about three months.

The increase in precipitation causes the water level of the Poyang Lake to rise. The precipitation amount decreases after July. However, the water level of the Yangtze River rises due to the water supply from precipitation and snowmelt in its upper reaches, which feeds the Poyang Lake and makes the water level of this Poyang Lake continue to rise [53] under the continuous influence of human activities and the Yangtze River water diversion and a large amount of sediment deposits, which has an important influence on the area of Poyang Lake.

Figure 1 shows the river networks in the Poyang Lake basin. Since most of the water bodies in the Poyang Lake basin are distributed in the northern region, we have selected an area of interest to compare the water identification e ffects of di fferent methods. Due to the influence of monsoon precipitation, the spatial coverage of Poyang Lake changes significantly during the wet and dry seasons. Therefore, we select images in summer and winter, respectively, to evaluate the water body recognition effect of the used models.

**Figure 1.** The river networks in Poyang Lake basin.
