*2.3. Methods*

To produce a water map from high-resolution satellite images, a DenseNet-based water mapping method was proposed. To verify the e ffectiveness of the proposed method, we compared this method with both methods of water index and classical convolutional neural network.

We select the method of the water index because it is the most widely used and representative method in the field of remote sensing image water extraction. Using the water index, we want to show that the proposed method has better performance than the traditionally used water index in water extraction, and in order to avoid the influence of subjective factors above the threshold selection of water index on the results, we used the Otsu's threshold segmentation method [54,55] to find the optimal threshold. Due to the limitation of GF-1 spectral bands, we choose the NDWI to extract water.

#### 2.3.1. The Normalized Di fference Water Index (NDWI)

The GF-1 images only contain four bands, hence NDWI can only be used to identify the water area. The optimal threshold of NDWI is determined using Otsu's method. The NDWI is a widely-used method for water identification based on the green band and near-infrared band. Using GF-1 spectral bands, the NDWI is computed as follows:

$$\text{NDWI} = \frac{\mathbf{b}\_{\text{green}} - \mathbf{b}\_{\text{near}-\text{infrared}}}{\mathbf{b}\_{\text{green}} + \mathbf{b}\_{\text{near}-\text{infrared}}} \tag{1}$$

where bgreen represents the reflectivity of green band, bnear-infrared represents the reflectivity of near-infrared band. Ideally, a positive NDWI value indicates the ground is covered with water, rain or snow; a negative NDWI value indicates vegetation coverage; and the ground is covered by rocks or bare soil if the NDWI is equal to 0. The threshold value is always not 0, due to various influences such as vegetation on the water surface. The selection of threshold is a key and di fficult problem for accurate water body identification, and we use Otsu's method to determine it.

This Otsu's method is a classical algorithm in the image segmentation field which was proposed by the Japanese Nobuyuki Otsu in 1979 [54,55]. It is an adaptive threshold determination method. For a color image, it converts the image into a grayscale image and then distinguishes the target from the background according to the grayscale characteristics. The larger the variance of the gray value between target and the background, the greater the di fference between these two parts. So, it calculates the maximum value of the class variance between target and background to find the optimal threshold. Among them, the definition of inter-class variance is as follows:

$$\mathrm{e}^2(\mathrm{T}) = \mathrm{P}\_{\mathrm{O}}\left(\mu - \mu\_{\mathrm{O}}'\right)^2 + \mathrm{P}\_{\mathrm{b}}\left(\mu - \mu\_{\mathrm{b}}'\right)^2\tag{2}$$

where μ is the grayscale mean of the image, μo and μb are the means of the target and background, Po and Pb are the proportion of grayscale of target and background, and T is the threshold. When T is the maximum value of e2(T), it is the optimal threshold.

In this study, as the pixel-wise NDWI values are derived, it is necessary to stretch them to the gray value from 0 to 256, from which Otsu's threshold is then calculated to segmen<sup>t</sup> the water body from the background.

#### 2.3.2. Evolution of Convolutional Neural Network

With the development of technology and the optimization of hardware facilities, many classical networks have emerged after numerous updates of the convolutional neural network. In 2014, researchers developed the new deep convolutional neural network, VGG [30]. They discussed the relationship between the depth and the performance of neural network. VGG [30] successfully constructed the deep layer of 16–19 convolutional neural networks, and it proves that the increase of the network depth a ffects the performance of the network to some extent. It was once widely used

as a backbone feature extraction network for various detection network frameworks [42,56] until the ResNet was proposed.

As a neural network with more than 100 layers, the ResNet's biggest innovation lies in that it solves the problem of network degradation through the introduction of a residual block. The traditional convolutional network has problems such as information loss during information transmission, and leads to the disappearance of gradient or gradient explosion, which makes the deep network unable to train. ResNet passes the input information directly to the output, thus solving this problem to some extent. It simplifies the di fficulty of learning by learning the di fference between input and output, instead of all input characteristics. DenseNet was proposed based on ResNet, but with considerable improvement.

As shown in Figure 2, the inputs of each layer of DenseNet are the outputs of all previous layers. The information transmission between di fferent layers of the network is guaranteed to be maximized. Instead of connecting layers over summation such as the ResNet, the DenseNet connects the features through concatenating to achieve feature reuse. Meanwhile, a small growth rate is adopted, and the feature graph of each layer is relatively small; thus, to achieve the same accuracy, the computation required by DenseNet is only about half that of the ResNet. Therefore, this study chooses DenseNet as the backbone to extract features.

**Figure 2.** Multi-dimensional Dense Connection Module. (BN refers to Batch Normalization, ReLU refers to Rectified Linear Unit, Conv refers to Convolution.)

For a standard CNN, the output of the layer is the input of the next layer. The ResNet simplifies the training of the deep network by introducing the residual block, of which the output of the layer is the sum of the output of the previous layer and its nonlinear transformation. As for a DenseNet, the input of the *l* layer is the concatenation of the output characteristic map from 1 to *l* − 1 layer, and then makes nonlinear changes, that is:

$$\mathbf{x}\_{\mathrm{l}} = \mathbf{K}\_{\mathrm{l}}([\mathbf{x}\_{\mathrm{l-1}}, \mathbf{x}\_{\mathrm{l-2}'}, \dots, \mathbf{x}\_{\mathrm{l}}]),\tag{3}$$

here K is made up of batch normalization, activation functions, convolution and dropout. DenseNet's dense connections increase the utilization of features, make the network easier to train, and has the effect of regularization.

Fully convolutional networks (FCNs) [57,58], as a convolutional neural network, can segmen<sup>t</sup> images at pixel scale; therefore, it solves the problem of semantic segmentation. The classic CNN uses the fully connected layers after the convolution layer to obtain the feature vector for classification (fully connected layer + SoftMax output) [59–62]. Unlike the classic CNN, FCN uses deconvolution to return the reduced feature map to the original size after feature extraction. In this way, while preserving the spatial information of the input, the output with the same size of the input is gradually obtained, so as to achieve the purpose of pixel classification. It can accept input images of any size. Many networks have been proposed for image segmentation after FCN. SegNet [34] was proposed as an encoder–decoder network which uses the first 13 layers of VGG16 as encoders, and the max pooling indices as decoders to improve the segmentation resolution. DeepLab v3+ [39] was proposed in 2018, and it is the latest version of DeepLab series. It uses deep convolutional neural network with atrous convolution in the decoder part. Then the Atrous Spatial Pyramid Pooling (ASPP) is used to introduce multiscale information. Compared with DeepLab v3, v3+ introduces the decoder module, which further integrates the low-level features and high-level features to improve the accuracy of segmentation boundary.

#### 2.3.3. Model-Based on DenseNet

Figure 3 shows the architecture of the network we have proposed for water body identification. Our model is a fully convolutional neural network with the fusion of multiscale features. The model chooses DenseNet as the backbone for feature extraction. The DenseNet we use contains four dense blocks. The transition block makes the connection between each dense block. The transition block consists of a 1 × 1 convolution and a 2 × 2 pooling operation. It can reduce the spatial dimensionality of feature maps.

**Figure 3.** Proposed network architecture for semantic identification based on the DenseNet model.

In our network, in order to recover from the input spatial resolution, the upsampling layer is implemented by the transpose convolution. The feature map of the upsampling is then concatenated to the feature map from the dense block in the down-sampling process. The batch normalization (BN) and the Rectified Linear Unit (ReLU) are performed before the convolution of the image.

Our model can input images of arbitrary size during inference. But for the convenience of training, and to ensure that there is sufficient memory for training, we unified all input images into the size of 224 × 224 pixels. We cut out images of uniform size from GF-1 images, and screened out images containing both water and non-water as effective training data. At the same time, to ensure that the model can directly extract useful features from the original data, we did not do any preprocessing of the input image. We used the Adam optimization algorithm to optimize the weight. Hyperparameters β1 = 0.9 and β2 = 0.999 are selected as recommended by the algorithm. We trained our model in stages with the initial learning rate λ = <sup>10</sup>−4, which was reduced by 10 times after 30 epochs. The initial learning rate here is the best result from multiple trials. The growth rate of the network is set as 32, weight decay is <sup>10</sup>−4, and the Nesterov momentum is 0.9, which remain the same as the classic DenseNet.

In order to determine the number of network layers, we experiment with the number of convolutions in each dense block to find the optimal result. The DenseNet proposed by Huang et al. [51] designed three network layers for di fferent tasks, i.e., Densenet121, Densenet169 and Densenet201. In addition to testing the above-mentioned three networks, we also adjust the number of layers to find the most suitable result for this task. We first halve the convolution layers of first three dense blocks of DenseNet121, the fourth block is unchanged, which is DenseNet79.

Then we tried to halve the convolution layers of four blocks and it became DenseNet63. We trained five DenseNets with di fferent network layers to compare which is the best.

In order to make an e ffective comparison of the results, we use training time as one indicator to determine which network is faster and more convenient. We use the precision (P), recall (R), F1 score (F1) and mean Intersection over Union (mIoU) to quantitatively measure the performance of the network, which are all based on the confusion matrix. The same indicators were used to evaluate the performances of NDWI, VGG, ResNet, SegNet, DeepLab v3+ and DenseNet. As an evaluation index, the confusion matrix evaluates the performance of a classifier, and it is more accurate for the identification results of unbalanced categories. The confusion matrix divides the image identification results into four parts: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). The specific calculation formula of evaluation index is as follows [63]:

$$\text{IP} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{4}$$

$$\text{RR} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{5}$$

$$\mathbf{F}\_{\alpha} = \frac{(\mathbf{1} + \alpha^2)\mathbf{P} \times \mathbf{R}}{\alpha^2(\mathbf{P} + \mathbf{R})} \tag{6}$$

$$\text{mIoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} \tag{7}$$

where P means the precision, and R means the recall. MIoU is the intersection of two sets of ground truth and predicted results. Precision is the fraction of correctly identified water pixels (TP) among the predicted water pixels (TP + FP) by the model. Recall is the fraction of correctly identified water pixels (TP) among the actual water pixels (TP + FN). Since precision and recall are sometimes contradictory, we further introduce the F1 score to measure the accuracy of a binary model, which simultaneously takes precision and recall into consideration [64]:

$$\text{F1} = \frac{\text{2(P} + \text{R)}}{\text{P} + \text{R}} \tag{8}$$

The comparison results of five networks are shown in Table 1. The best results of all indicators are displayed in bold fonts. We can see that with the increase of network layers, the training time also increases; however, the performance does not become better with layer increase. This may be because the input samples of the network are not enough, and the characteristics of the water are easier to identify, so too many layers will not contribute to the results. Among these five networks, DenseNet79 has the best performance in recall, F1 score and mIoU. Its precision is lower than DenseNet169, but the training time is almost two hours less than DenseNet169. Therefore, DenseNet79 is most suitable for the task of water recognition in this study.


**Table 1.** Comparison of DenseNets with different layers. The optimal value for each metric is shown in bold. (P refers to Precision; R refers to Recall; F1 refers to F1 score and mIoU refers to mean Intersection over Union).

To verify the performance of our implementations, VGG, ResNet, SegNet and DeepLab v3+ were selected to make comparisons. VGG and ResNet were selected, respectively, as representatives of the neural network with less than 100 layers, and the neural network with more than 100 layers. SegNet and DeepLab v3+ were selected as representatives of two segmentation network structures: encoder–decoder structure and Atrous convolution. Also, due to the limitation of computation resources and the number of training datasets, it is not necessary to use powerful and complicated networks as our exception, since, as the backbone of DeepLab v3+, we chose MobileNet [65] as the backbone, which has much less parameter, and can achieve good results on our task in shorter time.
