*2.2. Methods*

Using the digital elevation model (DEM) and pre-possessed OLI bands (LS8), three secondary bands were created: Slope, NDVI and NDWI. Figure 2 shows the secondary bands. After which these three secondary bands were stacked one by one with LS8 such that this study has three new datasets as LS8 + Slope, LS8 + NDVI, and LS + NDWI. Finally, all three were stacked with LS8 to form another new data LS8 + Slope + NDVI + NDWI.

**Figure 2.** Secondary bands (**a**) slope; (**b**) NDVI and (**c**) NDWI.

A total of 800 reference datasets were used in the whole scene. To minimize the overfitting or selection bias in the predictive ability of machine learning algorithms, the authors used ten folds cross-validation to train the model. In the process, the data is divided into ten sets, nine sets are first used to train the model and the remaining one is used for validation purposes. The process is repeated ten times with each of the ten subsamples being used for both training and validation. Thus, each observation is used exactly once for training and validation.

After forming the dataset, six machine learning methods, NB, RPART, NNET, SVM, RF, and GBM, were implemented in the CARET package of R, and were used for training and preparing the models. The advantage of the CARET package is that it provides complete machine learning processes from data preparation, training, modelling and prediction to validation. Moreover, the train function in the caret sets up a grid of tuning parameters as required for the algorithm which fits each model and calculates a resampling based performance measure. As details on these models can be widely found in the literature, the authors will be briefly describing these algorithms.

NB is a probabilistic classifier based on Bayes' theorem with the independent assumptions between predictors. It is easy to build, with no complicated iterative parameter estimation and suited particularly when the dimensionality of the inputs is high. With few tuneable parameters and fast, they end up being very useful as a quick-and-dirty baseline for a classification problem.

RPART is a type of binary tree used for classification or regression tasks. It performs a search over all possible splits by maximizing an information measure of node impurity, selecting the covariate showing the best split.

NNET in R, is a feed-forward neural network with a single hidden layer flowing left to right. Feed-forward neural networks were the first type of artificial neural network invented, and are simpler than their counterparts, recurrent neural networks. They are called feed-forward networks because information only travels forward in the network (no loops), first through the input nodes, then through the hidden nodes (if present), and finally through the output nodes. These are primarily used for supervised learning in cases where the data to be learned are neither sequential nor time dependent.

SVM is a data classification method that separates data using a hyperplane. In other words, for the given labelled training data (supervised learning), the algorithm outputs an optimal hyperplane which separates only one type of data. The SVM technique is generally useful for data that is non-regularity, which means data whose distribution is unknown.

RF is a meta estimator that fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

GBM is a class of ensemble learning techniques to create a collection of shallow and weak successive trees with each tree learning and improving on the previous based on a cost function (for example, squared error).

All these six models for five different datasets were applied to the full scene to classify the image into binary water and non-water maps. After the classification, the full reference dataset was evaluated for overall accuracy (OA) and kappa coefficient (Kappa) [22].

### **3. Results and Discussion**

In this section, first, the results of the cross-validated models were assessed, and then they were applied to the whole scene for the surface water extraction. Next, the complete reference dataset was tested against the predicted results. In the end, a detailed comparison and discussion of the different machine learning algorithms were done to evaluate the performance case by case.

Figure 3 shows the boxplot of the OA and Kappa for different models based on the cross-validated data for five different multiband data. At first glance, it can be seen that, other than RPART and ND, four algorithms showed quite high OA and Kappa in the cross-validation. In the case with only LS8 multiband data, NNET and RF were able to achieve maximum OA and Kappa. After adding the slope, NNET and RF produced the model with maximum OA and Kappa. However, adding NDVI and NDWI with the LS8 bands, only NNET achieved improved OA and Kappa. Moreover, it can be clearly seen that the OAs achieved, with all the machine learning algorithms, narrow ranges with the addition of NDWI, followed by NDVI and the slope. The mean of OA and Kappa shifted towards higher values after adding only the slope and all three secondary bands for all algorithms, except RPART.

**Figure 3.** *Cont*.

**Figure 3.** The boxplot of overall accuracy (OA) and the Kappa coefficient (Kappa) using different machine learning algorithms for provided multiband data: (**a**) LS8; (**b**) LS8 + Slope; (**c**) LS8 + NDVI; (**d**) LS8 + NDWI; and (**e**) LS8 + Slope + NDVI + NDWI.

In the CARET package, the train function produces the model with tuning parameters. Therefore, the produced models can be easier to apply in the prediction. Figure 4 shows the predicted result of all the algorithm models for all the multiband data. Based on the visual inspection, all the algorithms performed well in the lower flat and hilly regions compared to the Himalayas. Adding the slope made the result visually better in most algorithms, except the RPART. Similarly, adding only NDWI and adding all three secondary bands produced visually better results. In contrast, NDVI only improved with RF and GBM algorithms and others produced many misclassified surface water bodies. For further evaluation on how well all these methods have performed in the scene, eight different types of cases were carefully analyzed and compared for each algorithm in the section below.

**Figure 4.** *Cont*.

**Figure 4.** Results of surface water extraction using different machine learning methods (**a**–**f**) for provided multiband data (**1**–**5**).

For the quantitative evaluation, the confusion matrix-based OA and Kappa were produced for the full reference class and predicted class. In Figure 5, both OA and Kappa shows similar patterns for all the algorithms in response to the addition of the secondary band. RF shows both OA and Kappa 1 for all multiband data. Following this, GBM, NNET and SVM performed well. However, the NB and RPART performance were among the worst. In NB, LS8 and LS + NDWI data did well compared to others. The addition of NDVI or the slope or all three secondary bands decreased the performance. While in the case of RPART, the addition of one or all three bands improved the OA and Kappa. In NNET, the addition of all bands increased the performance compared to each secondary band, while it was opposite for the SVM i.e., a decrease in performance. In GBM, adding NDWI boosted the performance the highest, and the slope and all three did well, but the addition of NDVI decreased the performance.

**Figure 5.** (**a**) Overall accuracy (OA) and (**b**) Kappa coefficient (Kappa) of all the algorithms for provided multiband data.

Except RF and NNET, similar patterns of performance were seen. NDWI followed by the slope, increases the performance of the machine learning algorithms. In contrast, NDVI or adding all bands reduced the performance. It shows that adding specific secondary bands with the original LS8 band is useful for enhancement of surface water. Similarly, the slope that ensures flat surface of water bodies is also useful if added. In NDVI, water are negative values at the same time that many other non-vegetative bodes can also have the same value, which has led to a decrease in performance. NNET and SVM are very well-known state-of-art algorithms in classification. They perform well in many cases and their performance in the original Landsat 8 scene of Nepal is satisfactory. However, the interesting result is that while NNET increased the performance with that addition of data gradually, SVM went the other side. This also shows that the addition of secondary data does not necessarily improve performance and are rather dependent on algorithms. Thus, comparative studies are necessary to check these performances and select the best one for the area under study.

Figure 6 shows the cases for NB in the entire multiband, along with the slope and results from the previous study, i.e., conditioning NDVI and NDWI with elevation (Elev\_NDWnVI) [13]. Cases a, g and h were the ones representing snow and melting ice in that Himalayas, in which NB performed very bad and most of the shadowed regions were misclassified. It performed well in cases from b to f, and even did well than in the previous study in case c, i.e., narrow river channels with shadows. It was able to classify non-water shadow features well. In the case of additional band performance, all the results in the cases are similar with no distinct difference. As per the RPART performance in cases (Figure 7), it was able to remove the shadow issues in the narrow river with shadows in case b. It performed well in cases of wide shadows, ponds and wide rivers. However, it was not able to separate shadows of the Himalayas in cases a, g and h. Further, adding the slope somewhat improved these cases but failed in the case of the shadow in case b. However, adding NDVI and NDWI were unable to improve in the shadows.

**Figure 6.** The comparison of NB results of special cases of surface water in the test scene (Figure 1) for different multiband data used.

**Figure 7.** The comparison of RPART results of special cases of surface water in the test scene (Figure 1) for different multiband data used.

In Figure 8, NNET showed gradual improvement in identifying water bodies with the addition of secondary bands. NDWI enhanced the shadows as water compared to NDWI and all three together. For case b, the addition of NDVI misclassified few cloud shadows while others did not. For case c, all the dataset were able to solve the shadow issue from the previous shadow in the hills, except the

NDVI. For the plain area in case d with the narrow river, NNET results were good in all the dataset with even enhanced small water bodies. In case e for urban small water bodies, LS8 predicted water and some shadows but with the addition of secondary bands, the pure pixels were enhanced. With the addition of all three bands, it refined the water bodies very well. For a large river in case f, the results are similar to each other and with the previous study [13]. For cases g and h, the results are different for every dataset. The slope misclassified the melting ice, NDVI misclassified the shadows and combined, showed both misclassified. LS8 only and the addition of NDWI were somewhat better compared to others. With the SVM in Figure 9, the SVM showed similar results to NNET for most cases. However, SVM misclassified the darkest shadows as water in the whole scene. It is clearly seen in case a, b, g and h, especially with the addition of the slope. In case a, with LS8 and addition of NDVI and NDWI, even light shadows were misclassified.

**Figure 8.** The comparison of NNET results of special cases of surface water in the test scene (Figure 1) for different multiband data used.

**Figure 9.** The comparison of SVM results of special cases of surface water in the test scene (Figure 2) for different multiband data used.

As both RF and GBM are ensemble methods, their performance on the cases are quite similar in Figures 10 and 11 respectively. Both methods were successful in separating shadows in case c, but failed in cases a and g. In addition, with the addition of the slope only, the results were better than the original LS8. However, the addition of NDVI or NDWI did not perform well in the shadow areas. Only the addition of the slope as a secondary band seems to be a good choice rather than adding all for both RF and GBM algorithms in surface water mapping.

**Figure 10.** The comparison of RF results of special cases of surface water in the test scene (Figure 1) for different multiband data used.

**Figure 11.** The comparison of GBM results of special cases of surface water in the test scene (Figure 1) for different multiband data used.

In comparison to the previous study [13], except NB and RPART, all other results are above 90% and even up to 100% as seen in the literature [16,23–25]. In addition, for a hilly test scene in the study done by Jiang et al., [15] using a multilayer perceptron neural network in Landsat 8 OLI satellite imagery, the neural network performance was similar to this work, i.e., OA 98.50 and Kappa 0.970. Compared to our previous study [13], the results were good for hilly and lower flatlands where there is no snow. The performance of surface water detection for narrow rivers in hilly regions with shadows improved, however, they were only able to detect pure water pixels. The main issues were the ice and snow with shadows in hilly areas. Adding the slope somewhat improved cases in the Himalayas, but not the NDVI or NDWI. In a case by case evaluation also, the performance seem well but were mostly misclassified in the Himalayas. Quantitatively, machine learning algorithms were much better compared to the index methods, however, the results are not as reliable as it should be for the whole scene. In the full scene, very few areas are covered by the snow compared to the large hilly and flat lands. Nevertheless, Nepal is a mountainous country and has a quantitatively large cover, which could be a challenge in applying machine learning algorithms. Thus, the wrong predictions in those areas are less significant in the validation. A further investigation with cutting-edge machine learning technology, i.e., convolutional neural networks or deep learning could be undertaken for improvement.
