3.1.3. K-Nearest Neighbors

The classification of a data point using a KNN algorithm is carried out by using the properties of the neighboring data points [20]. It is a more efficient form of the ball tree concept [24], which can be applied to larger dimensions. The algorithm is widely used in LSM applications [25] and the probability of a data point to be allocated in any class is determined by the classification of its nearest neighbors [26]. The data point takes the classification in which the maximum number of its neighbors is classified. The number of K shall be decided by tuning process for better results.

KNN is classified as a non-parametric model, as the computation process does not depend upon the distributions of the dataset. This is another advantage while using KNN for LSM applications where the number of parameters is more and the data seldom fits to standard distributions. For a set of unclassified points, the algorithm calculates the distance from each point to find K closest neighbors. The classification of these neighbors are then used for voting, and the classification with the maximum votes is assigned to the unclassified data point.

## 3.1.4. Random Forest

As the name indicates, RF is a combination of many Decision Trees (DT) and the concept was developed in 1995 [27]. Each DT has nodes and branches. The decisions are made at nodes and the classification continues on a particular branch based on the decision. The decisions are continued by considering all LCFs, and each DT assigns a class for the object. RF then considers the class predicted by all DTs and assigns a class for the object based on voting. Each DT is a subset of the whole dataset, and is independently sampled by bootstrapping. The randomness of selection at each node is the major advantage of RF model, which often results in highly accurate predictions, making it suitable for LSM [21,28–30].

The use of splitting at nodes, bootstrapping, and several number trees reduces overfitting in RF by increasing randomness. The model can be fine-tuned by varying the depth of trees, number of trees to be combined, and the number of features considered at each node.

#### 3.1.5. Support Vector Machines

The SVM algorithm classifies a data point using a hyperplane in a multidimensional space, first proposed by Vapnik and Lerner [31–33]. The hyperplanes are boundaries that decide the classification of an object. The number of LCFs used for the analysis determines the dimensions of the hyperplane. For each dataset, multiple hyperplanes are possible, which can classify the points into different classes. Hence, the SVM algorithm should choose a hyperplane which can maximize the distance between the data points of both classes using statistical learning theory [31,34]. The distance is maximized in order to accommodate the future data points. The data points which are located near to the hyperplane determine the orientation and position of the hyperplane, and these data points are called support vectors.

The SVM algorithm classifies the objects by using different kernel functions, and the choice of kernel function is critical in the results produced by the algorithm. The algorithm is widely used for LSM applications [29,35] and has been in practice since the 2000s [34].

#### *3.2. Data Collection and Sampling Strategies*

The landslide inventory map for the study was prepared manually after interpreting satellite images before and after the event. A total of 388 landslides which occurred in 2018 were identified within the boundary of Wayanad. The 2018 disaster was chosen for the study as the district was widely affected by this particular event. The locations where historical landslides were reported were affected, and many new landslides were also reported. Two datasets were prepared from the landslide data collected (Figure 2). In the first approach, the landslide was represented by a point in the crown area and, in the second method, the shape of landslide was demarcated using pre- and post-satellite images; the polygon was marked as inventory data. The district was highly affected by long runout debris flows, as 309 events out of the total 388 were classified as debris flow events. Among the remaining events, 68 were shallow landslides and 11 were rock falls or rockslides. The 388 landslides were represented by 388 cells in the first sampling strategy (Figure 2a) and 9431 cells using the second strategy (Figure 2b). The developed landslide susceptibility map thus provides the probability of occurrence of any of these landslide typologies in the region, and it is not specific for any single landslide typology. The debris flows have very long runout distances [16], and even the locations which are a few kilometers away from the crown points, with entirely different LCFs, were also affected. Hence, using point data for the training and testing of the model might ignore the probability of the occurrence of hazards in the runout zones. To avoid this issue, polygon inventory data was also used in the analysis. The polygon data represents all the cells affected by landslides, unlike the single point used in the first approach. However, the polygon does not differentiate between the crown area and the runout zone. The objective is to train the ML model to predict the probability of the occurrence of a landslide in each cell, and the focus of this manuscript is to compare the probabilities predicted by different approaches. The methodology does not differentiate between crown and landslide body, and checks only if the cell is affected by landslide or not.

**Figure 2.** Different sampling strategies adopted in this study: (**a**) point data, and (**b**) polygon data.

The DEM for the study was collected from an Advanced Land Observing Satellite– Phased Array type L-band Synthetic Aperture Radar (ALOS PALSAR) [36], with a resolution of 12.5 m. All the other layers were also prepared in the same resolution as the DEM, and all GIS operations were carried out using QGIS version 3.10. The LCFs, such as slope, aspect, Stream Power Index (SPI), and Topographic Wetness Index (TWI) were derived from the DEM. The first LCF used in this study was elevation, which was directly obtained from the DEM. Slope angle is another significant factor which is critical in triggering the landslides. Slope is defined as the ratio of the vertical to horizontal distance between two points, expressed in terms of the tangent angle in degrees. The value of slope may vary between 0 and 90 degrees. The orientation of the sloping face is expressed using the direction and is termed as aspect. From previous studies, it was found that the value of aspect is critical when landslides occur after the formation of tension cracks in clay [37] and hence it is considered as an LCF. The value of aspect ranges from 0 to 360 degrees and it is classified into 9 categories based on the orientation.

The drainage map for the district was also prepared using the DEM. The locations of the streams were then verified using satellite images and were used for calculating the distance from the streams' layer, which is considered as an LCF. The observation from the inventory data was that many of the long runout debris flows occurred near the streams in the locality. The DEM was also used to create the flow accumulation map and the SPI and TWI layers were developed using the values of flow accumulation. Both SPI and TWI are significant in the process of the initiation of landslides, as SPI represents the power of a flowing water source to erode the material. As the values of the SPI ranges over multiple orders, the natural logarithm of SPI was used for calculation. TWI indicates the wetness of the location, which quantifies the topographic control on different hydrological processes.

The Normalized Difference Vegetation Index (NDVI) is considered to be an important LCF, as it indicates the amount of greenness of a location [38]. When the NDVI values are higher, it represents the presence of vegetation [39,40] and can be correlated with the canopy cover [23]. Thus, the NDVI values are maximum for forest regions and minimum for water bodies and non-vegetated surfaces. Most landslides have occurred within the forest region itself, and the long runout debris flows have originated in the forest area. The net cropped area is 1129.76 km2, and a major share of cropped area is being used for perennial crops such as coffee, arecanut, and coconut [41]. The cash crops such as coffee and tea and spices such as cardamom are widely cultivated along the hill slopes, while the other crops are cultivated in flatter areas. The NDVI value was calculated using two bands of the electromagnetic spectrum, the Near Infra-Red (NIR) and Red (R) bands [42]. For Landsat 8 images, Band 5 represents NIR and Band 4 represents R. Hence, for this study, the NDVI values were calculated from Landsat 8 images captured in December 2017 and January 2018. As a major share of the cultivated areas is dedicated to perennial crops, the collected images can also satisfactorily represent the conditions at the time of landslides. From the collected images, NDVI is derived using the following formula:

$$NDVI = \frac{\left(Band\ 5 - Band\ 4\right)}{Band\ 5 + Band\ 4} \tag{4}$$

The rainfall data for the Wayanad district was collected from the Indian Meteorological Department (IMD) [43]. The data from four different rain gauge stations from 2010 to 2018 were interpolated using inverse distance weighted method of interpolation to ge<sup>t</sup> the average annual rainfall values across the district.

The geology, geomorphology, road network, and lineaments of the district were collected from maps published by the Geological Survey of India (GSI). The lineaments and roads were first rasterized and then used to develop the distance rasters, which were used as LCFs. The geology and geomorphology layers were classified and rasterized. The geology was classified into 7 groups, such as migmatite complex, charnockite, younger intrusive, basic intrusive, wayanad group, acid intrusive and peninsular gneissic complex (Figure 3). Geomorphologically, the region was classified into four categories: the highly

dissected hills and valleys, moderately dissected hills and valleys, low dissected hills and valleys, and pediment complex. The collected layers are shown in Figure 3. The layers were then further processed to prepare the database for LSM.

**Figure 3.** Different LCFs used for LSM: (**a**) elevation, (**b**) slope, (**c**) aspect, (**d**) distance from lineaments, (**e**) distance from streams, (**f**) distance from roads, (**g**) geomorphology, (**h**) geology, (**i**) rainfall, (**j**) NDVI, (**k**) ln SPII, and (**l**) TWI.

The processing of different LCFs is depicted in detail in Figure 4. The processing is different for raster and vector layers. The vector layers are first rasterized and then converted to XYZ format. For roads, streams, and lineaments, the distance from each feature is first calculated, and the distance rasters were used as LCF (Figure 3).

**Figure 4.** Schematic representation of dataset preparation from different spatial layers collected.

After preparing the landslide inventory data, an equal number of no landslide points were also prepared for the purpose of training and testing for both the sampling methods (point and polygon data). The landslide cells are represented using 1 and no landslide cells using 0 in the dataset. The data from all LCFs were then extracted for the landslide and no landslide points to develop the training and testing dataset. The derived model was later applied to the whole dataset to develop the landslide susceptibility map for the study area.

#### *3.3. K-Fold Cross Validation and Data Splitting*

Validation techniques are used to evaluate the performance of ML models. When the dataset is limited, cross validation techniques are often adopted to overcome the limitations associated with the size of the dataset. For k-fold cross validation, the value of k is the only input required, and the dataset is then divided into k different subsets or folds (Figure 5). Among the k-folds, k−1 folds are used for training the model and the last fold is used for testing. The process is repeated k−1 times so that each subset in the dataset is considered for testing.

**Figure 5.** k-fold cross validation represented graphically.

The number of k decides the ratio of train to test ratio of validation and, in most studies, the value of k is randomly is chosen as 5 (train to test ratio 80:20) or 10 (train to test ratio 90:10) [44]. However, detailed studies on performance of cross validation sugges<sup>t</sup> that repeated cross validation should be carried out to determine the optimum value of k [12].
