**1. Introduction**

Catastrophic landslides in mountainous terrains interact with human environment and cause adverse impacts on lives and properties [1]. Aids for managing the risk due to landslides is a topic of which several decades of research has been devoted [2,3]. Mapping the spatial distribution of landslide hazard is one of the most-adopted strategies for risk management, as the landslide susceptibility maps can be used by the governmen<sup>t</sup> for strategic planning and development [4]. With the recent advancements in Machine Learning (ML) techniques and computational facilities, Landslide Susceptibility Mapping (LSM) have become much easier.

Data driven methods are extensively used for LSM, and the earlier statistical methods using Geographical Information System (GIS)-based approaches are now being replaced by advanced ML algorithms. Different ML algorithms are being widely used for this purpose [5], and the literature shows that no single ML algorithm can be said to be the best for LSM. The choice of an ML algorithm for a particular region is subjected to the scientific goals and objectives of the LSM [5]. Five different algorithms are considered in this study,

**Citation:** Abraham, M.T.; Satyam, N.; Lokesh, R.; Pradhan, B.; Alamri, A. Factors Affecting Landslide Susceptibility Mapping: Assessing the Influence of Different Machine Learning Approaches, Sampling Strategies and Data Splitting. *Land* **2021**, *10*, 989. https://doi.org/ 10.3390/land10090989

Academic Editors: Enrico Miccadei, Cristiano Carabella and Giorgio Paglia

Received: 26 August 2021 Accepted: 17 September 2021 Published: 19 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

viz., Naïve Bayes (NB), Logistic Regression (LR), K Nearest Neighbors (KNN), Random Forest (RF), and Support Vector Machines (SVM). All the algorithms are popular in LSM, but the best suited model for each scenario has to be decided by a quantitative comparison of the model performances. The data used for training and testing of the ML algorithm should be prepared with utmost care, as the quality of data is the key parameter which decides the performance of any ML model. The data includes the landslide inventory and the Landslide Conditioning Factors (LCF). The LCFs are selected considering the topographical and meteo-geological conditions of the study area, and most conditioning factors are often derived from Digital Elevation Models (DEM), satellite data, and existing regional maps. In most cases, landslide inventories are obtained based on satellite images and field investigations [6].

Even though the quality of LCFs are found to be satisfactory and with good resolution DEMs available from satellite-based missions such as TanDEM-X and ALOS, the landslide inventories are often incomplete [7]. The quality of the landslide inventory is subjected to the positional accuracy and sampling strategy. In many studies, the inventories are prepared using points representing landslide crowns. The training and testing data for LSM are prepared using the data from all LCFs extracted using the landslide points. Hence, the positional accuracy of the inventory significantly affects the dataset used for testing and training. When the region is affected by shallow landslides only, the Crown Point provides a satisfactory representation of the landslide-affected area. However, when a region is affected by long runout landslide events, such as debris flows and avalanches, the runout zones cannot be represented using single point information [7], and the events can cause adverse effects in downslope areas [1,8]. The LCFs of the initiation zones and runout zones are entirely different, and a model which is trained using only the initiation zones will ignore the runout zones that may be affected by landslides [9,10]; however, in most studies, landslides are represented using point data, due to the limitations in data availability [11]. Hence, in this study, both point data (single point at the crown of landslide) and polygon data (cluster of points covering the area affected by landslides) are used for LSM. Each point in the cluster represents a cell in the landslide body and is used for LSM. The difference between both the approaches is that the point data considers only the crown area, while the polygon data considers the whole area affected by landslides, including the crown and the runout zone.

The resampling technique of cross validation is a recent advancement in ML, applied to test cases with limited data samples [5]. k-fold cross validation techniques are being widely used for LSM applications, in which the data is split into k parts and are internally resampled such that k −1 parts are used for training and 1 part of testing at each stage of sampling. Even though the method is being widely used for the purpose of validation, there are no guidelines for the number of k to be chosen for an analysis, and, in most studies, the value is chosen as 5 or 10 arbitrarily [12]. The number of k decides the ratio of train to test data, which can affect the performance of the ML model. Hence, in this study, the value of k is also varied from 2 to 10 in order to find the optimum value of k for each algorithm.

To test the objectives, the Wayanad district in Kerala, India, was selected as the test site. The district has suffered from a number of landslides after the incessant rains that occurred during monsoon seasons of 2018, and the landslide inventory data of 2018 was used for LSM.

## **2. Study Area**

The Wayanad district is in the southern part of India (Figure 1), which belongs to Western Ghats, the most prominent orographic feature of the peninsular India. This district is highly prone to landslides [13,14] and has a total area of 2130 km2, of which 40% is covered by forests. The topography falls mostly in plateau region sloping towards east, for this hilly district is located at the southern tip of Deccan plateau. A major share of the district contributes to the east-flowing river Kabani and its tributaries (Figure 1). The

natural drainage system is constituted by a number of streams, rivulets, and small springs, and the district landscape with flood plains and ridges is formed by this drainage system. Many debris flows that have occurred in the district have runout distances of a few hundred meters, and the longest one ranges up to 3 km. All these slides have contributed to the process of landscape evolution in the district, and minor order streams are originated along the debris flow paths. Thus, the development of drainage paths and watersheds are highly related to the occurrence of landslides, especially debris flows in the region. The flood plains are formed by alluvial deposits with a thickness of more than 10 m. The northwest, southwest, and western parts of the region are formed by higher elevation hill ranges, with steep slopes and a rugged topography. Most of the forest areas are also along these hilly regions. The continuous erosion, transportation, and deposition of the rocks have resulted in the formation of valleys in between the hill ranges. The long runout debris flows that are common in the region also contribute to this process of landscape evolution. Geologically, the district is composed of a peninsular gneissic complex, charnockite group, Wayanad group, and the migmatite complex [15]. Bands of the Wayanad group are found in the northern part of the district, while the rocks of south and southeast are formed by the charnockite group [15]. The northcentral part is composed of a peninsular gneissic complex and the southcentral part is of the migmatite complex.

**Figure 1.** Location map of Wayanad.

A major share of the district is covered by reddish-brown lateritic soil with higher fine content. The forest zones are covered by forest soil with rich organic content, and the riverbanks are formed by thick alluvial deposits. The larger regolith thickness often leads to the bed erosion and bulking of landslides, which increases the landslide volume and destruction potential [16].

The district is highly affected by geohazards such as landslides and floods due to its topographic and geomorphological conditions. The highly dissected hills and valleys along the west, northwest, and southwest parts of the district are highly prone to landslides. During August of 2018, the district was affected by a number of landslides due to torrential rains [17]. A total of 388 landslides (Figure 1) were mapped within the district using

governmen<sup>t</sup> reports and pre- and post-event satellite images from Google Earth, and have been verified using a recently published dataset [18]. The inventory data were prepared separately for LSM, and are different from the dataset used for previous studies conducted by the authors [13] in which they derived the rainfall thresholds for the region. For deriving rainfall thresholds, multiple landslides occurring on the same day were considered to be a single landslide event, and approximate locations were used, as the focus was on the day of occurrence of the landslide event. However, the inventory data of LSM needs to be accurate, and the spatial distribution of landslides is more important than the time of occurrence of landslides. Hence, the high resolution satellite images available from Google Earth were utilized to prepare a separate landslide inventory database of 388 landslides which occurred in 2018 alone. The district faced major setbacks during the disaster and the catastrophic landslides repeated in the years 2019 and 2020 as well. The increasing frequency of landslides in the districts calls for an updated landslide susceptibility map using data-driven approaches.
