Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation

Sun, Deliang; Chen, Danlu; Zhang, Jialan; Mi, Changlin; Gu, Qingyu; Wen, Haijia

doi:10.3390/land12051018

Open AccessArticle

Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation

by

Deliang Sun

¹,

Danlu Chen

¹,

Jialan Zhang

²

,

Changlin Mi

³,

Qingyu Gu

¹ and

Haijia Wen

^2,*

¹

Key Laboratory of GIS Application Research, School of Geography and Tourism, Chongqing Normal University, Chongqing 401331, China

²

National Joint Engineering Research Center of Geohazards Prevention in the Reservoir Areas, Key Laboratory of New Technology for Construction of Cities in Mountain Area, School of Civil Engineering, Chongqing University, Chongqing 400045, China

³

Linyi Natural Resources Development Service Center, Linyi 276000, China

^*

Author to whom correspondence should be addressed.

Land 2023, 12(5), 1018; https://doi.org/10.3390/land12051018

Submission received: 12 March 2023 / Revised: 20 April 2023 / Accepted: 2 May 2023 / Published: 5 May 2023

(This article belongs to the Topic Landslides Analysis and Management: From Data Acquisition to Modelling and Monitoring II)

Download

Browse Figures

Versions Notes

Abstract

:

(1) Background: The aim of this paper was to study landslide susceptibility mapping based on interpretable machine learning from the perspective of topography differentiation. (2) Methods: This paper selects three counties (Chengkou, Wushan and Wuxi counties) in northeastern Chongqing, delineated as the corrosion layered high and middle mountain region (Zone I), and three counties (Wulong, Pengshui and Shizhu counties) in southeastern Chongqing, delineated as the middle mountainous region of strong karst gorges (Zone II), as the study area. This study used a Bayesian optimization algorithm to optimize the parameters of the LightGBM and XGBoost models and construct evaluation models for each of the two regions. The model with high accuracy was selected according to the accuracy of the evaluation indicators in order to establish the landslide susceptibility mapping. The SHAP algorithm was then used to explore the landslide formation mechanisms of different landforms from both a global and local perspective. (3) Results: The AUC values for the test set in the LightGBM mode for Zones I and II are 0.8525 and 0.8859, respectively, and those for the test set in the XGBoost model are 0.8214 and 0.8375, respectively. This shows that LightGBM has a high prediction accuracy with regard to both landforms. Under the two different landform types, the elevation, land use, incision depth, distance from road and the average annual rainfall were the common dominant factors contributing most to decision making at both sites; the distance from a fault and the distance from the river have different degrees of influence under different landform types. (4) Conclusions: the optimized LightGBM-SHAP model is suitable for the analysis of landslide susceptibility in two types of landscapes, namely the corrosion layered high and middle mountain region, and the middle mountainous region of strong karst gorges, and can be used to explore the internal decision-making mechanism of the model at both the global and local levels, which makes the landslide susceptibility prediction results more realistic and transparent. This is beneficial to the selection of a landslide susceptibility index system and the early prevention and control of landslide hazards, and can provide a reference for the prediction of potential landslide hazard-prone areas and interpretable machine learning research.

Keywords:

landslide susceptibility mapping; different landform types; LightGBM; XGBoost; SHAP

1. Introduction

Landslides are a highly destructive natural disaster that seriously affect the safety of human life and property, as well as social and economic development [1,2]. Landslides account for nearly 9% of the total number of natural disasters worldwide [3], and China is one of the countries most extensively and severely affected by landslide disasters in the world, with landslides occurring in mountainous and hilly regions across the country. According to the State Statistical Bureau (http://www.stats.gov.cn/ (accessed on 1 January 2023), there were 332,715 geological disasters in China from 2004 to 2021, including 237,487 landslides, resulting in more than 24,000 casualties and direct economic losses of approximately USD 1.5 billion. Therefore, it is indispensable to identify areas that are potentially susceptible to landslides [4].

Landslide susceptibility mapping (LSM) is a method of quantitatively predicting the spatial distribution of landslide susceptibility in a region by combining regional topography, geological structures, hydro-meteorology and other characteristics, which is of great significance to landslide prevention and management, and urban planning [5,6]. Earlier studies on LSM have mainly adopted statistical models (e.g., entropy [7], frequency ratio [8], linear regression [9], etc.). Statistical models that are based on historical landslides and geographical conditions have the advantage of being quantitative and objective in their analysis. However, traditional statistical models are weak in explaining the complex and non-linear relationship between landslides and their conditioning factors, and are subjective in selecting factor weights, making it difficult to handle high-dimensional and large data sets; in addition, their accuracy still needs to be further improved.

With the development of data mining technology, machine learning methods have been increasingly extensively used in landslide susceptibility mapping, such as support vector machine (SVM) [10,11], decision tree (DT) [12,13], random forest (RF) [14,15], etc. The choice of evaluation method is crucial in the process of landslide susceptibility mapping, and directly affects the generalization ability and transparency of the model. Machine learning methods explore the complex relationship between landslides and their conditioning factors based on historical landslide data, and have the advantages of a high evaluation accuracy, strong generalization performance and less over-fitting. Pradhan [16] demonstrated the significant advantage of machine learning models in reflecting the relationship between environmental factors and landslide susceptibility based on a comparison of three relatively new approaches used to predict landslides in Penang Mountain landslide, Malaysia; these were SVM, DT and the adaptive neuro-fuzzy inference system (ANFIS). Most scholars have studied the basic conditioning factors and decision mechanism of landslides based on different machine learning models or hybrid machine learning models [17,18]. Liao et al. [19] discussed the effect of hybrid machine learning model identification on landslide susceptibility evaluation at different grid resolutions, aiming to identify the underlying conditioning factors of landslides and improve the predictive capability of landslide susceptibility evaluation models. In recent years, Gradient Boosting has been widely considered by scholars for its excellent prediction capability and stability. In particular, XGBoost, LightGBM and CatBoost are increasingly being used in research on landslides. Xu et al. [20] proposed a superposition concept and an ensemble learning technology for eight types of machine learning. By comparing the prediction results of the optimized models, their capabilities were found to be superior to those of ordinary regression models, and the ensemble learning models were combined and applied to landslide prediction. The study improved the robustness and generalization ability of machine learning models. Combining the machine learning methods XGBoost and LightGBM, Zhang et al. [21] developed an incident reliability analysis method and used it for the analysis of the Bazimen Landslide in the Three Gorges Reservoir Area. The model enables the efficient and accurate evaluation of time-varying damage probabilities, facilitating the acquisition of time-variant failure probabilities in practical applications and reducing the computational cost of performing extensive deterministic analysis.

While machine learning models can improve evaluation accuracy and outperform traditional models, the “black-box” characteristics of these algorithms result in less transparency and credibility, and cannot be explained to users [22]; this has a huge impact on their applications in high-risk areas, such as geological disaster prediction, automatic driving and medical diagnosis, and has severely hindered the development of machine learning in a number of areas. In recent years, the post-interpretation algorithm [23] has provided new directions for the interpretation of black-box models. This algorithm is designed to build exegetical models to explain the working mechanism of learning model and decision-making behavior, which makes fairer and more robust decisions while ensuring the causality of model inference [24]. Commonly used post hoc machine learning interpretation algorithms include Shapley additive explanation (SHAP) [25], partial dependence plots (PDP) [26,27], local interpretable model-agnostic explanations (LIME) [28], global surrogate [29], Dalex [30], etc. These kinds of analysis tools are developing rapidly, among which the SHAP algorithm is gaining popularity due to its simple operation and comprehensive content. The algorithm has been used with good results in other research areas, but the post-interpretation algorithm is rarely used in the LSM field. Shaker [31] developed an explainable AD diagnosis and progression detection model using random forest to forecast the early diagnosis and incidence of AD within three years, and applied the SHAP algorithm to provide a global and instance explanation that enhances the credibility of the random forest model. Zhou [22] presented an interpretable combined model based on SHAP and XGBoost that can provide a scientific basis for landslide hazards and can be used as a comprehensive evaluation framework for landslide susceptibility. Omer [32] proposed a new hierarchical binary prediction framework for the susceptibility assessment of geological and hydrological disasters, such as floods and landslides, using a combined ERT and PSO classification scheme based on the Shapley additive interpretation algorithm.

Geomorphic features are an important component of the earth surface system. The geomorphology classification system is used to fundamentally classify the basic elements of geomorphology (relief, slope and elevation, etc.), constituent materials (bedrock, unconsolidated sediments), the forces of genesis, and the landform formation environment based on geomorphic genesis [33,34]. The characteristics of various internal and external forces that shape landforms and their magmatic differentiation lead to the multiple superposition of modern geomorphic entities, resulting in geomorphic diversity and variability. Therefore, the study of the genesis of different landform types is one of the fundamental and central elements of geographic research. However, previous authors studying landslide susceptibility generally consider the entire research area as a unit, which is more effective and accurate for areas with simple and less varied landform environments. However, the subareas with complex landform environments and topographies contained in the majority of the research area have different geological structures and environmental conditions even when the same evaluation factors are applied. In addition, the degree of influence of the evaluation factors on landslides in various small areas may still differ depending on the type of landform, resulting in the poor reliability and authenticity of the evaluation results. Hence, it is necessary to evaluate the landslide susceptibility of different areas in a larger study area by zoning blocks according to their landform type, to assess and interpret the contribution of conditioning factors, and to investigate the variation in the intrinsic factors that affect landslide occurrence under the conditions of different landform types.

Chongqing, located in southwestern China, is one of the most landslide-prone areas in China due to its complex geological and geomorphological environment. For this reason, this paper takes two areas of Chongqing with widely varying landform and geological conditions, namely the corrosion layered high and middle mountain region (Zone I), and the middle mountainous region of strong karst gorges (Zone II), as the research area in order to explore the internal cause mechanisms and spatial distribution of the different landform types that affect landslide occurrence. The objectives of this paper were as follows: (1) To use the Bayesian optimization algorithm to optimize the parameters of LightGBM and XGBoost, select the optimal hyperparameters for training in order to construct a landslide susceptibility evaluation model for Zones I and II, and to evaluate the accuracy of the model in both zones. Then, to select the model with the higher accuracy to construct landslide susceptibility mappings. (2) To test the accuracy of the prediction model and the difference in performance between the two algorithms using collinearity analysis and McNemar’s Test. (3) To interpret the prediction results based on the SHAP algorithm’s factor importance ranking and single-factor dependency plots in order to explore the factors inherent to the different landform conditions that influence landslides. (4) To sample individual landslides from Zones I and II separately, and apply local and individual interpretation to them based on a summary plot of SHAP values.

2. Materials and Methods

2.1. Research Area and Data Sources

2.1.1. Research Area

With reference to previous studies [35], Chongqing was divided into platform, hill, small undulating mountains and medium undulating mountains according to the degree of fluctuation, and into low, medium and mid-to-high altitudes according to elevation (Table 1). The landform types were then reclassified based on the ArcGIS Platform according to this criterion. However, due to the relatively high number of twelve landform types after reclassification, Chongqing was then divided into four major regions based on the main landform types (Figure 1), namely the northeast Dabashan Mountains structural corrosion layered high-middle mountain region, the southeast middle mountainous region of strong karst gorges, the midportion structure paralleled ridge valley (low hilly area) region, and the western mid-low mountain hilly region. In this paper, the Dabashan Mountains structural corrosion layered high-middle mountain region and the middle mountainous region of strong karst gorges, which are typical counties with two landform types, were selected as the research area for analysis. Zone I includes Chengkou, Wuxi and Wushan counties, and Zone II includes Wulong, Pengshui and Shizhu counties. Figure 2a,b shows the percentage of each landform type in Zones I and II, respectively.

Spanning 108°15′ E–110°11′ E, 23°28′ N–32°12′ N, Zone I is located in the hilly to mountainous terrain of the eastern edge of the Sichuan Basin, with an elevation range of 63–2813 m. In areas of karst, tectonic landform genesis is the main camping force controlling the morphology of mountains. The landscape is predominantly mountainous, with deep valleys and mid-high mountains interspersed to form an alternating landform with large undulations and steep slopes. The geological structure is dominated by folding, and is located at the junction of the Dabashan Mountain Fold Belt and the East Sichuan Fold Belt, with complex tectonic stress fields. The lithology of Zone I comprises mostly sedimentary rocks that date back to the Cambrian and even to the Jurassic, with a scattered range of Quaternary. The rock is soft-hard rock that is interbedded with fully developed second-order folds and fragile structures. It is in the north-south climate transition zone, with an average annual temperature of about 10 to 24 °C and an annual rainfall of about 1049 mm. The vertical distribution of vegetation remarkably varies, with 33 groups of forest vegetation; the main vegetation types are evergreen broad-leaved forest, warm temperate coniferous forest and deciduous broad-leaved forest belt, etc.

Spanning 107°14′ E–108°34′ E, 28°57′ N–29°51′ N, Zone II is located in the Wuling Mountains with an elevation range of 78–2021 m, and is related to the Wujiang River and belongs to the Wushan Dalou Mountain Area. It is situated in the southeastern part of the Sichuan Basin where the two mountain systems of Daloushan and Wulingshan converge at the edge of the basin. The geological structure is the Chuan-E-Xiang-Qian fold belt of the neo-cathaysian structural system in southeast Chongqing. The terrain is high in the northwest and low in the southeast, which is the middle and low mountain terrain of structural denudation. The topography is controlled by the northeast structure, and the main mountains extend in the northeast direction, with obvious stratification. Valleys, foothills, karst low-lying lands and small intermountain basins are interleaved with the sequential-reverse landform. The karst landform is widely distributed, and the groundwater and surface karst morphology are well developed. The area is distributed with typical landscapes, such as stone forests, peak forests, depressions, sinkholes, karst caves, underground rivers and canyons. The climate is subtropical and monsoonal humid, with an average annual precipitation of about 1353 mm and an average annual temperature of about 13 to 23 °C. The main vegetation types include evergreen broad-leaved forest, evergreen coniferous forest, shrub forest and shrub grassland.

The topography and geological conditions of the two zones (Zone I and Zone II) with significantly different topographic and structural features were analyzed and compared in order to explore the formation mechanisms of landslides at different divisions. The present study can provide a reference for the effective prediction and management of landslide hazards, as well as the research of machine learning interpretability. Figure 3 shows a map of the geographical location of the research area.

2.1.2. Data Sources and Conditioning Factors

According to the China Natural Hazards Database (Resource Discipline Innovation Platform (data.ac.cn(accessed on 12 December 2022) and the Chongqing Municipal Emergency Management Bureau (https://yjj.cq.gov.cn/wap.html (accessed on 12 December 2022)), landslide hazards are frequent in the research area as a result of a complex topography, human engineering activities (e.g., urbanization, reservoir migration, migration town) and climatic and hydrological influences.This shows that a total of 91 earthquakes and landslide disasters occurred in Zone I between 2019 and 2022, affecting 1,033,000 people; 49,000 people were evacuated and relocated in an emergency, 17,000 people were relocated in an emergency, and 16 people died as a result of the disaster. In addition, 42,157 hectares of crops were affected and 22,620 houses were damaged (among them, 4136 houses collapsed, 6155 houses were severely damaged and 12,329 houses were generally damaged). The direct economic loss was CNY 2.34 billion. A total of 83 geological disasters occurred in Zone II, including 35 landslide disasters. Compared with the average value of the same period in the past five years, the number of deaths caused by geological disasters rose by 40%, while the number of people affected, the number of emergency relocations, the number of damaged houses and the direct economic losses fell by 47.8%, 44.8%, 81.5% and 67.6%, respectively.

The sources of each conditioning factor are shown in Table 2, and the classification criteria for the land use types and lithology are shown in Table 3. The elevation data were derived from DEM raster data at a spatial resolution of 30m from the Aster satellite. The Landsat 8 0LI satellite imagery (2016) was downloaded from the Geospatial Data Cloud with a spatial resolution of 30 m. Lithology data were obtained via the vectorization of 1:200,000 geological maps from the National Geological Data Centre. The rainfall data were obtained by interpolating the rainfall data tables from the Chongqing Meteorological Bureau using the ArcGIS spatial interpolation method. The road data were obtained from the Chongqing Municipal Transport Commission. River network data were obtained from the Chongqing Municipal Water Resources Bureau. Land use type data and administrative division data were obtained from the Geographical State Monitoring Cloud Platform. The accuracy of all four data types was 1:100,000. The POI data came from the Baidu API crawler and the data type was vector data. In summary, the raster raw data were all at a resolution of 30 m and the vector data were converted to 30 m resolution data using the Euclidean distance tool. Furthermore, studies by Faming H et al., 2020 [36], P. A. Buah et al., 2019 [37] and several others have shown that the best results are obtained using a 30 m resolution for landslide susceptibility modelling. Thus, this study was based on 30 m resolution raster data for landslide susceptibility modelling in Zones I and II.

The data used for the study were for the Chongqing landslide samples from 2000 to 2016. The survey revealed 1873 historical landslide events in Zone I and 1255 historical landslide events in Zone I. At the same time, in order to obtain a clear picture of the landslide sample data, statistical analyses of the causes of landslide formation, scale grade and danger level were conducted (Figure 4).

2.1.3. Geospatial Database

Choosing suitable landslide conditioning factors as input variables is a key step in conducting landslide susceptibility assessments. According to Ayalew and Yamagishi, a GIS-based landslide conditioning factor should be measurable, operational, complete and non-redundant [38]. Typically, landslides occur when the underlying conditions of the slope itself in the area combine with external triggers. Combining development conditions, spatial patterns and the features of landslides in the study area, 16 conditioning factors under the influence of topography and landform (elevation, slope, aspect, curvature, incision depth, fluctuation, topographic wetness index (TWI), terrain ruggedness index (TRI)), geological conditions (lithology, distance from faults), environmental conditions (NDVI (normalized differential vegetation index), distance from rivers, land use type, average annual rainfall) and human activities (distance from roads, POI kernel density) were selected to construct an index system for evaluating landslide susceptibility in Zones I and II in this paper.

In particular, incision depth (surface cutting depth), which refers to the difference between the average elevation of the field range of a point on the ground and the minimum elevation within the field range, was selected as one of the conditioning factors as it can intuitively reflect surface erosion and cutting. Its effect on landslide susceptibility has rarely been considered in previous studies. In addition, it can be used to measure the influence of the water flow cutting intensity and the valley cutting depth on landslide occurrence, and hence its influence on landslide occurrence in different landform types can also be explored. The incision depth is obtained by processing the flow direction and raster data of the river.

A grid unit with a resolution of 30 m × 30 m was taken as the elementary unit for the landslide susceptibility assessment. A visual thematic layer of the geospatial database of the landslide conditioning factors in Zone I (Figure 5) and Zone II (Figure 6) was established. In this study, positive samples (landslide sites) and negative samples (non-landslide sites) constituted the entire sample, with each landslide point considered as a grid unit. Non-landslide areas were used to expand the amount of data used for machine learning. The selection of reasonable negative samples has a significant impact on the prediction results. In previous studies, negative samples have often been selected using environmental similarity-based sampling (ESBS), buffer-controlled sampling (BCS) [39], target space exteriorization sampling (TSES) [40], etc. A-Xing Zhu [41] proposed a negative sample sampling similarity theory that quantifies negative samples based on the environmental similarity between alternative negative and positive samples, and compared it with two existing negative sample generation methods, BCS and TSES. The results showed that negative samples can be sampled using environmental similarity with a better predictive accuracy and confidence. Therefore, in this study, the positive samples were selected from historical landslide data and the negative samples were sampled using ESBS. Heckmann et al. [42] believe that a plus–minus sample ratio of 1:1 to 1:10 usually works best. Therefore, combined with the landslide areas in Zone I and Zone II, the non-landslide areas in both zones were extracted at a ratio of 1:1 in order to prepare the entire sample set.

2.2. Method

The research process mainly involved three steps (Figure 7). First, 16 factors extracted from multiple sources of data, such as DEM, satellite images and geological data, were employed as landslide susceptibility conditioning factors in order to construct a geospatial database. Secondly, the parameters of the LightGBM and XGBoost models were optimized on the basis of Bayesian algorithms in order to obtain the optimal parameters for the evaluation of the two regions, respectively. The AUC values under the ROC curve and the accuracy, precision, recall, and F1-score of the confusion matrix were used to evaluate the accuracy. The accuracy of the two models in the two zones was then compared in order to select the model with the higher accuracy and to construct landslide susceptibility mapping. Next, the accuracy and significance of the predictive models were tested by applying the multivariate analysis of covariance alongside McNemar’s test. Finally, the SHAP values were used to explain the occurrence mechanisms of landslides in the research area from a global perspective and to explain individual landslides in the research area from a local perspective.

2.2.1. Bayesian Optimization Algorithm

Bayesian optimization is a black-box optimization algorithm that is used in machine learning for the automated machine learning (AutoML) algorithm. It automatically determines the hyperparameters of the machine learning algorithm, i.e., it automatically searches for the optimal hyperparameter values in order to maximize the desired goal [43,44]. The concept of Bayesian optimization is as follows: (1) To generate an initial set of candidate solutions. (2) To find the next most likely extreme point based on these points, to add that point to the set, and to repeat the procedure until the iteration terminates. (3) To identify the point with the largest function value from these points to be used as the solution to the problem [45].

The main hyperparameters of LightGBM and XGBoost were optimized by using the test set AUC values as the objective function for the optimization; the seven main hyperparameters involved in the two models are shown in Table 4.

2.2.2. eXtreme Gradient Boosting (XGBoost)

Chen and Guestrin [46] proposed an ensemble learning method called XGBoost in 2016, which has been widely used in the field of landslide susceptibility. XGBoost uses the pre-sorted algorithm, the core idea of which is to construct a large number of base classifiers to form a model that can find the data segmentation points more accurately. The model firstly ranks all features by value, then determines the optimal segmentation point for each feature at each sample segmentation with a cost of O (#data). Finally, it identifies the final feature and splitting point, and splits the data into left and right sub-nodes. This kind of pre-sorting algorithm is able to accurately find the splitting point, but has a significant overhead in space and time. However, it requires twice as much memory as the training data because features need to be pre-sorted and the sorted index values need to be saved. When traversing each segmentation point, it is necessary to calculate the splitting gain, which results in a high cost.

The predicted value of the XGBoost model for a given evaluation unit is as follows:

{\hat{y}}_{i} = ϕ (X_{i}) = \sum_{k = 1}^{K} f_{k} (X_{i}), f_{k} \in F

(1)

where

f_{k}

is the base classifier. The final model contains multiple base classifiers.

Its objective function is as follows:

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (X_{i})) + Ω (f_{t})

(2)

where

{\hat{y}}_{i}^{(t - 1)}

is the predicted value of the first t − 1 ensemble classifiers for the evaluation unit,

f_{t} (X_{i})

is the predicted value of the current classifier for the evaluation unit, and

Ω (f_{t})

is the regularization item of the the classifier.

2.2.3. Light Gradient Boosting Machine (LightGBM)

The light gradient-boosting machine (LightGBM) is a gradient-boosting decision tree (GBDT) model that was open sourced by Microsoft (Ke et al.) [47] on GitHub in 2017. LightGBM is a fast, distributed, high-performance gradient-boosting algorithm that is based on the decision tree algorithm and can be used for sorting, classification, regression and many other machine learning tasks. For example, Fan et al. [48] and Massaoudi et al. [49] proved that LightGBM is efficient and shows a good generalization ability in drought assessment, crop growth simulation, short-term load forecasting and other applications. The algorithm, whose core idea is to discretize continuous floating-point features into k discrete values and construct a Histogram with a width of k, takes up less memory and has a lower data segmentation complexity. It traverses training data and counts the cumulative statistics of each discrete value in the histogram. In feature selection, only the discrete value of the histogram is needed to traverse and thus find the optimal segmentation point.

The basic learner of LightGBM is a decision tree, which can be expressed as follows:

H_{T} (x) = \sum_{t = 1}^{T} H_{t} (x), H_{t} \in ϑ

(3)

where H_t is the i-th learner and ϑ is a collection of learners.

2.2.4. SHapley Additive exPlanation

In the field of landslide susceptibility, most scholars currently favor algorithms with a high prediction accuracy and a good generalization performance, but ignore the intrinsic mechanisms of models. These algorithms have low interpretability, which has seriously hindered the practical applications of machine learning algorithms. Exploring the reasons why the model predicts landslides or non-landslides in the evaluation units is helpful in order to optimize the model and improve the credibility and scientific validity of its application. Chelgani et al. [50] found that no scholar had explained the relationship between the CF loop operation variables of the running factory database and metallurgical correspondence. The SHAP-XGBoost machine learning model proposed by him was used to explain the CL of industrial CF circuits, providing an accurate multivariate correlation evaluation of the CF datasets. Combining interpretable techniques with anomaly detection algorithms, Kim et al. [51] overcame the failure of the model to explain the classification of specific data instances as anomalies, and found that the model could provide more useful explanations in the case of anomaly interpretation using SHAP values.

The core idea of SHAP (SHapley Additive exPlanation) is derived from the cooperative game theory, which was proposed by Lundberg and Lee [52] in order to quantify the contribution of players to collaborative games in the early stage [53]. The SHAP framework combines several existing approaches to create an intuitive and theoretically sound way of explaining the predictions of any model and has proven to be an important advance in the field of machine learning model interpretation. The SHAP value quantifies the magnitude and direction (positive or negative) of the influence of features on prediction.

In this study, the SHAP value was used to quantify the contribution of each factor to the landslide susceptibility prediction results. SHAP interprets the Shapley value as an additive feature attribution method and the predicted value of the model as the sum of the attribution values of each input feature:

g (x^{'}) = ϕ_{0} + \sum_{M}^{j = 1} ϕ_{j}

(4)

where g(x’) is the value of the model, and ϕ₀ is the constant that explains the model, that is, the predicted mean of all training samples. ϕ_j is the attribution value (Shapley value) of each feature.

2.2.5. Validation Metrics

Any unvalidated assessment model is not scientifically valid and it is therefore necessary to assess the validity of the models used. In order to verify the accuracy of the models constructed by the XGBoost and LightGBM algorithms, the evaluation models were validated via quantitative and qualitative approaches. The accuracy of the landslide susceptibility model predictions can be analyzed via the receiver operating characteristic curve (ROC curve) and its area under curve (AUC) value [54]. The closer the ROC curve is to the top, the higher the accuracy of the model. The AUC is the area covered by the ROC curve and can be used to quantify the accuracy of the model; the closer its value is to 1, the higher the accuracy.

The confusion matrix is the basis of the ROC curve and is the most basic, intuitive and simple way in which to measure the accuracy of the typology model. The accuracy, precision, recall and F1-score based on the confusion matrix are the evaluation criteria of the model, and can quantitatively determine the accuracy of the model. The larger the value, the higher the accuracy, and the greater the accuracy in the range of (0, 1).

3. Results

3.1. Model Accuracy and Verification

3.1.1. Results of Hyperparameter Optimization Based on Bayesian Algorithm

Hyperparameters are parameters that need to be filled in by the user prior to machine learning. Hyperparameter optimization is the process of extracting the hyperparameters that maximize the performance of the model, which directly affects the performance and accuracy of the model. Bayesian Optimization helps the user to minimize the model loss function by varying the model parameters, i.e., finding the minimum number of points in the fewest number of steps.

The study used a Bayesian optimization algorithm to optimize the hyperparameters of the LightGBM and XGBoost models. The seven parameters of the models were optimized, and the optimized parameters of the models based on Zones I and II are shown in Table 5 and Table 6.

3.1.2. Model Accuracy in Zone I

The optimized hyperparameters were substituted into the XGBoost and LightGBM algorithms to train and evaluate the accuracy of the landslide susceptibility models for Zone I based on historical landslides and conditioning factors, as specifically shown in Figure 8 and Table 7. The AUC values of the XGBoost model test set and the training set in Zone I were 0.8175 and 0.8929, respectively, and those of the LightGBM model test set and training set were 0.8858 and 0.9649, respectively. The accuracy, precision, recall and F1-score values of the two models were much higher than 0.5, indicating the higher accuracy of the LightGBM model in Zone I.

3.1.3. Model Accuracy in Zone II

The optimized hyperparameters were substituted into the XGBoost and LightGBM algorithms to train and evaluate the accuracy of landslide susceptibility models for Zone I based on historical landslides and conditioning factors, as shown in Figure 9 and Table 8. The AUC values of the XGBoost model test set and the training set in Zone II were 0.8176 and 0.8930, respectively, and those of the LightGBM model test set and training set were 0.8725 and 0.9285, respectively. The accuracy, precision, recall and F1-score values of the two models were significantly higher than 0.5, suggesting the higher accuracy of the LightGBM model in Zone II.

By evaluating the above two models’ accuracy results, it can be seen that both the XGBoost and LightGBM evaluation models based on the Zone I and Zone II data have good accuracy and predictive ability. The difference between the accuracy evaluation of the test set and the training set is small, indicating that the Bayesian optimization algorithm is accurate and effective in tuning the model parameters, and that both of them can be applied to the study of landslide susceptibility zoning. At the same time, the LightGBM evaluation accuracy is superior to XGBoost, which can accurately and efficiently evaluate the spatial and temporal distribution of landslides in different landscapes. Therefore, this paper follows up with the construction of landslide susceptibility zoning maps for Zones I and II and for evaluation based on the LightGBM algorithm.

3.1.4. Covariance Analysis

Collinearity is a highly correlated relationship between two or more variables in a model, which, if left untreated, can lead to the distortion of the linear model or make accurate prediction difficult, reducing the reliability and accuracy of the model. Tolerance and the variance inflation factor (VIF) are the more commonly used indicators for collinearity analysis, and they are the reciprocal of each other. In general, when the VIF value is greater than 10, it indicates that the model has a strong collinearity problem [55,56]. Therefore, this paper used the VIF to evaluate factor collinearity; based on SPSS statistical software, 16 conditioning factors were diagnosed with collinearity (Table 9). The results show that the conditioning factors VIF value in both regions was less than 10 and thus passed the collinearity test. Therefore, all the above factors can be used to construct a landslide susceptibility index system for machine learning algorithms.

3.1.5. McNemar’s Test

McNemar’s test is a common method used to compare the difference in performance between two classifiers by counting the number of samples correctly and incorrectly classified by a 2 × 2 confusion matrix in order to compare the predicted results with the actual results. The value of the McNemar statistic is then calculated using the number of misclassified samples, and finally the p-value is calculated using the value of the McNemar statistic. The usual level of significance is 0.05. If the p-value is less than 0.05, i.e., less than the significance level, then the difference between the two models is statistically significant and there is a significant difference in the classification performance. Conversely, if the p-value is greater than the significance level, no significant difference can be concluded.

This study tested the Zone I and II datasets, obtained predictions and calculated the number of misclassified samples and p-values, the results of which are shown in Table 10. It can be seen that the p-value for Zone I is 0.005603194 and that for Zone II is 0.004912444, meaning that both p-values are less than 0.05. The conclusion can be drawn that applying the two models to Zones I and II is statistically significant and that they differ significantly in their classification performance.

3.2. Landslide Susceptibility Mapping Results

In this section, based on the LightGBM model, a 30 × 30 m grid was used as the evaluation unit, and the sample set was randomly divided into training and test sets at a ratio of 8:2. Applying the trained model to Zones I and II, the spatial distribution of the landslide occurrence probabilities in the research area was predicted (Figure 10). The landslide susceptibility mapping was divided into very low, low, medium, high, and very high susceptibility zones according to the same classification intervals of [0–0.2], [0.2–0.4], [0.4–0.6], [0.6–0.8], and [0.8–1] (Table 11). As can be seen from the susceptibility zoning maps for Zones I and II, the high and very high susceptibility zones, based on the raster units, only accounted for 22.79% and 21.2%, respectively; meanwhile, the landslide density gradually increased from the very low to very high susceptibility zones. The resulting susceptibility zoning maps for the two zones were more plausible.

In conclusion, the learning landslide susceptibility assessment model could effectively predict the susceptibility zoning map based on raster units. However, the prediction process of the LightGBM model was less intuitive in terms of its understanding and acceptance than the simple structured decision tree; therefore, this paper subsequently analyzes the interpretability of the model’s prediction results based on the SHAP algorithm.

3.3. SHAP-Based Model Interpretation

3.3.1. Factor Importance Ranking

Shapley values were utilized to quantify the contribution of each factor in each unit to the predicted results of the susceptibility model as a means of screening the dominant factor in the landslide formation mechanism. As global importance is required, the absolute Shapley values for each feature needed to be averaged.

I_{j} = \sum_{i = 1}^{n} |ϕ_{j}^{(i)}|

(5)

A global interpretation of the landslide susceptibility model for Zones I and II was applied in order to determine the importance of the factors in both areas, resulting in a factor importance ranking plot based on the LightGBM model (Figure 11). As shown in Figure 11a, the elevation, distance from river, distance from roads, surface cutting depth, land use, and average annual rainfall had the greatest influence on the model prediction results in Zone I. As can be seen in Figure 11b, the elevation, average annual rainfall, distance from roads, land use, distance from faults and surface cutting depth had the greatest influence on the model in Zone II.

3.3.2. Summary Plot of SHAP Values

Summary diagram (also known as a ‘honeycomb diagram’) is a method of presenting a composite of SHAP values that clearly shows individual factors, combining the magnitude of eigenvalues and the degree of influence of weights on the evaluation results [57]. Each unit in Figure 12 represents the Shapley value for each feature, with the labels on the left being the features; the values are ordered by importance, the same as the average SHAP plot. However, the difference is that each point in the honeycomb plot represents a real sample. The color of each group is determined by the feature value; the larger the feature value, the redder the color of the points. Furthermore, the more points with the same SHAP value, the larger the cross-sectional area of the fovea, and the higher the value of that feature. The length of the value is the extent to which it contributes to the outcome, with values closer to the right contributing to the occurrence of landslides and the values closer to the left inhibiting the occurrence of landslides [58].

As can be seen from Figure 12a, the elevation, distance from the river, the distance from roads, surface cutting depth, land use and average annual rainfall had the greatest influence on the occurrence of landslides in Zone I. Among them, elevation was negatively correlated with model prediction, and the lower the elevation, the more likely landslides would occur. The distance from the river, the distance from roads, surface cutting depth, land use and average annual rainfall were positively correlated with model prediction. As shown in Figure 12b, the factors that had a strong influence on the occurrence of landslides in Area II were as follows: elevation, average annual rainfall, the distance from roads, land use, distance from faults, and surface cutting depth. Of these, the elevation and surface cutting depth were negatively correlated with the model, that is, the higher the elevation, the shorter the surface cutting depth and the less likely landslides were to occur. However, the average annual rainfall, distance from roads, land use and distance from faults were positively correlated with model prediction.

3.3.3. Single-Factor Dependence Plots

To analyze the relationship between the eigenvalue size and prediction impact, it is more appropriate to use the single-factor dependence plot, which clearly shows how each factor affects model prediction [59]. This paper applied a single-factor dependency plot for global interpretation, with the feature values on the x-axis and the corresponding Shapley values on the y-axis for each data instance. The factors ranking in the top six in terms of importance were selected for analysis in this study. Figure 13 shows some of the single-factor dependency plots for the landslide conditioning factors in Zones I and II.

Zone I: As can be seen from Figure 13a, most of the sample points had SHAP values greater than 0 when the altitude was less than 1000 m, indicating that landslides were more likely to occur in areas with an altitude of less than 1000 m. With the increase in elevation, the impact value was less than zero and decreased, indicating that the contribution to landslides was weakened with the increase in elevation. It can be seen from Figure 13b that a distance of less than approximately 2500 m from the river played a role in promoting landslides, and with the increase in the distance from the river, the impact value gradually decreased; this means that the greater the distance from the river, the less likely landslides are to occur. Figure 13c indicates that landslides were promoted when the distance from roads was less than about 250 m, and the greater the distance from roads, the less of the less likely landslides were to occur. Figure 13d shows that most of the Shapley values were greater than 0 at surface cutting depths of less than about 1600 m, which was positively correlated with the occurrence of landslides, and with the increase in depth, its contribution to landslides was weakened. As can be seen from Figure 13e, woodland land use had an inhibitory effect on landslides, grassland, cultivated land and garden land had a weak effect on landslides, and water conservancy, industrial land, transportation and residential land use had a promoting effect on landslides. Figure 13f shows that an average annual rainfall of less than 1400 mm had a relatively weak effect on the occurrence of landslides and that its contribution to landslides was enhanced as it increased.

Zone II: As can be seen in Figure 14a, the impact values of sample points were greater than 0 when the altitude was less than 450 m, indicating that altitude contributed to landslides at this point. As elevation increased, the impact value was less than 0, indicating that the contribution of elevation to landslides diminished. As revealed in Figure 14b, its contribution to landslide occurrence became stronger when the average annual rainfall was less than about 1300 mm but diminished when it was above 1300 mm. Figure 14c shows that a distance from roads of less than about 350 m promoted the occurrence of landslides and its increase weakened its influence on the occurrence of landslides. Figure 14d shows that landslides were significantly inhibited when the land use type was woodland, grassland, arable land and garden land, but were promoted in the case of water use land, industrial land, transportation and residential areas. It can be found from Figure 14e that a distance of less than 5000 m from a fault contributed a little to the occurrence of landslides, and as it increased, its contribution was enhanced. As can be seen from Figure 14f, at surface cutting depths of less than 1750, most SHAP values were less than 0, and that the surface cutting depth had a mild effect on the occurrence of landslides. With the increase in the surface cutting depth, its promoting effect became increasingly strong.

3.3.4. SHAP Waterfall Plot

The SHAP waterfall plot is a local analysis diagram of a single instance forecast. In a waterfall plot, the horizontal axis is the SHAP value and the vertical axis is the value taken for each feature of the selected sample. E[f(x)] is the baseline value of SHAP, i.e., the average predicted value, and f(x) is the final value, which is also the predicted value of this object. The blue part indicates that the feature is negatively correlated with the predicted outcome (arrow to the left, SHAP value decreases), while the red part indicates that the feature has a positive effect on the predicted outcome (arrow to the right, SHAP value increases). Each row in the diagram represents a feature, and the SHAP value in each row represents the contribution of that feature to the total SHAP value of that object.

The SHAP algorithm processes the contribution of each factor to the local interpretation, where P is the probability of potential landslide occurrence, as predicted by the model.

f (x) = I n (P / (1 - P))

(6)

In this study, the SHAP algorithm was utilized to generate waterfall maps for the local interpretation of the Jinjiling landslide in Wushan County (Zone I) and the Jiweishan landslide in Wulong County (Zone II), respectively.

The Jinjiling landslide was located in Jiangdong District, Wushan County, Chongqing City, on the left bank of Daning River, a tributary of the Yangtze River. The Jinjijling landslide was initially deformed after heavy rainfall on 18 June 2018, and then intensified around 1 August into a critical slip state, which threatened the safety of a large number of engineering projects, endangered 55 people in 16 households, and posed potential economic losses of more than CNY 100 million. The Jinjiling landslide was irregular in shape as a whole and was complex in its geological structure, with a gully as the boundary on both sides. Affected by the Lijiapo syncline in the area, the sliding bodies with a large change in the rock formation were the Quaternary artificial filling layer Q and the Holocene collapse slope accumulation layer Qcol + d. In addition, human engineering activities are intensive in the landslide area. A highway slope, building slope and landfill project are located in the front, middle and back of the landslide area [60].

As shown in Figure 15a, the annual average rainfall and distance from roads had a great influence on the occurrence of the Jinjiling landslide, with a positive contribution rate of +0.63; this was followed by the surface humidity index (+0.49), altitude (+0.4) and surface cutting depth (+0.32). In addition, the land use, NDVI, distance from a fault, POI and slope orientation all positively contributed, while the lithology, TRI, slope, relief, distance from the river and curvature had a dampening effect. The above results suggest that the average annual rainfall, distance from roads, TWI, elevation, and surface cutting depth were the dominant factors affecting the occurrence of the Jinjiling landslide.

The Jiweishan landslide, a giant sliding rock collapse, occurred on 5 June 2009 in Wulong County. A large number of sliding bodies were cut out from the sliding source area and continued to disintegrate during the movement, eventually forming a body with an average thickness of more than 30 m, a length of 2150 m, and a volume of 700 × 10⁴ m³, resulting in 74 deaths and 8 injuries [61,62]. Jiwei Mountain in Wulong, composed of hard and soft rock strata, has an inclined thick limestone mountain structure with a steep upper part and a wide and gentle lower part. It has the dual structure characteristics of a steep upper and gentle lower part, and a hard upper and soft lower part. The east side of the Jiweishan slope is near the empty cliff, and the cliff is nearly north-south. The rock stratum is monoclinic stratum. The exposed strata are Maokou Formation (P1m), Qixia Formation (P1q) and Liangshan Formation (P1l) from top to bottom. In response to the complex mountain instability mode, the Wulong County government and relevant departments had carried out many geological disaster explorations since 1994 and considered its failure mode to be rock collapse [63]. However, the reality was quite different from a previous survey. Therefore, it is necessary to study the formation mechanism of this kind of inclined thick landslide.

As shown in Figure 15b, the distance from roads had the most significantly positive influence on the Jiweishan landslide, with a SHAP value of +0.93; this was followed by the aspect and surface cutting depth, with SHAP values of +0.34 and +0.14, respectively. However, the land use type, altitude, average annual rainfall, distance from a fault and slope had an inhibitory effect on the Jiweishan landslide. The above results indicate that the distance from roads, slope direction, surface cutting depth and undulation were the dominant factors affecting the Jiweishan landslide.

4. Discussion

4.1. LightGBM-SHAP Hybrid Model

This paper first uses a Bayesian optimization algorithm to optimize the model parameters and applies the optimized hyperparameters to two different machine learning algorithms to investigate landslide susceptibility in areas with different landform types. Figure 8 and Figure 9 and Table 7 and Table 8 demonstrate that both the LightGBM and XGBoost algorithms had strong predictive power. Meanwhile, the accuracies of the LightGBM test set were 0.068 and 0.055 higher than those of the XGBoost test set in Zones I and II, respectively, which shows that the prediction performance of LightGBM was superior to that of XGBoost. For this reason, the LightGBM algorithm was employed for training in this paper. Its landslide susceptibility results for both zones were found to be reasonable and consistent with the distribution of historical landslide cases, indicating the stronger generalization capability of the LightGBM algorithm. Through verification with practical examples, LightGBM, as an optimization algorithm for XGBoost, is more accurate and capable of handling large-scale data and requires less memory, which agrees well with previous research results.

The application of machine learning algorithms in the field of landslides is currently more mature, but most scholarly research has been limited to the differences in the prediction performance of different models. Sahin, E.K. [64] constructed GBM, XGBoost and RF models to map landslide susceptibility and assessed the differences between the models to obtain the optimal predictive power of XGBoost. However, this is far from sufficient, as researchers should aim for a high predictive power along with a comprehensive explanatory performance in order to make research more rational and transparent. At the same time, few scholars have assessed landslide susceptibility based on different landform types or have investigated their intrinsic causal mechanisms, which could lead to inaccurate evaluation results for research areas with complex landform conditions.

While the LightGBM model achieved favorable prediction results in this study, the intrinsic prediction mechanism of the machine learning model was difficult to predict and explain, reducing the credibility and transparency of the model and seriously hindering the use of machine learning in the field of landslide susceptibility. Therefore, by combining the LightGBM and SHAP algorithms, this paper proposed an interpretable machine learning system. The model was constructed on the basis of different geomorphological types, and the model output was interpreted both comprehensively and locally in order to explore the relevance of each factor to the occurrence of landslides, while exploring the contribution of each conditioning factor to the evaluation results of each evaluation unit in the research area, as well as to the changes in the dominant factors leading to landslides under different geomorphological conditions. The SHAP waterfall plot was also used to analyze the causes and predict the risks of a randomly selected landslide sample and comprehensively investigate the decision-making mechanism of the model.

4.2. Global Explanation

Summary plot is a comprehensive interpretation of sample prediction results that allows us to have an intuitive sense of how features affect the overall predicted value, while visualizing the importance of the features and clearly capturing the distribution of SHAP values for each feature [65,66].

In this study, the importance of each factor was calculated and ranked from a global perspective based on the SHAP algorithm (Figure 11, Figure 12, Figure 13 and Figure 14). The Zone I factors were ranked by importance in descending order as follows: elevation, distance to the river, distance to roads, surface cutting depth, land use, average annual rainfall, relief, curvature, distance to a fault, aspect, TWI, POI, lithology, slope, NDVI and TRI. The Zone II factors were ranked by importance as follows: elevation, average annual rainfall, distance from roads, land use, distance from a fault, surface cutting depth, distance from river, TWI, NDVI, relief, POI, curvature, lithology, aspect, slope and TRI. Combining the SHAP results above, we concluded that the elevation, surface cutting depth, land use, average annual rainfall and distance from roads had a significant influence on the occurrence of landslides in Zones I and II, and that these five factors were still important factors in the occurrence of landslides even in areas with different geomorphological types. The difference was that the distance from the river ranked second in the order of importance in Zone I, and the distance from faults ranked fifth in the order of importance in Zone II. This section mainly, respectively, analyzes the dominant factors that are common and individual to both zones.

4.2.1. Ensemble Dominant Factor

Elevation controls the local vegetation type and cover, and reflects the intensity of human activity [67]. The elevation ranked first in importance in both zones. The effect of elevation on landslide occurrence is negatively correlated in Zones I and II, with landslides being inhibited at high values of elevation and promoted at low values of elevation. The reason is that Zone I belongs to the layered mountainous hilly landform, with an uplift internal power, steep slopes, vertical and horizontal valleys and loose soil. Zone II was characterized by the coexistence of the middle and low canyon landform and reverse landform. Loose sediments were likely to accumulate in the low-altitude areas of the two zones and close to the water system. There were also extensive human engineering activities in the territory (such as excavation slope toe, deforestation, sewage infiltration, excessive exploitation of groundwater, etc.), which reduced slope stability and greatly increased the probability of landslide occurrence. In high-altitude areas with high vegetation cover, low human activity and a high soil and water consolidation capacity, landslides were less likely to occur.

The surface cutting depth, as opposed to relief, is targeted at small localized areas, using the proximity of the watershed to the nearest valley channel as an indicator. It has important reference significance for studying the development of soil erosion and surface erosion due to its ability to reflect the valley depth and the relative elevation difference in the zone, as well as surface fragmentation in the vertical direction. At the same time, the distribution of the degree of fragmentation of the landscape varied due to geological formations and lithology, which had an impact on regional erosion and ecology, and thus on the occurrence of landslides. As shown in Figure 13d and Figure 14f, this factor played an important role in the landslides in Zones I and II. In the space of Zone I, the landform in Zone I presents an alternating high and low landform formed by the crisscross of deep valleys and middle mountains, and the structural form of Daba Mountain. This landform is mainly present in layered high-medium mountainous areas. For the mountainous landform, the external geomorphological process is mostly erosion and denudation. Therefore, the surface cutting depth plays a promoting role in the low-value area. Zone II is high in the northwest and low in the southeast. The river flows from west to east, and the terrain is deep. There are many valleys and mountains, mainly in the middle and low terrain, with obvious moderate cutting. It is a typical middle and deep cutting middle mountain terrain with less flat area. The greater the surface cutting depth, the more effective the promotion of landslide occurrence.

The average annual rainfall is the data obtained from long-term observation, whose influence on slope stability is related to permeability, hydrophilicity and the initial water content before precipitation. It has a great effect on the local surface runoff level and groundwater flow. At the same time, rain will infiltrate the slope, erode the slope, and scour the rock and soil on the surface of the slope, which will increase the pore water pressure, soften the rock and soil, increase the bearing capacity of the slope, but also affect the development of vegetation, thus promoting or inducing the occurrence of landslides. However, as can be seen from Figure 13f and Figure 14b, rainfall had very different effects on landslides for the same amount of precipitation in different areas. There was a positive correlation between precipitation and landslide occurrence in Zone I, i.e., a promoting effect on landslides in areas with high values of precipitation and an inhibitory effect in areas with low values of precipitation. In contrast, precipitation in Zone II was negatively correlated with initiation, acting as an inhibitor in areas with high values of precipitation and promoting landslides in areas with low values of precipitation. This is due to the fact that Zone I is predominantly a stratified mountainous landscape with an undulating topography and a humid zone, with high precipitation and strong flowing water. The surface soil type is mainly lime (rock) soil, and the area is mostly slightly eroded. The increase in rainfall aggravates slope erosion and reduces slope stability. As a result, when the average annual rainfall in Zone I increases, the potential for landslide hazards increases. Zone II is a mountainous and hilly landform with high mountains, steep slopes, undulating terrain and abundant rainfall. Its surface soil type is mainly yellow brown soil with severe soil erosion. Landslides can occur at lower rainfall levels. Although high rainfall aggravates erosion, it also takes away the loose material on the surface of the slope, making the slope more compact and stable, thus reducing the possibility of landslide occurrence. In addition, the factor of the multi-year average rainfall used in this study mainly refers to its indirect influence on landslide occurrence through its long-term influence on vegetation development, soil moisture content, surface erosion and other disaster-pregnant environments. There may be a potential correlation between historical precipitation and landslide development, or with landslide incubation processes, in the development of landslides. Compared with the direct landslide triggering factors, such as ‘24 h cumulative rainfall (daily rainfall)’, ‘effective rainfall in the early stage (such as 10 days before the landslide)’ and ‘rainfall duration’, the average annual rainfall does not directly trigger landslides, but can be one of the underlying factors conditioning their formation. Due to the large difference in the magnitude and spatial distribution of the average annual rainfall between Zone I and Zone II, combined with the very different geomorphology of the two areas, the average annual rainfall in Zone I and Zone II has a significantly different effect on the occurrence of landslides.

Human activities that violate natural laws and destabilize slope conditions can trigger the occurrence of landslides. Due to engineering construction, slope excavation and filling, the slope sliding force changes. The surface load increases the slope gravity, leading to a slope foot excavation and free surface, which can lead to the revival of old landslides, slope instability or natural landslide intensification, and thus the occurrence of large-scale landslides. For example, excavation of the foot of the slope, the construction of railway roads, building houses on the mountain, etc., can have an effect. Vigorous blasting and forced excavation during construction can lead to slumps in the lower part of the slope due to the loss of support and subsequent landslides on the side slopes, bringing hazards to road construction operations. Thus, as can be seen from Figure 13c and Figure 14c, this factor contributed to the occurrence of landslides when the distance from roads was within 500 m, and only as the distance increased would the contribution to landslides diminish.

Different land use practices have affected the inherent mechanisms of landslides differently, and with an increasing population and economic development in both urban and rural areas, human engineering activities have altered the original balance of ancient landslides. The resurrection of ancient landslides and slope instability as a result of irrational land use is a frequent occurrence, such as the revival of Maliuzui landslide in the Banan District of Chongqing and the Daheba ancient landslide in the Wanzhou District of Chongqing. It can be seen from Figure 13e and Figure 14d that forest, shrub, grassland and cultivated land had an inhibitory effect on landslides because forest–shrub–grassland is capable of soil and water conservation. Land use mainly affects the critical strength of the induced landslide, enhances the surface strength of soil, and enhances the fixation of the root system during landslides. Cultivated paddy fields can reduce the anti-sliding force of landslides. However, urban land plays a significant role in promoting landslides. Not only do cities gather large numbers of people, their construction works are the most numerous and have the greatest impact on landslides. Landslides are more densely distributed in urban areas than in other types of areas with the same geological topography. The ways of using urban land, such as through landfill, broken surface loading, the unreasonable discharge of sewage, the unreasonable mining of industrial and the mining industries to form goaf, have an important impact on landslides. In addition, steel, concrete and other hard materials are used in urban construction. The engineering excavation exposes slopes and enhances landslide susceptibility in the case of weak rock/soil slopes. In summary, the impact of the land use type on landslide development is fundamentally different from that of geomorphological factors.

4.2.2. Comparison of Individual Dominant Factors

As can be seen from Figure 13b and Figure 14e, the top factors in Zone I and Zone II top are the distance from the river and the distance from faults, respectively.

Distance from river: The erosion of rivers was one of the common factors affecting the formation of landslides. The slope of the valley bank became steeper due to the erosion of water flow, which destabilized and even destroyed its slope, while making the slope toe and sliding face empty, leading to soil sliding and the collapse of the bank slope. The pattern and degree of river erosion changes with the evolution of riverbank erosion. In the karst landform of Zone I, flowing water acted on limestone. Under the long-term erosion of river water (river incision and side erosion), the rock and soil accumulated at the foot of the bank slope got lost, weakening its supporting effect on the sliding body and affecting stability. For example, seven tributaries, including the Daning River and Baolong River, were strongly incised in the north-south direction.

The distance between faults is one of the most important structures in the crust. Generally speaking, geological faults cause a large number of landslides, and structural faults usually reduce the strength of the surrounding rock [68,69]. The more developed the faults that cut and separate the slope are, the denser the landslide scale is. The geological structure of Zone II is the Sichuan–Hunan–Guizhou uplift fold belt in the southeastern Chongqing of the neo-cathaysian tectonic system. The Shapley value indicates that with the increase in the distance between faults, landslides will be more likely to occur due to more developed gullies in the north and south of landslide-prone mountains and deeper cutting. The geological structure, including faults and folds, has a great impact on the formation of landslides. In general, rock mass near the fold core and the fault zone is broken, the landslide is developed, and the crustal stability of the active tectonic section is poor. In particular, the recent strong active fault zone makes the landslide densely developed.

4.3. Local Explanation

In this study, taking the Jinjiling landslide in Wushan County (Zone I) and the Jiweishan landslide in Wulong County (Zone II) as the research area, this paper objectively analyzed the decision-making process of the research model based on the waterfall plot generated by SHAP, which provided a local explanation of the causes of individual landslides.

The average annual rainfall, distance from roads, TWI, elevation, and surface cutting depth were the dominant factors affecting the occurrence of landslides in the Jinjiling landslide in Zone I. Combined with the field analysis results, we found that the landslide was surrounded by a round-chair terrain, with multiple gullies gathering in the Longdong Gully, in which surface water and groundwater converged into the landslide area, thus creating a favorable condition for the occurrence of landslides. Due to tectonic compression, the rock in the landslide area was extremely fragmented, which makes it easy for rainwater and groundwater to infiltrate and for groundwater to accumulate in the landslide area. At the same time, broken stone soil was mainly accumulated in the shallow layer of the landslide, which had a loose structure, good permeability and experienced the rapid infiltration of atmospheric rainfall and surface water. In addition, the landslide was strongly deformed after severe rainfall. The continuous and concentrated rainfall process increased the self-weight of the slide, and soil was pushed towards the central front and flowed under its own weight. In addition, the front of the landslide was formed by the excavation of the road, which was performed in order to form a mining surface, providing conditions for the landslide to shear out. In summary, the topography, geological structure and stratigraphic lithology collectively provided a good physical foundation and prerequisite for the formation of landslides, while the interaction between human engineering activities (landfill projects) and heavy rainfall were the key factors contributing to the deformation of the Jinjiling landslides.

The Jiweishan landslide in Zone II illustrates the unique mechanism of slope failure [63], with the dominant factors for its occurrence being the distance from roads, slope orientation, surface cutting depth and relief. With the complex local terrain and dense vegetation of the Jiweishan landslide, the geological conditions for this landslide formation were hidden and complex, usually manifesting as lateral landslip. The Jiweishan landslide was the result of the joint action of the weak strata controlling the geological structure underground mining and karstification. Jiweishan is a kind of inclined thick limestone hill structure that is widely distributed in the main limestone hills (Chongqing, Sichuan, Hubei, etc.). Since the 1920s, there have been continuous mining activities on the hill, and a hollow area of more than 5 × 10⁴ m² had been formed in the lower part of the slope, which had a certain influence on the deformation and stress adjustment of the hill. In areas with a large slope, the shear and stability of the slopes were even weaker, and a large number of sloping hills were characterized by instability. Therefore, the formation of a large-scale dangerous rock mass on the Jiweishan landslide was mainly attributed to high and steep terrain, and to large-area iron ore mining under the hill. With the intensification of human activities, especially at lower elevations and at closer distances from roads, the eco-environment on the ground surface was more and more seriously damaged, all of which contributed to the occurrence of landslides.

The waterfall plot helps us to explore the internal occurrence mechanism of individual landslides in a comprehensive and clear manner, and is highly practical in terms of providing a basis for disaster management authorities to make decisions.

4.4. Post-Programming

Notably, the SHAP summary plots showed that the surface cutting depths chosen in this paper all have high contribution values in the landslide susceptibility zoning, further suggesting the important contributing role of the surface cutting depth in the occurrence of landslides. However, this factor has been rarely selected as a conditioning factor in previous studies. At the same time, the incision density, incision depth and surface fluctuation degree were three single-factor indicators that could be combined to quantify the degree of surface fragmentation. However, only the incision depth and the surface fluctuation degree were selected in order to examine the factors influencing the landslide mechanism from the macroscopic perspective of the vertical level and regional topographic features in this paper. Therefore, it is advised that the surface incision density is included as a conditioning factor in future studies on landslide susceptibility.

5. Conclusions

In this study, 16 landslide susceptibility conditioning factors were extracted using multiple sources of data, such as satellite imagery, geological data and hydro-meteorological data, in areas with distinct differences in terms of their topographic and geological features. A negative sample was also randomly selected at a 1:1 ratio between historical landslide data and non-landslide data. A comprehensive and interpretable landslide susceptibility evaluation model framework of two mountain landforms that was based on the SHAP-LightGBM algorithm was constructed in order to perform a comparative analysis and to explore the variation and spatial distribution characteristics of the dominant factors that induce landslides under different geomorphological conditions; the algorithm was also constructed in order to explore the internal decision-making mechanism of the landslide susceptibility results that were constructed using machine learning algorithms. The aim of this study was to improve the scientific accuracy and transparency of zoning results and to minimize the influence of different geomorphological conditions on the results of landslide evaluation. This paper provides a reference for interpretable research on landslide hazard management and machine learning in two distinct areas: the corrosion layered high-middle mountain region and the middle mountainous region of strong karst gorges. By selecting two typical landform type areas in Chongqing as the research area, this paper assessed the feasibility and interpretability of the proposed model and its prediction accuracy. The conclusions are made as follows.

1. The AUC values of the LightGBM and XGBoost models based on the Bayesian optimization algorithm for Zone I are 0.9649 and 0.9292, while those for Zone II are 0.9920 and 0.9773, respectively. Most of the areas on the landslide susceptibility map are in the low and lower susceptibility zones, with the high and very high susceptibility zones accounting for the majority of the total number of historical landslides in the research area, with a gradual increase in the landslide density from the very low to very high susceptibility zones. It can be observed that both algorithms are accurate in predicting the landslide occurrence in both zones, and the model constructed by the LightGBM model after using the hyperparameter optimization algorithm has a higher evaluation accuracy, which further validates the excellent prediction performance of the algorithm with instances. The LSM results constructed based on the LightGBM algorithm have great application prospects because they are realistic, reliable and scientific.

2. The elevation, surface cutting depth, land use, distance from roads and average annual rainfall are the dominant factors that act together in the context of the two different landform types. The distance from rivers is more relevant in the corrosion layered high-middle mountain region, while the distance from faults has a stronger influence on the distribution of landslides in typical low-medium hills, multi-gorge regions.

3. The single-factor dependence plot generated by the SHAP algorithm quantifies the value of the contribution of individual factors to the evaluation results of the model in terms of individual evaluation units. In addition, the analysis of the different degrees of factor influence on individual landslides during their occurrence and the genesis analysis take into account the uniqueness and diversity of individual landslide causes. The integrated SHAP-LightGBM interpretation model that has been proposed in order to evaluate the causes of a single landslide occurrence, as well as risk prediction, is of great value in the field of landslide susceptibility prediction.

4. SHAP is an algorithm used to effectively interpret landslide susceptibility assessment results. The integrated interpretation framework based on the SHAP-LightGBM model can measure the importance and interaction of factors at both global and local levels; this enables scholars to comprehensively and explicitly understand and analyze the distribution characteristics of each factor during model modelling and the occurrence pattern of landslide hazards, improves the credibility of machine learning algorithms, and provides a reference for research on the interpretability of machine learning. It is believed that SHAP and other XAI analysis tools will become an integral part of later research on machine learning systems.

Author Contributions

Conceptualization, J.Z.; Data curation, D.C. and C.M.; Formal analysis, D.C.; Funding acquisition, D.S.; Investigation, C.M.; Methodology, J.Z. and H.W.; Resources, Q.G.; Software, D.S.; Supervision, H.W.; Validation, Q.G.; Visualization, D.S.; Writing—original draft, D.S. and D.C.; Writing—review & editing, J.Z. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Chongqing (Grant number: CSTB2022NSCQ-MSX0594) and National Social Science Funds of China (Grant No. 22BJY140).

Data Availability Statement

Data are contained within the article. The corresponding author can provide the necessary model upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Y.; Feng, L.; Li, S.; Ren, F.; Du, Q. A hybrid model considering spatial heterogeneity for landslide susceptibility mapping in Zhejiang Province, China. Catena 2020, 188, 104425. [Google Scholar] [CrossRef]
Hungr, O.; Leroueil, S.; Picarelli, L. The Varnes classification of landslide types, an update. Landslides 2014, 11, 167–194. [Google Scholar] [CrossRef]
Gokceoglu, C.; Sonmez, H.; Nefeslioglu, H.A.; Duman, T.Y.; Can, T. The 17 March 2005 Kuzulu landslide (Sivas, Turkey) and landslide-susceptibility map of its near vicinity. Eng. Geol. 2005, 81, 65–83. [Google Scholar] [CrossRef]
Fang, K.; Tang, H.M.; Li, C.D.; Su, X.X.; An, P.J.; Sun, S.X. Centrifuge modelling of landslides and landslide hazard mitigation: A review. Geosci. Front. 2023, 14, 101493. [Google Scholar] [CrossRef]
Hong, H.; Pourghasemi, H.R.; Pourtaghi, Z.S. Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology 2016, 259, 105–118. [Google Scholar] [CrossRef]
Lacroix, P.; Handwerger, A.L.; Bievre, G. Life and death of slow-moving landslides. Nat. Rev. Earth Environ. 2020, 1, 404–419. [Google Scholar] [CrossRef]
Guo, Z.; Yin, K.; Huang, F.; Fu, S.; Zhang, W. Landslide susceptibility evaluation based on landslide classification and weighted frequency ratio model. Chin. J. Rock Mech. Eng. 2019, 38, 14. [Google Scholar]
Sun, D.; Shi, S.; Wen, H.; Xu, J.; Zhou, X.; Wu, J. A hybrid optimization method of factor screening predicated on GeoDetector and Random Forest for Landslide Susceptibility Mapping. Geomorphology 2021, 379, 107623. [Google Scholar] [CrossRef]
Lee, S.; Won, J.-S.; Jeon, S.; Park, I.; Lee, M.J. Spatial Landslide Hazard Prediction Using Rainfall Probability and a Logistic Regression Model. Math. Geosci. 2014, 47, 565–589. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Moradi, H.R.; Aghda, S.M.F. Landslide susceptibility mapping by binary logistic regression, analytical hierarchy process, and statistical index models and assessment of their performances. Nat. Hazards 2013, 69, 749–779. [Google Scholar] [CrossRef]
Bui, D.T.; Pradhan, B.; Lofman, O.; Revhaug, I. Landslide Susceptibility Assessment in Vietnam Using Support Vector Machines, Decision Tree, and Naïve Bayes Models. Math. Probl. Eng. 2012, 2012, 1–26. [Google Scholar]
Kavzoglu, T.; Sahin, E.K.; Colkesen, I. Landslide susceptibility mapping using GIS-based multi-criteria decision analysis, support vector machines, and logistic regression. Landslides 2013, 11, 425–439. [Google Scholar] [CrossRef]
Tien, B.D.; Anh, T.T.; Klempe, H.; Pradhan, B.; Revhaug, I. Spatial prediction models for shallow landslide hazards: A comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree. Landslides 2016, 13, 361–378. [Google Scholar] [CrossRef]
Chen, W.; Xie, X.; Wang, J.; Pradhan, B.; Hong, H.; Bui, D.T.; Duan, Z.; Ma, J. A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility. Catena 2017, 151, 147–160. [Google Scholar] [CrossRef]
Were, K.; Bui, D.T.; Dick, B.; Ram, B. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 2015, 52, 394–403. [Google Scholar] [CrossRef]
Pradhan, B. A comparative study on the predictive ability of the decision tree, support vector machine and neuro-fuzzy models in landslide susceptibility mapping using GIS. Comput. Geosci. 2013, 51, 350–365. [Google Scholar] [CrossRef]
Sun, D.; Ding, Y.; Zhang, J.; Wen, H.; Wang, Y.; Xu, J.; Zhou, X.; Liu, R. Essential insights into decision mechanism of landslide susceptibility mapping based on different machine learning models. Geocarto Int. 2022. [Google Scholar] [CrossRef]
Sun, D.; Gu, Q.; Wen, H.; Xu, J.; Zhang, Y.; Shi, S.; Xue, M.; Zhou, X. Assessment of landslide susceptibility along mountain highways based on different machine learning algorithms and mapping units by hybrid factors screening and sample optimization. Gondwana Res. 2022. [Google Scholar] [CrossRef]
Liao, M.; Wen, H.; Yang, L. Identifying the essential conditioning factors of landslide susceptibility models under different grid resolutions using hybrid machine learning: A case of Wushan and Wuxi counties, China. Catena 2022, 217, 106428. [Google Scholar] [CrossRef]
Xu, W.; Kang, Y.; Chen, L.; Wang, L.; Qin, C.; Zhang, L.; Liang, D.; Wu, C.; Zhang, W. Dynamic assessment of slope stability based on multi-source monitoring data and ensemble learning approaches: A case study of Jiuxianping landslide. Geol. J. 2022. [Google Scholar] [CrossRef]
Zhang, G.-R.; Cheng, W. Stability prediction for Bazimen landslide of Zigui County under the associative action of reservoir water lever fluctuations and rainfall infiltration. Rock Soil Mech. 2011, 32, 476–482. [Google Scholar]
Zhou, X.Z.; Wen, H.J.; Li, Z.W.; Zhang, H.; Zhang, W.G. An interpretable model for the susceptibility of rainfall-induced shallow landslides based on SHAP and XGBoost. Geocarto Int. 2022, 37, 13419–13450. [Google Scholar] [CrossRef]
Peng, J.; Cai, Z.; Chen, Z.; Liu, X.; Zheng, M.; Song, C.; Zhu, X.; Teng, Y.; Zhang, R.; Zhou, Y.; et al. An trustworthy intrusion detection framework enabled by ex-post-interpretation-enabled approach. J. Inf. Secur. Appl. 2022, 71, 103364. [Google Scholar] [CrossRef]
Fleming, S.W.; Watson, J.R.; Ellenson, A.; Cannon, A.J.; Vesselinov, V.C. Machine learning in Earth and environmental science requires education and research policy reforms. Nat. Geosci. 2021, 14, 878–880. [Google Scholar] [CrossRef]
Sun, D.L.; Gu, Q.Y.; Wen, H.J.; Shi, S.X.; Mi, C.L.; Zhang, F.T. A Hybrid Landslide Warning Model Coupling Susceptibility Zoning and Precipitation. Forests 2022, 13, 827. [Google Scholar] [CrossRef]
Alnahit, A.O.; Mishra, A.K.; Khan, A.A. Stream water quality prediction using boosted regression tree and random forest models. Stoch. Environ. Res. Risk Assess. 2022, 36, 2661–2680. [Google Scholar] [CrossRef]
Zhou, J.; Li, E.; Yang, S.; Wang, M.; Shi, X.; Yao, S.; Mitri, H.S. Slope stability prediction for circular mode failure using gradient boosting machine approach based on an updated database of case histories. Saf. Sci. 2019, 118, 505–518. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Crombecq, K.; Gorissen, D.; Deschrijver, D.; Dhaene, T. A Novel Hybrid Sequential Design Strategy for Global Surrogate Modeling of Computer Experiments. SIAM J. Sci. Comput. 2011, 33, 1948–1974. [Google Scholar] [CrossRef]
Biecek, P.L. DALEX: Explainers for Complex Predictive Models in R. J. Mach. Learn. Res. 2018, 19, 3245–3249. [Google Scholar]
El-Sappagh, S.; Alonso, J.M.; Islam, S.M.R.; Sultan, A.M.; Kwak, K.S. A multilayer multimodal detection and prediction model based on explainable artificial intelligence for Alzheimer’s disease. Sci. Rep. 2021, 11, 1–26. [Google Scholar]
Ekmekcioğlu, Ö.; Koc, K. Explainable step-wise binary classification for the susceptibility assessment of geo-hydrological hazards. Catena 2022, 216, 106379. [Google Scholar] [CrossRef]
Beven, K. What we see now: Event-persistence and the predictability of hydro-eco-geomorphological systems. Ecol. Model. 2015, 298, 4–15. [Google Scholar] [CrossRef]
Oguchi, T. Geomorphological debates in Japan related to surface processes, tectonics, climate, research principles, and international geomorphology. Geomorphology 2019, 366, 106805. [Google Scholar] [CrossRef]
Li, B.; Pan, B.; Han, J. Basic terrestrial geomorphological types in china and their circum scriptions. Quaternary Sci. 2008, 28, 535–543. [Google Scholar]
Huang, F.; Chen, J.; Du, Z.; Yao, C.; Huang, J.; Jiang, Q.; Chang, Z.; Li, S. Landslide Susceptibility Prediction Considering Regional Soil Erosion Based on Machine-Learning Models. ISPRS Int. J. Geo-Inf. 2020, 9, 377. [Google Scholar] [CrossRef]
Buah, P.A.; Zhang, Y.; Bakah, D.A.Y.; Ahiabu, M.K.; Lei, Z. Earthquake-Induced Landslide Susceptibility Analysis: The Effect of DEM Resolution. In Proceedings of the 2019 International Conference on Mechatronics, Remote Sensing, Information Systems and Industrial Information Technologies (ICMRSISIIT), Ghana, 20–22 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Ayalew, L.; Yamagishi, H. The application of GIS-based logistic regression for landslide susceptibility mapping in the Kakuda-Yahiko Mountains, Central Japan. Geomorphology 2005, 65, 15–31. [Google Scholar] [CrossRef]
Xi, C.; Han, M.; Hu, X.; Liu, B.; He, K.; Luo, G.; Cao, X. Effectiveness of Newmark-based sampling strategy for coseismic landslide susceptibility mapping using deep learning, support vector machine, and logistic regression. Bull. Eng. Geol. Environ. 2022, 81, 174. [Google Scholar] [CrossRef]
Chen, S.; Miao, Z.; Wu, L.; Zhang, A.; Li, Q.; He, Y. A One-Class-Classifier-Based Negative Data Generation Method for Rapid Earthquake-Induced Landslide Susceptibility Mapping. Front. Earth Sci. 2021, 9, 609896. [Google Scholar] [CrossRef]
Zhu, A.-X.; Miao, Y.; Liu, J.; Bai, S.; Zeng, C.; Ma, T.; Hong, H. A similarity-based approach to sampling absence data for landslide susceptibility mapping using data-driven methods. Catena 2019, 183, 104188. [Google Scholar] [CrossRef]
Heckmann, T.; Gegg, K.; Gegg, A.; Becht, M. Sample size matters: Investigating the effect of sample size on a logistic regression susceptibility model for debris flows. Nat. Hazards Earth Syst. Sci. 2014, 14, 259–278. [Google Scholar] [CrossRef]
Ronquist, F.; Teslenko, M.; van der Mark, P.; Ayres, D.L.; Darling, A.; Höhna, S.; Larget, B.; Liu, L.; Suchard, M.A.; Huelsenbeck, J.P. MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space. Syst. Biol. 2012, 61, 539–542. [Google Scholar] [CrossRef]
Scheres, S.H. RELION: Implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 2012, 180, 519–530. [Google Scholar] [CrossRef] [PubMed]
Wood, S.N. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2011, 73, 3–36. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Qi, M. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Fan, J.; Wu, X.; Zhang, L.; Yu, F.; Zeng, X.; Wen, Z. Light Gradient Boosting Machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological data. Agric. Water Manag. 2019, 225, 105758. [Google Scholar] [CrossRef]
Massaoudi, M.; Refaat, S.S.; Chihi, I.; Trabelsi, M.; Oueslati, F.S.; Abu-Rub, H. A novel stacked generalization ensemble-based hybrid LGBM-XGB-MLP model for Short-Term Load Forecasting. Energy 2021, 214, 118874. [Google Scholar] [CrossRef]
Chelgani, S.C.; Nasiri, H.; Alidokht, M. Interpretable modeling of metallurgical responses for an industrial coal column flotation circuit by XGBoost and SHAP-A “conscious-lab” development. Int. J. Min. Sci. Technol. 2021, 31, 1135–1144. [Google Scholar] [CrossRef]
Kim, D.; Antariksa, G.; Handayani, M.P.; Lee, S.; Lee, J. Explainable Anomaly Detection Framework for Maritime Main Engine Sensor Data. Sensors 2021, 21, 5200. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Shapley, L.S. A Value for n-Person Games; RAND Corporation: Santa Monica, CA, USA, 1952. [Google Scholar]
Kalantar, B.; Pradhan, B.; Naghibi, S.A.; Motevalli, A.; Mansor, S. Assessment of the effects of training data selection on the landslide susceptibility mapping: A comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN). Geomat. Nat. Hazards Risk 2017, 9, 49–69. [Google Scholar] [CrossRef]
Mirzaei, G.; Soltani, A.; Soltani, M.; Darabi, M. An integrated data-mining and multi-criteria decision-making approach for hazard-based object ranking with a focus on landslides and floods. Environ. Earth Sci. 2018, 77, 1–23. [Google Scholar] [CrossRef]
Salmeron, R.; Garcia, J.; Garcia, C.; Lopez, M.D. Transformation of variables and the condition number in ridge estimation. Comput. Stat. 2018, 33, 1497–1524. [Google Scholar] [CrossRef]
Lubo-Robles, D.; Devegowda, D.; Jayaram, V.; Bedle, H.; Marfurt, K.J.; Pranter, M.J. Machine learning model interpretability using SHAP values: Application to a seismic facies classification task. In Proceedings of the SEG International Exposition and Annual Meeting, Virtual, 11–16 October 2020; OnePetro: Richardson, TX, USA, 2020. [Google Scholar]
Wang, D.; Thunéll, S.; Lindberg, U.; Jiang, L.; Trygg, J.; Tysklind, M. Towards better process management in wastewater treatment plants: Process analytics based on SHAP values for tree-based machine learning methods. J. Environ. Manag. 2021, 301, 113941. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Geng, T.; Shen, W.; Zhang, J.; Zhou, Y. Quantifying the influencing factors and multi-factor interactions affecting cadmium accumulation in limestone-derived agricultural soil using random forest (RF) approach. Ecotoxicol. Environ. Saf. 2020, 209, 111773. [Google Scholar] [CrossRef] [PubMed]
Yan, G.; Yin, Y.; Huang, B.; Zhang, Z.; Zhu, S. Formation mechanism and characteristics of the Jinjiling landslide in Wushan in the Three Gorges Reservoir region, China. Landslides 2019, 16, 2087–2101. [Google Scholar] [CrossRef]
Ge, Y.; Tang, H.; Eldin, M.A.M.E.; Chen, H.; Zhong, P.; Zhang, L.; Fang, K. Deposit characteristics of the Jiweishan rapid long-runout landslide based on field investigation and numerical modeling. Bull. Eng. Geol. Environ. 2018, 78, 4383–4396. [Google Scholar] [CrossRef]
Luo, H.; Hu, W.; Zhang, X.H.; McSaveney, M.; Li, Y. The study on rock thermal fractures at sliding surface of Jiweishan landslide. Eng. Geol. 2022, 300, 106588. [Google Scholar] [CrossRef]
Zhao, Z.; Deng, L. Initiation mechanism of Jiweishan high-speed rockslide in Chongqing, China. Nat. Hazards 2020, 103, 3765–3781. [Google Scholar] [CrossRef]
Sahin, E.K. Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Appl. Sci. 2020, 2, 1308. [Google Scholar] [CrossRef]
Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]
Wang, Y.X.; Lang, J.W.; Zuo, J.Z.; Dong, Y.Q.; Hu, Z.T.; Xu, X.L.; Zhang, Y.K.; Wang, Q.J.; Yang, L.Z.; Wong, S.T.C.; et al. The radiomic-clinical model using the SHAP method for assessing the treatment response of whole-brain radiotherapy: A multicentric study. Eur. Radiol. 2022, 32, 8737–8747. [Google Scholar] [CrossRef]
Sun, D.; Xu, J.; Wen, H.; Wang, Y. An Optimized Random Forest Model and Its Generalization Ability in Landslide Susceptibility Mapping: Application in Two Areas of Three Gorges Reservoir, China. J. Earth Sci. 2020, 31, 1068–1086. [Google Scholar] [CrossRef]
Wu, Y.; Li, W.; Wang, Q.; Liu, Q.; Yang, D.; Xing, M.; Pei, Y.; Yan, S. Landslide susceptibility assessment using frequency ratio, statistical index and certainty factor models for the Gangu County, China. Arab. J. Geosci. 2016, 9, 1–16. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Landform zoning map.

Figure 2. (a) The percentage plot of landform types of Zone I; (b) The percentage plot of landform types of Zone II.

Figure 3. Location of the research area (the left is Zone II and the right is Zone I).

Figure 4. Historical landslide event data feature partition statistics.

Figure 5. Theme layer of partial landslide conditioning factors in Zone I. (a) curvature; (b) elevation; (c) distance from faults; (d) slope; (e) POI; (f) perennial mean rainfall; (g) distance from river; (h) distance from roads; (i) surface cutting depth; (j) lithology; (k) aspect; (l) land use; (m) NDVI; (n) relief; (o) TRI; and (p) TWI.

Figure 6. Theme layer of partial landslide conditioning factors in Zone II. (a) curvature; (b) elevation; (c) distance from faults; (d) slope; (e) POI; (f) perennial mean rainfall; (g) distance from river; (h) distance from roads; (i) surface cutting depth; (j) lithology; (k) aspect; (l) land use; (m) NDVI; (n) relief; (o) TRI; and (p) TWI.

Figure 7. Research flow chart.

Figure 8. Model ROC curves of XGBoost (a) and LightGBM (b).

Figure 9. Model ROC curves of XGBoost (a) and LightGBM (b).

Figure 10. Landslide susceptibility zoning map ((a) is the result of Zone I, (b) is the result of Zone II).

Figure 11. Factor importance ranking diagram based on LightGBM model ((a) for Zone I, (b) for Zone II).

Figure 12. SHAP summary plot ((a) for Zone I, (b) for Zone II).

Figure 13. Single-factor dependence plot for Zone I. (a) Elevation; (b) distance from river; (c) distance from roads; (d) surface cutting depth; (e) land use; (f) perennial mean rainfall.

Figure 14. Single-factor dependence plot for Zone II. (a) Elevation; (b) perennial mean rainfall; (c) distance from roads; (d) land use; (e) distance from faults; (f) surface cutting depth.

Figure 15. Two cases waterfall plots (a) Jinjiling landslide, (b) Jiweishan landslide.

Table 1. Criteria for the reclassification of landform types.

Fluctuation (m)	Elevation (m)
Fluctuation (m)	Low Elevation <1000	Middle Elevation 1000–2000	Mid-High Elevation 2000–4000
Terrace < 75	Low elevation terrace	Middle elevation terrace	Mid-high elevation terrace
Hill 75–200	Low elevation hill	Middle elevation hill	Mid-high elevation hill
Microrelief mountain 200–500	Microrelief low mountain	Microrelief middle mountain	Microrelief mid-high mountain
Mesorelief mountain 500–1000	Mesorelief low mountain	Mesorelief middle mountain	Mesorelief mid-high mountain

Table 2. Data and data sources.

Data Name	Data Source	Type	Accuracy
Historical Landslides	Chongqing Geological Monitoring Station	Data Sheet
DEM	Global digital elevation model (GDEM)	Raster	30 m
Geological Information	National Data Center for Geological Information	Raster	1:200,000
Administrative Zoning	Chongqing Municipal Bureau of Land	Vector	1:200,000
Land use	Chongqing Municipal Bureau of Land	Raster	1:100,000
River Network	Chongqing Water Resources Bureau	Vector	1:100,000
Satellite images	Geospatial Data Cloud Platform	Raster	30 m
Rainfall Data	Chongqing Meteorological Bureau	Data Sheet	30 m
Roads	Chongqing Municipal Transportation Commission	Vector	1:100,000
NDVI	Landsat 8 OLI	Geospatial Data Cloud	30 m
2016 Chongqing Point of Interest	Web crawler	Vector

Table 3. Partial factor classification criteria.

Influence Factors	Grade	Classification Standards
Land use type	9	1. forested land (11, 12); 2. grassland (22, 23, 24); 3. farmland (31); 4. garden land (33); 5. residential land (41, 42); 6. industrial and mining storage land (44); 7. transportation land (45); 8. water and water facilities land (61, 62, 63, 64); and 9. other land (72, 73).
Lithology	10	1. Qb₂l; 2. TJx; 3. T₁j; 4. D; 5. T₁d-j; 6. J₃sn; 7. T₃xj; 8. €₃; 9. S₂; 10. Z

Table 4. Main hyperparameters involved in LightGBM and XGBoost.

Hyperparameter	Explanation
colsample_bytree	Before each tree fitting, the number of features used is determined, default is 1.
gamma	Minimum loss reduction required to create a new branch on a leaf node of a tree, default is 0.
learning_rate	The predicted outcome of each tree is multiplied by this learning rate, default is 0.3.
max_depth	The maximum depth of a tree, default is 6.
min_child_weight	The sum of the minimum instance weights required in a child node.
reg_alpha	The weight of the L1 regular term, default is 0.
reg_lambda	The weight of the L2 regular term, default is 1.

Table 5. Results of hyperparameter optimization of the model based on Zone I.

Model	Hyperparameter	Result	Model	Hyperparameter	Result
LightGBM	colsample_bytree	0.5489720889492241	XGBoost	colsample_bytree	0.689583488663039
	gamma	0.13316670248904283		gamma	0.19699648421981572
	learning_rate	0.31192436689467884		learning_rate	0.0631408298056055
	max_depth	470		max_depth	110
	min_child_weight	0.5296189137535521		min_child_weight	0.09409252963115114
	reg_alpha	0.30000000000000004		reg_alpha	0.6000000000000001
	reg_lambda	0.4270520293750171		reg_lambda	0.4025587616151741

Table 6. Results of hyperparameter optimization of the model based on Zone II.

Model	Hyperparameter	Result	Model	Hyperparameter	Result
LightGBM	colsample_bytree	0.8765083254007644	XGBoost	colsample_bytree	0.8728754599170524
	gamma	0.3652433194728314		gamma	0.5242207496165615
	learning_rate	0.4106947090304667		learning_rate	0.618432241258086
	max_depth	495		max_depth	390
	min_child_weight	0.2253032913943282		min_child_weight	0.9909597061558622
	reg_alpha	1.5		reg_alpha	0.6000000000000001
	reg_lambda	0.1870025431675585		reg_lambda	0.884017219634587

Table 7. Comparison of accuracy of landslide susceptibility models in Zone I.

Model	Accuracy	Precision	Recall	F1-Score	AUC_Test	AUC_Train
XGBoost	0.7506	0.7789	0.7581	0.7645	0.8175	0.8929
LightGBM	0.8182	0.8095	0.8132	0.8138	0.8858	0.9649

Table 8. Comparison of accuracy of landslide susceptibility models in Zone II.

Model	Accuracy	Precision	Recall	F1-Score	AUC_Test	AUC_Train
XGBoost	0.7442	0.7218	0.7211	0.7328	0.8176	0.8930
LightGBM	0.7649	0.7662	0.7562	0.7762	0.8725	0.9285

Table 9. Results of Zone I and II covariance analysis.

Zone I			Zone II
Conditioning Factors	Tolerance	VIF	Conditioning Factors	Tolerance	VIF
Aspect	0.992	1.008	Aspect	0.98	1.021
Curvature	0.878	1.138	Curvature	0.883	1.133
Distance from fault	0.63	1.588	Distance from fault	0.851	1.175
Distance from river	0.573	1.744	Distance from river	0.565	1.771
Distance from road	0.68	1.471	Distance from road	0.88	1.136
Elevation	0.241	4.143	Elevation	0.215	4.643
Land use	0.768	1.303	Land use	0.794	1.259
Lithology	0.726	1.378	Lithology	0.904	1.106
NDVI	0.437	2.288	NDVI	0.627	1.595
Perennial mean rainfall	0.202	4.949	Perennial mean rainfall	0.251	3.983
POI	0.651	1.535	POI	0.786	1.272
Relief	0.354	2.822	Relief	0.353	2.83
Slope	0.152	6.592	Slope	0.191	5.243
Surface cutting depth	0.606	1.65	Surface cutting depth	0.546	1.831
TRI	0.171	5.859	TRI	0.223	4.481
TWI	0.715	1.399	TWI	0.76	1.315

Table 10. Results of McNemar’s Test.

Data Set	The Value of the McNemar Statistic	p-Value
Zone I	7.673684211	0.005603194
Zone II	7.911392405	0.004912444

Table 11. Classification statistics of landslide susceptibility in Zones I and II.

	Susceptibility Class	Area (/km²)	Number of Landslides (pcs)	Landslide Density (pcs/km²)
Zone I	Very low	6044.495	45	0.007
	Low	2736.526	114	0.042
	Moderate	2093.62	254	0.121
	High	2095.491	602	0.287
	Very high	1113.746	858	0.770
Zone II	Very low	8823.138	102	0.012
	Low	6527.038	149	0.023
	Moderate	5665.423	271	0.048
	High	4067.3	299	0.074
	Very high	1587.695	434	0.273

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, D.; Chen, D.; Zhang, J.; Mi, C.; Gu, Q.; Wen, H. Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation. Land 2023, 12, 1018. https://doi.org/10.3390/land12051018

AMA Style

Sun D, Chen D, Zhang J, Mi C, Gu Q, Wen H. Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation. Land. 2023; 12(5):1018. https://doi.org/10.3390/land12051018

Chicago/Turabian Style

Sun, Deliang, Danlu Chen, Jialan Zhang, Changlin Mi, Qingyu Gu, and Haijia Wen. 2023. "Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation" Land 12, no. 5: 1018. https://doi.org/10.3390/land12051018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Landslide Susceptibility Mapping Based on Interpretable Machine Learning from the Perspective of Geomorphological Differentiation

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Area and Data Sources

2.1.1. Research Area

2.1.2. Data Sources and Conditioning Factors

2.1.3. Geospatial Database

2.2. Method

2.2.1. Bayesian Optimization Algorithm

2.2.2. eXtreme Gradient Boosting (XGBoost)

2.2.3. Light Gradient Boosting Machine (LightGBM)

2.2.4. SHapley Additive exPlanation

2.2.5. Validation Metrics

3. Results

3.1. Model Accuracy and Verification

3.1.1. Results of Hyperparameter Optimization Based on Bayesian Algorithm

3.1.2. Model Accuracy in Zone I

3.1.3. Model Accuracy in Zone II

3.1.4. Covariance Analysis

3.1.5. McNemar’s Test

3.2. Landslide Susceptibility Mapping Results

3.3. SHAP-Based Model Interpretation

3.3.1. Factor Importance Ranking

3.3.2. Summary Plot of SHAP Values

3.3.3. Single-Factor Dependence Plots

3.3.4. SHAP Waterfall Plot

4. Discussion

4.1. LightGBM-SHAP Hybrid Model

4.2. Global Explanation

4.2.1. Ensemble Dominant Factor

4.2.2. Comparison of Individual Dominant Factors

4.3. Local Explanation

4.4. Post-Programming

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI