**Landslide Susceptibility Research Combining Qualitative Analysis and Quantitative Evaluation: A Case Study of Yunyang County in Chongqing, China**

**Wengang Zhang <sup>1</sup> , Songlin Liu <sup>1</sup> , Luqi Wang 1,\*, Pijush Samui <sup>2</sup> , Marcin Chwała <sup>3</sup> and Yuwei He <sup>1</sup>**


**Abstract:** Machine learning-based methods are commonly used for landslide susceptibility mapping. Most of the recent publications focused on quantitative analysis, i.e., improving data processing methods, comparing and perfecting the data-driven model itself, but rarely taking the qualitative aspects of the local landslide occurrences into consideration and the further analysis of the key features was always lacking. This study aims to combine qualitative and quantitative analysis and examine its effect on mapping accuracy; based on the feature importance ranks and the related literature, the key features for identifying landslide/non-landslide points of different sub-zones were further analyzed. Before modeling, the study area Yunyang County, Chongqing City, China, was manually divided into four sub-zones based on the information from geological hazards exploration in Chongqing, including the mechanism of landslide formation and sliding failure and geomorphic unit characteristics. Upon the qualitative analysis basis, five grid searches tuned random forest models (one for the whole region and four for the sub-zones independently) were established by 1654 data points and 20 conditioning features. Compared with the conventional data-driven method, the integrated quantitative evaluation based on the qualitative analysis results showed higher reliability, which not only improved the mapping accuracy but also increased the AUC values of all four submodels, which were 8.8%, 2.3%, 1.9% and 9.1% higher than that of the parent model. Moreover, the quantitative evaluation based on the qualitative analysis revealed the key factors affecting local landslide formation. Therefore, qualitative analysis is recommended in future landslide susceptibility modeling with the additional combination of data-driven methods.

**Keywords:** landslide susceptibility mapping; random forest model; qualitative analysis; quantitative evaluation; Yunyang County

### **1. Introduction**

Landslides are one of the most destructive geological hazards, which not only cause enormous damage to houses and infrastructure, such as bridges and roads, but also lead to loss of life [1]. According to the World Health Organization, approximately 4.8 million people were affected, and more than 18,000 deaths were caused by landslides between 1998 and 2017. Specifically, as one of the countries with a high incidence of landslides, China suffered severe loss of life [2,3]. The China Statistical Yearbook indicates that during 2000 to 2015, 373,630 landslides occurred in this country, killing 10,996 people, which is approximately 690 landslide-related deaths per year [4]. To mitigate the serious social impact caused by landslides, constructive and productive activities should be avoided in areas with high susceptibility to landslides. Therefore, developing an efficient method

**Citation:** Zhang, W.; Liu, S.; Wang, L.; Samui, P.; Chwała, M.; He, Y. Landslide Susceptibility Research Combining Qualitative Analysis and Quantitative Evaluation: A Case Study of Yunyang County in Chongqing, China. *Forests* **2022**, *13*, 1055. https://doi.org/10.3390/ f13071055

Academic Editors: Chong Xu, Haijia Wen, Weile Li and Hiromu Daimaru

Received: 11 May 2022 Accepted: 28 June 2022 Published: 4 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to distinguish landslide-prone zones is an essential need for both local governments and research institutes [5]. Landslide susceptibility describes the likelihood that a landslide will occur in a certain area based on local terrain conditions [6]. Landslide susceptibility mapping (LSM) is one of the most widely used assessment methods, it visualizes the spatial distribution of zones with different probabilities of occurrence of landslides in a certain area.

Various methods such as probabilistic analysis, statistical analysis, analytic process, and weighted overlay were widely applied to LSM by researchers in the early stages. With the development of Artificial Intelligence (AI) and Geographic Information System (GIS), machine learning-based methods, with the capability of solving complex nonlinear problems, are becoming increasingly popular compared to opinion-driven models and statistical learning, making the accuracy and precision of susceptibility models evolve rapidly [7,8]. Huang et al. [9] adopted logic regression (LR), support vector machine (SVM), and random forest (RF) on LSM for model comparison. He et al. [10] used RF in the global assessment of earthquake-induced landslide susceptibility. Sun et al. [11] applied the Bayes algorithm to optimize the hyper-parameters of the RF model for LSM. Smith et al. [12] compared the effect of landslide inventories assembled by different methods on the performance of RF and LF for LSM. Lim et al. [13] applied the RF model to estimate the probability of a landslide. Nhu et al. [14] investigated and compared the Logistic Model Tree, LR, NBTress, Artificial Neural Network (ANN), and SVM in the shallow landslide susceptibility mapping for Bijar City in Kurdistan City. Zhang et al. [15] used the predictive performance of RF, XGBoost, SVM, and LR on landslide susceptibility mapping in Yunyang County. Hu et al. [16] compared the effect of different non-landslide sampling methods on the performance of SVM and NB for LSM. Zhou et al. [17] applied GeoDetector and RFE for factor optimization and then used the selected factors as inputs to train an RF model to obtain the LSM of Wuxi County. Sun et al. [18] proposed a hybrid landslide warning model based on RF susceptibility zoning and precipitation. Zhou et al. [19] constructed an interpretable model for the susceptibility to rainfall-induced shallow landslides based on SHAP and XGBoost.

Among those methods, RF is the most commonly used method in large-scale mapping and classification [20–22] due to its characteristics of low computational cost, low data requirement, convenience of hyper-parameters tuning, and robustness in solving complex nonlinear problems [23]. Previous work usually focused on quantitative analysis, such as the selection and improvement of models and input features, but rarely took into account the qualitative analysis of landslide areas. Actually, as one of the major geological hazards, landslides are highly area dependent, the mechanism of landslide formation and its corresponding triggering factors are undoubtedly different in distinct areas. The frequent fluctuation of reservoir water seriously reduces the stability of the slopes in the reservoir area, making them prone to landslides [24]. For mountainous areas, however, rainfall is the major triggering factor for the occurrence of landslides [25]. With increased population, human activities have become the major issue that accelerates landslide formation in areas with high population density. Therefore, manually dividing a relatively large region into different sub-zones according to the qualitative analysis of the landslide formation and geomorphic unit characteristics will theoretically improve mapping accuracy. This paper aims to use the 827 historical landslide data points in Yunyang County and the 20 conditioning factors to build 5 RF models, including an RF model (referred to as the parent model below) for the whole region and four RF models for the divided four sub-zones (referred to as sub-model one to sub-model four below). Then, the feature importance and the performances of the parent model and the four sub-models are analyzed and compared to verify the effectiveness of applying experience-based zonation before modeling.

### **2. Study Area**

Chongqing City is located in the mountainous area around the eastern Sichuan basin and the slope area of the basin margin. It spans two tectonic units, namely the Yangtze quasi-platform and the Qinling fold system. The landscape of Chongqing City is mainly mountains and hills, which make up 92% of its total area. There are many adverse geological conditions accelerating the formation of landslides, dangerous rock collapse, ground collapse, debris flow, and other geological disasters, including developed surface water networks, strongly cut terrain, complex rock and soil structure, and geological structure, making it one of the cities with the highest geological disaster frequency in the country.

The spatial distribution of geological hazards in Chongqing City shows a certain degree of concentration and can be concluded as a striped distribution and vertical zonal distribution; moreover, its temporal distribution presents a seasonal cluster pattern. According to the statistics, there are currently 14,926 geological hazard-prone points in Chongqing City, of which 5776 (38.7%) are located in the 7 districts and counties of northeast Chongqing City (Wanzhou, Kaizhou, Chengkou, Wuxi, Wushan, Fengjie, Yunyang), 1864 (12.49%) are in the 5 districts and counties of southeastern Chongqing City (Wulong, Youyang, Qianjiang, Pengshui, Xiushan), and 1320 (8.84%) are in the 11 districts of the main city. Therefore, northeast Chongqing City is the key area with a high probability of potential geological disasters.

As one of the seven districts and counties in the northeast of Chongqing City (Figure 1), Yunyang County (spans 108◦2403700–109◦1404700 E and 30◦3405900–31◦2602800 N) is located in the middle of the Three Gorges Reservoir Project area, being the important hub of the ecological and economic zone along the Yangtze River. According to the announcement of the Chongqing Forest Bureau, while the forest area of Chongqing city reaches 54.5%, that of Yunyang County exceeds 58.5%, making it one of the greenest counties in China. Based on the Seventh National Census of China, there were 929,034 long-term residents (48% of them are urban residents) in this area in the year 2020. Yunyang County is crossed by twelve major folds, namely Changdianfang Syncline (1), Macaoba Anticline (2), Qvmahe Syncline (3), Tiefengshan Anticline (4), Yangliuwan Syncline (5), Dongcun Anticline (6), Xinchang Anticline (7), Huangpoxi Syncline (8), Guling Syncline (9), Fangdoushan Anticline (10), Ganchang Syncline (11), and Longjukan Syncline (12). Under the subtropical monsoon climate, Yunyang County has an average annual rainfall of 1123.7 to 1264.8 mm and an average annual temperature from 10.2 to 18.5 ◦C.

Mountainous areas are generally susceptible to mass movements due to preparatory and triggering causal factors [26]; not only the weathering effects but anthropogenic activities in the region also commonly accelerate the formation of unstable areas on both the earth material and on hill slopes [27]. As a part of Chongqing City, Yunyang County has always been a significant hotspot for landslide occurrences. There are a total of 836 historical landslides recorded in the dataset; 827 data points are left after data cleaning. A total of 28.2% of them are small landslides, 51.8% are medium landslides, and 20% are large landslides. Among them, trust-load-caused landslides accounted for 53.7%, and loosen-caused landslides and multi-caused landslides accounted for 14.5% and 31.8%, respectively. To build sub-models, we manually divided the study area into four different sub-zones (Figure 2) based on the information from the exploration of geological hazards in Chongqing City, such as the mechanism of landslide formation and sliding failure and the geomorphic unit characteristics. Among the four sub-zones, sub-zone II contains all the strip-distributing landslides along the mainstream of the Yangtze River, so it can also be called the Yangtze River mainstream zone. From a larger scope, a part of Yunyang County belongs to the low-hills section that crosses Yunyang, Fengjie, and Kaizhou; this area is classified as sub-zone IV. Sub-zone I (south of sub-zone II) is crossed by the main highway called S305, and the main area of sub-zone III (between sub-zone II and IV) is crossed by the S103 and S305. Similarly, the density of the road network is also at a high stage in the other two parts of sub-zone III. The landslides that occurred in these two sub-zones are found to be mainly along the roads (Figure 3). After zonation, 89 landslides are located in sub-zone I, 285 of them are in sub-zone II, sub-zone III contains 44 landslides, and with the largest area, 408 of the historical landslides occurred in sub-zone IV.

**Figure 1.** Location, landslide distribution and tectonic map of the study area. (1) Changdianfang Syncline, (2) Macaoba Anticline, (3) Qvmahe Syncline, (4) Tiefengshan Anticline, (5) Yangliuwan Syncline, (6) Dongcun Anticline, (7) Xinchang Anticline, (8) Huangpoxi Syncline, (9) Guling Syncline, (10) Fangdoushan Anticline, (11) Ganchang Syncline, and (12) Longjukan Syncline. **Figure 1.** Location, landslide distribution and tectonic map of the study area. (1) Changdianfang Syncline, (2) Macaoba Anticline, (3) Qvmahe Syncline, (4) Tiefengshan Anticline, (5) Yangliuwan Syncline, (6) Dongcun Anticline, (7) Xinchang Anticline, (8) Huangpoxi Syncline, (9) Guling Syncline, (10) Fangdoushan Anticline, (11) Ganchang Syncline, and (12) Longjukan Syncline.

Mountainous areas are generally susceptible to mass movements due to preparatory and triggering causal factors [26]; not only the weathering effects but anthropogenic activities in the region also commonly accelerate the formation of unstable areas on both the earth material and on hill slopes [27]. As a part of Chongqing City, Yunyang County has always been a significant hotspot for landslide occurrences. There are a total of 836 historical landslides recorded in the dataset; 827 data points are left after data cleaning. A total of 28.2% of them are small landslides, 51.8% are medium landslides, and 20% are large landslides. Among them, trust-load-caused landslides accounted for 53.7%, and loosencaused landslides and multi-caused landslides accounted for 14.5% and 31.8%, respectively. To build sub-models, we manually divided the study area into four different subzones (Figure 2) based on the information from the exploration of geological hazards in Chongqing City, such as the mechanism of landslide formation and sliding failure and the geomorphic unit characteristics. Among the four sub-zones, sub-zone II contains all the As one of the typical landslides in the Three Gorges Reservoir area, the Jiuxianping landslide (in sub-zone II) is located on the left bank of the Yangtze River (Figure 2b). After the Three Gorges Reservoir project, the fluctuation of the Three Gorges Reservoir water level restarted the displacement and deformation of the ancient landslide, making this area more prone to geological hazards. A subsidence of about one meter occurred on a roadway in the middle of the landslide body after heavy rain in 2003 and 2004, causing the roadway to be abandoned. With the impact of continuous heavy rain, landslides occurred in the back accumulation of Jiuxianping on 19 and 22 June 2007, causing the houses of the villagers to collapse, and the mountain body cracked. On 9 June 2009, the back-accumulation of Jiuxianping deformed again under the impact of heavy rain, causing cracks on both the accumulation body and the houses of the villagers. Recently, under the continuous effect of the Three Gorges Reservoir, this area has been in the overall creep deformation stage for years, especially the cliffs near the river, which often suffer from local collapse and damage.

*Forests* **2022**, *13*, x FOR PEER REVIEW 5 of 22

the largest area, 408 of the historical landslides occurred in sub-zone IV.

strip-distributing landslides along the mainstream of the Yangtze River, so it can also be called the Yangtze River mainstream zone. From a larger scope, a part of Yunyang County belongs to the low-hills section that crosses Yunyang, Fengjie, and Kaizhou; this area is classified as sub-zone IV. Sub-zone I (south of sub-zone II) is crossed by the main highway called S305, and the main area of sub-zone III (between sub-zone II and IV) is crossed by the S103 and S305. Similarly, the density of the road network is also at a high stage in the other two parts of sub-zone III. The landslides that occurred in these two sub-zones are found to be mainly along the roads (Figure 3). After zonation, 89 landslides are located in sub-zone I, 285 of them are in sub-zone II, sub-zone III contains 44 landslides, and with

**Figure 2.** Zonation and typical landslides: (**a**) Longjiao Landslide, (**b**) Jiuxianping Landslide, (**c**) Mawangmiao Landslide, (**d**) Tuantan Landslide **Figure 2.** Zonation and typical landslides: (**a**) Longjiao Landslide, (**b**) Jiuxianping Landslide, (**c**) Mawangmiao Landslide, (**d**) Tuantan Landslide.

The continuous heavy rains from 30 August to 1 September 2014 made the accumulated rainfall in Jiangkou Town more than 300 mm. The day after that, the Tuantan landslide (Figure 2d) occurred on the back mountain and on the left side of the Yongfa Coal Mine staff dormitory in Tuantan village, Jiangkou Town, Yunyang County (in sub-zone IV). Although the employees were notified to evacuate from the area subjected to the massive landslide, twelve of them were buried on the spot. Unfortunately, only one of the twelve was saved.

Typically, in sub-zone I, under the impact of rainfall, a landslide occurred on the S202 Highway in the direction from Longjiao to Rucao (Figure 2a) on 13 July 2021. Similarly, there was a 10,000 cubic-meter landslide triggered by heavy rainfall in the area of Mawang Temple (Figure 2c), which trapped two four-wheel cars and a motorcycle, and blocked the highway section for five days.

*Forests* **2022**, *13*, x FOR PEER REVIEW 6 of 22

**Figure 3.** Landslides distribution versus distance from road in different sub-zones. **Figure 3.** Landslides distribution versus distance from road in different sub-zones.

#### As one of the typical landslides in the Three Gorges Reservoir area, the Jiuxianping **3. Method Explanation**

#### landslide (in sub-zone II) is located on the left bank of the Yangtze River (Figure 2b). After *3.1. Random Forest*

the Three Gorges Reservoir project, the fluctuation of the Three Gorges Reservoir water level restarted the displacement and deformation of the ancient landslide, making this area more prone to geological hazards. A subsidence of about one meter occurred on a roadway in the middle of the landslide body after heavy rain in 2003 and 2004, causing the roadway to be abandoned. With the impact of continuous heavy rain, landslides occurred in the back accumulation of Jiuxianping on 19 and 22 June 2007, causing the houses of the villagers to collapse, and the mountain body cracked. On 9 June 2009, the backaccumulation of Jiuxianping deformed again under the impact of heavy rain, causing cracks on both the accumulation body and the houses of the villagers. Recently, under the continuous effect of the Three Gorges Reservoir, this area has been in the overall creep As one of the most popular classification methods, RF was first proposed by Breiman [28] and Cutler [29]. During its training process, different data subsets obtained by random sampling are used to train multiple decision trees as independent estimators; each of them is only allowed to fit the data based on the part of the input features, and the final output of the RF model is based on the voting results of the constructed estimators. With the double randomness, those estimators will be trained as distinct ones, and such a special structure makes RF less sensitive to noise and outliers, less likely to overfit and has a lower dependency on feature selection, but more robust and can provide more accurate predictions compared with the other machine learning models.

The RF training process can be concluded as the flowing three simple steps (Figure 4):

**Figure 4.** Random forest flowchart. **Figure 4.** Random forest flowchart.

*3.2. Grid Search* As pre-set parameters, hyper-parameters play a crucial role in model performance, Bootstrapping: To build M decision trees (estimators), M subsets will be generated as training sets from the original dataset by sampling with replacement.

and there are rare algorithms that are hyper-parameter-free [30]. Therefore, proper hyperparameter tuning is essential for improving model performance. Grid search, as one of the conventional automatic hyper-parameter tuning methods, is widely used because of Modeling: Each subset obtained by bagging is used to train the corresponding estimator. Similarly, for the purpose of getting estimators as distinct as possible, the available features for each estimator are also randomly chosen from the entire feature list.

its simple operation. Its basic idea is to choose the best hyper-parameter combination by enumerating and iterating over all possible combinations. Although it is computationally Voting: Each estimator will output a result independently, then the final result of the model will be generated based on the voting results (Equation (1)).

$$\mathbf{f}(\mathbf{x}) = \frac{1}{M} \sum\_{m=1}^{M} f\_m(\mathbf{x}) \tag{1}$$

*3.3. Performance Measure* where f(x) represents the final output, and *fm*(*x*) means the output of the *m*th tree.

#### For classification problems, accurate-oriented modeling sometimes ends up with *3.2. Grid Search*

"rabid" models that tend to classify all samples as a certain type, especially for unbalanced datasets. Therefore, the confusion matrix is introduced in this case (Table 1). **Table 1.** Confusion matrix. **Predicted Values Actual Values Positive Negative** *Positive* True Positive (TP) False Positive (FP) *Negative* False Negative (FN) True Negative (TN) As pre-set parameters, hyper-parameters play a crucial role in model performance, and there are rare algorithms that are hyper-parameter-free [30]. Therefore, proper hyperparameter tuning is essential for improving model performance. Grid search, as one of the conventional automatic hyper-parameter tuning methods, is widely used because of its simple operation. Its basic idea is to choose the best hyper-parameter combination by enumerating and iterating over all possible combinations. Although it is computationally expensive because of the exhaustive search process, grid search suits random forest very well, as random forest is a hyper-parameter tuning friendly model; only two hyper-parameters are to be tuned in our case.

#### TP/(TP + FN), and the false positive rate is defined as FPR = FP/(FP + TN). After model *3.3. Performance Measure*

establishment, its corresponding TPR versus FPR at different cutoff values can be plotted, which is called as receiver operation characteristic (ROC) curve and can be used to represent the predictive ability of the model. The area under the ROC curve (AUC) is commonly used as an index that represents the true classification accuracy of the model. For classification problems, accurate-oriented modeling sometimes ends up with "rabid" models that tend to classify all samples as a certain type, especially for unbalanced datasets. Therefore, the confusion matrix is introduced in this case (Table 1).

According to Table 1, the true positive rate, also called the recall, is defined as TPR =


**Table 1.** Confusion matrix.

According to Table 1, the true positive rate, also called the recall, is defined as TPR = TP/(TP + FN), and the false positive rate is defined as FPR = FP/(FP + TN). After model establishment, its corresponding TPR versus FPR at different cutoff values can be plotted, which is called as receiver operation characteristic (ROC) curve and can be used to represent the predictive ability of the model. The area under the ROC curve (AUC) is commonly used as an index that represents the true classification accuracy of the model. *Forests* **2022**, *13*, x FOR PEER REVIEW 9 of 22

### **4. Methodology**

#### *4.1. Data Collection and Preparation* **4. Methodology**

Data collection and data cleaning are of the top importance in machine learning applications; high-quality data is the foundation of accurate prediction models. Model perfection also plays a vital role; choosing appropriate hyper-parameters will help models to extract useful information from input data more effectively and precisely. In this section, we will introduce the procedure for data preparation and model establishment. The analysis of the performance and the factor importance of different models will be presented in the following sections. The flowchart for this study is shown in Figure 5. *4.1. Data Collection and Preparation* Data collection and data cleaning are of the top importance in machine learning applications; high-quality data is the foundation of accurate prediction models. Model perfection also plays a vital role; choosing appropriate hyper-parameters will help models to extract useful information from input data more effectively and precisely. In this section, we will introduce the procedure for data preparation and model establishment. The analysis of the performance and the factor importance of different models will be presented in the following sections. The flowchart for this study is shown in Figure 5.

**Figure 5.** Flowchart of this study. Therefore, in this study, topographical factors (aspect, elevation, plane curvature, profile **Figure 5.** Flowchart of this study.

Triggered by the joint effects of natural factors and human activities, the mechanism of the occurrence of landslides is very complex [5]. Based on the assumption that future

factors and analyzing their relationships with historically recorded landslides can contribute to forming the basis for the prediction of future landslides in an area [32,33]. However, after reviewing the studies related to the evaluation of landslide sensitivity from 1983 to 2016, Reichenbach et al. [34] found that there was a total of 596 different factors that affected landslide formation, considering all of them will be laborious and time-consuming. According to previous articles, topography, hydrology, geology, land cover, and natural and human-related factors are generally used for landslide susceptibility analysis [14,17].

Triggered by the joint effects of natural factors and human activities, the mechanism of the occurrence of landslides is very complex [5]. Based on the assumption that future landslides will occur under the same conditions as past landslides [31], evaluating these factors and analyzing their relationships with historically recorded landslides can contribute to forming the basis for the prediction of future landslides in an area [32,33]. However, after reviewing the studies related to the evaluation of landslide sensitivity from 1983 to 2016, Reichenbach et al. [34] found that there was a total of 596 different factors that affected landslide formation, considering all of them will be laborious and time-consuming. According to previous articles, topography, hydrology, geology, land cover, and natural and human-related factors are generally used for landslide susceptibility analysis [14,17]. Therefore, in this study, topographical factors (aspect, elevation, plane curvature, profile curvature, relief amplitude, slope), hydrological factors (aridity, distance from rivers, index of moisture (IM), and topographic wetness index (TWI)), geological factors (lithology, distance from anticline axis, distance from syncline axis), factors related to land cover (namely land use and normalized difference vegetation index (NDVI)), human factors (such as human activity intensity of land surface (HAILS), distance from road, and population), and natural factors (average annual temperature, average annual rainfall) are all considered (Figure 6). *Forests* **2022**, *13*, x FOR PEER REVIEW 10 of 22 curvature, relief amplitude, slope), hydrological factors (aridity, distance from rivers, index of moisture (IM), and topographic wetness index (TWI)), geological factors (lithology, distance from anticline axis, distance from syncline axis), factors related to land cover (namely land use and normalized difference vegetation index (NDVI)), human factors (such as human activity intensity of land surface (HAILS), distance from road, and population), and natural factors (average annual temperature, average annual rainfall) are all considered (Figure 6).

**Figure 6.** Conditioning factor types. **Figure 6.** Conditioning factor types.

As the grid data layers cannot be directly used for model training, the Yunyang County fishnet with a cell size of 25 × 25 m was created for the purpose of data extraction for model training and measuring the distances from specific structures/natural sources, the total number of cells is 5,831,382. The secondary data includes aspect, plane curvature, As the grid data layers cannot be directly used for model training, the Yunyang County fishnet with a cell size of 25 × 25 m was created for the purpose of data extraction for model training and measuring the distances from specific structures/natural sources, the

profile curvature, relief amplitude, slope, distance from rivers, TWI, distance from anticline axis, distance from syncline axis, and distance from road. Among them, TWI, aspect,

processing of DEM. The characteristics associated with distances (i.e., distance from rivers, distance from anticline axis, and distance from syncline axis) were obtained by applying the near function of ArcGIS to the corresponding primary data (i.e., the river network of Chongqing City, the road network of Chongqing City, and the geological structure of total number of cells is 5,831,382. The secondary data includes aspect, plane curvature, profile curvature, relief amplitude, slope, distance from rivers, TWI, distance from anticline axis, distance from syncline axis, and distance from road. Among them, TWI, aspect, plane curvature, profile curvature, relief amplitude, and slope were obtained by ArcGIS processing of DEM. The characteristics associated with distances (i.e., distance from rivers, distance from anticline axis, and distance from syncline axis) were obtained by applying the near function of ArcGIS to the corresponding primary data (i.e., the river network of Chongqing City, the road network of Chongqing City, and the geological structure of Yunyang County) and the created fishnet of Yunyang County. The ranges and categories of the 20 characteristics are summarized in Table 2.


**Table 2.** (**a**) Categories of conditioning factors. (**b**) Ranges of conditioning factors.

All 20 factors were visualized by ArcGIS with a resolution of 25 m (Figure 7). Then the corresponding classes and rating values of the 20 factors were assigned to each cell of the prepared fishnet.

*Forests* **2022**, *13*, x FOR PEER REVIEW 13 of 22

**Figure 7.** Conditioning factors of landslides susceptibility: (**a**) TWI; (**b**) Slope; (**c**) Relief amplitude; (**d**) Profile curvature; (**e**) Population; (**f**) Plane curvature; (**g**) Aridity; (**h**) Distance from rivers; (**i**) Land use; (**j**) NDVI; (**k**) Aspect; (**l**) Distance from anticline axis; (**m**) Distance from road; (**n**) Distance from syncline axis; (**o**) HAILS; (**p**) IM; (**q**) Elevation; (**r**) Lithology; (**s**) Average annual temperature; (**t**) Average annual rainfall.

After obtaining the 827 historical landslide cells in the study area as positive samples (1), the same number of non-landslide cells were randomly extracted from the landslide-free areas as negative samples (0) [35,36]. To make sure that the sub-models can be constructed with the same data as the parent model, the number of negative samples selected from each sub-zone was determined by the number of the positive samples in the sub-zone (i.e., 89 for sub-zone I, 286 for sub-zone II, 44 for sub-zone III, and 408 for sub-zone IV). Thus the five sample datasets were formed (Table 3). Based on previous research [14,37], 80% of each dataset was randomly selected to train the corresponding statistical model, and the remaining 20% was used for validation purposes.


**Table 3.** The number of data points in different zonation and datasets.

### *4.2. Model Development and Application*

Five independent random forest-based models were established and validated based on the sample datasets constructed before. The hyper-parameters of each model were tuned by the grid search method. Being considered as the most straightforward optimization method [38], the grid search method needs to iterate over the entire interval of each hyper-parameter. Therefore, the number and the iterating interval of the tuned hyperparameters have an obvious impact on its search efficiency. The number of estimators determines the stability of a random forest model, and adding more estimators will lower its mean squared prediction delta (MSPD) and hence improve model stability [39], but will increase computational cost [40]. The maximum depth of a tree controls the stability of a random forest model in a different way. The stability will decrease as depth increases since increasing the depth will make the model tend to just memorize the training data, but if the forest is too shallow, the model will underfit, resulting in low AUC [39]. In this case, the number of estimators and the maximum depth of a tree are the two hyper-parameters to be tuned; the results are shown in Table 4.



After obtaining well-trained models, we used all 5,831,382 cells prepared in Section 4.1 as input to generate the landslide susceptibility maps for the study area.

### **5. Results**

The landslide susceptibility maps generated by the parent model and sub-models are displayed in Figure 8, where Figure 8a represents the map outputted by the parent model, and Figure 8b is the map produced by sub-models, Figure 8c,d are the detailed scopes for the part of both maps. The entire region is divided and classified into five zones of susceptibility to different levels of landslides (very low, low, moderate, high, and very

Sub-Model Three

high) by the method of natural breaks method. From the result of the parent model in Table 5, 14.6% of the whole area is classified as a very low landslide-prone zone, 22.8% as a low landslide-prone zone, 25.23% as a moderate landslide-prone zone, 21.47% as a high landslide-prone zone, and 15.90% as a very high landslide-prone zone. From the result of the sub-models, their ratios are 16.42%, 20.81%, 22.56%, 21.54%, and 18.67%, respectively. *Forests* **2022**, *13*, x FOR PEER REVIEW 15 of 22

**Figure 8.** Comparison between the landslide susceptibility mapping of the parent model and submodels: (a) Landslide susceptibility map generated by the parent model, (b) Landslide susceptibility map generated by the sub-models, (c) Detailed view of landslide susceptibility map generated by the sub-models, (d) Detailed view of landslide susceptibility map generated by the parent model. **Figure 8.** Comparison between the landslide susceptibility mapping of the parent model and submodels: (**a**) Landslide susceptibility map generated by the parent model, (**b**) Landslide susceptibility map generated by the sub-models, (**c**) Detailed view of landslide susceptibility map generated by the sub-models, (**d**) Detailed view of landslide susceptibility map generated by the parent model.


**Table 5.** Statistic results of landslide susceptibility in different levels of different models. **Table 5.** Statistic results of landslide susceptibility in different levels of different models.

Very Low 0.000–0.303 0.00% 13.68% 0.000 Low 0.303–0.421 0.00% 24.81% 0.000 Moderate 0.421–0.536 9.10% 25.78% 0.353 High 0.536–0.662 31.81% 22.06% 1.442 Sub-Model Four


*Forests* **2022**, *13*, x FOR PEER REVIEW 16 of 22

**Table 5.** *Cont.*

Logically, the landslides/area ratio should increase from a very low landslide-prone zone to a very high landslide-prone zone, which is exactly what our models indicate. According to the results of the parent model, the landslides/area ratio increases from 0.016 to 4.548, from a very low landslide-prone zone to a very high landslide-prone zone, and that also increases from 0.007 to 4.051 from the results of the sub-models. Figure 9 displays such a tendency, and it can be seen that the outputs of the sub-models have more obvious gaps between the very low landslide-prone zone and the very high landslide prone-zone, which reflects another merit of the sub-models compared with the parent model. Very High 0.662–1.000 59.09% 13.67% 4.323 Very Low 0.000–0.227 0.00% 16.59% 0.000 Low 0.227–0.427 0.74% 16.20% 0.0457 Moderate 0.427–0.612 2.94% 19.38% 0.152 High 0.612–0.792 19.12% 22.26% 0.859 Very High 0.792–1.000 77.20% 25.57% 3.020

**Figure 9.** Landslides/area ratio. **Figure 9.** Landslides/area ratio.

predicting process of sub-model four.

The validation AUC values of the five models are shown in Figure 10; the AUC value of the parent model is 0.872, while that of sub-model one to sub-model four are 0.949, 0.892, 0.889, and 0.951, respectively. All of the sub-models outperformed the parent model, which proved the aforementioned hypothesis. However, the increases are not at the same level; sub-model four achieved the highest improvement (9.1%), while the lowest improvement is 1.9%, which was obtained by sub-model two. The importance of features generally represents how much a specific feature contrib-The validation AUC values of the five models are shown in Figure 10; the AUC value of the parent model is 0.872, while that of sub-model one to sub-model four are 0.949, 0.892, 0.889, and 0.951, respectively. All of the sub-models outperformed the parent model, which proved the aforementioned hypothesis. However, the increases are not at the same level; sub-model four achieved the highest improvement (9.1%), while the lowest improvement is 1.9%, which was obtained by sub-model two.

utes to the decision-making process of a model. In this case, the most important features can be the key factors in identifying landslide/non-landslide points. As shown in Table 6 and Figure 11, the distance from syncline axis, aridity, elevation, distance from rivers, and average annual temperature are of the highest importance for the parent model. NDVI,

the most important feature, which is followed by average annual rainfall, distance from rivers, distance from anticline axis, and average annual temperature. Distance from road, HAILS, elevation, plane curvature, and average annual temperature are the top features for sub-model three. Last but not least, distance from syncline axis, average annual temperature, elevation, aridity, and average annual rainfall play important roles during the

**Figure 10.** Validation AUC values of different models. **Figure 10.** Validation AUC values of different models.

**Table 6.** Top five important factors in different sub-zones. **Zone Feature Importance Rank 1st 2nd 3rd 4th 5th** Whole Region Distance from syncline axis (m) Aridity Elevation (m) Distance from rivers (m) Average Annual Temperature (°C) Sub-Zone I Distance from syncline axis (m) NDVI Average Annual Rainfall (mm) Distance from Road (m) Elevation (m) Sub-Zone II Elevation (m) Average Annual Rainfall (mm) Distance from rivers (m) Distance from anticline axis (m) Average Annual Temperature (°C) Sub-Zone III Distance From Road (m) HAILS Elevation (m) Plane Curvature Average Annual Temperature (°C) Sub-Zone IV Distance from syn-Average Annual Temperature (°C) Elevation (m) Aridity Average Annual Rain-The importance of features generally represents how much a specific feature contributes to the decision-making process of a model. In this case, the most important features can be the key factors in identifying landslide/non-landslide points. As shown in Table 6 and Figure 11, the distance from syncline axis, aridity, elevation, distance from rivers, and average annual temperature are of the highest importance for the parent model. NDVI, average annual rainfall, distance from road, and elevation are the main features that are associated with the formation of landslides in sub-zone I. For sub-model two, elevation is the most important feature, which is followed by average annual rainfall, distance from rivers, distance from anticline axis, and average annual temperature. Distance from road, HAILS, elevation, plane curvature, and average annual temperature are the top features for sub-model three. Last but not least, distance from syncline axis, average annual temperature, elevation, aridity, and average annual rainfall play important roles during the predicting process of sub-model four.

fall (mm)



*Forests* **2022**, *13*, x FOR PEER REVIEW 18 of 22

**Figure 11.** Top five important factors of different models. AAT: Average annual temperature; AAR: Average annual rainfall; DFSA: Distance from syncline axis; DFAA: Distance from anticline axis; DFRD: Distance from road; DFRS: Distance from rivers. **Figure 11.** Top five important factors of different models. AAT: Average annual temperature; AAR: Average annual rainfall; DFSA: Distance from syncline axis; DFAA: Distance from anticline axis; DFRD: Distance from road; DFRS: Distance from rivers.

### **6. Discussion**

#### **6. Discussion** *6.1. Feature Importance Analysis*

*6.1. Feature Importance Analysis* As indicated in Table 6, the distance from syncline axis is at the dominant place among the twenty factors for the parent model, sub-model one, and sub-model four. Factors associated with water, such as average annual rainfall and the distance from rivers, are also important places. The triggering factors of shallow landslides are highly dependent on the rainfall water infiltration and its further redistribution [41]. Due to its softening effect and long-term erosion effect, the distance from rivers has a significant influence on the development of landslides [42]. Temperature has a remarkable effect on landslide formation; experimental results indicated that the shear strength of slip surface soils reduces with decreasing temperature, which will negatively affect slope instability [43]. Therefore, obtained by the accumulated temperature and rainfall, aridity is also of great importance. Although different plants generally have a different contribution of rainfall to soil water [44], which will affect slope stability differently, on average, the effect of water uptake from the plant cover makes the vegetated slopes averagely 12.84% drier, and matric suctions three times higher than the fallow slope, which contributes to slope stability [45]; so NDVI is taken as one of the main considerations. Unlike vegetation cover, human-made land covers usually have negative influences on slope stability. The development of new building areas can potentially increase susceptibility to landslides [46]. Road networks not only directly destabilize existing slopes by disturbing their original structures during construction processes, but transport activities also have negative effects on slope stability. As indicated in Table 6, the distance from syncline axis is at the dominant place among the twenty factors for the parent model, sub-model one, and sub-model four. Factors associated with water, such as average annual rainfall and the distance from rivers, are also important places. The triggering factors of shallow landslides are highly dependent on the rainfall water infiltration and its further redistribution [41]. Due to its softening effect and long-term erosion effect, the distance from rivers has a significant influence on the development of landslides [42]. Temperature has a remarkable effect on landslide formation; experimental results indicated that the shear strength of slip surface soils reduces with decreasing temperature, which will negatively affect slope instability [43]. Therefore, obtained by the accumulated temperature and rainfall, aridity is also of great importance. Although different plants generally have a different contribution of rainfall to soil water [44], which will affect slope stability differently, on average, the effect of water uptake from the plant cover makes the vegetated slopes averagely 12.84% drier, and matric suctions three times higher than the fallow slope, which contributes to slope stability [45]; so NDVI is taken as one of the main considerations. Unlike vegetation cover, human-made land covers usually have negative influences on slope stability. The development of new building areas can potentially increase susceptibility to landslides [46]. Road networks not only directly destabilize existing slopes by disturbing their original structures during construction processes, but transport activities also have negative effects on slope stability. Similarly, HAILS is another influential factor. For topological factors, the two most important factors, in this case, are elevation and plane curvature.

Similarly, HAILS is another influential factor. For topological factors, the two most im-

portant factors, in this case, are elevation and plane curvature.

### *6.2. Model Comparison*

### 6.2.1. Parent Model

When dealing with the whole region, the model tends to capture information that is more general, and thus its feature importance has a higher universality. In our case, six major synclines are lying in the whole domain, and their generally dominant effects on identifying landslide/non-landslide points are successfully captured by the parent model. Aridity, distance from rivers, and average annual temperature affect the formation of landslides in meteorological and hydrological aspects. Although ranked fourth as the most important one, elevation is the most important topological factor in the research area.

### 6.2.2. Sub-Models

After dividing the whole region into different sub-zones, the models will have the chance to learn the knowledge that is specifically right for each sub-zone. Thus their performances will be improved.

*Sub-model One.* Located in the south of Yunyang County, sub-zone I occupies 741.7 square kilometers, which is only 20% of the entire research area. Nevertheless, it is crossed by two of the six major synclines, which makes the syncline effect more obvious, and, therefore, the distance from the syncline axis is still the most important factor for sub-zone I. Similarly, the landslide susceptibility in sub-zone I is also affected by both hydrological and topological factors, namely average annual rainfall and elevation. However, specifically, as it is crossed by the major highway S305, land cover (NDVI) and human activities (distance from road) are the other two main influential factors.

*Sub-model Two.* As we discussed before, sub-zone II is the most special sub-zone since it is crossed by the Three Gorges Reservoir area, most of the landslides in Yunyang County were impacted by water occurred here. Therefore, theoretically, the landslide formation in sub-zone II is sensitive to hydrological factors. In this case, the average annual rainfall and the distance from rivers are ranked as the second and the third-most important factors. The reason why elevation is of the greatest importance is that lower places are usually prone to the influence of the periodic variation of reservoir water level. Statistically, 76% of the landslides impacted by water in sub-zone II occurred below the elevation of 249 m. The tectonic action and natural factors also contribute a lot to the landslide formation in sub-zone II; the distance from anticline axis and the average annual temperature are ranked in fourth and fifth place, respectively.

*Sub-model Three.* For sub-zone III, specifically, its data points are relatively far away from the major synclines, which mitigates the importance of the feature. Instead, as an area crossed by major highways and dense road networks, human activities, including distance from road and HAILS, are the two most important factors that have the main contribution to the landslide formation of the sub-zone. Topographical factors, namely elevation and plane curvature, are ranked in third and fourth place, respectively, and the average annual rainfall is evaluated as the fifth-most important factor.

*Sub-model Four.* Crossed by three major synclines, the landslide identification in subzone IV is also significantly dependent on the distance from the syncline axis. Furthermore, since it is part of the low hills section crossing Yunyang, Fengjie, and Kaizhou, the formation of landslides in this area is primarily influenced by the factors associated with the stability of mountain bodies, such as the average annual temperature, elevation, aridity, and the average annual rainfall.

### **7. Conclusions**

In this study, Yunyang County is manually zoned into four parts based on the qualitative analysis of geological hazards exploration in Chongqing City, including the mechanism of landslide formation and sliding failure and geomorphic unit characteristics. Based on the qualitative analysis result, five random forest landslide susceptibility models are constructed using historical landslides data points and twenty relating factors for the following quantity analysis. These models, including a parent model and four sub-models, are optimized by the grid search method individually. A comparison between the parent model and the combination of the sub-models is conducted. The following conclusions are drawn:

The AUC value of the parent model achieves 0.872, which shows that the traditional RF with the hyper-parameters tuned by the grid search method has a reliable performance on landslide susceptibility mapping. In this study, synclines have the most important effects on the formation of landslides in Yunyang County, followed by aridity, elevation, distance from rivers, and average annual temperature.

However, more general information extracted from "mainstream" landslides would usually cover that of the "minority" landslides when treating a large region equally, resulting in low information utility and the inability to identify potential landslides under special geological conditions. With enough data points, experience-based zoning before modeling is proved to be an effective solution to the issue; the qualitative analysis serves the purpose of pre-classification based on the information from geological hazards exploration, which groups the landslides that occurred under similar geological conditions, and thus enables the models to obtain the specific knowledge under each condition. Therefore, in our case, while the traditional RF obtained the general prediction skill for the entire region of Yunyang County, all the sub-models have become "experts" in their respective sub-areas. The test AUC values of sub-model one to four are 8.8%, 2.3%, 1.9%, and 9.1% higher than those of the parent model. Furthermore, the proposed method also contributes to further revealing the key factors that include local landslide instability under specific geological conditions, which can be used by planners and policymakers for a more specific and accurate landslide control in certain areas, thus further improving the safety of life and public property.

For sub-zone I, the top five conditioning factors are distance from syncline axis, NDVI, average annual rainfall, distance from road, and elevation. For sub-zone III, without the influence of major synclines, its top factors are distance from road, HAILS, elevation, plane curvature, and average annual temperature. For sub-zone IV, the distance from syncline axis becomes the most important factor again, and it is followed by average annual temperature, elevation, aridity, and average annual rainfall.

Sub-zone II is crossed by the Three Gorges Reservoir area. Suffered by periodic variation in reservoir water level and the impacts of other factors related to the reservoir band, the modified method based on general conditioning factors has relatively less effect on improving the accuracy of the mapping. The effect of more specific factors on the formation of landslides on the banks of the reservoir will be analyzed in further research. In the case of this paper, the results of sub-model two point out that elevation, average annual rainfall, distance from rivers, distance from anticline axis, and average annual temperature are the top five conditioning factors among the existing twenty factors for sub-zone II.

**Author Contributions:** Methodology, S.L.; validation, L.W.; software, Y.H.; resources, P.S.; formal analysis, M.C.; supervision, W.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Key R&D Program of China (2019YFC1509605), the China Postdoctoral Science Foundation funded project (2021M700608), the Natural Science Foundation of Chongqing, China (cstc2021jcyj-bsh0047), High-end Foreign Expert Introduction program (G20200022005 and DL2021165001L), and Science and Technology Research Program of Chongqing Municipal Education Commission (HZ2021001).

**Data Availability Statement:** Restrictions apply to the availability of these data. Data was obtained from National Earth System Science Data Center, Resource and Environment Science and Data Center and Chongqing Geological monitoring station and are available at http://www.geodata.cn/ ;jsessionid=4FDC4730785878003B1ED71B62212634 (accessed on 1 May 2022), and https://www.resdc. cn/Default.aspx (accessed on 1 May 2022) with the permission of National Earth System Science Data Center and Resource and Environment Science and Data Center.

**Conflicts of Interest:** The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

### **References**

