Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning

Wang, Xiaodong; Ma, Xiaoyi; Guo, Dianheng; Yuan, Guangxiang; Huang, Zhiquan

doi:10.3390/app14146040

Open AccessArticle

Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning

by

Xiaodong Wang

¹,

Xiaoyi Ma

^1,*,

Dianheng Guo

¹,

Guangxiang Yuan

¹ and

Zhiquan Huang

²

¹

College of Geosciences and Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

Luoyang Institute of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6040; https://doi.org/10.3390/app14146040

Submission received: 6 June 2024 / Revised: 6 July 2024 / Accepted: 9 July 2024 / Published: 10 July 2024

(This article belongs to the Special Issue Recent Advances in Modeling, Assessment, and Mitigation of Landslide Hazards)

Download

Browse Figures

Versions Notes

Abstract

:

The appropriate selection of machine learning samples forms the foundation for utilizing machine learning models. However, in landslide susceptibility evaluation, discrepancies arise when non-landslide samples are positioned within areas prone to landslides or demonstrate spatial biases, leading to differences in model predictions. To address the impact of non-landslide sample selection on landslide susceptibility predictions, this study uses the western region of Henan Province as a case study. Utilizing historical data, remote sensing interpretation, and field surveys, a sample dataset comprising 834 landslide points is obtained. Ten environmental factors, including elevation, slope, aspect, profile curvature, land cover, lithology, topographic wetness index, distance from river, distance from faults, and distance from road, are chosen to establish an evaluation index system. Negative sample sampling areas are delineated based on the susceptibility assessment outcomes derived from the information value model. Two sampling strategies, whole-region random sampling (I) and partition-based random sampling (II), are employed. Random Forest (RF) and Back Propagation Neural Network (BPNN) models are used to forecast and delineate landslide susceptibility in the western region of Henan Province, with prediction accuracy evaluated. The model prediction accuracy is ranked as follows: II-BPNN (AUC = 0.9522) > II-RF (AUC = 0.9464) > I-RF (AUC = 0.8247) > I-BPNN (AUC = 0.8068). Under the Receiver Operating Characteristic (AUC) curve and accuracy, the II-RF and II-BPNN models exhibit increases in the region by 12.17% and 15.61%, respectively, compared to the I-RF and I-BPNN models. Moreover, the II-BPNN model shows improvements over the I-BPNN model with increases in AUC and accuracy by 14.54% and 16.52%, respectively. This indicates enhancements in model performance and predictive capability. In terms of recall and specificity, the II-RF and II-BPNN models demonstrate increases in recall by 15.09% and 17.47%, respectively, and in specificity by 15.80% and 14.99%, respectively. These findings suggest that the optimized models have better predictive capabilities for identifying landslide and non-landslide areas, effectively reducing the uncertainty introduced by point data in landslide risk prediction.

Keywords:

landslide susceptibility; Random Forest; Back Propagation Neural Network; information value model; frequency ratio; negative sample sampling

1. Introduction

Landslide hazard risk assessment, as a fundamental task in landslide risk management, generally consists of three levels: susceptibility assessment, hazard assessment, and risk assessment. Susceptibility assessment evaluates the spatial probability of the likelihood of slope transitioning into landslides under the nonlinear combination of various environmental factors at specific locations [1]. Its main components include landslide data cataloging, indicator system construction, assessment unit delineation, a selection of assessment methods, and an accuracy evaluation of assessment results. It serves as the basis for conducting hazard and risk evaluations. The concept of landslide susceptibility originated in the 1960s, and by the early 21st century, internationally recognized technical guidelines for landslide susceptibility had been developed. This led to widespread attention to landslide susceptibility assessment as one of the internationally advocated and promoted available disaster prevention and mitigation approaches [2].

Existing assessment methods are classified into two main categories: deterministic methods and non-deterministic methods, with non-deterministic methods further divided into knowledge-driven and data-driven types [3,4,5]. Currently, widely used assessment methods mainly belong to the data-driven category, including information value models [6], frequency ratio models [7], and machine learning models [8]. Machine learning models have received a lot of attention for their excellent performance in overcoming overfitting problems. Additionally, machine learning models can simulate the nonlinear correlations among influential factors and susceptibility, and have advantages such as generating optimal features to achieve high-precision predictions. These advantages make them an ideal choice for the rapid analysis of landslide susceptibility on a large scale [9]. However, the performance and applicability of different machine learning models vary significantly, making it crucial to distinguish between them to optimize prediction accuracy. Previous studies have employed various machine learning techniques for landslide assessment. Support Vector Machine (SVM) [10,11], Random Forest (RF) [12], XGBoost [13], artificial neural network (ANN) [14,15], and others are among the commonly utilized machine learning models. These models have shown different performances and advantages in various studies, demonstrating their broad application potential in landslide susceptibility evaluation. Additionally, numerous scholars have developed novel risk assessment methods based on machine learning [16,17]. For example, Kazemi et al. [18] utilized innovative hyperparameter optimization methods (such as halving search, grid search, random search, fine-tuning method, and k-fold cross-validation) in Python to develop a risk assessment support tool for deriving seismic fragility curves. Compared to traditional seismic fragility assessment methods, these optimized machine learning methods significantly reduce computational efforts and exhibit superior performance. They achieve satisfactory accuracy and closely match the actual curves. The establishment of these risk assessment methods provides a scientific foundation for disaster prevention and mitigation. Meanwhile, the application of machine learning models has also enhanced the accuracy and efficiency of the assessments.

Constructing a machine learning sample set is a significant process in using machine learning models. Currently, the landslide sample sets used in research are typically manually interpreted and compiled based on historical data. In the study of specific landslides, determining their boundaries is not a difficult task. However, in regional landslide susceptibility studies, the lack of historical data and the complexity of the natural environment can lead to errors in determining landslide boundaries. Furthermore, the process of remote sensing interpretation for numerous landslides consumes a considerable amount of time and resources. Therefore, in practical research, points are often used to replace irregular landslide boundaries [19]. The high-precision prediction of machine learning models relies on effective training and testing of the model. To support this, it is necessary to select a non-landslide sample set that matches the landslide sample dataset. However, the sampling methods for negative samples are currently mostly based on subjective speculation or random selection. This can result in negative samples being biased towards potential landslide points and locations, introducing uncertainty into the model and causing differences in landslide risk prediction results.

The core focus of this study is to investigate the impact of negative sample selection on model predictions and strive to minimize biases introduced by sample selection on model outcomes. The study proposes a modeling process for landslide susceptibility based on machine learning algorithms, integrating the information value method and frequency ratio method to optimize sample selection and model input variables. Using trained models, the study predicts landslide susceptibility indices and generates maps, followed by a comparative analysis of evaluation results.

2. Methodology

2.1. Frequency Ratio

The frequency ratio method calculates the likelihood of landslides occurring within various graded intervals of factors influencing landslides. Through ArcGIS10.8 spatial analysis, the area of landslide occurrence in grid cells or grid faces is obtained, and the corresponding FR value is further calculated [20,21]. The calculation formula is as follows:

F R = \frac{l / L}{s / S}

(1)

In this equation,

l

represents the count of landslide grid cells within a specific interval of an environmental factor,

L

is the total count of grid cells occupied by landslides in the study area,

s

is the number of grid cells within a certain attribute interval, and

S

represents the total number of grid cells in the study area.

If FR is greater than 1, it manifests that the environmental factor has a relatively significant influence on landslides within the specific attribute interval; if FR is less than 1, it manifests a relatively minor influence of the environmental factor on landslides. This index helps evaluate the relative importance of environmental factors in landslide occurrence within the study area.

2.2. Information Value Model

The information value model is a classic method used for predicting regional geological disasters, based on the information theory. This model reflects the likelihood of geological disasters occurring in a certain area, with its principle being to deduce the information contributed by a certain combination of environmental factors to geological disasters based on known geological disaster information and environmental factors. The larger the total information value of a certain area, the higher the likelihood of geological disasters occurring in that area [22].

The information value for a single environmental factor in different states is calculated as follows:

I_{A j} = \frac{N_{j} / N}{S_{j} / S} (j = 1, 2, 3, \dots n)

(2)

where

N_{j}

is the number of units with factor A in state j where geological disasters occur,

N

is the aggregate count of units with documented geological disaster distribution within the study area,

S_{j}

is the count of units containing factor A in state j, and

S

represents the total count of units in the study area [23].

The total information value for each assessment unit across various states of factors is calculated as

I = \sum_{i = 1}^{n} \ln \frac{N_{i} / N}{S_{i} / S} (i = 1, 2, 3, \dots n)

(3)

where

I

is the total information value of geological disaster occurrence for a particular unit,

N_{i}

is the geological disaster area or number of geological disaster points under state i of a specific factor,

N

is the distribution area under state i of the specific factor,

S_{i}

is the total geological disaster area or total count of geological disaster points within the study area, and

S

represents the total area of the study region.

2.3. Machine Learning

2.3.1. Random Forest (RF)

Random Forest is an ensemble learning method that improves the performance and robustness of the overall model by combining the predictions of multiple weak learners, typically decision trees. Decision trees are tree-like structures used for classification or regression by partitioning data in the feature space. The Random Forest model typically employs a bootstrap sampling method, randomly selecting a subset of samples with replacement from the original dataset to construct multiple different training sets for training different decision trees. When constructing each decision tree, a random subset of features is chosen at each split, helping to ensure the diversity of each tree and increase model variability. This ensemble method helps to reduce the risk of overfitting and improve the model’s generalization ability [24].

2.3.2. Back Propagation Neural Network (BPNN)

Back Propagation Neural Network is an artificial neural network model based on the back propagation algorithm, used for solving classification and regression problems [25]. It typically comprises an input layer, one or more hidden layers, and an output layer, with varying numbers of neurons in each. Neurons in adjacent layers are fully connected, and each connection is assigned a weight. In this study, a neural network model with a four-layer structure is chosen, consisting of an input layer, two hidden layers, and an output layer. The number of neurons within the input layer corresponds to the chosen number of environmental factors. The hidden layers adopt a dual-layer structure with 10,6 neurons, respectively, and employ the ReLU activation function. For the output layer, the Softmax function is utilized, with two nodes being output. The model structure is illustrated in the Figure 1.

3. Study Area and Data

3.1. Study Area

Henan Province, situated in the central–eastern region of China, spans the middle and lower sections of the Yellow River. Its geographical coordinates range from 31°23′ to 36°22′ north latitude and 110°21′ to 116°39′ east longitude, covering an area of approximately 166,000 square kilometers. The province comprises 55.7% plains and 44.3% mountainous and hilly terrains. It experiences an average annual temperature between 10.5 °C and 16.7 °C, with annual rainfall varying from 407.7 to 1295.8 mm. Primarily within the warm temperate zone, the climate transitions to subtropical in the southern areas. Characterized by a continental monsoon climate, the region shifts from the northern subtropical zone to the warm temperate zone. Additionally, from east to west, there is a transition from plain to hilly and mountainous climates. This study focuses on the extensive mountainous and hilly regions in western Henan Province, including the Luoyang Basin, stretching from the province’s western border to the eastern border of Zhengzhou City. The terrain of the study area is diverse, primarily consisting of mountains and hills, as well as plains and basins. The elevations of the mountains range mostly between 1000 and 2000 m. The northern side is characterized by loess hills, with elevations typically ranging from 500 to 700 m. The constant erosion of rainwater has formed various landforms such as valleys, ravines, loess hills, terraces, and loess plateaus. The western and southern sides are mostly mountains and hills, with numerous high mountains forming dense water networks. The plains and basins are formed at the junctions of these high mountains and water networks, concentrated along the floodplains of the Yi, Luo, Ru, and Jian rivers, as well as in alluvial plains such as Lushi and Zhangtou.

Figure 2 below presents detailed views of two specific landslide sites within the study area.

3.2. Database Preparation and Analysis

Based on the geological disaster investigation results of various counties and cities in the western region of Henan Province, landslide cataloging has been conducted. By the end of 2017, a total of 834 landslides had been recorded, categorized into two types: loess landslides and rockslides (Figure 3). These landslides are mainly distributed in mountainous and hilly areas and along roadsides.

Landslide susceptibility is determined based on the basic environmental conditions of an area to assess the spatial probability of landslides occurring. In accordance with basic environmental data and geological conditions, and with reference to the relevant literature analyzing the characteristics of landslide formation in the western region of Henan Province, a total of 10 environmental factors were selected: elevation, slope, aspect, profile curvature, land cover, lithology, topographic wetness index, distance from river, distance from faults, and distance from road [26]. GIS was used to extract raster layers of each environmental factor at a resolution of 30 m (Figure 4), and the FR values of each environmental factor were calculated (Table 1).

(1) Terrain and landform factor: From Figure 4a and Table 1, it is evident that when the elevation is between 400 and 1000 m, the FR value is greater than 1. This indicates that there are differences in human activities and engineering construction intensity in different elevation areas, leading to significant differences in the disturbance experienced by slopes [27,28]. Landslides in the western Henan area mainly occur in the 0°–30° range, especially in the slope range of 10°–30°, which is more prone to landslide formation. Slope is considered a critical condition for landslide initiation. Slopes that are too gentle cannot provide enough driving force, while slopes that are too steep are not conducive to the accumulation of slope materials, thus failing to provide sufficient material basis for landslides [29,30]. In addition, the aspect influences the distribution of solar radiation intensity and rainfall on slopes [31]. The study results indicate that southeast, south, and southwest aspects significantly influence the formation of landslides in the region. Profile curvature, derived from slope calculation, represents the rate of change of ground elevation along the direction of maximum slope descent. It affects the speed of movement of landslide material and rainfall runoff, thus having an important influence on the spatial distribution and stock of landslide sources. In general, when the profile curvature is greater than 0, landslides are more likely to occur.

(2) Basic geological factors and land cover factor: Lithology, as an indispensable material basis in the development of various geological disasters, plays a fundamental controlling role in the development of landslides. Among them, the lithology of the largest area and the most concentrated landslide distribution is mainly composed of intermediate-acidic volcanic rocks such as andesite and andesite basalt, with small amounts of rhyolite and volcanic clastic rocks. The lithology of the second largest distribution includes shale, sandstone, limestone, and dolomite. The lithology of the third most abundant distribution of landslides mainly consists of clay, silt clay, and clayey loam, with sub-clayey loamy paleosols at the bottom, accounting for 34.87% of the area and 41.25% of landslides. The fault structures in the geological body play a decisive role in the development of joint fissures. These fissures cause the rock mass to be segmented or fragmented, thereby damaging the stability of slopes. Regarding the distance from faults, the study divides it into nine buffer zones and finds that landslides are more pronounced within 1000 m of faults, showing the highest FR value. It can be seen that the proximity of faults plays a certain promoting role in landslide formation. In addition, different land use types have varying degrees of impact on landslide occurrence. Figure 4e shows that land use in the study area is classified into seven categories. When land use types are cropland and grassland, FR values are greater than 1, indicating a higher susceptibility to landslides.

(3) Hydrological environmental factors: The topographic wetness index (TWI) is considered as a composite topographic index used to assess the spatial distribution of soil moisture and to quantitatively simulate the wetness conditions of terrain and soil moisture within a watershed [32]. The TWI in the study area is divided into six levels: <5, 5~7.5, 7.5~10, 10~12.5, 12.5~15, and >15. In the western Henan region, the TWI in most areas is less than 10. It is worth noting that in areas where TWI ranges from 5 to 10, the corresponding FR values are less than 1, indicating that these areas are relatively unfavorable for landslide development. Rivers in the western Henan region were extracted, and buffers were created at 500 m intervals. It was observed that areas within 2000 m of rivers generally exhibited FR values greater than 1. Among them, areas within 500 m of rivers showed the highest FR values, reaching 1.411.

(4) Human activity factor: Human engineering activities have always been a significant trigger for landslide formation, and the distance from road reflects the degree of influence of human engineering activities on landslide disasters [33]. Different buffer zones were set for roads in the study area, including 0~200 m, 200~400 m, 400~600 m, 600~800 m, 800~1000 m, and >1000 m. In the 0~200 m buffer zone from roads, the frequency ratio value is greater than 1 and reaches its maximum. However, in other distance ranges, the frequency ratio values are all less than 1, indicating that in these areas that are relatively far from roads, the impact of human activities on landslides gradually diminishes.

4. Results and Discussions

The study utilized historical landslide data from the western region of Henan Province as sample data. The information value model and frequency ratio method were employed to randomly select negative sample points. With the assistance of Random Forest (RF) and Back Propagation Neural Network (BPNN) models, landslide susceptibility assessment in the western region of Henan Province was conducted. The main process and modeling are as follows: 1. Compile landslide data in the study area and obtain relevant environmental factors. 2. According to the frequency ratio method, determine the FR values of the factors for landslide that have occurred. 3. Use the information value model to assess landslide susceptibility and then classify into five levels: very high, high, medium, low, and very low. 4. Adopt two sampling methods: (a) Positive samples are landslide points, and negative samples are chosen randomly from the entire area. (b) Landslide points constitute the positive samples, while negative samples are randomly selected from the medium-, low-, and very-low-susceptibility zones delineated by the information value model. Use FR values as input variables for the model, with landslide and non-landslide points as output variables. 5. Apply the trained model for predicting landslide susceptibility indices and generating maps. 6. Assess the model’s accuracy using the Receiver Operating Characteristic (ROC) curve and compare evaluation outcomes while analyzing the influence of negative sample selection on landslide susceptibility.

4.1. Sampling Partitioning and Ptimization Based on the Information Value Model

In this research, grid cells are selected as assessment units, and the study area is divided into grids with a spatial resolution of 30 m. Landslides are represented as point data in the GIS platform; thus, the quantity of landslides corresponds to the total grid cells they encompass. Utilizing the frequency ratio values of the influencing factors (Table 1) and the calculation formula of the information value method, the information value for each classification is computed. Subsequently, raster layers corresponding to each classification are obtained, and a susceptibility index map based on the information values is generated. Employing the natural break method, susceptibility is divided into five levels: very high, high, medium, low, and very low. The levels thus generate the landslide susceptibility zoning map (Figure 5).

To prevent negative sample points from being located within potential landslide areas, an optimization of negative sample sampling methods was conducted. Based on the landslide susceptibility zoning generated using the information value model, the sampling area was further divided. The areas of very high and high susceptibility were combined into Zone 1, while areas with medium, low, and very low susceptibility were grouped into Zone 2. Through random selection, 1000 non-landslide grid cells were chosen from Zone 1 and Zone 2 combined, as well as 1000 from Zone 2 alone, resulting in Sampling I and Sampling II (Figure 6). Various machine learning models were utilized to assess the effects of different sampling methods on landslide susceptibility modeling predictions.

4.2. Landslide Susceptibility Mapping

The study utilizes two models, BPNN and RF, for training, testing, and landslide susceptibility prediction. Both sets of negative samples consist of 1000 non-landslide points selected using random sampling. These non-landslide points are evenly distributed within the sampling area. Based on the environmental factor analysis and feature selection in Section 3.2, the frequency ratio values of ten environmental factors, including elevation, slope, aspect, land cover, distance from road, etc., are used as model input variables, with landslide points (labeled as 1) and non-landslide points (labeled as 0) serving as output variables.

Based on the Python programming environment and the Scikit-learn library, I-RF, I-BPNN, II-RF, and II-BPNN models are constructed. The 834 assigned landslide grid cells and 1000 non-landslide grid cells are divided into training and testing datasets. The training dataset is used to train the landslide models, while the testing dataset is used to validate the performance of the landslide models [34], with 70% allocated to the training dataset and 30% to the testing dataset. The influence factor layers with 30 m resolution are used as input data and inputted into the trained and tested models, resulting in the probability distribution map of landslide susceptibility index with a resolution of 30 × 30 m (Figure 7). Compared to the I-RF and I-BPNN models, the II-RF and II-BPNN models exhibit a more distinct polarization in the distribution of landslide susceptibility index, where blue and red areas are more prominent in the figure. Due to the classification characteristics of the RF model, the key features play a crucial role in influencing the final probability distribution of susceptibility index, which is evident in the figures of I-RF and II-RF. It is worth noting that, compared to the two RF models, the analysis results of the two BPNN models show fewer fragmented effects. This reflects the advantage of BPNN models in handling complex, nonlinear classification problems.

Using the natural breaks method, the landslide susceptibility index is divided into five susceptibility levels: very low, low, medium, high, and very high. A landslide susceptibility map (Figure 8) is generated, allowing for a clear observation of the distribution of different levels, reflecting the differences in geological risk distribution within the study area. During the statistical and computational processes, detailed data on the area proportion and landslide proportion of each level are obtained, with special attention to the impact of the sampling partition and optimization of negative samples. Furthermore, the FR values of each level are calculated to measure the influence of negative sample selection on model performance. Comparing the area distribution of susceptibility levels among the four models, II-BPNN and II-RF models exhibit a polarization in area proportion compared to I-RF and I-BPNN models. Specifically, the proportions of very low and very high levels predicted by I-RF, I-BPNN, II-RF, and II-BPNN models are 37.71%, 40.29%, 60.6%, and 76.42%, respectively, with proportions of low, medium, high, and susceptibility gradually decreasing. Regarding the prediction results, the proportions of very high and high susceptibility areas predicted by I-RF, I-BPNN, II-RF, and II-BPNN models are 26.4%, 31.98%, 38.98%, and 40.43%, respectively. As the prediction of very high and high susceptibility regions reflects the accuracy of susceptibility assessment, it further highlights the crucial role of negative sample selection in landslide susceptibility evaluation. Precise selection of negative samples helps the model better understand and capture the characteristics of high-risk areas, thereby improving the overall performance and prediction accuracy of the model.

The frequency ratio analysis in Section 3.2 reveals several key characteristics of landslide-prone areas in western Henan. Firstly, landslides are predominantly distributed in the mountainous and hilly regions with elevations between 400 and 1000 m, especially on convex slopes within the 10°–30° range. The lithology is primarily composed of intermediate-acidic volcanic rocks, particularly andesite and andesite basalt, which play a crucial role in the formation of landslides. Additionally, the fault structures within the geological body significantly influence the development of joint fissures, with higher frequency ratios observed closer to faults. Regarding hydrological environmental factors, areas with a Topographic Wetness Index (TWI) between 5 and 10 are less favorable for landslide development, while regions near rivers show significantly higher frequency ratios compared to areas farther from rivers. Lastly, human engineering activities have a significant impact on landslide occurrence, with areas closer to roads exhibiting higher frequency ratios. The distribution of high susceptibility areas in Figure 8 and the statistical analysis results in Table 2 indicate that the proportions of landslides in the very-high- and high-susceptibility zones predicted by the I-RF, I-BPNN, II-RF, and II-BPNN models are 90.05%, 77.22%, 92.57%, and 84.41%, respectively. These prediction results are consistent with the geological, topographical, and anthropogenic factors revealed by the frequency ratio analysis, thereby validating their effectiveness and reliability in landslide susceptibility assessment.

4.3. Accuracy Analysis

Receiver Operating Characteristic (ROC) curve, a commonly used tool in machine learning, is an effective method for assessing classifier performance. This curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds, depicting the model’s performance at various operating points. An ROC curve closer to the upper-left corner indicates better classifier performance. The Area Under the Curve (AUC) is a crucial metric in ROC analysis, representing the classifier’s ability to distinguish between positive and negative samples. In this study, AUC (ranging from 0 to 1), Accuracy, Specificity, and Recall are used to evaluate the overall performance of the constructed models. Accuracy represents the proportion of correctly classified samples out of the total samples; Recall is the probability of correctly predicting landslide samples; and Specificity is the probability of correctly predicting non-landslide samples.

The precision evaluation results, as shown in Figure 9 and Table 3, display the ROC curves for the I-RF and I-BPNN models on the left and the curves for the II-RF and II-BPNN models on the right. From the figures, the AUC values for the I-RF, I-BPNN, II-RF, and II-BPNN models are 0.8247, 0.8068, 0.9464, and 0.9522, respectively. The model accuracy ranking is as follows: II-BPNN (AUC = 0.9522) > II-RF (AUC = 0.9464) > I-RF (AUC = 0.8247) > I-BPNN (AUC = 0.8068). This indicates a significant improvement in model accuracy by adjusting the selection of negative samples without changing the positive samples or model parameters. The original unoptimized negative samples have a noticeable impact on the predictive performance of the models. Compared to the I-RF model, the II-RF model shows improvements of 12.17% in AUC, 15.61% in Accuracy, 15.80% in Specificity, and 15.09% in Recall. Similarly, the II-BPNN model outperforms the I-BPNN model by 14.54% in AUC, 16.52% in Accuracy, 14.99% in Specificity, and 17.47% in Recall. The Recall and Specificity of the II-RF model are 94.82% and 86.78%, respectively, while those of the II-BPNN model are 95.50% and 86.36%, respectively, indicating better predictive performance in identifying landslide and non-landslide areas.

Regarding the selection of different models for various application scenarios, RF is commonly applied to structured data processing and feature importance analysis, while BPNN excels in handling tasks involving complex nonlinear relationships. In this study, the superior performance of the II-BPNN model is more suitable for tasks that are highly sensitive to landslide susceptibility, while the II-RF model is more suitable for the structural analysis of specific environmental factors. This emphasizes the importance of selecting appropriate models based on task requirements. In practical applications, considering the characteristics and applicable scenarios of models can effectively enhance model performance for complex and diverse geological tasks.

5. Conclusions

This study concentrates on the western region of Henan Province and develops an index system for evaluating landslide susceptibility based on ten environmental factors: elevation, slope, aspect, profile curvature, land cover, lithology, topographic wetness index, distance from river, distance from faults, and distance from road. Utilizing the susceptibility results from the information value model to define a negative sample sampling area, two sampling methods, frequency ratio and random sampling, are employed. RF and BPNN models are selected to delineate landslide susceptibility zones and compare model accuracy. The main conclusions are as follows:

(1) Analysis using the frequency ratio method reveals that in the study region, areas prone to landslides exhibit characteristics, which include elevations of 400–1000 m; slopes of 10°–30°; aspects predominantly southeast, south, and southwest; a positive profile curvature; a lithology mainly composed of andesite, andesitic basalt with a small amount of rhyolite and volcanic breccia; a topographic wetness index ranging from 12.5 to 15.0; distances to faults within 1000 m; distances to water bodies within 500 m; land cover consisting mainly of cropland and grassland; and distances to roads within 200 m. The proportions of landslides in the very high and high zones predicted by models I-RF, I-BPNN, II-RF, and II-BPNN are 90.05%, 77.22%, 92.57%, and 84.41%, respectively, and these areas exhibit the aforementioned characteristics, indicating the effective prediction of landslide-prone areas by the four models.

(2) The areas of high landslide susceptibility in the western region of Henan Province, predicted by models I-RF, I-BPNN, II-RF, and II-BPNN show an increasing trend, with proportions of high- and very-high-susceptibility areas being 26.4%, 31.98%, 38.98%, and 40.43%, respectively. Compared to models I-RF and I-BPNN, models II-RF and II-BPNN more accurately identify and predict high- and very-high-susceptibility areas. The ROC accuracies of models II-RF and II-BPNN are 0.9464 and 0.9522, respectively, representing an improvement of approximately 12% compared to the single RF model and nearly 15% compared to the single BPNN model. This underscores the importance of sample set selection for model performance.

(3) The purpose of optimizing the sampling area for negative samples is to reduce the likelihood of non-landslide samples being located near actual landslide boundaries. This enhances the prediction accuracy and classification ability of the machine learning model, enabling better identification of non-landslide grid cells. In situations where it is difficult to obtain accurate landslide boundaries, this method serves as an alternative approach for achieving relatively accurate landslide susceptibility zoning and prediction results.

Author Contributions

Conceptualization, X.W. and X.M.; Data curation, X.M. and D.G.; Methodology, X.W. and X.M.; Project administration, Z.H.; Software, X.M. and D.G.; Formal analysis, X.M., X.W. and D.G.; Resources, G.Y. and Z.H.; Visualization, X.M. and D.G.; Writing—original draft, X.M. and X.W., Writing—review and editing, X.M., X.W. and G.Y.; Supervision, G.Y. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results received funding from the Key Research and Development Project of Henan Province under Grant Agreement No. 221111321500.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request. The data are not publicly available due to ongoing support from the funding project, which restricts direct public disclosure at this time.

Conflicts of Interest

The authors certify that there are no conflicts of interest with any individual/organization for the present work.

References

Huang, F.; Xiong, H.; Jiang, S.-H.; Yao, C.; Fan, X.; Catani, F.; Chang, Z.; Zhou, X.; Huang, J.; Liu, K. Modelling landslide susceptibility prediction: A review and construction of semi-supervised imbalanced theory. Earth-Sci. Rev. 2024, 250, 104700. [Google Scholar] [CrossRef]
Fell, R.F.; Corominas, J.; Bonnard, C.; Cascini, L.; Leroi, E.; Savage, W.Z. Guidelines for landslide susceptibility, hazard and risk zoning for land-use planning. Eng. Geol. 2008, 102, 85–98. [Google Scholar] [CrossRef]
Do Pinho, T.M.; Augusto Filho, O. Landslide susceptibility mapping using the infinite slope, SHALSTAB, SINMAP, and TRIGRS models in Serra do Mar, Brazil. J. Mt. Sci. 2022, 19, 1018–1036. [Google Scholar] [CrossRef]
Luo, W.; Liu, C. Innovative landslide susceptibility mapping supported by geomorphon and geographical detector methods. Landslides 2018, 15, 465–474. [Google Scholar] [CrossRef]
Moragues, S.; Lenzano, M.G.; Lanfri, M.; Moreiras, S.M.; Lannutti, E.; Lenzano, L.E. Analytic hierarchy process applied to landslide susceptibility mapping of the North Branch of Argentino Lake, Argentina. Nat. Hazards 2020, 105, 915–941. [Google Scholar] [CrossRef]
Akıncı, H.; Yavuz Ozalp, A. Landslide susceptibility mapping and hazard assessment in Artvin (Turkey) using frequency ratio and modified information value model. Acta Geophys. 2021, 69, 725–745. [Google Scholar] [CrossRef]
Addis, A. GIS-Based Landslide Susceptibility Mapping Using Frequency Ratio and Shannon Entropy Models in Dejen District, Northwestern Ethiopia. J. Eng. 2023, 2023, 1062388. [Google Scholar] [CrossRef]
Tehrani, F.S.; Calvello, M.; Liu, Z.-q.; Zhang, L.; Lacasse, S. Machine learning and landslide studies: Recent advances and applications. Nat. Hazards 2022, 114, 1197–1245. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Mao, H.-L. Influence of human activity on landslide susceptibility development in the Three Gorges area. Nat. Hazards 2020, 104, 2115–2151. [Google Scholar] [CrossRef]
Hu, Q.; Zhou, Y.; Wang, S.; Wang, F. Machine learning and fractal theory models for landslide susceptibility mapping: Case study from the Jinsha River Basin. Geomorphology 2020, 351, 106975. [Google Scholar] [CrossRef]
Xia, D.; Tang, H.; Sun, S.; Tang, C.; Zhang, B. Landslide Susceptibility Mapping Based on the Germinal Center Optimization Algorithm and Support Vector Classification. Remote Sens. 2022, 14, 2707. [Google Scholar] [CrossRef]
Kavzoglu, T.; Teke, A. Predictive Performances of Ensemble Machine Learning Algorithms in Landslide Susceptibility Mapping Using Random Forest, Extreme Gradient Boosting (XGBoost) and Natural Gradient Boosting (NGBoost). Arab. J. Sci. Eng. 2022, 47, 7367–7385. [Google Scholar] [CrossRef]
Zhou, X.; Wen, H.; Li, Z.; Zhang, H.; Zhang, W. An interpretable model for the susceptibility of rainfall-induced shallow landslides based on SHAP and XGBoost. Geocarto Int. 2022, 37, 13419–13450. [Google Scholar] [CrossRef]
Bragagnolo, L.; Silva, R.V.d.; Grzybowski, J.M.V. Landslide susceptibility mapping with r.landslide: A free open-source GIS-integrated tool based on Artificial Neural Networks. Environ. Model. Softw. 2020, 123, 104565. [Google Scholar] [CrossRef]
Nwazelibe, V.E.; Egbueri, J.C.; Unigwe, C.O.; Agbasi, J.C.; Ayejoto, D.A.; Abba, S.I. GIS-based landslide susceptibility mapping of Western Rwanda: An integrated artificial neural network, frequency ratio, and Shannon entropy approach. Environ. Earth Sci. 2023, 82, 439. [Google Scholar] [CrossRef]
Cong, Y.; Inazumi, S. Integration of Smart City Technologies with Advanced Predictive Analytics for Geotechnical Investigations. Smart Cities 2024, 7, 1089–1108. [Google Scholar] [CrossRef]
Katsuumi, A.; Cong, Y.; Inazumi, S. AI-Driven Prediction and Mapping of Soil Liquefaction Risks for Enhancing Earthquake Resilience in Smart Cities. Preprints 2024, 2024061310. [Google Scholar]
Kazemi, F.; Asgarkhani, N.; Jankowski, R. Machine learning-based seismic fragility and seismic vulnerability assessment of reinforced concrete structures. Soil Dyn. Earthq. Eng. 2023, 166, 107761. [Google Scholar] [CrossRef]
Simon, N.; Róiste, M.d.; Crozier, M.J.; Rafek, A.G.M. Representing landslides as polygon (areal) or points? How different data types influence the accuracy of landslide susceptibility maps. Sains Malays. 2017, 46, 27–34. [Google Scholar] [CrossRef]
Regmi, A.D.; Yoshida, K.; Pourghasemi, H.R.; Dhital, M.R.; Pradhan, B. Landslide susceptibility mapping along Bhalubang—Shiwapur area of mid-Western Nepal using frequency ratio and conditional probability models. J. Mt. Sci. 2014, 11, 1266–1285. [Google Scholar] [CrossRef]
Zhang, Y.-x.; Lan, H.-x.; Li, L.-p.; Wu, Y.-m.; Chen, J.-h.; Tian, N.-m. Optimizing the frequency ratio method for landslide susceptibility assessment: A case study of the Caiyuan Basin in the southeast mountainous area of China. J. Mt. Sci. 2020, 17, 340–357. [Google Scholar] [CrossRef]
Yin, K.-l. Statistical prediction model for slope instability of metamorphosed rocks. In Proceedings of the 5th International Symposium on Landslides, Lausanne, Sweitzerland, 10–15 July 1988. [Google Scholar]
Sun, D.; Wu, X.; Wen, H.; Gu, Q. A LightGBM-based landslide susceptibility model considering the uncertainty of non-landslide samples. Geomat. Nat. Hazards Risk 2023, 14, 2213807. [Google Scholar] [CrossRef]
Deng, H.; Wu, X.; Zhang, W.; Liu, Y.; Li, W.-l.; Li, X.; Zhou, P.; Zhuo, W. Slope-Unit Scale Landslide Susceptibility Mapping Based on the Random Forest Model in Deep Valley Areas. Remote Sens. 2022, 14, 4245. [Google Scholar] [CrossRef]
Xiong, J.; Sun, M.; Zhang, H.; Cheng, W.; Yang, Y.J.; Sun, M.; Cao, Y.; Wang, J. Application of the Levenburg–Marquardt back propagation neural network approach for landslide risk assessments. Nat. Hazards Earth Syst. Sci. 2019, 19, 629–653. [Google Scholar] [CrossRef]
Lee, S.; Talib, J.A. Probabilistic landslide susceptibility and factor effect analysis. Environ. Geol. 2005, 47, 982–990. [Google Scholar] [CrossRef]
Bragagnolo, L.; Silva, R.V.d.; Grzybowski, J.M.V. Artificial neural network ensembles applied to the mapping of landslide susceptibility. Catena 2020, 184, 104240. [Google Scholar] [CrossRef]
He, S.; Pan, P.; Dai, L.; Wang, H.; Liu, J. Application of kernel-based Fisher discriminant analysis to map landslide susceptibility in the Qinggan River delta, Three Gorges, China. Geomorphology 2012, 171, 30–41. [Google Scholar] [CrossRef]
Panchal, S.; Shrivastava, A.K. A Comparative Study of Frequency Ratio, Shannon’s Entropy and Analytic Hierarchy Process (AHP) Models for Landslide Susceptibility Assessment. ISPRS Int. J. Geo Inf. 2021, 10, 603. [Google Scholar] [CrossRef]
Pham, B.T.; Bui, D.T.; Dholakia, M.B.; Prakash, I.; Pham, H.V.; Mehmood, K.; Le, H.Q.; Xuân, T.; Nội, H.; Nam, V. A novel ensemble classifier of rotation forest and Naïve Bayer for landslide susceptibility assessment at the Luc Yen district, Yen Bai Province (Viet Nam) using GIS. Geomat. Nat. Hazards Risk 2017, 8, 649–671. [Google Scholar] [CrossRef]
Khan, H.; Shafique, M.; Khan, M.A.; Bacha, M.A.; Shah, S.U.; Calligaris, C. Landslide susceptibility assessment using Frequency Ratio, a case study of northern Pakistan. Egypt. J. Remote Sens. Space Sci. 2019, 22, 11–24. [Google Scholar] [CrossRef]
Moore, I.D.; Grayson, R.B.; Ladson, A.R. Digital terrain modelling: A review of hydrological, geomorphological, and biological applications. Hydrol. Process. 1991, 5, 3–30. [Google Scholar] [CrossRef]
Fan, X.; Scaringi, G.; Xu, Q.; Zhan, W.; Dai, L.; Li, Y.; Pei, X.-j.; Yang, Q.; Huang, R.-q. Coseismic landslides triggered by the 8th August 2017 Ms 7.0 Jiuzhaigou earthquake (Sichuan, China): Factors controlling their spatial distribution and implications for the seismogenic blind fault identification. Landslides 2018, 15, 967–983. [Google Scholar] [CrossRef]
Pham, B.T.; Pradhan, B.; Bui, D.T.; Prakash, I.; Dholakia, M.B. A comparative study of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India). Environ. Model. Softw. 2016, 84, 240–250. [Google Scholar] [CrossRef]

Figure 1. Artificial neural network model structure.

Figure 2. Photo of landslide site in the study area.

Figure 3. Topographic map and landslide distribution in western Henan Province.

Figure 4. Distribution of environmental factors.

Figure 5. Landslide susceptibility zoning based on information value model.

Figure 6. Sampling distribution.

Figure 7. Landslide susceptibility index map.

Figure 8. Landslide susceptibility zoning map of machine learning models.

Figure 9. ROC curves and AUC values. I-RF, I-BPNN (Left); II-RF, II-BPNN (Right).

Table 1. Frequency ratio of environmental factors.

Environmental Factors	Classes	% Class Pixels	% Landslide Pixels	FR
Elevation	0–200 m	8.59%	0.84%	0.098
	200–400 m	22.21%	9.59%	0.432
	400–600 m	19.24%	25.42%	1.321
	600–800 m	14.91%	29.86%	2.003
	800–1000 m	13.29%	22.06%	1.660
	1000–1200 m	11.00%	7.31%	0.665
	1200–1400 m	6.86%	3.96%	0.577
	>1400 m	3.90%	0.96%	0.246
Slope	0°–10°	43.49%	34.05%	0.783
	10°–20°	27.85%	36.81%	1.322
	20°–30°	18.27%	22.78%	1.247
	30°–40°	8.48%	5.76%	0.679
	40°–50°	1.76%	0.60%	0.340
	>50°	0.14%	0.00%	0.000
Aspect	Flat (−1)	0.28%	0.00%	0.000
	North (0–22.5, 337.5–360)	12.77%	10.79%	0.845
	Northeast (22.5–67.5)	13.88%	13.55%	0.976
	East (67.5–112.5)	12.89%	12.47%	0.967
	Southeast (112.5–157.5)	11.70%	16.07%	1.374
	South (157.5–202.5)	12.69%	14.15%	1.115
	Southwest (202.5–247.5)	12.61%	13.67%	1.084
	West (247.5–292.5)	11.69%	9.59%	0.820
	Northwest (292.5–337.5)	11.50%	9.71%	0.845
Profile curvature	<−1	5.79%	4.08%	0.704
	−1~−0.5	11.26%	7.31%	0.649
	−0.5~0	32.99%	29.74%	0.901
	0~0.5	32.04%	33.33%	1.040
	0.5~1	12.00%	17.75%	1.479
	>1	5.91%	7.79%	1.319
Land cover	Plowland	42.47%	46.64%	1.098
	Forest	43.60%	38.01%	0.872
	Grass land	6.76%	10.31%	1.526
	Wetland	0.23%	0.12%	0.527
	Water body	0.90%	0.12%	0.133
	Artificial surface	6.02%	4.80%	0.797
	Bare land	0.03%	0.00%	0.000
Lithology	The lithology of the strata is predominantly composed of andesite, with minor occurrences of basaltic andesite, rhyolite, and volcaniclastic rocks.	17.61%	26.62%	1.511
	The lithology of the strata mainly consists of shale, sandstone, limestone, and dolomite.	5.39%	8.63%	1.601
	The primary lithologies consist of clay, silty clay, clayey silt, and loess, with the presence of alluvial sand and gravel layers at the base.	11.87%	6.00%	0.505
	The lithology of the strata comprises shallow metamorphic clastic rocks interbedded with carbonate rocks, characterized by carbonaceous material, intercalated coal seams, and interbedded coarse-grained (basaltic) rocks.	1.07%	4.44%	4.158
	The lithology of the strata primarily consists of gneiss, schist, amphibolite, and phyllite, intercalated with marble and magnetite quartzite.	4.35%	4.20%	0.965
	The lithology of the strata is predominantly composed of limestone. It includes chert-banded limestone and limestone containing interbedded chert.	5.00%	3.48%	0.695
	The lithology comprises silty clay, clayey silt, and sandy gravel.	7.88%	2.76%	0.350
	…	…	…	…
TWI	<5.0	34.17%	36.45%	1.067
	5.0~7.5	46.91%	43.29%	0.923
	7.5~10.0	11.93%	10.55%	0.885
	10.0~12.5	4.45%	5.88%	1.321
	12.5~15.0	1.62%	2.64%	1.626
	>15.0	0.91%	1.20%	1.311
Distance from river	0~500 m	9.60%	13.55%	1.411
	500~1000 m	8.39%	8.99%	1.071
	1000~1500 m	7.81%	7.91%	1.014
	1500~2000 m	7.33%	8.15%	1.112
	2000~2500 m	6.84%	7.31%	1.070
	>2500 m	60.03%	54.08%	0.901
Distance from faults	0~1000 m	10.74%	12.71%	1.184
	1000~2000 m	10.48%	9.83%	0.938
	2000~3000 m	9.75%	9.23%	0.947
	3000~4000 m	8.99%	10.19%	1.134
	4000~5000 m	8.11%	7.79%	0.961
	5000~6000 m	6.84%	6.47%	0.947
	6000~7000 m	5.78%	6.59%	1.141
	7000~8000 m	5.00%	4.44%	0.888
	>8000 m	34.33%	32.73%	0.954
Distance from road	0~200 m	32.29%	54.08%	1.675
	200~400 m	23.40%	18.94%	0.810
	400~600 m	16.33%	10.07%	0.617
	600~800 m	10.93%	7.79%	0.713
	800~1000 m	7.01%	4.68%	0.667
	>1000 m	10.03%	4.44%	0.442

Table 2. Statistical results of landslide susceptibility evaluation grades for machine learning models.

Model	Susceptibility Level	Class Pixels	%Class Pixels	Landslide Pixels	%Landslide Pixels	FR
I-RF	Very Low	8,757,965	26.97%	6	0.72%	0.027
	Low	8,696,789	26.78%	24	2.88%	0.107
	Medium	6,442,641	19.84%	53	6.35%	0.320
	High	5,086,075	15.66%	127	15.23%	0.972
	Very High	3,487,518	10.74%	624	74.82%	6.966
I-BPNN	Very Low	7,905,889	24.35%	25	3.00%	0.123
	Low	8,143,737	25.08%	71	8.51%	0.339
	Medium	6,037,252	18.59%	94	11.27%	0.606
	High	5,208,466	16.04%	194	23.26%	1.450
	Very High	5,175,644	15.94%	450	53.96%	3.385
II-RF	Very Low	11,034,546	33.98%	13	1.56%	0.046
	Low	5,575,632	17.17%	16	1.92%	0.112
	Medium	3,204,537	9.87%	33	3.96%	0.401
	High	4,012,344	12.36%	105	12.59%	1.019
	Very High	8,643,929	26.62%	667	79.98%	3.004
II-BPNN	Very Low	14,106,961	43.44%	56	6.71%	0.155
	Low	3,051,532	9.40%	39	4.68%	0.498
	Medium	2,185,191	6.73%	35	4.20%	0.624
	High	2,418,843	7.45%	58	6.95%	0.934
	Very High	10,708,461	32.98%	646	77.46%	2.349

Table 3. Performance comparison of machine learning.

Assessment Indicators	I-RF	I-BPNN	II-RF	II-BPNN
AUC	82.47%	80.68%	94.64%	95.22%
Accuracy	75.68%	74.41%	91.29%	90.93%
Specificity	70.98%	71.37%	86.78%	86.36%
Recall	79.73%	77.03%	94.82%	94.50%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Ma, X.; Guo, D.; Yuan, G.; Huang, Z. Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning. Appl. Sci. 2024, 14, 6040. https://doi.org/10.3390/app14146040

AMA Style

Wang X, Ma X, Guo D, Yuan G, Huang Z. Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning. Applied Sciences. 2024; 14(14):6040. https://doi.org/10.3390/app14146040

Chicago/Turabian Style

Wang, Xiaodong, Xiaoyi Ma, Dianheng Guo, Guangxiang Yuan, and Zhiquan Huang. 2024. "Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning" Applied Sciences 14, no. 14: 6040. https://doi.org/10.3390/app14146040

APA Style

Wang, X., Ma, X., Guo, D., Yuan, G., & Huang, Z. (2024). Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning. Applied Sciences, 14(14), 6040. https://doi.org/10.3390/app14146040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Construction and Optimization of Landslide Susceptibility Assessment Model Based on Machine Learning

Abstract

1. Introduction

2. Methodology

2.1. Frequency Ratio

2.2. Information Value Model

2.3. Machine Learning

2.3.1. Random Forest (RF)

2.3.2. Back Propagation Neural Network (BPNN)

3. Study Area and Data

3.1. Study Area

3.2. Database Preparation and Analysis

4. Results and Discussions

4.1. Sampling Partitioning and Ptimization Based on the Information Value Model

4.2. Landslide Susceptibility Mapping

4.3. Accuracy Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI