Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China)

Zhang, Shaohan; Tan, Shucheng; Sun, Yongqi; Ding, Duanyu; Yang, Wei

doi:10.3390/land13091361

Open AccessArticle

Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China)

by

Shaohan Zhang

^1,2,

Shucheng Tan

^2,3,*,

Yongqi Sun

²,

Duanyu Ding

⁴ and

Wei Yang

^1,2

¹

Institute of International Rivers and Eco-Security, Yunnan University, Kunming 650500, China

²

Yunnan International Joint Laboratory of Critical Mineral Resource, Kunming 650500, China

³

School of Earth Science, Yunnan University, Kunming 650500, China

⁴

Faculty of Architecture and City Planning, Kunming University of Science and Technology, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Land 2024, 13(9), 1361; https://doi.org/10.3390/land13091361

Submission received: 14 August 2024 / Revised: 20 August 2024 / Accepted: 20 August 2024 / Published: 26 August 2024

(This article belongs to the Topic Landslides and Natural Resources)

Download

Browse Figures

Versions Notes

Abstract

Selecting the most effective prediction model and correctly identifying the main disaster-driving factors in a specific region are the keys to addressing the challenges of geological hazards. Fuyuan County is a typical plateau mountainous town, and slope geological hazards occur frequently. Therefore, it is highly important to study the spatial distribution characteristics of hazards in this area, explore machine learning models that can be highly matched with the geological environment of the study area, and improve the accuracy and reliability of the slope geological hazard risk zoning map (SGHRZM). This paper proposes a hazard mapping research method based on multisource remote sensing data extraction and machine learning. In this study, we visualize the risk level of geological hazards in the study area according to 10 pathogenic factors. Moreover, the accuracy of the disaster point list was verified on the spot. The results show that the coupling model can maximize the respective advantages of the models used and has highest mapping accuracy, and the area under the curve (AUC) is 0.923. The random forest (RF) model was the leader in terms of which single model performed best, with an AUC of 0.909. The grid search algorithm (GSA) is an efficient parameter optimization technique that can be used as a preferred method to improve the accuracy of a model. The list of disaster points extracted from remote sensing images is highly reliable. The high-precision coupling model and the single model have good adaptability in the study area. The research results can provide not only scientific references for local government departments to carry out disaster management work but also technical support for relevant research in surrounding mountainous towns.

Keywords:

machine learning model; remote sensing datasets; disaster point extraction; geological hazard risk mapping

1. Introduction

Slope geological hazards generally include collapse, landslide and debris flow, and their degree of damage is second only to earthquakes. Unpredictability and universality mean geological hazards have become a research hotspot, which has attracted the attention of people from all walks of life. Therefore, advanced machine learning models have become a key technology to effectively deal with this complex problem. Many geological hazard risk mapping models have been developed in recent decades. Four categories can be used to generally group the various mapping techniques: heuristic [1], physical [2], mathematical [3], and machine learning [4]. The heuristic model relies on expert experience to predict the potential areas for future geological hazards, making it an unreliable prediction method for changing geological environments [5]. However, in order to evaluate slope stability, physical models use techniques like the limit equilibrium approach, which calls for taking complicated mechanical and hydrological circumstances into account [6,7], which is mainly applicable for assessing geological hazards with clear damage and resistance factors. Model simplification, high data requirements, strong subjectivity and the high cost of field tests limit the spread of heuristic models and physical models. With the continuous progress of human society, the complexity of the original surface is also increasing, which urgently requires more intelligent machine learning models to assume the role of disaster prediction and reduce the losses caused by disasters. However, not all machine learning models have the same effect on disaster prediction in the same region. Many researchers compare and couple single machine learning models to derive the most effective local prediction model.

In recent years, machine learning models have almost replaced the traditional methods and have become a crucial in the field of geological hazard mapping [8]. However, it has been observed that the combination of geographic information system (GIS), which is usually used for geological hazard risk (GHR) assessments, and machine learning models makes mapping extremely simple [9]. In a study conducted in the Uttarkashi region of Uttarakhand in India, Kainthura and Sharma [10] compared the effectiveness of the random forest (RF) model, back propagation neural network, and Bayesian network (BN) in evaluating landslide susceptibility. They found that the RF model outperformed the others. Similarly, Youssef et al. [11] compared the performances of RF, boosting regression tree, classification regression tree, and general linear modeling techniques in a landslide susceptibility assessment in the Wadi Taya Basin in the Asir region of Saudi Arabia. The results showed high accuracy for all four models, with the boosting regression tree model demonstrating the highest accuracy. Additionally, Goetz et al. [12] evaluated the sensitivity of landslide geological hazards in three Austrian locations by comparing the predictive power of machine learning algorithms with quantitative statistics. The results highlighted the suitability of the RF model for landslide sensitivity modeling in the region. Huang et al. [13] conducted a comparative analysis of the prediction accuracy of data models in landslide susceptibility mapping, including heuristic, general statistical, and machine learning models. The results showed that machine learning algorithms reasonably depict the characteristics of geological hazard distribution. Based on the shortcomings of the artificial neural network (ANN) model, Moayedi et al. [14] optimized the ANN with two algorithms. The results show that the accuracy of the optimized ANN has been significantly improved, and the landslide susceptibility map is reliable. Zhang et al. [15] optimized the support vector machine (SVM) and proposed three new optimized models. The results show that the particle swarm optimization vector machine has better performance in landslide susceptibility assessments and has greater implementation potential in the Qinghai–Tibet Plateau. Kaur et al. [16] tested the performance of four machine learning models—Naive Bayes, K-nearest neighbor, RF and extreme gradient boosting (XGBoost)—in the landslide susceptibility mapping of a valley section in the Himalayas. Among these four models, the ensemble-based advanced machine learning algorithm XGBoost shows excellent performance. The above research work shows that many machine learning models can perform differently in different geographical regions. As Ali mentioned, there is no general model that can adapt to all environments [17]. Therefore, we must find a model with the highest accuracy among many models as a risk prediction tool. At the same time, we found that previous studies have subjectively combined the models in pairs, and there is no reliable basis. The high correlation between the two models indicates that they have a common mechanism for predicting the same results. In this case, the coupling model can make the overall prediction more accurate and reliable. The two models with high correlation perform well under certain specific factors. By coupling them, the advantages of these models can be utilized in a unified framework to enhance the performance of the overall model. In the last part of this study, we use the correlation of the output results of each model as a link to establish the relationship between single models and further build a set of coupling models to serve the management of geological hazards in plateau mountainous areas.

When evaluating the risk of geological hazards, choosing evaluation units is crucial because it determines how disaster point data are used to build a model for a risk assessment [18]. The regional unit can fully reflect the complex relationship between slope failure, geomorphic environment, and fault structure explored by investigators from field or remote sensing images. However, there are limitations in determining unstable factors [19]. Fewer evaluation units within the study region due to a single watershed’s extensive coverage may compromise the precision of the assessment findings [20]. Therefore, the slope unit (SU), which is based on the division of young valleys, is used as the evaluation unit because it can establish a good relationship with the topography and reflect the control effect of various disaster-causing factors, increasing the validity of the assessment’s findings [21,22].

In the face of such scientific problems, we selected typical mountainous towns as the research object, extracted a list of disaster points from remote sensing images, obtained evaluation factor resources through multiple channels, and established RF, adaptive boosting (ADBoost), XGBoost, support vector machine (SVM), artificial neural network (ANN), Bayesian network (BN), decision tree (DT) and logistic regression (LR) models, totaling eight different models. Finally, the mutual information model (MI) was used to analyze the correlation of the results of a single model. The two models with the highest correlation were coupled, and the risk-zoning results of all models were visually output. The ROC curve and confusion matrix (CM) were used to evaluate the effectiveness of the model, and an extremely high-risk area was extracted to verify the applicability of the model. The purpose of this study is to find the most effective GHR zoning model for mountainous towns and provide a scientific basis for establishing a coupling model. The research results are expected to provide scientific reference for local disaster prevention and mitigation and sustainable development work.

2. Materials and Methods

2.1. Study Area Environment

Fuyuan County is located in the eastern region of Yunnan Province (Figure 1), with geographical coordinates ranging from 103°58′ to 104°49′ E longitude and 25° to 25°58′ N latitude. It is dominated by the canyon landform; the terrain is high in the north and low in the south. The Wumeng Mountain branch spans through the whole territory from north to south, with the highest altitude of 2748.9 m. The lowest altitude is 1110 m. The mountain is high and steep, the karst is developed, the surface water system is developed, the tributaries are densely distributed, the rivers are longitudinally cut, and the topographic fluctuation is prominent, which provides conditions for the formation of landslides, collapses, and debris flows in the whole territory.

Geological structure is the main internal factor leading to disasters. The Neo-cathaysian fault is the controlling fault in this area and is generally characterized by compression–torsion, a large fault distance and long extension. The main structural systems in the area are mountain-shaped structures, northeast (NE)-oriented structures and arcuate structures. The NE-trending structure is composed of a series of NE-trending parallel open short-axis folds and high-angle compressive thrust faults. In addition, secondary extensional and torsional faults perpendicular to or oblique to the tectonic belt have developed. The neotectonic activity in the area is relatively calm, with intermittent uplift, and there is no obvious block movement or fault activity.

The main exposed strata in the working area are Mesozoic–Triassic, upper Paleo-zoic–Permian, Carboniferous–Devonian, and Cenozoic–Quaternary. The Silurian stratum is the earliest exposed stratum in the study area and is dominated by shale with fine sandstone and mudstone with marl. The distribution of the Devonian strata is limited to the northwest; the strata are mainly composed of sandstone, shale, dolomite and limestone, and the whole area features layered or inclined bedding. The Carboniferous strata are thick and are mainly composed of mudstone, limestone, bioclastic rock, sandstone, and dolomite. Carbonate rock is one of the more developed rock types, with clear bedding and a hard texture. The thickness of the Permian strata is approximately 220 m, and they are composed mainly of light gray and dark gray siltstone and mudstone, with sandstone, basalt, coal seams and siderite. The Triassic stratigraphic structure is dominated by layered sediments, which are mainly composed of sandstone, dolomite, argillaceous limestone, shale and fine sandstone interlayers. The Quaternary sediments are mainly loose gravel, clay and peat, which are widely distributed in modern rivers and lake basins. A geological map of the study area is shown in Figure 2.

In recent years, with the continuous growth of coal production capacity and highway construction capacity, human engineering activities related to housing construction have been increasing, and the trend of geological hazards has intensified. The last major geological hazard occurred in 2013, resulting in a total of 10 deaths and injuries; three houses were destroyed instantaneously, resulting in a crisis affecting thousands of people’s lives and property safety. Therefore, the risk assessment of geological hazards in Fuyuan County has an important impact on early warning, management, sustainability and environmental issues.

2.2. Dataset

Sentinel-1 satellite provides C-band synthetic aperture radar (SAR) data with all-weather and all-day observation capabilities and has high resolution for surface coverage, topography and geomorphology. Alaska satellite equipment is the channel for us to obtain DEM data in the study area. One can visit ASF Data Search (https://search.asf.alaska.edu/#/, accessed on 13 March 2024) via this link. Through the ArcGIS platform to process DEM data, we obtain the elevation, slope, aspect and topographic relief. We download the red band and near-infrared band data of the Landsat8 satellite, use ENVI to extract the reflection value information of the near-infrared band and red band, and substitute the reflection value into the formula to calculate the NDVI. One can access the Landsat dataset through this link (https://landsat.gsfc.nasa.gov/, accessed on 13 March 2024). ArcGIS is used to process geological maps and fault lines to obtain lithology and fault zone distribution. The vector data of the road network and water system can be downloaded through the geographic information resource directory service system (https://www.webmap.cn/main.do?method=index, accessed on 14 March 2024) to extract the main road and water system distribution map of the study area based on the ArcGIS platform. The rainfall data are provided by the government meteorological department and visualized by ArcGIS 10.6.

2.3. Disaster Point Inventory

The disaster point inventory is one of the basic data sources for GHR evaluation [23,24]. It records in detail the previous disaster events in a study area, including key information such as the location, scale and type of disaster points, which helps to evaluate landslide susceptibility more accurately. The disaster point data are used for the training and verification of the model to ensure that the model can accurately predict the spatial distribution of disasters.

The remote sensing disaster dataset is obtained by http://gpcv.whu.edu.cn/data/, accessed on 16 March 2024, which is the Bijie disaster dataset. The dataset consists of satellite optical images, disaster boundary shape files and digital elevation models. From the TripleSat satellite images taken from May to August 2018, 770 geological hazard samples were cropped, including rockfalls, rockslides and some mudslides, as well as 2003 negative samples covering a variety of backgrounds. The machine learning model trains different features to identify disaster points in the image [17]. The DEM provides terrain height information, which can help the Swin transformer better identify and predict potential landslide areas. Through the DEM data, the Swin transformer not only learns the shape features from optical images but also learns the three-dimensional structural features of the terrain, which helps the model to have a better generalization ability in different terrains and complex environments. Because the DEM data provide additional spatial resolution, the model can analyze surface features in more detail. When the Bijie disaster point dataset is used to train the Swin transformer algorithm, two-thirds of the images and the DEM data in the dataset are used as the training set, and the rest are used as the test set. To fully adapt to the characteristics of hyperspectral data, the parameters and architecture of the model are adjusted, especially through the alternating window and hierarchical structure, to effectively capture the morphological features from local to global. In addition, a classification layer of the segmentation network is used to evaluate whether each pixel or block belongs to the disaster area and to obtain the optimal training weight file. These optimal training weights are used as pretraining models to identify landslides in the study area, and the extracted block center points are used as a list of disaster points to realize the automatic extraction of disaster points. Field verification revealed that 144 disaster points were identified and adjusted from the 171 disaster points extracted from the original data. Because some open-pit mines, quarries, naturally formed cliffs and wastelands do not experience disaster conditions, 27 blocks were dismantled. A total of 104 landslides, 32 collapses and 8 debris flows constitute the disaster list (Table 1, Figure 3). The attributes used as identification and classification criteria include the following: the motion type, material type, sliding state, geometry of the failure area and the resulting displacement [25,26,27,28,29]. Landslides constitute the main disaster type that restricts the development of Fuyuan County, accounting for 75% of the list of disaster points.

2.4. Disaster-Causing Factors (DCFs)

The choice of evaluation parameters is a critical step in the risk estimation for geological hazards. Multiple factors are considered and adjusted flexibly according to the specific situation [30,31]. Elevation affects the stability of rocks and soils and indirectly controls the occurrence of natural disasters in a particular area [32,33] (Figure 4a). The greater the slope is, the more prone the area is to geological hazards [34,35] (Figure 4b). Aspect impacts surface runoff and rainfall distribution, which in turn influences gully formation and surface erosion [36] (Figure 4c). Topographic relief refers to the degree of change in surface elevation and corresponds to the complexity of the terrain (Figure 4d). Geological hazard development and occurrence are directly influenced by the stratum lithology. The mechanical properties and stability of different rock types vary considerably [37,38] (Figure 4e). The description of the lithology category is shown in Table 2. In areas near fault zones, the geological conditions, such as stratigraphic lithology and structural characteristics, are more complicated, often triggering geological hazards [39,40] (Figure 4f). The tensile action of vegetation roots on slope rock and soil improves the structural stability of rock and soil in regions with significant vegetation covering [41,42] (Figure 4g). As one of the primary causes of geological hazards, rainfall has a significant impact on the frequency and severity of these events [43,44] (Figure 4h). Areas close to rivers face an increased possibility of geological catastrophes like landslides and collapses [45] (Figure 4i). The distance from the road is a key factor in the frequency and severity of geological catastrophes and represents the linear influence of human engineering operations on geological hazards (Figure 4j). In summary, this paper carefully analyzes the disaster mechanism of typical geological hazards in the study area and compares various information sources, such as historical disaster information, geological surveys, meteorological records, and remote sensing images. Based on these data, ten disaster-causing factors, namely, elevation, slope, aspect, topographic relief, stratigraphic lithology, distance from the fault, the normalized difference vegetation index (NDVI), average annual rainfall, distance from the river, and distance from the road, were selected as the evaluation indices.

In this study, the data types of evaluation indicators are divided into two types: discrete and continuous. Discrete data are directly classified according to attributes. The point density and point ratio of each continuous data are counted according to a certain step size, and the grade is divided according to the proportion of disaster points and the mutation point of the point density curve.

2.5. Methods

2.5.1. Multiple Collinearity Diagnosis

As the inducing factors of geological hazards vary considerably, collinearity between the selected disaster-causing factors may induce errors in the model, which can affect the model’s accuracy [46]. Furthermore, the normalized information values of disaster points and non-disaster points were imported into SPSS 25.0 software for multicollinearity diagnosis, and estimated tolerance (T) and variance inflation factor (VIF) values were assessed. The T and VIF values are reciprocal to each other. When T < 0.1, there is a high degree of collinearity [47]. The VIF values of all the evaluation factors are less than 10, with a significance (Sig) less than 0.05, indicating meaningful statistics [48].

2.5.2. Support Vector Machine

SVM is a method for classifying images by finding the best segmentation line. To find this optimal dividing line, it is necessary to find the peripheral data points closest to another set of data from a set of data and then draw the optimal dividing line between the two sets of peripheral data points. These data directly affect the position and direction of the dividing line. We present the kernel function to translate nonlinearly separable data in the field of geological hazards to a higher-dimensional feature space, where it becomes linearly separable [49,50]. The output of the RBF kernel function mainly depends on the distance between samples, and it has local response characteristics, which makes it perform well in dealing with complex data patterns. By configuring the kernel’s functional variable

γ

and the penalty parameter C, the model is optimized. Parameter C controls the degree of the error penalty of the model. A smaller C value increases the possibility of misclassification, and a larger C value leads to the overfitting of the model. The parameter

γ

controls the influence range of the RBF kernel function. A smaller

γ

means a larger similarity range, and a larger

γ

will also lead to the overfitting of the model. The RBF function formula is as follows:

(x_{i}, x_{j}) = c x p (- γ ∥ x_{i} - x_{j} ∥^{2})

(1)

where

x_{i}, x_{j}

are the two sample points in the dataset;

γ

is a parameter that controls the decay rate of the RBF function.

2.5.3. Random Forest

An ensemble learning approach called RF builds models by creating several decision trees. The voting is a representation of the mean outcomes of all decision trees as well as the categories and regress prediction findings [51]. The following advantages are conducive to the RF algorithm to fully utilize its performance in this study. The RF model is predicated on bagging, also known as bootstrap aggregating, which is the process of creating several subdatasets from the initial dataset using random sampling, and multiple decision tree models are constructed using the subdatasets [52]. Each decision tree’s training samples and feature sets are randomly selected. Under the combined influence of these two random factors, the overfitting of the model can be prevented to a certain extent, which results in a robust model outcome [53]. The RF model has outliers and noise in the dataset, which indicates the importance of ranking features to assess the feature influence of the dataset. It is among the top models for machine learning [54].

The impureness of the Gini index can be used to determine the proportional importance of every assessment metric in the risk assessment of geological hazards using the random forest algorithm. The Gini index reduction value

R_{k x y}

of the evaluation factor

k

in the node segmentation is calculated for each node of all trees in the random forest, and the average of all trees is calculated and summed as the evaluation factor’s proportional importance

k

. The following is the calculating formula:

∆_{k} = \sum_{x = 1}^{m} \sum_{y = 1}^{c} R_{k x y} / \sum_{k = 1}^{n} \sum_{x = 1}^{m} \sum_{y = 1}^{c} R_{k x y}

(2)

where ‘

n

’ is the total number of evaluation indicators, ‘

m

’ is the number of parent nodes in the classification decision tree, ‘

c

’ is the number of subnodes, and ‘

R_{k x y}

’ is the Gini reduction in the k-th evaluation factor at the y-th child node beneath the parent node, ‘

x

’. The relative significance of the k-th evaluation factor is denoted by the symbol

‘ ∆_{k} ’

.

2.5.4. Artificial Neural Networks

An ANN typically consists of three layers: input, hidden, and output. The output layer and the hidden layer include the essential data processing. In an ANN, the hidden layer performs the function of feature extraction, and the output layer generates the classification results based on the features that have been extracted and the weights that have been acquired through training. We mostly utilize supervised learning algorithms in the application of geological catastrophes; that is, the training data consist of input and matching goal output [55].

2.5.5. Bayesian Network

BN is a Bayesian theorem-based probabilistic graphical model. In this study, the conditional probability relationship between indicators is represented by edges, while nodes stand in for other assessment indicators. To make judgments and draw conclusions, BN updates the probability distribution of the occurrence of disasters using the data from disaster-causing indicators. The basis of the entire model is the probability theory behind the chain rule [56]. The probability distribution of the other indicators is computed step by step, starting from the input data. It can assist us in determining the likelihood of geological hazards under specific circumstances or the weight value of each index. The following is the probability reasoning formula for the chain rule:

P (X_{1}, X_{2}, X_{3}, \dots X_{n}) = \prod_{i = 1}^{n} P (X_{i} | P a (X_{i}))

(3)

P

is the joint probability distribution of the entire network,

P a

is the parent node of

X_{i}

, and

X_{i}

represents the conditional probability of each index.

2.5.6. Extreme Gradient Boosting Tree

Both the XGBoost algorithm and RF algorithm are integrated algorithms that use decision tree models. The core idea of the XGBoost algorithm is to gradually construct multiple weak learners through a gradient lifting algorithm and to combine a regularization technique with parallel processing to increase the model’s capacity for generalization and training efficiency [57]. Studies indicate that the XGBoost algorithm performs well in tackling the issues of dependent variable categorization and independent variable dispersion. For instance, pillar stability, rockburst categorization, and landslide susceptibility have all been predicted using the XGBoost algorithm [37,58,59]. Highly dimensional features and big datasets can be handled with XGBoost, which also offers excellent performance and scalability. The XGBoost model is used in geological hazard risk mapping for the feature importance analysis, categorization, and grade or probability prediction of disasters in a given area.

2.5.7. Adaptive Boosting

The core idea of ADBoost to improve the performance of model prediction is to build a strong learner by combining multiple weak learners. It does not limit the type of weak learner, and the goal of a weak learner is to minimize the weighted error rate. All the trained weak learners are combined based on their weights to form a strong learner, and the final prediction result is obtained by weighted voting according to their weights [60].

2.5.8. Logistic Regression

The basic principle of logistic regression is to transform the prediction results of the linear regression model through the Sigmoid function and map its output to between 0 and 1 to realize the classification of samples. In the GHR evaluation, ‘1’ indicates that a disaster occurs, and ‘0’ indicates that it does not occur. Taking various influencing factors as independent variables, the probability of geological hazards under certain conditions can be calculated [61].

2.5.9. Decision Tree

A decision tree classifies and predicts the data by constructing a tree model. Different DCFs are sorted according to their importance and reflected at different levels of the tree [62]. In the GHR evaluation, a decision tree can divide the study area into different risk levels according to the DCF. The prediction results are highly interpretable, which is particularly important for the disaster management decision-making process.

2.5.10. Mutual Information

The MI model is a feature selection and feature association evaluation method based on the concept of information theory. In this case, our primary goal is to measure the extent of information sharing—that is, the contribution of information from one algorithm to that from another algorithm—between the prediction results and the output of one algorithm [63]. The formula is as follows:

I (X; Y) = \sum_{x \in X} \sum_{y \in Y} p (x, y) l o g \frac{p (x, y)}{p (x) p (y)}

(4)

In the above formula,

I (X; Y)

represents the mutual information between algorithms

X

and

Y

, and

p (x, y)

denotes the combined probability distribution of

X

and

Y

with (

x

,

y

).

p (x)

and

p (y)

denote the corresponding marginal probability distributions of

X

and

Y

.

The correlation coefficient is used to evaluate the correlation strength between each model. The models are sorted according to the size of the coefficient of correlation, and then the corresponding two models are selected as the final ensemble learning model.

2.6. Parameter Optimization Algorithm

In most cases, the GSA is used to optimize hyperparameters in machine learning and deep learning. This hyperparameter optimization technique uses an exhaustive search method to traverse all possible combinations of hyperparameters to find the optimal hyperparameters. After determining the hyperparameters that need to be optimized and the value range of each hyperparameter, a grid is created. For each set of hyperparameters in the grid, the training set is used to train the model, and the performance of the model is evaluated on the verification set. The set of hyperparameters with the best performance on the verification set is the optimal parameter of the algorithm.

2.7. Validation Techniques

One frequently utilized metric for measuring a model’s precision in classification is the region under the ROC curve, or AUC [64]. The nearer the ROC curve is to the upper left corner, the more accurately the model performs. Better performance is demonstrated by a higher AUC around 1, whereas model inaccuracies are highlighted by a value closer to 0.5 [65].

A popular technique for assessing algorithm performance is a confusion matrix, which provides an easy-to-read representation of the predictions made by the algorithms for each category. The confusion matrix’s accuracy, recall, precision, and specificity are typically utilized to analyze the algorithm’s performance. The precision is used to describe the proportion of instances in which the algorithm predicts positive samples to be truly positive samples. Formula (5) displays the calculating formula for it. It is the most intuitive and commonly used evaluation index. In this study, we use precision to measure the effectiveness of the algorithm [66,67].

P r e c i s i o n = \frac{T P}{T P + T N}

(5)

In Formula (5),

T P

denotes the number of times the model predicts disaster points and those instances actually occur as disaster points, while

T N

denotes the number of times the model predicts disaster points, but those instances occur as non-disaster points.

2.8. Experimental Process

The SU is the basic mapping unit in this study, which is created using DEM data in the study area. Swin Transformer is used to extract the list of disaster points from remote sensing images, and it is verified by field exploration techniques such as UAV, GPS positioning and manual identification. The geological environment, natural geographical conditions, spatial distribution of disaster points and typical geological hazard mechanism in the study area were analyzed, and 10 DCFs were selected to draw SGHRZM. The training dataset and test dataset of the model are the disaster point list and DCFs carrying information. After the model is running, the weights of each index under different models are obtained. The MI model is used to correlate the running results of the eight classifiers, and the coupling model is selected according to the correlation coefficient. Finally, the hazard mapping results of all single models and coupled models are visualized through the ArcGIS platform. To verify the reliability of each model, the ROC curve and confusion matrix were used to compare the accuracy of each model (Figure 5).

3. Results

3.1. Results of Collinearity Diagnosis

The results in Table 3 show that the collinearity between the evaluation factors is not strong and has good independence, which can be used for the GHR evaluation.

3.2. Results of Correlation Analysis

According to the prediction results of the machine learning model, the combined probability distribution of the prediction results is calculated. The mutual information between the probability distribution and the marginal likelihood probability of each model is calculated. The correlation of eight models produced 28 results. Among them, RF and XGBoost have the highest correlation, which is 0.869. This step guides us to build the RF + XGBoost coupling model next.

3.3. Model Parameter Optimization Results

Different models need to determine different hyperparameters to define a parameter grid. For example, the RF algorithm needs to determine the number of decision trees and the maximum number of features, the ANN needs to determine the number of layers and the number of nodes in each layer and so on. Next, cross-validation is set to train the model, and the GridSearchCV category tool is used to combine the model, parameter grid and verification method. Running the grid search process, the system automatically proposes parameter combinations to create model instances, and cross-validation evaluates the performance of each model. After the grid search is completed, the best _ params attribute best _ score is used to check the optimal parameter combination. When the parameter grid is defined, a search for the optimal parameters by the GSA technique takes places, which is a repeated process. Therefore, this paper only shows the results of the GSA technique through the RF model (Figure 6).

3.4. Feature Importance Evaluation Based on RF Model

The 144 disaster points extracted by Swin Transformer were classified into the first group and marked as ‘1‘. To avoid misclassifying potential disaster points into non-disaster points to ensure the quality of the dataset, we generate the same number of non-disaster points in low-risk areas. The 144 non-disaster points randomly generated at low-risk locations were classified as the second category and marked as ‘0‘. Each evaluation factor’s information is used to create the model’s training and test sample data. Each sample point comprises attributes and a classification label for each evaluation factor. The training dataset of the model consists of 202 sample points (70% of all catastrophe and non-disaster points), while the remaining 86 sample points (30% of all catastrophe and non-disaster points) form the test dataset of the model.

A significant advantage of the RF model is that it can provide information on which features have the greatest impact on the prediction results. In Figure 7, lithology, river, roads and rainfall rank top, indicating that they have an important contribution to the occurrence of disasters and should be focused on in disaster management. Compared with other factors, aspect has a weaker impact on disaster risk. However, rainfall is the main external cause of disasters, accounting for a large proportion of all disaster-causing factors. Decision makers should pay attention to the investigation of hidden dangers of disasters in flood season.

3.5. SGHRZM

The ArcGIS grid calculator tool is used to superimpose the weights of each indicator that each algorithm produces in order to create the comprehensive weight. To determine the risk of disasters, we used a reclassification tool to divide the comprehensive weights into four categories: low-risk area (Low), medium-risk area (Medium), high-risk area (High), and extremely high-risk area (Extremely high). Figure 8 shows the SGHRZM of a single machine learning model. Figure 9 shows the SGHRZM of the coupled machine learning model.

All the SGHRAMs reveal that the ‘High’ and ‘Extremely high’ areas of Fuyuan County are distributed in the central, eastern, southern, and southeastern parts and are widely distributed throughout the study area. ‘Low’ areas are concentrated in the north and northwest, but this does not mean that there are no conditions for disasters. The LR, BN, SVM, and ADBoost’s results are similar, and here, we call it type I; the results of RF, ANN, DT and XGBoost are similar, which are called class II. The difference is that the ‘Low’ range of type II recognition is larger than that of type I. From the results of SGHRZM, there are relatively few samples in the ‘Low’ region. Based on tree models RF, DT, XGBoost and the network structure ANN, type II machine learning models can identify these minority classes through their majority voting mechanism and multiple network nodes, thereby capturing more feature information.

3.6. Validation of Research Results

This study offers a clear and impartial depiction of the correctness of the evaluation results by randomly selecting and counting 30% of the disaster and non-disaster sites as test sample points in each risk category zone. Using SPSS, the model’s ROC curve was created.

As seen in Figure 10, the accuracy of RF was the highest, with AUC = 0.909. In second place, the accuracy of XGBoost was 0.892; ADBoost and DT ranked third and fourth, respectively, and their AUC values were 0.871 and 0.863, respectively. The last two are BN and LR (AUC = 0.809 and AUC = 0.799). The accuracy of the coupled model is the highest, and its AUC = 0.923. This means that among the total nine models, its prediction results are the most reliable. The AUC value of each model is greater than 0.5 and close to 1, which fully demonstrates the reliability of the results. The new findings are that the evaluation results of Figure 10 and Figure 11 can provide a reference for researchers to select models, which is conducive to the use of high-precision models for research.

The results in Table 4 show that the classification accuracy of the coupled model is the highest. In the single model, the classification accuracy of the RF model is ahead of other models, and the accuracy is only 3% lower than that of the coupled model. This shows that the coupling model and RF model have high confidence in the correct classification of disaster points. The models are sorted from high to low according to the classification accuracy: RF + XGBoost, RF, XGBoost, SVM, ADBoost, ANN, BN, DT and LR.

3.7. Model Adaptability Verification

To verify the adaptability of the XGBoost + RF model and the RF model, we extract the disaster points in the ‘Extremely High’ and ‘High’ regions and count their proportion in the disaster point list.

In Figure 12 and Figure 13, 66% and 55% of the disaster points fall in the ‘Extremely High’ and ‘High’ regions, respectively. This means that the model can reasonably predict the spatial distribution of disaster risk based on input data and algorithm logic and accurately identify most of the areas with higher risk. This is one of the key factors for the successful promotion and application of the model in practical applications.

4. Discussion

High-resolution optical remote-sensing images of the study area were obtained. These images usually have clear surface textures and details, including detailed information on different features, such as spectral features, shape features, and texture features, which help identify surface changes caused by disasters, such as landslides, ground collapses, and debris flows. The Swin transformer is an advanced deep learning model that is particularly good at capturing global and local features in images. The pretrained Swin transformer algorithm is used to extract shape features from remote sensing images in the study area. Finally, the extracted disaster points are verified by combining UAV photos, GPS positioning data, and field exploration data. Many studies have focused on the application of machine learning in the field of geological hazards, including a susceptibility assessment and a risk assessment [68,69,70,71,72]. However, GHR evaluations combining remote sensing data and machine learning models are rare. Notably, this study verified the adaptability of SGHRZM and revealed that it can better reflect the spatial distribution of historical disaster points in a study area. We extracted important hidden disaster dangers in the ‘Extremely High’ and ‘High’ regions, fully demonstrating the effectiveness of this technology in practical applications. The typical disaster-related danger points in Fuyuan County are shown in Figure 14.

The algorithms based on tree models (RF and XGBoost) strongly support the idea that lithology plays a role in geological hazards [73,74,75,76,77,78]. RF is based on out-of-bag data to evaluate the importance of features, and XGBoost uses information gain as an evaluation index of feature importance. Although the calculation methods for the feature weights of the two algorithms differ, both algorithms determine the lithology as the index with the greatest contribution, which means that the lithology characteristics have a significant effect on the node segmentation selection of the two algorithms. The weights of the road factors are quite different between XGBoost and RF. This is because RF performs random sampling with put-back and randomly selects a portion of the features (rather than all the features) for evaluation to select the optimal split features and thresholds. The road factor has a high correlation with the risk of geological hazards in the dataset, but its weight is underestimated in the RF model because of the influence of random sampling and feature selection. The difference is that XGBoost considers all features when selecting split features and thresholds and uses a gradient-boosting algorithm and regularization technology to optimize the model. This approach assesses each feature’s contribution to the model prediction by using all available data. The road factor is selected as the splitting feature in multiple iterations, and the weight of the feature increases accordingly. To improve the usability of the inference of this study, we clearly show the differences and similarities between the RF and XGBoost models through tables (Table 5).

Numerous facets of research on enhancing the model’s accuracy are reflected, including network structure optimization, training parameter optimization, and ensemble learning methods, which can effectively improve the accuracy of the model [79,80,81,82]. The model’s capacity for generalization can be effectively enhanced with the aid of the hyperparameter optimization technique. The number of decision trees, layers, and neural network nodes, among other model characteristics, directly govern the model’s structure and training procedure. The mix of hyperparameters that GSA technology discovers can improve the model’s performance on unknown data.

The powerful data processing and pattern recognition capabilities of machine learning models provide strong support for the prediction and assessment of geological hazards and are increasingly widely used in the field of geological hazards. However, owing to the complexity and diversity of geological hazards, as well as the performance differences in different models in specific environments, the best machine learning model remains controversial. In this study, eight machine learning models with high attendance rates were used, including RF, XGBoost, ADBoost, SVM, ANN, LR, BN, and DT. Because RF has the highest prediction accuracy and classification accuracy, it stands out among the many models. In addition, XGBoost and ADBoost also exhibit strong competitiveness. The performance of the ANN and BN is almost the same, and there is no trade-off between them. LR does not perform well in this study because the LR model assumes a linear relationship between the independent variables and the dependent variables. However, the occurrence of geological hazards often involves multiple complex factors, and there may be a nonlinear relationship between these factors. Therefore, the LR model may not perform well in addressing this nonlinear relationship. Given the debate on the best machine learning model, the current research trend is to combine multiple models and methods for ensemble learning or hybrid modeling. By integrating the prediction results of RF and XGBoost, this study can make full use of the advantages of different models to improve the accuracy and reliability of geological hazard prediction.

Landslide disasters are the main types of disasters in Fuyuan County, especially in the central, eastern, and southwestern regions. In addition, Permian mudstone, siltstone, silty mudstone, breccia, argillaceous limestone, argillaceous shale, and basalt strata are widely distributed in high-risk areas, which provides favorable internal conditions for the occurrence of disasters. Specifically, the shear strength of mudstone is low, and mudstone easily softens under the action of natural precipitation, forming a weak surface; limestone is relatively hard, but its structural surface evolves into a sliding surface of a landslide, especially under the action of water, and its shear strength decreases greatly. The breccia is composed of broken stones, its stability is poor, and it is easy to damage under the action of external force; the fine sandstone particles are fine, and the structure is loose. Under the action of rainwater erosion and gravity, shear failure easily occurs. The argillaceous shale rock mass is weak, has a low bearing capacity, and easily expands and softens in water, resulting in a significant decrease in its strength, which in turn increases the risk of a landslide. Sandstone is usually relatively hard, whereas mudstone is relatively weak. Owing to the difference in weathering rates, the combination of soft and hard rock strata easily results in differential weathering, which leads to the suspension of upper sandstone and collapse under the action of gravity. In the eastern part of the study area, the mountains are high, the valleys are deep, the terrain is undulating, and precipitation is abundant. This terrain is conducive to the collection of rainwater and the rapid penetration of rock and soil, increases the weight and sliding force of rock and soil, and provides sufficient potential energy for the formation of debris flows.

5. Conclusions

There is a close relationship between the spatial distributions of different disasters and the spatial distribution characteristics of geological landforms. Landslide disasters often occur in the central and southern parts of the study area, which coincides with the distribution of Permian and Quaternary strata. Areas where siltstone, fine sandstone, shale, coal seams, mudstone, and basalt are widely distributed should be given special attention. In addition, due to the uneven weathering thickness of soluble limestone and dolomite, karst pipelines are more developed. Coupled with a large number of coal-bearing strata, they are easy to soften in water, and collapse disasters are very prominent in the central and northern regions of Fuyuan County. On the other hand, debris flows have developed mainly in steep terrains, V-shaped valleys, and mudstone and silty mudstone areas with poor water quality. The faults in the whole study area are relatively developed, which results in the destruction of the rock mass integrity of these strata under the influence of the fault zone, and dissolution and erosion are more significant. Moreover, the joint surface or layer formed by the development of synclines and anticlines often continues to develop into the final sliding surface, which reduces the stability of the slope.

This study explores the adaptability of eight models in typical mountainous towns in Yunnan Province. RF is the champion in a single model, followed by XGBoost and ADBoost. In addition, we couple the two models with the highest correlation as an integrated model, and the results are surprising. The integrated model has the highest accuracy and is highly similar to the RF model in the SGHRZM.

Combining remote sensing data with machine learning models can make full use of the real-time characteristics of remote sensing data and the ability of machine learning models to quickly process and analyze data, improve the accuracy and accuracy of GHR evaluations, and provide strong support for disaster prevention and mitigation. In addition, it is helpful to promote interdisciplinary and technological innovation and promote the scientific and intelligent prevention and control of geological disasters. Notably, the limitation of this study is that the performance of different satellites or sensors is different and is affected by various factors, such as atmospheric conditions and surface coverage, which may lead to noise, insufficient resolution or geometric correction errors in remote sensing data.

For different regions and towns, especially towns with regional characteristics, pre-disaster prediction and post-disaster reconstruction measures should be unique. In addition, future research should focus on exploring the optimal model for a specific region and continue to strengthen the research and application of remote sensing data processing technology to address the challenges posed by geological hazards.

Author Contributions

S.Z., conceptualization, formal analysis, methodology, software, visualization, writing—original draft; S.T., funding acquisition, investigation, project administration, supervision, writing—review and editing; S.T., resources, supervision, validation, writing—review and editing; Y.S., data curation, software, writing—review and editing; D.D., formal analysis, software, writing—review and editing; W.Y., data curation, methodology, visualization, writing—review and editing; S.Z. and Y.S., data curation, methodology, formal analysis, writing—review and editing, grammar checking; D.D. and W.Y., data curation, funding acquisition, review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Yunnan Province Education Department’s Science and Technology Innovation Team Program (Grant No. CY22624109), and the Yunnan Key research and development plan program (Grant No. 202303AP140020) provided funding for this project.

Data Availability Statement

The data that support the findings of this study are available from the first author upon reasonable request.

Acknowledgments

Thanks to Southwest Nonferrous Kunming Exploration Surveying and Designing (Institute) Co., Ltd., for providing technical assistance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Carrara, A.; Crosta, G.; Frattini, P. Comparing models of debris-flow susceptibility in the alpine environment. Geomorphology 2008, 94, 353–378. [Google Scholar] [CrossRef]
Bregoli, F.; Medina, V.; Chevalier, G.; Hürlimann, M.; Bateman, A. Debris-flow susceptibility assessment at regional scale: Validation on an alpine environment. Landslides 2015, 12, 437–454. [Google Scholar] [CrossRef]
Torizin, J. Elimination of informational redundancy in the weight of evidence method: An application to landslide susceptibility assessment. Stoch. Environ. Res. Risk Assess. 2016, 30, 635–651. [Google Scholar] [CrossRef]
Huang, F.M.; Yin, K.L.; Huang, J.S.; Gui, L.; Wang, P. Landslide susceptibility mapping based on self-organizing-map network and extreme learning machine. Eng. Geol. 2017, 223, 11–22. [Google Scholar] [CrossRef]
Shu, H.; Hürlimann, M.; Molowny-Horas, R.; González, M.; Pinyol, J.; Abancó, C.; Ma, J. Relation between land cover and landslide susceptibility in Val d’Aran, Pyrenees (Spain): Historical aspects, present situation and forward prediction. Sci. Total Environ. 2019, 693, 133557. [Google Scholar] [CrossRef]
Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A review of statistically-based landslide susceptibility models. Earth Sci. Rev. 2018, 180, 60–91. [Google Scholar] [CrossRef]
Palacio Cordoba, J.; Mergili, M.; Aristizábal, E. Probabilistic landslide susceptibility analysis in tropical mountainous terrain using the physically based r. slope. stability model. Nat. Hazards Earth Syst. Sci. 2020, 20, 815–829. [Google Scholar] [CrossRef]
Shang, H.; Ni, W.K.; Cheng, H. Application of slope unit division to risk zoning of geological hazards of Pengyang County. Soil Water Conserv. China 2011, 3, 48–50. [Google Scholar] [CrossRef]
Ozdemir, A.; Altural, T. A comparative study of frequency ratio, weights of evidence and logistic regression methods for landslide susceptibility mapping: Sultan Mountains, SW Turkey. J. Asian Earth Sci. 2013, 64, 180–197. [Google Scholar] [CrossRef]
Kainthura, P.; Sharma, N. Machine learning driven landslide susceptibility prediction for the Uttarkashi region of Uttarakhand in India. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2022, 16, 570–583. [Google Scholar] [CrossRef]
Youssef, A.M.; Pourghasemi, H.R.; Pourtaghi, Z.S.; Al-Katheeri, M.M. Landslide susceptibility mapping using random forest, boosted regression tree, classification and regression tree, and general linear models and comparison of their performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 2016, 13, 839–856. [Google Scholar] [CrossRef]
Goetz, J.; Brenning, A.; Petschko, H.; Leopold, P. Evaluating machine learning and statistical prediction techniques for landslide susceptibility modelling. Comput. Geosci. 2015, 81, 1–11. [Google Scholar] [CrossRef]
Huang, F.M.; Cao, Z.S.; Guo, J.F.; Jiang, S.H.; Li, S.; Guo, Z.Z. Comparisons of heuristic, general statistical and machine learning models for landslide susceptibility prediction and mapping. Catena 2020, 191, 104580. [Google Scholar] [CrossRef]
Moayedi, H.; Dehrashid, A.A. A new combined approach of neural-metaheuristic algorithms for predicting and appraisal of landslide susceptibility mapping. Environ. Sci. Pollut. Res. 2023, 30, 82964–82989. [Google Scholar] [CrossRef]
Zhang, Y.B.; Xu, P.Y.; Liu, J.; He, J.X.; Yang, H.T.; Zeng, Y.; He, Y.Y.; Yang, C.F. Comparison of LR, 5-CV SVM, GA SVM, and PSO SVM for landslide susceptibility assessment in Tibetan Plateau area, China. J. Mt. Sci. 2023, 20, 979–995. [Google Scholar] [CrossRef]
Kaur, R.; Gupta, V.; Chaudhary, B.S. Landslide susceptibility mapping and sensitivity analysis using various machine learning models: A case study of Beas valley, Indian Himalaya. Bull. Eng. Geol. Environ. 2024, 83, 228. [Google Scholar] [CrossRef]
Ali, N.; Chen, J.; Fu, X.; Ali, R.; Hussain, M.A.; Daud, H.; Hussain, J.; Altalbe, A. Integrating Machine Learning Ensembles for Landslide Susceptibility Mapping in Northern Pakistan. Remote Sens. 2024, 16, 988. [Google Scholar] [CrossRef]
Tang, C.; Ma, G.C. Small regional geohazards susceptibility mapping based on geomorphic unit. Sci. Geogr. Sin. 2015, 35, 92–94. [Google Scholar] [CrossRef]
Zhang, Y.S.; Guo, C.B.; Yao, X.; Yang, Z.H.; Wu, R.A.; Du, G.L. Research on the geohazard effect of active fault on the eastern margin of the Tibetan Plateau. Acta Geosci. Sin. 2016, 37, 277–286. [Google Scholar] [CrossRef]
Hu, R.L.; Fan, L.F.; Wang, S.S.; Wang, L.C.; Wang, X.L. Theory and method for landslide risk assessment-current status and future development. J. Eng. Geol. 2013, 21, 76–84. [Google Scholar]
Zhao, X.Y.; Tan, S.C.; Li, Y.P. Risk assessment of geological hazards in Dongchuan District based on the methods of slope unit and combination weighting. J. Yunnan Univ. Nat. Sci. Ed. 2021, 43, 299–300. [Google Scholar] [CrossRef]
Zou, F.C.; Leng, Y.Y.; Tao, X.L.; He, S.B. Landslide hazard identification based on slope unit: A case study of shallow soil slope in Wanshan, Guizhou Province. Chin. J. Geol. Hazard Control. 2022, 33, 114–122. [Google Scholar] [CrossRef]
Chen, X.; Zhao, C.; Xi, J.; Lu, Z.; Ji, S.; Chen, L. Deep Learning Method of Landslide Inventory Map with Imbalanced Samples in Optical Remote Sensing. Remote Sens. 2022, 14, 5517. [Google Scholar] [CrossRef]
Hussain, M.A.; Chen, Z.; Zheng, Y.; Zhou, Y.; Daud, H. Deep Learning and Machine Learning Models for Landslide Susceptibility Mapping with Remote Sensing Data. Remote Sens. 2023, 15, 4703. [Google Scholar] [CrossRef]
Gao, Y. An analysis of disaster types and dynamics of landslides in the southwest karst mountain areas. Hydrogeol. Eng. Geol. 2020, 47, 14–23. [Google Scholar] [CrossRef]
Varnes, D.J. Landslide Types and Processes. Highw. Res. Board Spec. Rep. 1958, 24, 20–47. [Google Scholar]
Varnes, D.J. Slope Movement Types and Processes. Transp. Res. Board Spec. Rep. 1978, 176, 11–33. [Google Scholar]
Zhang, H.; Yin, C.; Wang, S.; Guo, B. Landslide susceptibility mapping based on landslide classification and improved convolutional neural networks. Nat. Hazards 2023, 116, 1931–1971. [Google Scholar] [CrossRef]
Cruden, D.M.; Varnes, D.J. Landslide Types and Processes, Special Report, Transportation Research Board, National Academy of Sciences. Spec. Rep.—Natl. Res. Counc. Transp. Res. Board 1996, 247, 76. [Google Scholar]
Dai, C.; Li, W.L.; Lu, H.Y.; Zhang, S. Landslide hazard assessment method considering the deformation factor: A case study of Zhouqu, Gansu province, northwest China. Remote Sens. 2023, 15, 596. [Google Scholar] [CrossRef]
Luino, F.; Barriendos, M.; Gizzi, F.T.; Glaser, R.; Gruetzner, C.; Palmieri, W.; Porfido, S.; Sangster, H.; Turconi, L. Historical Data for Natural Hazard Risk Mitigation and Land Use Planning. Land 2023, 12, 1777. [Google Scholar] [CrossRef]
Lin, J.H.; Chen, W.H.; Qi, X.H.; Hou, H.R. Risk assessment and its influencing factors analysis of geological hazards in typical mountain environment. J. Clean. Prod. 2021, 309, 127077. [Google Scholar] [CrossRef]
Ji, J.; Zhou, Y.; Cheng, Q.; Jiang, S.; Liu, S. Landslide Susceptibility Mapping Based on Deep Learning Algorithms Using Information Value Analysis Optimization. Land 2023, 12, 1125. [Google Scholar] [CrossRef]
Zhou, X.; Wen, H.; Zhang, Y.; Xu, J.; Zhang, W. Landslide susceptibility mapping using hybrid random forest with GeoDetector and RFE for factor optimization. Geosci. Front. 2021, 12, 101211. [Google Scholar] [CrossRef]
Ullah, I.; Aslam, B.; Shah, S.H.I.A.; Tariq, A.; Qin, S.; Majeed, M.; Havenith, H.-B. An Integrated Approach of Machine Learning, Remote Sensing, and GIS Data for the Landslide Susceptibility Mapping. Land 2022, 11, 1265. [Google Scholar] [CrossRef]
Kahal, A.Y.; Abdelrahman, K.; Alfaifi, H.J.; Yahya, M.M.A. Landslide hazard assessment of the Neom promising city, northwestern Saudi Arabia: An integrated approach. J. King Saud Univ. Sci. 2021, 33, 101279. [Google Scholar] [CrossRef]
Ullah, K.; Wang, Y.; Fang, Z.; Wang, L.Z.; Rahman, M. Multi-hazard susceptibility mapping based on Convolutional Neural Networks. Geosci. Front. 2022, 13, 101425. [Google Scholar] [CrossRef]
Qin, Z.; Zhou, X.; Li, M.; Tong, Y.; Luo, H. Landslide Susceptibility Mapping Based on Resampling Method and FR-CNN: A Case Study of Changdu. Land 2023, 12, 1213. [Google Scholar] [CrossRef]
Kanwal, S.; Atif, S.; Shafiq, M. GIS based landslide susceptibility mapping of northern areas of Pakistan, a case study of Shigar and Shyok Basins. Geomat. Nat. Hazards Risk 2017, 8, 348–366. [Google Scholar] [CrossRef]
Yang, F.; Men, X.; Liu, Y.; Mao, H.; Wang, Y.; Wang, L.; Zhou, X.; Niu, C.; Xie, X. Estimation of Landslide and Mudslide Susceptibility with Multi-Modal Remote Sensing Data and Semantics: The Case of Yunnan Mountain Area. Land 2023, 12, 1949. [Google Scholar] [CrossRef]
Makonyo, M.; Zahor, Z. GIS-based analysis of landslides susceptibility mapping: A case study of Lushoto district, north-eastern Tanzania. Nat. Hazards 2023, 118, 1085–1115. [Google Scholar] [CrossRef]
Guo, H.; Martínez-Graña, A.M. Susceptibility of Landslide Debris Flow in Yanghe Township Based on Multi-Source Remote Sensing Information Extraction Technology (Sichuan, China). Land 2024, 13, 206. [Google Scholar] [CrossRef]
Lee, M.J.; Park, I.; Won, J.S.; Lee, S. Landslide hazard mapping considering rainfall probability in Inje, Korea. Geomat. Nat. Hazards Risk 2016, 7, 424–446. [Google Scholar] [CrossRef]
Bentivenga, M.; Gizzi, F.T.; Palladino, G.; Piccarreta, M.; Potenza, M.R.; Perrone, A.; Bellanova, J.; Calamita, G.; Piscitelli, S. Multisource and Multilevel Investigations on a Historical Landslide: The 1907 Servigliano Earth Flow in Montemurro (Basilicata, Southern Italy). Land 2022, 11, 408. [Google Scholar] [CrossRef]
Bi, J.a.; Xu, P.H.; Song, S.Y.; Ding, R. Assessment of the susceptibility to geological hazards in the manas river basin based on the coupled information value-logistic regression model. J. Eng. Geol. 2022, 30, 1549–1560. [Google Scholar] [CrossRef]
Zhou, P.; Deng, H.; Zhang, W.J.; Xue, D.; Wu, X.; Zhuo, W. Landslide susceptibility evaluation based on information value model and machine learning method: A case study of lixian county, sichuan province. Sci. Geogr. Sin. 2022, 42, 1665–1675. [Google Scholar] [CrossRef]
Gao, H.X. Some method on treating the collinearity of independent variables in multiple linear regression. Appl. Stat. Manag. 2000, 20, 49–55. [Google Scholar]
Fu, S.L.; Liang, L.P.; Liu, Y.G. Assessment on geohazard susceptibility in Xinlong section of Yalong River based on CF-Logistic model. Res. Soil Water Conserv. 2021, 28, 404–410. [Google Scholar] [CrossRef]
Dou, H.Q.; Huang, S.Y.; Jian, W.B.; Wang, H. Landslide susceptibility mapping of mountain roads based on machine learning combined model. J. Mt. Sci. 2023, 20, 1232–1248. [Google Scholar] [CrossRef]
Dong, J.H.; Niu, R.Q.; Chen, T.; Dong, L.Y. Assessing landslide susceptibility using improved machine learning methods and considering spatial heterogeneity for the Three Gorges Reservoir Area, China. Nat. Hazards 2024, 120, 1113–1140. [Google Scholar] [CrossRef]
Breiman, L.E.O. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Sun, D.L.; Chen, D.L.; Mi, C.L.; Chen, X.Y.; Mi, S.W.; Li, X.Q. Evaluation of landslide susceptibility in the gentle hill-valley areas based on the interpretable random forest-recursive feature elimination model. J. Geomech. 2023, 29, 202–219. [Google Scholar] [CrossRef]
Lin, R.F.; Liu, J.P.; Xu, S.H.; Liu, M.M.; Zhang, M.; Liang, E.J. Evaluation method of landslide susceptibility based on random forest weighted information. Sci. Surv. Mapp. 2020, 45, 131–138. [Google Scholar] [CrossRef]
Zhang, S.H.; Wu, G. Debris flow susceptibility and its reliability based on random forest and GIS. Earth Sci. 2019, 44, 3115–3134. [Google Scholar] [CrossRef]
Kumar, C.; Walton, G.; Santi, P.; Luza, C. An Ensemble Approach of Feature Selection and Machine Learning Models for Regional Landslide Susceptibility Mapping in the Arid Mountainous Terrain of Southern Peru. Remote Sens. 2023, 15, 1376. [Google Scholar] [CrossRef]
Jiang, Z.; Wang, M.; Liu, K. Comparisons of Convolutional Neural Network and Other Machine Learning Methods in Landslide Susceptibility Assessment: A Case Study in Pingwu. Remote Sens. 2023, 15, 798. [Google Scholar] [CrossRef]
Jennifer, J.J. Feature elimination and comparison of machine learning algorithms in landslide susceptibility mapping. Environ. Earth Sci. 2022, 81, 489. [Google Scholar] [CrossRef]
Liang, W.Z.; Luo, S.Z.; Zhao, G.Y.; Wu, H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics 2020, 8, 765. [Google Scholar] [CrossRef]
Hussain, M.A.; Chen, Z.; Kalsoom, I.; Asghar, A.; Shoaib, M. Landslide susceptibility mapping using machine learning algorithm: A case study along Karakoram Highway (KKH), Pakistan. J. Indian Soc. Remote Sens. 2022, 50, 849–866. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016. [Google Scholar]
Zhang, R.; Zhang, L.; Fang, Z.; Oguchi, T.; Merghadi, A.; Fu, Z.; Dong, A.; Dou, J. Interferometric Synthetic Aperture Radar (InSAR)-Based Absence Sampling for Machine-Learning-Based Landslide Susceptibility Mapping: The Three Gorges Reservoir Area, China. Remote Sens. 2024, 16, 2394. [Google Scholar] [CrossRef]
Shahabi, H.; Ahmadi, R.; Alizadeh, M.; Hashim, M.; Al-Ansari, N.; Shirzadi, A.; Wolf, I.D.; Ariffin, E.H. Landslide Susceptibility Mapping in a Mountainous Area Using Machine Learning Algorithms. Remote Sens. 2023, 15, 3112. [Google Scholar] [CrossRef]
Ma, J.W.; Wang, Y.K.; Niu, X.X.; Jiang, S.; Liu, Z.Y. A comparative study of mutual information-based input variable selection strategies for the displacement prediction of seepage-driven landslides using optimized support vector regression. Stoch. Environ. Res. Risk Assess. 2022, 36, 3109–3129. [Google Scholar] [CrossRef]
Ha, H.; Bui, Q.D.; Tran, D.T.; Nguyen, D.Q.; Bui, H.X.; Luu, C. Improving the forecast performance of landslide susceptibility mapping by using ensemble gradient boosting algorithms. Environ. Dev. Sustain. 2024, 26, 1–35. [Google Scholar] [CrossRef]
Akgun, A. A comparison of landslide susceptibility maps produced by logistic regression, multi-criteria decision, and likelihood ratio methods: A case study at İzmir, Turkey. Landslides 2012, 9, 93–106. [Google Scholar] [CrossRef]
Guo, Z.Z.; Yin, K.L.; Huang, F.M.; Fu, S.; Zhang, W. Evaluation of landslide susceptibility based on landslide classification and weighted frequency ratio model. Chin. J. Rock Mech. Eng. 2019, 38, 287–300. [Google Scholar] [CrossRef]
Roy, J.; Saha, S. Landslide susceptibility mapping using knowledge driven statistical models in Darjeeling District, West Bengal, India. Geoenviron. Disasters 2019, 6, 11. [Google Scholar] [CrossRef]
Xing, Y.; Chen, Y.; Huang, S.; Xie, W.; Wang, P.; Xiang, Y. Research on the Uncertainty of Landslide Susceptibility Prediction Using Various Data-Driven Models and Attribute Interval Division. Remote Sens. 2023, 15, 2149. [Google Scholar] [CrossRef]
Shang, H.; Su, L.; Chen, W.; Tsangaratos, P.; Ilia, I.; Liu, S.; Cui, S.; Duan, Z. Spatial Prediction of Landslide Susceptibility Using Logistic Regression (LR), Functional Trees (FTs), and Random Subspace Functional Trees (RSFTs) for Pengyang County, China. Remote Sens. 2023, 15, 4952. [Google Scholar] [CrossRef]
Chen, C.; Shen, Z.; Weng, Y.; You, S.; Lin, J.; Li, S.; Wang, K. Modeling Landslide Susceptibility in Forest-Covered Areas in Lin’an, China, Using Logistical Regression, a Decision Tree, and Random Forests. Remote Sens. 2023, 15, 4378. [Google Scholar] [CrossRef]
Mosaffaie, J.; Salehpour Jam, A.; Sarfaraz, F. Landslide risk assessment based on susceptibility and vulnerability. Environ. Dev. Sustain. 2024, 26, 9285–9303. [Google Scholar] [CrossRef]
Ge, Y.; Liu, G.; Tang, H.; Zhao, B.; Xiong, C. Comparative analysis of five convolutional neural networks for landslide susceptibility assessment. Bull. Eng. Geol. Environ. 2023, 82, 377. [Google Scholar] [CrossRef]
Bravo-López, E.; Fernández Del Castillo, T.; Sellers, C.; Delgado-García, J. Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods. Land 2023, 12, 1135. [Google Scholar] [CrossRef]
Dornik, A.; Drăguţ, L.; Oguchi, T.; Hayakawa, Y.; Micu, M. Influence of sampling design on landslide susceptibility modeling in lithologically heterogeneous areas. Sci. Rep. 2022, 12, 2106. [Google Scholar] [CrossRef]
Park, S.-J.; Lee, D.-K. Predicting susceptibility to landslides under climate change impacts in metropolitan areas of South Korea using machine learning. Geomat. Nat. Hazards Risk 2021, 12, 2462–2476. [Google Scholar] [CrossRef]
Shi, N.; Li, Y.; Wen, L.; Zhang, Y. Rapid prediction of landslide dam stability considering the missing data using XGBoost algorithm. Landslides 2022, 19, 2951–2963. [Google Scholar] [CrossRef]
Zhang, S.; Wang, Y.; Wu, G. Earthquake-Induced Landslide Susceptibility Assessment Using a Novel Model Based on Gradient Boosting Machine Learning and Class Balancing Methods. Remote Sens. 2022, 14, 5945. [Google Scholar] [CrossRef]
Zhang, T.; Fu, Q.; Li, C.; Liu, F.; Wang, H.; Han, L.; Quevedo, R.P.; Chen, T.; Lei, N. Modeling landslide susceptibility using data mining techniques of kernel logistic regression, fuzzy unordered rule induction algorithm, SysFor and random forest. Nat. Hazards 2022, 114, 3327–3358. [Google Scholar] [CrossRef]
Li, L.M.; Cheng, S.K.; Wen, Z.Z. Landslide prediction based on improved principal component analysis and mixed kernel function least squares support vector regression model. J. Mt. Sci. 2021, 18, 2130–2142. [Google Scholar] [CrossRef]
Adnan Ikram, R.M.; Khan, I.; Moayedi, H.; Ahmadi Dehrashid, A.; Elkhrachy, I.; Nguyen Le, B. Novel evolutionary-optimized neural network for predicting landslide susceptibility. Environ. Dev. Sustain. 2023, 26, 17687–17719. [Google Scholar] [CrossRef]
Chen, Z.L.; Quan, H.C.; Jin, R.; Jin, A.F.; Lin, Z.H.; Jin, G.R.; Jin, G.Z. Assessment of landslide susceptibility using the PCA and ANFIS with various metaheuristic algorithms. KSCE J. Civ. Eng. 2024, 28, 1461–1474. [Google Scholar] [CrossRef]
Mallick, J.; Alkahtani, M.; Hang, H.T.; Singh, C.K. Game-theoretic optimization of landslide susceptibility mapping: A comparative study between Bayesian-optimized basic neural network and new generation neural network models. Environ. Sci. Pollut. Res. 2024, 31, 29811–29835. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Location of study area.

Figure 2. Geological map of study area.

Figure 3. Swin transformer extraction results.

Figure 4. The DCFs selected in this study: (a) elevation, (b) slope, (c) aspect, (d) topographic relief, (e) lithology, (f) fault, (g) NDVI, (h) rainfall, (i) river, (j) road.

Figure 5. Flow chart of experiment.

Figure 6. RF parameter optimization results.

Figure 7. The heat map of feature importance using the RF and XGBoost model.

Figure 8. SGHRZM of the single model.

Figure 9. SGHRZM of the coupled model.

Figure 10. ROC curve of single model.

Figure 11. ROC curve of coupled model.

Figure 12. Disaster point distribution map of high-risk area of coupling model.

Figure 13. Disaster points distribution map of high-risk area of single model.

Figure 14. Fuyuan County typical disaster hidden danger distribution diagram.

Table 1. List of disaster points.

Disaster Species	Category	Dividing Indicators and Characteristics	Quantity
Landslides	Soil landslide	The sliding body is mainly composed of quaternary soil and residual slope deposits.	94
	Rock landslide	The sliding body is composed of rock mass and is controlled by weak surface and joint fissure.	10
	Thrust-type landslide	The upper rock and soil mass is unstable and pushes the lower rock mass to slide.	37
	Retrogressive landslide	The supporting force of the front edge of the landslide is weakened, which leads to the first sliding of the lower block and drives the sliding of the upper rock and soil mass.	67
	Consequent landslide	The strike of the sliding surface is consistent with the tendency of the rock stratum.	88
	Reverse landslide	The strike of sliding surface is opposite to the dip of the rock stratum.	16
Collapses	Soil collapse	The collapse body is mainly a Quaternary loose accumulation layer.	24
	Rock collapse	The collapse body is mainly Triassic siltstone.	8
	Toppling-type collapse	The rock mass is toppled to the free direction as a whole or partially under the action of gravity.	6
	Sliding collapse	The shear sliding of rock mass occurs along the slip surface under the action of gravity.	26
Debris flows	Gully debris flow	The gully is long and the confluence area is large. It has an obvious formation area, circulation area and accumulation area.	6
Debris flows	Slope debris flow	The valley is basically the same as the slope, and there is no obvious circulation area and accumulation area.	2

Table 2. Description of lithology category.

Category	Rock Group	Main Lithology Description
A	Cohesive soil, breccia, sand and silt multi-layer soil group	Gravel, breccia, sand, clay and silt from alluvial deposits, avalanche deposits, residual deposits, slope deposits and cave deposits
B	Thin to medium-thick layered soft mudstone, siltstone, conglomerate, coal seam rock group.	Mudstone, siltstone, glutenite, coal seam
C	Thin to medium-thick layered hard shale, mudstone, siltstone, slate rock group.	Siltstone, shale or silty mudstone
D	Harder limestone, argillaceous limestone argillaceous dolomite group	Limestone, argillaceous limestone argillaceous dolomite
E	Medium to thick layered hard limestone, dolomite, dolomite limestone rock group	Limestone, dolomite, dolomitic limestone
F	Almond hard basalt rock group	Dark green basalt, bottom iron aluminum rock and tuff

Table 3. Evaluation factor collinearity diagnostic table.

Factor	Tolerability (T)	Variance Inflation Factor (VIF)	Significance (Sig.)
Elevation	0.831	1.204	0.002
Slope	0.768	1.961	0.000
Slope aspect	0.928	1.077	0.003
Terrain relief	0.665	1.504	0.005
Annual precipitation	0.752	1.329	0.027
Lithology	0.628	1.593	0.032
Distance from fault	0.871	1.148	0.000
Distance from road	0.928	1.077	0.019
Distance from river	0.728	1.374	0.022
NDVI	0.768	1.302	0.000

Table 4. Confusion matrix.

RF		True value/(pcs)
RF		Positive (1)	Negative (0)
Predicted value	Positive (1)	128	6
Predicted value	Negative (0)	16	133
XGBoost		True value/(pcs)
XGBoost		Positive (1)	Negative (0)
Predicted value	Positive (1)	122	9
Predicted value	Negative (0)	22	131
SVM		True value/(pcs)
SVM		Positive (1)	Negative (0)
Predicted value	Positive (1)	119	12
Predicted value	Negative (0)	25	134
ANN		True value/(pcs)
ANN		Positive (1)	Negative (0)
Predicted value	Positive (1)	115	7
Predicted value	Negative (0)	29	134
ADBoost		True value/(pcs)
ADBoost		Positive (1)	Negative (0)
Predicted value	Positive (1)	117	14
Predicted value	Negative (0)	27	128
DT		True value/(pcs)
DT		Positive (1)	Negative (0)
Predicted value	Positive (1)	113	12
Predicted value	Negative (0)	31	134
BN		True value/(pcs)
BN		Positive (1)	Negative (0)
Predicted value	Positive (1)	115	18
Predicted value	Negative (0)	29	125
LR		True value/(pcs)
LR		Positive (1)	Negative (0)
Predicted value	Positive (1)	111	16
Predicted value	Negative (0)	33	135
RF + XGBoost		True value/(pcs)
RF + XGBoost		Positive (1)	Negative (0)
Predicted value	Positive (1)	133	5
Predicted value	Negative (0)	11	133

Table 5. The similarities and differences between RF and XGBoost models.

Similarities

Differences

Both RF and XGBoost are suitable for the probabilistic prediction (classification) of geological disasters.
Both RF and XGBoost can effectively process nonlinear data.
Both RF and XGBoost can handle high-dimensional feature spaces and different types of data.
Both RF and XGBoost are ensemble methods based on a decision tree.
Both RF and XGBoost can evaluate the importance of each feature during model training.

The way of training data is different. RF builds multiple decision trees, and XGBoost builds trees sequentially.
There are different ways of dealing with errors. Each tree of RF independently makes predictions; XGBoost relies on the error of the previous tree to make predictions.
There are different ways to select features. RF randomly selects some features; XGBoost considers all features.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Tan, S.; Sun, Y.; Ding, D.; Yang, W. Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China). Land 2024, 13, 1361. https://doi.org/10.3390/land13091361

AMA Style

Zhang S, Tan S, Sun Y, Ding D, Yang W. Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China). Land. 2024; 13(9):1361. https://doi.org/10.3390/land13091361

Chicago/Turabian Style

Zhang, Shaohan, Shucheng Tan, Yongqi Sun, Duanyu Ding, and Wei Yang. 2024. "Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China)" Land 13, no. 9: 1361. https://doi.org/10.3390/land13091361

APA Style

Zhang, S., Tan, S., Sun, Y., Ding, D., & Yang, W. (2024). Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China). Land, 13(9), 1361. https://doi.org/10.3390/land13091361

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Risk Mapping of Geological Hazards in Plateau Mountainous Areas Based on Multisource Remote Sensing Data Extraction and Machine Learning (Fuyuan, China)

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area Environment

2.2. Dataset

2.3. Disaster Point Inventory

2.4. Disaster-Causing Factors (DCFs)

2.5. Methods

2.5.1. Multiple Collinearity Diagnosis

2.5.2. Support Vector Machine

2.5.3. Random Forest

2.5.4. Artificial Neural Networks

2.5.5. Bayesian Network

2.5.6. Extreme Gradient Boosting Tree

2.5.7. Adaptive Boosting

2.5.8. Logistic Regression

2.5.9. Decision Tree

2.5.10. Mutual Information

2.6. Parameter Optimization Algorithm

2.7. Validation Techniques

2.8. Experimental Process

3. Results

3.1. Results of Collinearity Diagnosis

3.2. Results of Correlation Analysis

3.3. Model Parameter Optimization Results

3.4. Feature Importance Evaluation Based on RF Model

3.5. SGHRZM

3.6. Validation of Research Results

3.7. Model Adaptability Verification

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI