1. Introduction
The intricate geological landscape of China, characterized by its diverse geomorphological features and frequent tectonic movements, engenders a propensity for a multitude of disasters. The unpredictable nature and sudden onset of these calamities contribute significantly to the substantial loss of both human life and property. According to the disaster statistics released (
https://www.stats.gov.cn/ accessed date: 1 May 2024), a total of 21,939 geologic disasters occurred in China from 2020 to 2023, and the rate of landslide disasters was 54.67%. Landslide is a cascading landslide geological disaster, which manifests itself as a natural phenomenon of rock and soil on slopes sliding downward under the action of gravity, and it is an internal driving force for the evolution of landscape patterns [
1,
2,
3]. Therefore, it is of great significance to utilize efficient and accurate evaluation techniques for landslide susceptibility evaluation and the identification of high-risk and high-incidence areas of landslide geohazards to improve the level of geological disaster forecasting, and regional geological disaster prevention and mitigation work.
Landslide susceptibility assessment usually combines historical disaster inventories with manually selected characterization factor data, and the assessment models are mainly categorized into physical models, expert models, statistical models, and machine learning models [
4,
5,
6,
7]. Each susceptibility model has its own advantages and disadvantages in different scenarios. Physical-based models provide the highest assessment accuracy at small spatial scales and with large amounts of exploration data, but it is due to these two types of characteristics that physical-based models do not achieve better assessment results at wide-area scales and with sparse data [
8]. Expert-based models rely on expert experience combined with subject matter expertise, and the subjective quantitative setting of landslide condition factor weights, to classify the sensitivity level. However, expert-based models are highly subjective, making it difficult to obtain objective and quantitative assessment results [
9]. Statistically based models select and handle a large number of landslide condition factors, and the model constructs mathematical methods to mine the relationship between the factors to divide the landslide susceptibility. But there are some limitations of statistically based models when facing the nonlinear and complex relationship between the factors [
10]. Machine learning models do not rely on the function of rules and a priori knowledge too much. Machine learning models select relevant conditioning factors and adjust parameters by learning the training data to minimize the mathematical function of the error between the predicted output and the true label [
11,
12,
13]. Recently, machine learning models have been widely used for landslide susceptibility assessment, such as logistic regression [
14], artificial neural network [
15], support vector machine [
16], decision tree [
17], random forest [
18], plain Bayes [
19], gradient boosting tree [
20] and so on. However, the feature factors selected by the above methods have the following problems. Firstly, the time of historical landslide data varies, but most research ignores the problem of temporal matching such as rainfall, meteorology, and human activities, etc. Secondly, there has been more research to carry out the optimal assessment factors of landslide susceptibility for different regions, utilize the landside domain knowledge, and reduce the inefficiency caused by repetitive research in overlapping regions is an urgent problem to be solved. Thus, there is an urgent need to establish a generalized, computable and reasoned-feature-factor intelligent recommendation model, so as to improve assessment accuracy and efficiency.
A knowledge graph, a kind of semantic network, uses graph models to describe complex knowledge and model associations in the world. A knowledge graph has significant advantages in the organizational integration of domain knowledge and implicit relationship mining. To a certain extent, in makes up for the feature selection being more subjective, the low utilization of knowledge, and the insufficiency of implicit relationship mining [
21,
22,
23]. Current research on the knowledge graph in the disaster domain mainly focuses on constructing a single disaster scenario model, decomposing and disaggregating the identification of disaster scenarios, and analyzing the temporal evolution and association relationship of disasters [
24,
25]. In addition, around the temporal and spatial evolutionary characteristics of the disaster chain, we analyze and excavate the relationship between disasters, the environment and the objects [
26,
27,
28]; we construct disaster events and emergency responses, and through embedded modeling, assess disaster losses and reason about the disaster emergency response and process [
29,
30,
31]. Nevertheless, the burgeoning corpus of research on landslide susceptibility assessment, encompassing case studies and related investigations, has yielded a reservoir of invaluable information that remains largely untapped. This is often accompanied by a large amount of redundant, spatially low correlation or even invalid data, which greatly increases the complexity of selecting valid information in the landslide susceptibility evaluation factors; these lead to a growing gap between the ability to acquire knowledge and the ability to apply it intelligently. There is an urgent need to develop methods for landslide susceptibility assessment indicator selection at the semantic level through the knowledge graph, to push the highly relevant and most effective selection methods to the assessment task, and to improve the efficiency and accuracy of identification.
This paper focuses on the low utilization of landslide disaster-prone knowledge and the subjective nature of indicator selection. Firstly, we combined scientific literature, landslide disaster reports, online encyclopedia and other Internet unstructured texts to extract knowledge, and construct a landslide disaster knowledge graph using a fine-tuned UIE model. Secondly, we constructed a feature factor recommendation model for knowledge association and discovery based on the knowledge graph. By calculating the similarity of association and attributes, we selected the area with high association in TGA. By eliminating characterization factors with high similarity, we determined base landslide susceptibility assessment factors for TGA, and mined the associated disaster factors for inclusion in evaluation factors through TF-IDF. Thirdly, we selected logistic regression (LR), support vector machine (SVM), random forest (RF), gradient boosted decision tree (GBDT), extreme gradient boosted tree (XGBoost), and categorical gradient boosted tree (CatBoost) ML models for landslide susceptibility assessment, and we utilized the SHAP algorithm to output the decision genesis of the prediction results so as to realize the automatic interpretable landslide susceptibility assessment integrating a priori knowledge and ML.
The main contributions of this research are as follows.
(1) Knowledge extraction for landslide susceptibility assessment is carried out using a fine-tuned UIE model to construct a knowledge graph. Through the threshold constraint space, attribute semantic correlation, the landslide susceptibility assessment feature factor is selected, and the companion hazard factor is mined through knowledge discovery to join the assessment feature.
(2) The factor recommendation model constructs different degrees of enhancement optimization in the accuracy, precision, recall and F1-score of RF, GBDT, XGBoost and CatBoost, and the associated disaster factors can significantly improve the landslide disaster susceptibility assessment results.
(3) Through hyper-parameter optimization and optimal feature combination selection, the probability of landslide occurrence in TGA is classified using RF identification, and the interpretability of each factor is calculated through the SHAP algorithm.
5. Discussion
With the increase in landslide-related research, it is an important task to integrate a priori knowledge in the landslide domain and mine implicit correlations to support high-precision and high-efficiency landslide disaster susceptibility assessment. Although many studies have been conducted on landslide susceptibility modeling under multi-model and multi-feature, for the selection of feature indicators, due to the varying time of the occurrence of disasters in landslide inventories, which leads to the problem of temporal matching of, e.g., NDVI and meteorological rainfall is not fully considered, it is necessary to design suitable feature factor recommendation models to achieve equivalent or efficient assessment effects. In addition, much landslide-related information contained in landslide disaster reports and landslide event encyclopedias has not been effectively utilized, such as the concomitant disaster density and Euclidean distance factor found in this paper through the landslide knowledge graph, and the integration of the existing methodology and technology to mine the implicit relationships in the unstructured data, which is of scientific significance to improve the accuracy of the assessment.
In order to verify the effectiveness of the spatial correlation and attribute similarity in this paper, the study area selected in this paper is not the Three Gorges Reservoir Area in the traditional sense, but the adjacent area is added on the basis of the Three Gorges Reservoir Area. Compared to existing research on the traditional Three Gorges Reservoir area., Fang et al. (2021) [
60], Song et al. (2024) [
61], and Yu et al. (2024) [
62] utilized historical landslide data by selecting 20, 12, and 15 feature factors such as NDVI, climate and meteorology, road data, and land cover under the current timestamp, respectively, to compare different machine learning models trained on landslide susceptibility in TGA, while the average accuracies of randomized forest assessment in the research were 80%, 84.05%, and 79.6%. Through the knowledge graph of landslide susceptibility assessment constructed in this paper, combined with the feature factor recommendation model, a total of 12 types of evaluation indexes are selected from five categories of Topographic Features, Geologic Environment, Soil Type, Rivers and Associated Disaster, and combined with the random forest model with hyperparameter optimization, the susceptibility assessment of TGA is carried out, and the optimal accuracy reaches 87.20%, which proves the superiority of the selection of feature indexes based on the knowledge graph established in this paper.
In this paper, non-landslide sample points are currently selected only in accordance with existing studies and the first law of geography; future research needs to consider more constraints to improve the production of non-landslide data in order to improve prediction accuracy. In addition, this study only utilizes single-modal unstructured text for knowledge mining. Future work should extract and mine multimodal landslide information, for example, through optical remote sensing imagery combined with deep learning to comprehensively recognize the surface geological environment [
63,
64,
65], or use image segmentation technology for the segmentation and data mining of planar raster map data to expand the knowledge map in the field of landslides, improve spatial correlation and attribute similarity, and mine more implicit relationships so as to provide a more complete system of feature factors for landslide recognition [
66,
67]. The aim is to provide a more complete feature factor system for landslide identification.
6. Conclusions
Landslide, as a major geological disaster, is of great significance to timely and accurately delineate landslide-prone areas for disaster defense and policy formulation. This research aims to construct a feature factor selection model under a priori knowledge, combined with a knowledge discovery algorithm to mine the highly correlated but underutilized factors in historical landslide disasters and construct a feature factor recommendation model. With the selected TGA as a study case, we evaluated six machine learning algorithms, and selected the better-performing random forest model for hyper-parameter optimization, constructing an interpretable landslide susceptibility assessment model.
In this study, the UIE model is used and fine-tuned with a small amount of labeled data to construct mapping relationships to realize the knowledge graph of landslide disaster for text data. The obtained knowledge graph contains 2167 entities and 2352 relationships. Based on the similarity calculation and knowledge discovery of the knowledge graph, this paper constructs a feature combination containing relevant disaster factors. With six ML models, it is effectively proved that the inclusion of landslide disaster associated factors contributes to the optimization of classification results. The RF accuracy after combining the optimal feature combination and hyper-parameter optimization reaches 87.20%, indicating that the model constructed in this paper has high credibility. The spatial distribution of susceptibility shows that the high-risk areas of TGA are located in the western and eastern edge areas. Quantitative analysis of the relationship between the characteristic factors and the probability of landslide occurrence shows that the probability of landslides occurring is smaller as the Mudslide distance and Collapse distance increase; and as the DEM, SLOPE, Fault Density, Collapse density increase with the value, the probability of landslide susceptibility first increases and then decreases. The Elevation Coefficient of Variation and River Density increase with the value as the probability of landslide susceptibility increases sharply.