Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping

Gao, Ruiyuan; Wu, Di; Liu, Hailiang; Liu, Xiaoyang

doi:10.3390/app14209240

Open AccessArticle

Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping

¹

Civil Engineering and Construction Center, Huanghe Science and Technology University, Zhengzhou 450063, China

²

College of Construction Engineering, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9240; https://doi.org/10.3390/app14209240

Submission received: 10 August 2024 / Revised: 21 September 2024 / Accepted: 9 October 2024 / Published: 11 October 2024

Download

Browse Figures

Versions Notes

Abstract

The lack of reliable negative samples is an important factor limiting the quality of machine learning-based debris flow susceptibility mapping (DFSM). The purpose of this paper is to propose multiple negative-sample acquisition strategies for DFSM considering different sample representation forms. The sample representation forms mainly include a single grid, multi-grid, and watershed unit, and the negative-sample acquisition strategies are based on support vector machine (SVM), spy technique, and isolation forest (IF) methods, respectively. These three strategies can assign a value to all the samples based on different assumptions, and reliable, negative samples can be generated from samples with values below a predefined threshold. Combining different sample representation forms with negative sample acquisition strategies, nine datasets were then involved in random forest (RF) modeling. The receiver operating characteristic (ROC) curves and related statistical results were used to evaluate the models. The results show that the strategy based on the spy technique is suitable for multiple datasets, while the IF-based strategy is well-adapted to the watershed unit datasets. This study can provide more options for improving the quality of datasets in DFSM, which can further improve the performance of machine learning models.

Keywords:

debris flow susceptibility mapping; machine learning; negative-sample acquisition; sample representation form

1. Introduction

Debris flows are widely spread in mountainous areas, which have been one of the major factors threatening the safety of life and property [1]. Due to the suddenness, ferocity, and wide distribution, debris flow should be an issue that deserves more attention [2]. Debris flow susceptibility mapping (DFSM) can efficiently identify areas prone to debris flows, which is an effective debris flow risk management approach [3].

The development of DFSM has experienced a process shifting from qualitative to quantitative [4]. The reliance on expert experience makes qualitative approaches too subjective and difficult to generalize on a large scale [5]. On the other hand, quantitative methods focus on extracting relevant information from large amounts of data to obtain objective results and are becoming increasingly popular as access to high-quality data has become easier recently [6]. Machine learning is one of the most promising quantitative methods developed for DFSM. The main algorithms that are widely used include artificial neural network, support vector machine, and random forests (RF) [7,8,9,10]. In addition, combining several weak classifiers based on some integrative frameworks has proved to be an effective approach to improve the prediction accuracy of the models [11].

The reliability of the samples is a key indicator of the quality of the datasets and is directly related to the performance of the machine learning models [12]. Debris flow samples can usually be obtained from multiple sources, and samples obtained through different sources can be mutually verified to ensure their reliability. Conversely, the process of obtaining negative debris flow samples is usually not guaranteed to be reliable. In previous studies, negative samples were often obtained randomly from unlabeled samples [13]. This approach would introduce potential data noise into the datasets. Zhu et al. [14] quantified the reliability of the negative samples obtained based on the difference between positive and negative samples. Negative samples with reliability values higher than a specific threshold were used for modeling. The determination of optimal thresholds needs to be explored further. Fu et al. [15] quantitatively evaluated the correlation between geological hazards and the conditioning factors; the reliable negative samples were obtained from areas with a poor correlation with an occurrence of geological hazards. This method has some high requirements for both the algorithm design and the number of positive samples. Xiao et al. [16] proposed that negative samples could be obtained by replacing some features of positive samples. However, negative samples obtained in this way may not exist in the real world. Liang et al. [9] treated some of the unlabeled samples as negative samples in combination with positive samples to build a model to predict all the samples. Samples with the lowest predicted value can be regarded as reliable negative samples. This paper proposed three negative-sample acquisition strategies based on support vector machine (SVM), spy technique, and isolation forest (IF) models, which can provide more appropriate options for reliable negative-sample acquisition.

The key to obtaining negative samples is to establish an evaluation metric based on positive versus unlabeled samples. The sample representation forms should be an issue that deserves more attention. For debris flows, grids and watershed units are currently the most commonly used sample representation forms. Grids are generally suitable for storing and processing information at various geographical scales. But, they lack the ability to express the overall characteristics of a particular object [11]. Watershed units, as the basic unit of occurrence of a debris flow event, can have some overall characteristics that are closely related to debris flow formation, but its acquisition process is usually more complicated [17]. In the present study, both grids and watershed units were considered for comparison.

This study was designed to explore the optimal negative-sample acquisition strategy for debris flow susceptibility mapping. The negative-sample acquisition strategy studied was based on SVM, spy technique, and IF techniques. Three sample representation forms, including a single grid, multi-grid, and watershed unit, were also considered. Then, the datasets were involved in the establishment of multiple RF models. The receiver operating characteristic (ROC) curves were used for model evaluation. In general, this study can provide more options for improving the quality of datasets in DFSM, which can further improve the performance of machine learning models.

2. Materials and Methods

2.1. Study Area and Debris Flow Inventory

Based on field surveys and historical records, Fangshan District in Beijing was selected as the study area for our study. Located in the warm, temperate, semi-humid monsoon continental climate zone, rainfall is quite abundant from July to August every year. The mountainous area makes up 2/3 of Fangshan District. A strong incision effect has resulted in steep slopes that are widely distributed in the study area. In terms of the lithology, dolomite and limestone are the dominant lithologies in the southern and northern portions of the study area, respectively. Fault structures have mostly developed in the northern part of the study area, with many debris flow samples in its vicinity. Moreover, human activities, such as road construction, are also flourishing in the study area.

A complete and reliable debris flow inventory is one of the most important factors in a DFSM study. As shown in Figure 1, based on multiple sources, such as field surveys, historical information, and remote sensing interpretation, a total of 155 debris flow locations was collected.

2.2. Sample Representation Forms and Conditioning Factors

2.2.1. Sample Representation Forms

The determination of the sample representation forms was the basis for collecting the data needed for the DFSM study. The sample representation forms considered in this paper included a single grid, multi-grid, and watershed unit (Figure 2). For the single-grid form, the choice of its location was the main problem to be solved. Based on previous studies, the centroid of the formation area, which is closely related to the formation of debris flows, was chosen to represent the debris flow event [18]. In general, the formation, flow, and accumulation areas of debris flows are quite different, and it is difficult to describe the general characteristics of debris flows with a single grid, so the multi-grid form was proposed to represent debris flow events. Considering the size of the grids and the computational cost, this paper generated multiple grids in each watershed to represent the corresponding debris flow event with a spatial accuracy of 50 m. As for the watershed unit, this form is naturally able to determine the overall characteristics of the debris flow event well [19]. So, each vector surface representing a watershed can be considered as an independent sample. In this paper, the hydrological analysis tool based on ArcGIS was used to obtain the watershed units, and remote sensing interpretation was performed to make some adjustments to the acquired watershed units.

2.2.2. Conditioning Factors

There has never been a consensus on the selection of debris flow conditioning factors [20], but the following conditions should be met: (1) The accessibility of the conditioning factors is the first factor to be considered, and when the relevant data resources are scarce and extremely difficult to obtain, the corresponding factors should not be considered; (2) The selected conditioning factors should be able to be transformed into various types of data for computation; and (3) The combination of the selected conditioning factors should be a reflection of the possible influences controlling the triggering mechanism of debris flows. Finally, a total of 9 conditioning factors was collected. Figure 3 shows the conditioning factor based on grids, and Figure 4 demonstrates the conditioning factor based on watershed units.

Rainfall plays an important role in the formation of most debris flows. A large number of unstable bodies are destabilized and washed out under rainfall conditions [21]. The average annual rainfall was therefore applied as a conditioning factor. The related data were obtained from a climate website (Worldclim.Org, accessed on 1 August 2024).

Topographic factors are never absent from DFSM studies. Altitude, slope, plane curvature, profile curvature, and topographic wetness index (TWI) were considered in the present study. Based on a 30 m digital elevation model, all these factors could be obtained based on some calculations and transformations. Altitude can be regarded as an indirect influence on debris flow formation since characteristics such as temperature, rainfall, and vegetation usually exhibit distributional differences for different altitude ranges [22]. Slope is closely related to debris flow formation, circulation, and accumulation processes, and is therefore one of the most commonly used conditioning factors in DFSM studies [9]. Plane curvature and profile curvature are two factors that can determine the flow process of surface runoff movement [23]. The TWI is normally associated with soil moisture. As the soil moisture increases, so does the potential for unstable slopes to provide a source of material for debris flows [24].

Surface vegetation cover is recognized to play an important role in debris flow formation [25]. Areas with less vegetation cover have more potential to provide material sources for debris flows. The normalized difference vegetation index (NDVI) was therefore chosen to describe the vegetation conditions in the present study.

The distance to a particular target is often used as a measure of the extent to which that target influences debris flow formation. The development of faults and the construction of roads could provide material sources for debris flows [26]. Accordingly, distance to faults (DTFs) and distance to roads (DTRs) were considered as two important conditioning factors.

3. Methodology

3.1. Sampling and Partitioning Strategies

3.1.1. SVM

The negative samples used in previous studies were usually obtained randomly, which might have incorrectly introduced some positive samples [13]. The introduction of a small number of erroneous samples would affect the performance of the model to some extent, but, in general, the model can still assign a value proportional to the actual probability to each sample. In this paper, an SVM model was built based on the positive samples and randomly obtained negative samples. Reliable negative samples could be obtained from samples with lower predicted values by the model.

An SVM is a typical supervised learning model that can project complex, nonlinear combinations of sample features into a higher-dimensional feature space. The goal of an SVM is to find a hyperplane in the high-dimensional feature space that separates the positive and negative samples as correctly as possible while maximizing the spacing of the positive and negative samples from the hyperplane. In addition, the building process of the SVM not only considers the problem of classification accuracy, but also restricts the model structure to avoid the over-complexity of the hyperplane decision process, which is an excellent algorithm that integrates empirical risk and structural risk [27]. In practice, the SVM algorithm solves the arithmetic problem in high-dimensional spaces by introducing kernel functions. In this paper, the radial basis function (RBF) was selected. The parameter involved in this function was determined to be 0.1.

3.1.2. Spy Technique

The main idea of the spy technique is to feed a portion of the debris flow samples into the unlabeled samples. With reference to previous studies, this portion was determined to be 15% in this paper [28]. This portion of the samples was used as the spy sample, along with the unlabeled samples as negative samples. Then they were combined with positive samples to train a classifier to make predictions for all the samples. The lowest predicted value of the spy sample was used as a threshold value, and samples with predicted values below this threshold could be regarded as reliable negative samples.

3.1.3. IF

The key idea of the IF algorithm is anomaly detection. As an ensemble learning algorithm, IF usually consists of multiple isolation trees. Each isolation tree was built based on a subset of samples obtained by sampling from the original datasets [29]. Since abnormal samples usually exhibit large differences from normal samples, abnormal samples would traverse a relatively short path in the isolation tree. Samples with lower anomaly scores can be considered as reliable negative samples. The key formula of the IF algorithm is as follows:

s (x, n) = 2^{- \frac{E (h (x))}{C (n)}}

(1)

C (n) = 2 H (n - 1) - \frac{2 (n - 1)}{n}

(2)

where E(h(x)) is the average path length of sample x in all ITs; H(n − 1) is a parameter related to the number of ITs, which takes the specific value of ln(n − 1) + 0.5772156649.

The IF algorithm was implemented based on the sklearn module for python 3.7. The number of ITs and the subsampling size were the two parameters that determine the performance of the model, which were taken as 200 and 256, respectively, in this paper. The remaining parameters were left as the default.

3.1.4. Cross-Validation

A large number of previous studies divided the original dataset into two parts for training and prediction [22]. However, cross-validation proves to be a more effective partitioning strategy when the sample size is not large enough [30]. In this paper, the dataset was divided into k parts to form k subsets; each subset was cyclically treated as a test set and the remaining k − 1 subsets were treated as the training sets. Finally, each sample was predicted k times as the test set. The average of the k prediction results was used to evaluate the model’s performance. This method fully utilized all the data in the dataset, which can effectively prevent model overfitting and improve the accuracy of the model evaluation.

3.2. Information Gain Ratio

The information gain ratio (IGR) is used to measure the predictive ability of a conditioning factor [26]. Conditioning factors with a poor predictive ability can be eliminated to ensure a good model performance while saving computational costs. The formulas are as follows:

R (D, X) = G (D, X) / I (X)

(3)

G (D, X) = H (D) - H (D| X)

(4)

H (D) = - \sum p_{i} l o g (p_{i})

(5)

H (D| X)) = p (X = x_{i}) \cdot H (D| X = x_{i})

(6)

I (X) = - \sum p (x_{i}) l o g (p (x_{i}))

(7)

where R(D,X) is the information gain ratio of feature X in set D; G(D,X) is the information gain ratio of feature X in set D; I(X) is a coefficient associated with the classification of feature X; H(D) is the information entropy of set D; H(D|X) is the conditional entropy when feature X is known; p_i is the proportion of each category within the set; p(X = x_i) is the probability of feature X being equal to x_i; H(D|X = x_i) is the entropy when X is equal to x_i; and p(x_i) is the probability of each feature class.

3.3. RF

RF is an ensemble learning algorithm with decision trees as weak classifiers [13]. Randomness is a key advantage of RF algorithms: (1) Each decision tree is built based on a subset of samples obtained from bootstrap sampling, which can improve the generalization ability of the model to some extent; and (2) the splitting and growth of each decision tree is based on randomly selected conditioning factors with a strong predictive ability, which enables the model to deal with high-dimensional data without feature selection. For the model parameters, the number of decision trees was specified as 200 and the number of conditioning factors used for decision tree splitting was specified as 3.

3.4. ROC Curves

The ROC curve is a composite reflection of the model’s effectiveness in classifying positive and negative samples. The formulas are as follows:

S e n s i t i v i t y = \frac{T P}{T P + F N}

(8)

S p e c i f i c i t y = \frac{T N}{F P + T N}

(9)

where TP and TN represent the number of correctly classified positive and negative samples, respectively; FP and FN represent the number of misclassified positive and negative samples, respectively. By varying the probability threshold for judging positive and negative samples, multiple sets of sensitivity and specificity values can be obtained. Then, the ROC curves can be plotted with sensitivity as the x-axis and 1-specificity as the y-axis [31]. The area under the curve (AUC) was used to evaluate the performance of the model. The higher the AUC value, the better the model’s performance.

4. Results

4.1. Negative-Sample Acquisition Results

Based on the negative-sample acquisition strategy proposed in the previous section, reliable negative samples for each dataset were acquired (Figure 5). The process of obtaining negative samples can be divided into the following two steps: (1) Assigning a probability value or anomaly score to all the samples in the study area with the help of a classifier, and (2) determining a threshold value; the reliable negative samples were randomly generated from samples that were below this threshold. It can be seen that the determination of the threshold has a significant impact on the selection of the final negative sample. The SVM-based and IF-based negative-sample acquisition strategies do not have clear requirements for the determination of a threshold. In the present study, the probability and anomaly scores of the samples were classified into four classes, very low (0–0.25), low (0.25–0.5), moderate (0.5–0.75), and high (0.75–1), and the reliable negative samples were generated from the very-low class. The determination of the threshold for the spy technique-based negative sample acquisition strategy was based on the model’s prediction of the spy sample. The probability thresholds determined for the single-grid, multi-grid, and watershed unit datasets were 0.23, 0.19, and 0.28, respectively.

4.2. IGR of the Conditioning Factors

After completing the acquisition of negative samples for debris flows, the complete modeling datasets could be established. The IGR values of the conditioning factors are shown in Table 1. The same conditioning factors showed different predictive abilities in different datasets. The negative-sample acquisition strategy has a significant effect on the predictive ability of the conditioning factors. For single-grid and multi-grid datasets, the spy technique is most effective in improving the predictive abilities of the conditioning factors, while the IF algorithm plays the most important role in determining watershed unit datasets. On the other hand, altitude, slope, TWI, rainfall, and DTR showed a strong predictive ability under different sample representation forms and negative-sample acquisition strategies, demonstrating the roles of topography, rainfall, and human engineering activities in controlling debris flow formation in the study area.

4.3. Comparison of Different Models

Based on the prepared datasets, a total of nine RF models was established. The corresponding ROC curves are shown in Figure 6 and Figure 7, and the AUC values are shown in Table 2 and Table 3. In terms of the sample representation forms, models based on the watershed unit datasets performed best overall, followed by the models based on the single-grid datasets. Models based on the multi-grid datasets had the worst performance. As for the negative-sample acquisition strategies, the negative-sample acquisition strategy based on the spy technique can make the models for multiple datasets perform well, especially for grid datasets. As for the watershed unit datasets, the IF algorithm was the key to improve the performance of the corresponding models. In real-world applications, a high-quality model also needs to consider its efficiency and computational cost. In increasing order, the computational costs in terms of sample representation forms are the single grid, watershed unit, and multi-grid. As for the negative-sample acquisition strategies, the IF algorithm can directly use all the samples for predictions, which makes it the most efficient. The SVM-based strategy is in second place because it requires a portion of unlabeled samples to be prepared beforehand as negative samples. The efficiency of the spy-based strategy is the worst because this strategy requires both negative and spy samples to be prepared in advance.

4.4. Debris Flow Susceptibility Maps

Based on the performance of the models, the optimal model under each sample representation form was selected to produce debris flow susceptibility maps (Figure 8). Each debris flow susceptibility map was classified into four classes: low, moderate, high, and very high. Table 4 demonstrates the statistical results of the susceptibility classes and the debris flow samples. The results show that most of the debris flow samples are classified as high or very high, validating the rationality of the debris flow susceptibility maps. In comparison, the model based on the watershed unit dataset allowed a higher percentage of debris flow samples to be correctly classified.

5. Discussion

5.1. Comparison of Different Sample Representation Forms

Machine learning is a typical data-driven approach. Different sample representation forms imply different data structures; thus, exploratory studies of sample representation forms are quite crucial. This study provided a comprehensive comparative study of three sample representation forms for DFSM: single grid, multi-grid, and watershed unit. The results show that the predictive ability of the conditioning factors, the performance of the machine learning models, and the quality of the debris flow susceptibility maps are all significantly affected by the sample representation forms. It is difficult to say which sample representation form is optimal, but the advantages and disadvantages of each form are fairly obvious. For the single-grid form, the advantages of easy collection and processing are obvious. However, a single grid usually describes the localized characterization of a debris flow event, which might be different from the overall characterization [11]. For the multi-grid form, the introduction of more grids better characterized the overall debris flow event, but also greatly expanded the sample size. A larger sample size means a higher computational cost and a higher chance of introducing some flawed data, which affects the model’s performance [32]. As for the watershed unit, this sample representation form allows for a better overall characterization of debris flow events with fewer samples [17]. Moreover, combined with a suitable negative-sample acquisition strategy, watershed units have the potential to provide the most predictive conditioning factors, the best performing models, and the most plausible debris flow susceptibility maps. On the other hand, the shortcomings of watershed units are mainly due to their more time-consuming acquisition and adjustment processes.

To summarize, the purpose of DFSM is to produce high-quality debris flow susceptibility maps. In this regard, the watershed unit demonstrated a major advantage in the present study. On the other hand, the overall performance of the single-grid datasets is largely satisfactory, which is suitable for scenarios that require a rapid evaluation.

5.2. Comparison of Different Negative-Sample Acquisition Strategies

In previous studies, researchers tended to pay little attention to areas with little or no potential for debris flow. The acquisition of negative samples is therefore usually problematic. Random acquisition from unlabeled samples was the most common method used in previous studies [13]. However, both positive and negative samples were present in the unlabeled samples, so the random selection approach is usually unreliable. Zhu et al. [14] quantified the reliability of the negative samples obtained based on the difference between positive and negative samples, which is an important step in negative-sample acquisition. Fu et al. [15] evaluated the correlation between geological hazards and conditioning factors; the reliable negative samples were obtained from areas with a poor correlation with the occurrence of geological hazards. It can be seen that establishing a set of criteria to identify negative samples from unlabeled samples in geographic or data spaces is the key problem to be solved.

This paper proposed three negative-sample acquisition strategies based on SVM, spy technique, and IF methods. Each strategy has different assumptions of the datasets. The SVM-based strategy is mainly based on the assumption that there are far more negative samples than positive samples in the unlabeled samples; most of the randomly selected unlabeled samples are still reliable. A model is therefore able to present a probability, which is proportional to the actual probability for each sample, and reliable negative samples can be generated from samples with lower probabilities. The spy-based strategy is mainly based on the assumption that spy samples and potential debris flow samples from unlabeled samples will behave similarly in a model’s prediction. By observing the probability threshold of spy samples, reliable negative samples can be identified. The IF-based strategy assumes that there are more negative samples than positive samples in the study area, and positive samples and negative flow samples should exhibit large differences. Then, most of the unlabeled samples can be considered as normal negative samples, and the potential debris flow samples can be considered as “abnormal samples”. Based on the idea of anomaly detection, the IF algorithm can assign an anomaly score to each sample. Samples with lower anomaly scores can be regarded as reliable negative samples.

To summarize, the negative-sample acquisition strategy based on the spy technique has no special requirements for either the datasets or the algorithms, which can explain the good performance of the strategy on multiple datasets. The IF-based negative-sample acquisition strategy assumes that debris flow samples are abnormal samples, which account for a relatively small portion of the overall datasets. For the grid datasets, grids with the same or similar characteristics as the debris flow samples are widely distributed throughout the study area, so the IF-based negative-sample acquisition strategy performed poorly in the grid datasets. On the other hand, the watershed unit datasets fit this assumption well, which shows that the IF-based negative-sample acquisition strategy performed well in the watershed unit datasets. For the SVM-based negative-sample acquisition strategy, it has some requirements for both the datasets and the algorithms, which is the reason for the unstable performance of this strategy in different datasets. On the other hand, the IF-based strategy proved to be the most efficient strategy, followed by the SVM-based strategy. The spy-based strategy had the worst efficiency. Overall, in practical applications, selecting reliable negative samples based on the distribution characteristics of the datasets is the key to effectively improve the reliability of negative samples.

5.3. Limitations

Considering multiple sample representation forms, different negative-sample acquisition strategies with different assumptions were compared and analyzed. Some limitations remain to be addressed: (1) The selection of debris flow conditioning factors has always been subjective to some degree. An approach based on objective data and algorithms could be explored to improve the quality of the datasets. (2) The SVM algorithm was introduced in this paper to assist in obtaining negative samples. More algorithms could be introduced to increase the reliability of the study. (3) This paper explored three negative-sample acquisition strategies with different assumptions for DFSM. For other types of natural hazards, the extent to which their samples conform to the assumptions of these strategies needs to be further explored to improve the universality of this study.

6. Conclusions

This paper explored the optimal negative-sample acquisition strategy for DFSM considering multiple sample representation forms. The following conclusions can be drawn:

(1): Different sample representation forms have their own advantages and disadvantages. Watershed units have more obvious strengths in DFSM studies.
(2): Different negative-sample acquisition strategies have different assumptions. The strategy based on the spy technique is less demanding and thus suitable for multiple datasets, while the IF-based strategy is well adapted to watershed unit datasets.
(3): Watershed units with the negative-sample acquisition strategy based on the IF algorithm are the optimal combination in this study, with the most predictive conditioning factors, the best performing models, and the most plausible debris flow susceptibility maps.

Author Contributions

Conceptualization, R.G. and D.W.; methodology, R.G.; software, R.G.; validation, D.W.; formal analysis, D.W.; investigation, H.L. and X.L.; resources, H.L. and X.L.; data curation, R.G.; writing—original draft preparation, R.G.; writing—review and editing, H.L. and X.L., visualization, H.L. and X.L.; supervision, R.G.; project administration, R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Foundation for Doctors of Huanghe Science and Technology University (No.02032817).

Data Availability Statement

The data and the code of this study are available from the corresponding authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tien Bui, D.; Shirzadi, A.; Shahabi, H.; Geertsema, M.; Omidvar, E.; Clague, J.J.; Thai Pham, B.; Dou, J.; Asl, D.T.; Bin Ahmad, B.; et al. New Ensemble Models for Shallow Landslide Susceptibility Modeling in a Semi-Arid Watershed. Forests 2019, 10, 743. [Google Scholar] [CrossRef]
Huang, H.; Wang, Y.; Li, Y.; Zhou, Y.; Zeng, Z. Debris Flow Susceptibility Assessment in China: A Comparison between Traditional Statistical and Machine Learning Methods. Remote Sens. 2022, 14, 4475. [Google Scholar] [CrossRef]
Pham, B.T.; Tien Bui, D.; Prakash, I.; Dholakia, M.B. Hybrid integration of Multilayer Perceptron Neural Networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena 2017, 149, 52–63. [Google Scholar] [CrossRef]
Corominas, J.; Van Westen, C.; Frattini, P.; Cascini, L.; Malet, J.-P.; Fotopoulou, S.; Catani, F.; Van Den Eeckhaut, M.; Mavrouli, O.; Agliardi, F.; et al. Recommendations for the quantitative analysis of landslide risk. Bull. Eng. Geol. Environ. 2014, 73, 209–263. [Google Scholar] [CrossRef]
Yao, X.; Tham, L.G.; Dai, F.C. Landslide susceptibility mapping based on support vector machine: A case study on natural slopes of Hong Kong, China. Geomorphology 2008, 101, 572–582. [Google Scholar] [CrossRef]
Tunusluoglu, M.C.; Gokceoglu, C.; Sonmez, H.; Nefeslioglu, H.A. An artificial neural network application to produce debris source areas of Barla, Besparmak, and Kapi Mountains (NW Taurids, Turkey). Nat. Hazards Earth Syst. Sci. 2007, 7, 557–570. [Google Scholar] [CrossRef][Green Version]
Hong, H.; Pradhan, B.; Jebur, M.N.; Bui, D.T.; Xu, C.; Akgun, A. Spatial prediction of landslide hazard at the Luxi area (China) using support vector machines. Environ. Earth Sci. 2016, 75, 40. [Google Scholar] [CrossRef]
Chen, W.; Xie, X.; Peng, J.; Wang, J.; Duan, Z.; Hong, H. GIS-based landslide susceptibility modelling: A comparative assessment of kernel logistic regression, Naive-Bayes tree, and alternating decision tree models. Geomat. Nat. Hazards Risk 2017, 8, 950–973. [Google Scholar] [CrossRef]
Liang, Z.; Wang, C.; Duan, Z.; Liu, H.; Liu, X.; Khan, K.U.J. A Hybrid Model Consisting of Supervised and Unsupervised Learning for Landslide Susceptibility Mapping. Remote Sens. 2021, 13, 1464. [Google Scholar] [CrossRef]
Trigila, A.; Iadanza, C.; Esposito, C.; Scarascia-Mugnozza, G. Comparison of Logistic Regression and Random Forests tech-niques for shallow landslide susceptibility assessment in Giampilieri (NE Sicily, Italy). Geomorphology 2015, 249, 119–136. [Google Scholar] [CrossRef]
Pham, B.T.; Nguyen-Thoi, T.; Qi, C.; Van Phong, T.; Dou, J.; Ho, L.S.; Van Le, H.; Prakash, I. Coupling RBF neural network with ensemble learning techniques for landslide susceptibility mapping. Catena 2020, 195, 104805. [Google Scholar] [CrossRef]
Youssef, A.M.; Pourghasemi, H.R. Landslide susceptibility mapping using machine learning algorithms and comparison of their performance at Abha Basin, Asir Region, Saudi Arabia. Geosci. Front. 2021, 12, 639–655. [Google Scholar] [CrossRef]
Hong, H.; Pourghasemi, H.R.; Pourtaghi, Z.S. Landslide susceptibility assessment in Lianhua County (China): A comparison between a random forest data mining technique and bivariate and multivariate statistical models. Geomorphology 2016, 259, 105–118. [Google Scholar] [CrossRef]
Zhu, A.-X.; Miao, Y.; Liu, J.; Bai, S.; Zeng, C.; Ma, T.; Hong, H. A Similarity-based Approach to Sampling Absence Data for Landslide Susceptibility Mapping Using Data-driven Methods. Catena 2019, 183, 104188. [Google Scholar] [CrossRef]
Fu, X.; Liu, Y.; Zhu, Q.; Ge, D.; Li, Y.; Zeng, H. Reliable assessment approach of landslide susceptibility in broad areas based on optimal slope units and negative samples involving priori knowledge. Int. J. Digit. Earth 2022, 15, 2495–2510. [Google Scholar] [CrossRef]
Xiao, T.; Yin, K.; Yao, T.; Liu, S. Spatial prediction of landslide susceptibility using GIS-based statistical and machine learning models in Wanzhou County, Three Gorges Reservoir, China. Acta Geochim. 2019, 5, 654–669. [Google Scholar] [CrossRef]
Gao, R.; Wang, C.; Han, S.; Liu, H.; Liu, X.; Wu, D. A Research on Cross-Regional Debris Flow Susceptibility Mapping Based on Transfer Learning. Remote Sens. 2022, 14, 4829. [Google Scholar] [CrossRef]
Gao, R.-Y.; Wang, C.-M.; Liang, Z. Comparison of different sampling strategies for debris flow susceptibility mapping: A case study using the centroids of the scarp area, flowing area and accumulation area of debris flow watersheds. J. Mt. Sci. 2021, 18, 1476–1488. [Google Scholar] [CrossRef]
Yu, B.; Li, L.; Wu, Y.; Chu, S. A formation model for debris flows in the Chenyulan River Watershed, Taiwan. Nat. Hazards 2013, 68, 745–762. [Google Scholar] [CrossRef]
Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A review of statistically-based landslide susceptibility models. Earth Sci. Rev. 2018, 180, 60–91. [Google Scholar] [CrossRef]
Zhao, Y.; Meng, X.; Qi, T.; Chen, G.; Li, Y.; Yue, D.; Qing, F. Extracting more features from rainfall data to analyze the conditions triggering debris flows. Landslides 2022, 19, 2091–2099. [Google Scholar] [CrossRef]
Hong, H.; Pradhan, B.; Xu, C.; Bui, D.T. Spatial prediction of landslide hazard at the Yihuang area (China) using two-class kernel logistic regression, alternating decision tree and support vector machines. Catena 2015, 133, 266–281. [Google Scholar] [CrossRef]
Oh, H.-J.; Pradhan, B. Application of a neuro-fuzzy model to landslide-susceptibility mapping for shallow landslides in a tropical hilly area. Comput. Geosci. 2011, 37, 1264–1276. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Jirandeh, A.G.; Pradhan, B.; Xu, C.; Gokceoglu, C. Landslide susceptibility mapping using support vector machine and GIS at the Golestan Province, Iran. J. Earth Syst. Sci. 2013, 122, 349–369. [Google Scholar] [CrossRef]
Meng, Z.; Lyu, L.; Xu, M.; Yu, G.; Ma, C.; Wang, Z.; Stoffel, M. Effects of frequent debris flows on barrier lake formation, sedimentation and vegetation disturbance, Palongzangbo River, Tibetan Plateau. Catena 2023, 220, 106697. [Google Scholar] [CrossRef]
Gao, R.; Wang, C.; Liang, Z.; Han, S.; Li, B. A Research on Susceptibility Mapping of Multiple Geological Hazards in Yanzi River Basin, China. ISPRS Int. J. Geo-Inf. 2021, 10, 218. [Google Scholar] [CrossRef]
Li, X.Z.; Kong, J.M. Application of Support Vector Machine with Posterior Probability Estimates in Debris Flow Hazard Assessment. Disaster Adv. 2011, 4, 38–44. [Google Scholar]
Bekker, J.; Davis, J. Learning from positive and unlabeled data: A survey. Mach. Learn. 2020, 109, 719–760. [Google Scholar] [CrossRef]
Tan, X.; Yang, J.; Rahardja, S. Sparse random projection isolation forest for outlier detection. Pattern Recognit. Lett. 2022, 163, 65–73. [Google Scholar] [CrossRef]
Krkač, M.; Gazibara, S.B.; Arbanas, Ž.; Sečanj, M.; Arbanas, S.M. A comparative study of random forests and multiple linear regression in the prediction of landslide velocity. Landslides 2020, 17, 2515–2531. [Google Scholar] [CrossRef]
Pontius, R.G.; Parmentier, B. Recommendations for using the relative operating characteristic (ROC). Landsc. Ecol. 2014, 29, 367–382. [Google Scholar] [CrossRef]
Wang, H.; Zhang, L.; Yin, K.; Luo, H.; Li, J. Landslide identification using machine learning. Geosci. Front. 2021, 12, 351–364. [Google Scholar] [CrossRef]

Figure 1. The debris flow inventory map of Fangshan District.

Figure 2. Schematic of the sample representation forms: (a) single grid; (b) multi-grid; (c) watershed unit.

Figure 3. Conditioning factors based on grids: (a) rainfall; (b) altitude; (c) slope; (d) plane curvature; (e) profile curvature; (f) TWI; (g) NDVI; (h) DTF; (i) DTR.

Figure 4. Conditioning factors based on watershed unit: (a) rainfall; (b) altitude; (c) slope; (d) plane curvature; (e) profile curvature; (f) TWI; (g) NDVI; (h) DTF; (i) DTR.

Figure 5. The negative-sample acquisition results: (a) single-grid-SVM dataset; (b) single-grid spy dataset; (c) single-grid-IF dataset; (d) multi-grid-SVM dataset; (e) multi-grid spy dataset; (f) multi-grid-IF dataset; (g) watershed-unit-SVM dataset; (h) watershed-unit spy dataset; (i) watershed-unit-IF dataset.

Figure 6. ROC curves based on different training datasets: (a) single grid; (b) multi-grid; (c) watershed unit.

Figure 7. ROC curves based on different validation datasets: (a) single grid; (b) multi-grid; (c) watershed unit.

Figure 8. Debris flow susceptibility maps based on the optimal models of (a) the single grid dataset; (b) multi-grid dataset; (c) watershed unit dataset.

Table 1. IGR values of conditioning factors.

Representation Forms	Conditioning Factors	SVM-Based	Spy-Based	IF-Based
Single grid	Rainfall	0.092661	0.096309	0.035283
	Altitude	0.064700	0.072970	0.044267
	Slope	0.052486	0.164884	0.050151
	Plane curvature	0.009772	0.011732	0.002229
	Profile curvature	0.007298	0.010095	0.003221
	TWI	0.102360	0.221782	0.021809
	NDVI	0.029817	0.030426	0.011323
	DTF	0.021958	0.069592	0.014625
	DTR	0.188112	0.217063	0.172086
Multi-grid	Rainfall	0.027362	0.073548	0.014158
	Altitude	0.036237	0.061468	0.013859
	Slope	0.011209	0.033521	0.006381
	Plane curvature	0.003544	0.003607	0.001564
	Profile curvature	0.005132	0.008222	0.002221
	TWI	0.002520	0.004043	0.000903
	NDVI	0.004635	0.027623	0.001348
	DTF	0.018008	0.068456	0.010054
	DTR	0.203351	0.228164	0.134484
Watershed unit	Rainfall	0.051102	0.067690	0.069879
	Altitude	0.054035	0.066648	0.079492
	Slope	0.046257	0.067201	0.077270
	Plane curvature	0.084715	0.103792	0.164850
	Profile curvature	0.097196	0.114082	0.164520
	TWI	0.040089	0.057586	0.070916
	NDVI	0.01849	0.027924	0.037865
	DTF	0.031382	0.042168	0.048835
	DTR	0.099004	0.173459	0.215150

Table 2. AUC values of the models based on training datasets.

Sample Representation Form	SVM-Based	Spy-Based	IF-Based
Single grid	0.919	0.932	0.902
Multi-grid	0.876	0.895	0.869
Watershed unit	0.937	0.939	0.946

Table 3. AUC values of the models based on validation datasets.

Sample Representation Form	SVM-Based	Spy-Based	IF-Based
Single grid	0.881	0.911	0.868
Multi-grid	0.866	0.888	0.856
Watershed unit	0.909	0.920	0.932

Table 4. Susceptibility class and sample statistical results.

Sample Representation Form	Susceptibility	Percentage	No. of Debris Flow Samples	Percentage of Debris Flow Samples (%)
Single grid	Low	29.5	4	2.6
	Moderate	15.1	8	5.2
	High	12.6	20	12.9
	Very high	42.8	123	79.3
Multi-grid	Low	35.8	9	5.8
	Moderate	9.2	15	9.7
	High	10.2	26	16.8
	Very high	44.8	105	67.7
Watershed unit	Low	31.0	3	1.9
	Moderate	13.2	5	3.3
	High	20.1	34	21.9
	Very high	35.7	113	72.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, R.; Wu, D.; Liu, H.; Liu, X. Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping. Appl. Sci. 2024, 14, 9240. https://doi.org/10.3390/app14209240

AMA Style

Gao R, Wu D, Liu H, Liu X. Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping. Applied Sciences. 2024; 14(20):9240. https://doi.org/10.3390/app14209240

Chicago/Turabian Style

Gao, Ruiyuan, Di Wu, Hailiang Liu, and Xiaoyang Liu. 2024. "Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping" Applied Sciences 14, no. 20: 9240. https://doi.org/10.3390/app14209240

APA Style

Gao, R., Wu, D., Liu, H., & Liu, X. (2024). Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping. Applied Sciences, 14(20), 9240. https://doi.org/10.3390/app14209240

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Different Negative-Sample Acquisition Strategies Considering Sample Representation Forms for Debris Flow Susceptibility Mapping

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Debris Flow Inventory

2.2. Sample Representation Forms and Conditioning Factors

2.2.1. Sample Representation Forms

2.2.2. Conditioning Factors

3. Methodology

3.1. Sampling and Partitioning Strategies

3.1.1. SVM

3.1.2. Spy Technique

3.1.3. IF

3.1.4. Cross-Validation

3.2. Information Gain Ratio

3.3. RF

3.4. ROC Curves

4. Results

4.1. Negative-Sample Acquisition Results

4.2. IGR of the Conditioning Factors

4.3. Comparison of Different Models

4.4. Debris Flow Susceptibility Maps

5. Discussion

5.1. Comparison of Different Sample Representation Forms

5.2. Comparison of Different Negative-Sample Acquisition Strategies

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI