Next Article in Journal
Representing Zooplankters: An Example from the Foraminifera
Previous Article in Journal
Petrographic and Textural Characterization of Beach Sands Contaminated by Asbestos Cement Materials (Cape Peloro, Messina, Italy): Hazardous Human-Environmental Relationships
Previous Article in Special Issue
The Open Landslide Project (OLP), a New Inventory of Shallow Landslides for Susceptibility Models: The Autumn 2019 Extreme Rainfall Event in the Langhe-Monferrato Region (Northwestern Italy)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Study of Susceptibility and Hazard for Mass Movements Applying Quantitative Machine Learning Techniques—Case Study: Northern Lima Commonwealth, Peru

by
Edwin Badillo-Rivera
1,2,*,
Manuel Olcese
3,
Ramiro Santiago
3,
Teófilo Poma
3,
Neftalí Muñoz
3,
Carlos Rojas-León
3,
Teodosio Chávez
3,
Luz Eyzaguirre
4,
César Rodríguez
5 and
Fernando Oyanguren
5
1
Centre for Climate Change and Disaster Risk Research, Universidad Nacional del Callao, Callao 07011, Peru
2
Faculty of Environmental Engineering and Natural Resources, Universidad Nacional del Callao, Callao 07011, Peru
3
Faculty of Geological, Mining and Metallurgical Engineering, Universidad Nacional de Ingeniería, Lima 15333, Peru
4
Faculty of Petroleum, Natural Gas and Petrochemical Engineering, Universidad Nacional de Ingeniería, Lima 15333, Peru
5
Faculty of Electrical and Electronic Engineering, Universidad Nacional del Callao, Callao 07011, Peru
*
Author to whom correspondence should be addressed.
Geosciences 2024, 14(6), 168; https://doi.org/10.3390/geosciences14060168
Submission received: 9 May 2024 / Revised: 8 June 2024 / Accepted: 10 June 2024 / Published: 14 June 2024

Abstract

:
This study addresses the importance of conducting mass movement susceptibility mapping and hazard assessment using quantitative techniques, including machine learning, in the Northern Lima Commonwealth (NLC). A previous exploration of the topographic variables revealed a high correlation and multicollinearity among some of them, which led to dimensionality reduction through a principal component analysis (PCA). Six susceptibility models were generated using weights of evidence, logistic regression, multilayer perceptron, support vector machine, random forest, and naive Bayes methods to produce quantitative susceptibility maps and assess the hazard associated with two scenarios: the first being El Niño phenomenon and the second being an earthquake exceeding 8.8 Mw. The main findings indicate that machine learning models exhibit excellent predictive performance for the presence and absence of mass movement events, as all models surpassed an AUC value of >0.9, with the random forest model standing out. In terms of hazard levels, in the event of an El Niño phenomenon or an earthquake exceeding 8.8 Mw, approximately 40% and 35% respectively, of the NLC area would be exposed to the highest hazard levels. The importance of integrating methodologies in mass movement susceptibility models is also emphasized; these methodologies include the correlation analysis, multicollinearity assessment, dimensionality reduction of variables, and coupling statistical models with machine learning models to improve the predictive accuracy of machine learning models. The findings of this research are expected to serve as a supportive tool for land managers in formulating effective disaster prevention and risk reduction strategies.

1. Introduction

More than 2.8 million people [1] live on the hillsides of Metropolitan Lima, occupying the territory in a disordered, disjointed articulation, with complete informality, resulting in high vulnerability and exposure to natural hazards (Figure 1). Among them, mass movements (MMs) are induced by El Niño phenomenon and earthquakes, especially on the central coast of Peru, which is located in an area with a 278-year seismic gap since the 1746 earthquake [2], and a seismic event of a magnitude greater than 8.8 Mw is expected. If this occurs, due to the level of ground shaking, landslides, topples, and rockfalls are expected, leading to loss of life, damage to infrastructure, and blockage of access routes that connect Lima with the most vulnerable districts and the highly populated areas settled on the hillsides of Metropolitan Lima [3]. On the other hand, the El Niño phenomenon has occurred with greater recurrence in recent years, with torrential rains that caused debris flow, landslides, and collapses in many departments along the Peruvian coast, including Lima, causing damage to infrastructure, and affecting the lives and health of people [4]. In the Northern Lima Commonwealth (NLC), it has been reported that approximately 60% of emergency reports correspond to MMs [5], resulting in human and economic losses. The lack of an effective system for assessing and predicting these hazards, as well as adequate mitigation strategies, poses a latent threat.
The assessment of mass movement susceptibility (MMS) is part of the first aspect of disaster risk management, which is important for urban planning, response, and post-disaster reconstruction [6]. In recent decades, MMS mapping has been widely used to zone probable areas for future MM based on identifying areas of past occurrences and areas with similar or identical physical characteristics [7]. To perform MMS mapping, various models based on Geographic Information Systems (GISs) have been used. These models can be classified into qualitative or knowledge-based methods such as the heuristic method [8], semi-quantitative methods such as the analytic hierarchy process [9], quantitative or data-driven methods such as bivariate or multivariate statistical methods [6,8,10,11,12,13], machine learning [14,15,16,17], and hybrid approaches [18,19,20,21,22,23]. Additionally, physics-based models have been employed in the assessment of landslide susceptibility [24,25,26] and could be used for MM to enhance the understanding of the mechanisms that generate these phenomena. It is acknowledged that, for our study area, as recognized by [27], the availability of input data for the models limits the results of the models.
The Geological, Mining, and Metallurgical Institute of Peru (INGEMMET) has developed the regional scale MMS model in Metropolitan Lima [28] using a qualitative (heuristic) expert judgment approach. This input is utilized by the National Center for Estimation, Prevention, and Reduction of Disaster Risk (CENEPRED) to conduct risk scenarios by adding triggering variables and hazard exposure analysis. The free availability of conditioning and triggering variables data from national and international technical–scientific institutions, coupled with significant advancements in GIS tools, software, and open-source codes for processing high-resolution spatial information over large areas of land, provides the necessary resources to close the gaps in exhaustive studies for exploring data-driven models for MMS mapping. These models include weight of evidence (WoE) and machine learning (ML) techniques such as logistic regression (LR), multilayer perceptron (MLP), random forest (RF), support vector machine (SVM), and naive Bayes (NB) for estimating susceptibility and hazard due to MMs. Addressing this knowledge gap is imperative to enhance the prediction capability and response to such hazards.

2. Materials and Methods

2.1. Study Area

The study area covers seven districts of the NLC, Los Olivos, San Martin de Porras, Independencia, Comas, Carabayllo, Ancon, and Puente Piedra; it covers an area of 793 km2, ranging from 11.572° to 12.039° S latitude and from 77.199° to 76.810° W longitude. The elevation of the study area ranges from 26.5 to 2748.0 m above sea level (m asl), with a maximum elevation difference of 2721.5 m asl. The NLC districts are located on the western–central boundary of the South American continent, in the subduction zone of the Nazca plate in relation to the South American. On the other hand, historically, during the El Niño phenomenon, anomalies in precipitation triggering MM phenomena have been demonstrated on the coast of Lima [29,30]. The oldest geological units belong to the Upper Jurassic and Pleistocene, which outcrop in the high reliefs of the Andean foothills in the highest areas of the study region; in contrast, the more recent units form the valley and plain fill that descends to the Pacific Ocean coast [31]. The rock types vary in nature, including andesitic volcanic rocks interbedded with marine shales, limestones, and sandstones; intrusive rocks, such as diorites, tonalites, granodiorites, and granites, outcrop in the peripheries of urban areas and are altered and easily disintegrated; fluvio-alluvial deposits, consisting of a matrix of sands and clays that incorporate gravels, pebbles, and boulders, cover most of Lima; lastly, aeolian deposits of fine sand and silt host numerous human settlements [32].

2.2. Mass Movement Inventory (MMI)

The MMI represents information on the spatial distribution of the location of MMs and also provides crucial information in the study of the relationships between the occurrence of MM and the conditioning and triggering causal factors [12,33]. The MMI used in this study was conducted by INGEMMET (from 1960 to 2023, last update), photointerpreted in the office at a scale of 1/25,000 from aerial photographs and satellite images from Landsat-5 and the Google Earth platform. In addition, the MMI was verified in extensive fieldwork, which was carried out by the technical team of INGEMMET. It should be noted that the types of MM phenomena considered in this study are landslides, topple, and rockfall, which could be associated with El Niño and seismic events. Non-MMs were mapped using high spatial-resolution satellite images from Google Earth, employing a hybrid approach. Initially, flat areas such as rivers, streams, and slopes less than 5° were mapped as proposed by [34]. Furthermore, the MMS model developed by INGEMMET was used to complement the non-MM samples in areas of very low, low, and medium susceptibility that coincide with geological zones of alluvial deposits and geomorphological features such as slopes or alluvial-fan piedmonts. These are moderate slope areas where mass movements such as landslides, rockfalls, and topples do not occur. As suggested by [35,36], the hybrid approach for mapping non-negative samples improves the performance and reliability of the models.
A total of 236 MM polygons were randomly selected from a total of 329 mapped using the vector > research tool > random selection tool in QGIS. Among these, 189 (80%) were used for training, and 47 (20%) were used for evaluation. It is worth mentioning that debris flows were not considered because these polygons are mostly located spatially in the lower parts of the valley, which could affect the results of the MMS models.

2.3. Data

The Table 1 shows the vector and raster data used in the research. All variables were standardized to a 12.5 m pixel size raster format, including geological, environmental, and triggering variables. Finally, the final working scale of the susceptibility and hazard mapping by MM was 1/10,000.
In addition, Table 2 presents the DEM-derived topographic products, geological and environmental variables of study for use in the susceptibility and hazard models for the NLC, their representation is presented in Figure A1 and Figure A2.

2.4. Methods

Figure 2 shows the flowchart of this study, which is divided into five steps: the first step is data downloading and preparation, followed by variable exploration, geospatial modelling of the MMS, evaluation of MMS models using the area under the curve (AUC), and estimation of MM hazard under two scenarios.

2.4.1. Step 1: Input Datasets

Vector and raster data were downloaded from national and international geospatial repositories as indicated in Table 1. The data were standardized to the raster format at the same spatial resolution, using QGIS. Training and testing data were selected using the random selection tool in QGIS.

2.4.2. Step 2: Exploratory Variable Methods

Before applying the MMS model, it is important to ensure that there is no dependence between variables (variable correlation) and that the variables are not influenced by multicollinearity.

Pearson Correlation

To discard the linear correlation of the variables, the Pearson correlation was applied, which is considered an effective method for this purpose [13]. The following formula was used:
r = σ x y / σ x σ y
where σ x y is the covariance of variables “x” and “y”; and σ x and σ y are the standard deviations of variables “x” and “y”, respectively. The value of the Pearson coefficient can vary from −1 to 1, where if r < 0, the correlation is negative and stronger as it approaches r = −1; if r > 0, the correlation is positive and stronger as it approaches r = 1; finally, if r = 0, then there is no relationship between the variables. Additionally, as indicated by [37], a value greater than ±0.8 could lead to multicollinearity issues.

Multicollinearity

Multicollinearity is a condition where there are high relationships between two or more independent variables in a multiple regression model [38]. To verify multicollinearity, the variance inflation factor (VIF) was used, where, in practical terms, a VIF value greater than 5 or 10 [39] indicates that the association of regression coefficients is poorly estimated due to multicollinearity issues [40].
V I F = 1 / ( 1 R j 2 )
where R j 2 is the coefficient of determination for the regression of x j on other explanatory variables.

Principal Component Analysis (PCA)

PCA is a multivariate mathematical procedure that performs the orthogonal transformation of a set of correlated variables into a smaller number of uncorrelated variables or principal components [41], which are mutually independent [42]. Once the multicollinearity among variables has been assessed and confirmed, it is common practice to exclude highly correlated variables that influence each other. However, PCA allows for addressing the multicollinearity issue among influential factors without the need to eliminate variables [10]. This is considered a good option because natural processes are integral, and excluding a variable could lead to loss of information in the analysis of the studied phenomenon. Additionally, PCA allows for evaluating the impact of different influencing factors on MMS. A more extensive discussion on the mathematical foundations of PCA can be found in [43,44,45], and the PCA summary procedure assumes that the original data are as follows:
X =   x 11 x 12 x 1 m . . . . . . x n 1 x n 2 x n m
where m is the number of causal factors, n is the number of MM, and each xnm represents the value of a MM factor, and the mean and standard deviation of these factors can be calculated. For this matrix, the eigenvalue and eigenvector can be determined by the following:
R λ i I l i = 0
where λi and li are the eigenvalues and eigenvectors, respectively; li corresponds to the principal components; and λi corresponds to the variance obtained from each principal component. For a specific feature vector, its cumulative contribution rate can be calculated by the following equation:
α =   λ i + λ i + + λ k λ i + λ i + + λ m × 100 %
The majority of information from input variables is found in the first principal component, which can be expressed as shown in the following equation [46]:
P C m = a m 1 y 1 + a m 2 y 2 + + a m n y n
where PC is the principal component in the “m-th” place, and a m n is the weight for the m-th PC for the n-th variable.
The number of PCs was determined based on the level of variance explanation, which ranges from 0 to 1 or 0% to 100%. In this study, a minimum threshold of 0.6 for variance explanation was used; however, in some similar studies, values as low as 0.40 [42] or as high as 0.68 [47] for variance explanation based on the number of PC have been found. The choice of the variance explanation value is based on the objective of reducing the correlation and multicollinearity of the variables.

2.4.3. Step 3: Mass Movement Susceptibility Modelling

After performing an exploratory analysis of topographic variables and reducing their dimensionality to PCA-1, PCA-2, and PCA-3, geological and environmental variables were integrated into U1, U2, U3, and U4. The MMS models were constructed using two composite variables, the sum of the topographic PCAs (PCA_123), and geological–environmental components (U_1234), after applying the WoE process. These MMS models were reclassified into quintiles at 20, 40, 60, 80, and 100% using the Numpy Python library. Finally, susceptibility areas were calculated using QGIS.

Weights of Evidence (WoE)

Quantitative techniques based on statistics establish functional relationships between instability factors and the past and present distribution of mass movements [42]. The WoE method proposed by [48] is a bivariate statistical technique based on Bayesian probability theory. The stability or instability of certain regions can be estimated through a set of conditioning factors, which are measured by the relationship and spatial distribution of areas known and affected by MM [47]. In this study, WoE was used to determine the relationship between factors influencing the occurrence of MM; to map the MMS; and as input variables for the LR, MLP, SVM, RF, and NB models.
The first step in determining the WoE for each class involves obtaining the prior probability of finding MM, which is estimated as the area affected by MM (L) in the past over the total study area (A).
P ( L ) = N ( L ) / N ( A )
Refs. [48,49] present a mathematical development of the fundamental of the methodology. The method assigns positive (W+) or negative (W) weights to each class of conditioning variables based on the degree of association between the variable class and the spatial distribution and density of evidence of MM. B and B ¯ represent the presence and absence of the conditioning factor in potential mass movements, respectively, and L ¯ indicates the absence of mass movements.
W i + = ln ( P B L / P B L ¯ )
W i = ln ( P B ¯ L / P B ¯ L ¯ )
where Wi+ with a positive value (>0) indicates that the variable is present where there is the presence of mass movements; additionally, its magnitude represents the positive correlation between the presence of the factor and MMs. On the other hand, Wi+ with a negative value (<0) indicates that the absence contributes to the generation of MMs. Wi is used to evaluate the importance of the absence of the factor in the occurrence of an MM: when Wi is positive (>0), the absence of the variable is favorable in the generation of the MM; the opposite (<0) is not [49]. Both Wi+ and Wi are estimated for each class of variables.
The contrast value or final weight, W f (Equation (7)), indicates the measure of correlation between the conditioning factor and the MM. If W f = 0 , then the spatial distribution of MM is independent of the considered variable. If W f > 0 , then there is a positive association between the variable and the generation of the MM. Lastly, if W f < 0 , there is a negative association, meaning that the absence of the factor contributes to the generation of the MM [49,50].
W f = W + W
Finally, to obtain the MMS map of the conditioning factors by WoE, the algebraic sum of all the contrast values of each variable is calculated. This means that the weights of each class of the variables are summed pixel by pixel, and the sum is the susceptibility map.
M M S W o E = W f , V a r i a b l e 1 + W f , V a r i a b l e 2 + W f , V a r i a b l e n

Logistic Regression (LR)

LR is considered one of the most popular statistical methods for multivariate regression analysis used to investigate binary response from a set of measurements [51] in the earth sciences. In other words, it estimates the relationship between a dependent variable and multiple independent variables [52]. The variables can be continuous, discrete, or a combination of both; they can have a normal or non-normal distribution, and the dependent variable is dichotomous [53].
In the analysis of MMS, the dependent variable is the absence (probability of occurrence, 0) or presence (probability of occurrence, 1) of an MM, which, for this study, was taken from the MMI of INGEMMET. LR transforms the dichotomous dependent variable into a logit variable, which can be used to form a multivariate regression relationship between the dependent variable and the independent variables, which, in this case, are the multiple conditioning factors [54]. The results of the LR estimate the probability of the presence or absence of an MM based on the predictor variables, according to the following equation.
p = 1 / ( 1 + e z )
where p is the probability that the dependent variable will be 1 (maximum probability of MM presence) or 0 (minimum probability of MM presence), forming an S-curve; e is the Napierian number, and z is the linear combination of the independent variables. The linear combination, z , can be expressed by the following formula [51].
z = B 0 + B 1 X 1 + B 2 X 2 + B 3 X 3 + B n X n
where X is the independent variable (conditioning factors), which can be represented as X ( x 1 , x 2 , x n ) , and B is the estimated coefficient for each independent variable, B ( b 1 , b 2 , , b n ) . B0 is the intercept of the model, and n is the number of independent variables [55]. Based on Equations (12) and (13), the LR model can be expressed in its extended form as Equation (14). It is worth mentioning that the statsmodels module of Python 3.6 was used.
L o g i t ( P ) = ( 1 1 + e ( B 0 + B 1 X 1 + B 2 X 2 + B 3 X 3 + B n X n ) )

Multilayer Perceptron (MLP)

The MLP is an artificial neural network composed of multiple layers of interconnected neurons, commonly used in supervised learning. It is used to solve complex classification and regression problems, thanks to its ability to learn nonlinear representations of data based on mathematical algorithms to mimic the learning process of the human brain [56]. Structurally, the network consists of input layers, hidden layers with different numbers of neurons, and an output layer (i.e., the MMS model). The neurons in the layers are connected through weight values, which are trained and tested to form a stable network structure with decision-making capabilities [8]. Considering X = Xi(i=1,2,…,n) as the vector of factors conditioning the MM, and Yj = (Y1, Y2) indicating the class of MM or non-MM, MLP can be expressed by the following equation:
Y i = f ( X i ) + b j
where bj is the bias value of the neuron, and f ( X i ) is an hidden function that is optimized by the adjustable network weights during the training process for a given network architecture [57,58]. The MLP model was implemented using the scikit-learn module of Python 3.6.

Support Vector Machine (SVM)

Proposed by [59], it is a supervised ML method based on the concept of an optimal separating hyperplane in the sample space, such that the distance to the classification hyperplane of the two class groups is a maximum function between the margins of the class boundaries [60,61,62]. This classification capability makes SVM used for solving non-linear classification and regression problems; thus, it is one of the most used ML techniques in assessing MMS. Consider a matrix of conditioning factors (X = Xi(i=1,2,…,n)); Yj = (Y1, Y2) is a vector of MM classes (non-MM and MM), and the optimal hyperplane can be obtained by solving the classification function as follows:
f x = s i g n i = 1 n a i Y j K X , X i + b
where ai is the positive real constant, n is the number of conditioning factors, b is the bias, and K(X, Xi) is kernel function whose binary solution can be reviewed in [63,64,65]. The SVM model was implemented using the scikit-learn module of Python 3.6.

Random Forest (RF)

Proposed by [66], random forest is a supervised classification algorithm that is based on classification and regression trees used to create each decision tree. It utilizes a random subset of variables at each node based on a Bootstrap sample, and RF generates thousands of random binary trees to form a forest [17]. In cases with large amounts of sample data, the more feature elements present, the fewer errors and overfitting are generated [61,67]. RF is frequently used to determine the MMS [17,62,67,68]. Equation (17) represents the algorithm of RF.
C ^ r f = m a j o r i t y   v o t e { C ^ n ( x ) } n = 1 N
where C ^ r f is the final predicted class by the RF, C ^ n is the class predicted by the n-th decision tree in the forest for the observation (x) and N indicates the number of decision trees in the forest [69]. In this study the Python scikit-learn module was applied to implement the RF model.

Naive Bayes (NB)

NB is an important algorithm in the field of ML and data mining [61], applied to various fields, and is based on the Bayes probability theorem [70], which is suited for when the data have a high dimension and is not affected by the distribution of the data [71]. NB is a classifier with absolute independence assumptions between attributes [70]. The NB classification process within a set of factors affecting the prior probability of an MM occurrence can be expressed as follows:
Y N B = a r g m a x Y i = n o n M M , M M P Y i i = 1 n P ( X i / Y i )
where X = (X1, X2, …, Xn) is the vector of factors affecting the MM, Yi = (Y1, Y2) is the vector of categorical variables (non-MM or MM), P(Yi) is the prior probability of event Yi, P(Xi/Yi) is the conditional probability, and n is the number of conditioning factors [63,65]. In this study, the Python scikit-learn module was applied to implement the NB model.

Machine Learning Hyperparameters

In this research, there is no intention to evaluate the influence of hyperparameter optimization on the results of the MMS mapping. The choice of hyperparameters has been made based on the literature, specifically on the values and configurations most used in susceptibility mapping studies, as presented in Table 3. As mentioned by [60], most ML models can achieve excellent accuracy with their default set of hyperparameters, as it is nearly impossible to manually search through combinations due to the countless possibilities of trial and error.

2.4.4. Step 4: Model Accuracy Evaluation

In Step 4, the MMS models were evaluated using the Receiver Operating Characteristic (ROC), the area under the curve (AUC), the F-1 score, and the accuracy (ACC).

Curve ROC y AUC

The ROC curve is a graphical representation of sensitivity (true positive rate) on the y-axis and 1-specificity (false positive rate) on the x-axis [10]. The AUC of the ROC allows for the assessment of model fit and prediction [42] and is commonly applied as a criterion for selecting the most appropriate model to determine the MMS [10]. The AUC value indicates the percentage of observed positive pixels that are correctly predicted, quantifying the probability that susceptibility models correctly classify the presence or absence of mass movement phenomena based on a set of independent variables or conditioning factors (Figure A1).
The AUC ranges from 0 to 1. AUC values between 0.5 and 0.6 are considered poor models. AUC values between 0.6 and 0.7 are considered average models. AUC values between 0.7 and 0.8 are considered good models. AUC values between 0.8 and 0.9 can be considered very good models, and an AUC value greater than 0.9 is considered an excellent model [11].

F-1 Score

The F-1 score shows the performance of the model, and high values are optimal [61]. This indicator unifies accuracy and sensitivity [15]. Accuracy is determined by dividing the true positives by the total number of pixels classified as MM. Sensitivity is the proportion of true positives predicted in the real positive class sample. Therefore, the F-1 score is the harmonic mean of precision and sensitivity [72].

Accuracy (ACC)

The ACC indicates the ratio of correct prediction to the total number of evaluation samples [71]. It is calculated as the ratio between the number of true predictions and the total number of samples in the dataset.

Cross-Validation (CV)

Cross-validation is a resampling method used to test the predictive robustness of a statistical model [73,74]. The idea is to further split the training dataset, fitting the model on one part and evaluating it on another. A data-splitting strategy of five equal parts (k = 5 folds) was used, performed randomly using the scikit-learn library in Python. After splitting the data into k-folds, the model was trained and validated k times. This procedure was applied to all ML models, obtaining the average and standard deviation of AUC (AUC-CV), F-1 score (F-1score-CV), and accuracy (ACC-CV).

2.4.5. Step 5: Hazard Mass Movement

In Step 5, the hazard was determined under two scenarios; the first scenario involved a seismic event of a magnitude greater than 8.8 Mw and the El Niño phenomenon. For the El Niño phenomenon, anomalies of the maximum accumulated rainfall values for January, February, March, and April during the last El Niño events with extraordinary rainfall in the country were used, namely 1983/84, 1997/98, 2017, and 2023 (where anomalies of precipitation ranging from 100% to 350% have been predominantly recorded within the study area) [75]. Regarding the seismic event, the seismic–geotechnical microzonation of the soil was used under a seismic scenario of 8.8 Mw, characterized into five zones related to the dynamic response of the soil to the earthquake. It is worth noting that, although the complete spatial distribution of the seismic microzonation in the study area is not available, it covers a large part of the area where the population and the livelihoods of the population are located.
To determine the hazard of MMs, the heuristic method was used. The MMS model with the highest training and evaluation metric was assigned a weight of 0.75, and each triggering variable was assigned a weight of 0.25 using the raster calculator in QGIS. Finally, the hazard model for each scenario was obtained and reclassified into quintiles, like the MMS.

3. Results

3.1. Exploration of Variables

To determine the MMS using the WoE, LR, MLP, SVM, RF, and NB methods, the correlation of the topographic variables was analyzed, and then the multicollinearity of the variables was analyzed using VIF. Figure 3 shows the Pearson correlation of topographic variables derived from a DEM.
From the figure, it is observed that there are variables that are correlated, as there are variables with a correlation coefficient ≥ 0.8. The correlation coefficient between the Terrain Roughness Index (T4) and the slope (T1) was 0.99, and the correlation coefficient between the general curvature (T7) and the profile curvature (T6) was 0.8. This indicates a high correlation between the influencing factors. Therefore, we proceeded to conduct a multicollinearity analysis using the VIF statistic (Table 4), where a value greater than 5 indicates multicollinearity issues in the fitting of susceptibility models, which are sensitive to the linear correlation of influencing factors.
From the multicollinearity result, it is observed that the slope and Terrain Roughness Index are affected by multicollinearity. Therefore, a dimensionality reduction of the variables was performed using PCA to exclude variable correlation and multicollinearity issues. In Figure 4, the contribution of the variance of the seven PCAs is observed. Assuming a variance explanation level of ≥0.65, 3 PCAs are required, resulting in 0.85 variance explanation, with each component being independent of the others.
Table 5 shows the importance and contribution of each influential variable in the three PCAs. The higher the value (in absolute terms), the greater the contribution to the PCA. For PCA-1, the most relevant variable was slope. For PCA-2, the general curvature variable showed the highest relevance. Lastly, orientation was the most relevant factor for PCA-3. Figure 5 displays the selected PCAs as input variables for the MMS.
Finally, the PCAs were reclassified into five quintiles, and contrast values were determined by applying WoE to both PCA and geological–environmental factors to determine the relationship between the study variables and MM phenomena; these variables were used in the WoE model and the coupled models of LR, MLP, SVM, RF, and NB.

3.2. Mass Movement Susceptibility (MMS) Modelling

The results of the MMS models were reclassified into five quintiles, very low (VL), low (L), medium (M), high (H), and very high (VH). The spatial distribution expressed in the area is shown in Table 6.
The highest MMS values (very high and high) were generated by the RF models, followed by the NB, SVM, and LR models, all with 20% of the surface of the study area in the very high and high susceptibility levels, as shown in Figure 6. The MMS levels by the heuristic method were generated by INGEMMET based on lithology, hydrogeology, slope, land use, and vegetation cover.

3.3. Model Validation

As indicated in the methodology section, three metrics were used to evaluate and validate the susceptibility models: the AUC value, F-1score, and ACC. The overall performance of the models was evaluated using AUC analysis [76,77]. All MMS models applied in this study, namely LR, MLP, SVM, RF, and NB, surpassed an AUC value of >0.9. The highest AUC value for the training of the models was observed in the RF model, AUC = 1.000, followed by the SVM model, AUC = 0.994, MLP, and LR, both with AUC = 0.986. Additionally, the difference between the maximum and minimum AUC values was only 1.9%. On the other hand, regarding the AUC value for the evaluation data, it was revealed that all the MMS models behave as excellent models in terms of correctly classifying the presence and absence of MM phenomena, as they all are very close to 1 (Figure 7). Finally, the CV performance is calculated as the average of the values from the k folds to clearly compare the datasets. The AUC-CV values for all different datasets of the ML models are high, averaging above 0.97.
Regarding the F-1 score value, it was determined that all the ML models surpass the F-1 score value of >0.950, with the RF model having the highest value of this metric, F-1 = 0.991. Additionally, the CV value of F-1score-CV shows values above 0.943, with the RF model presenting the highest average of 0.959. Finally, regarding the ACC of the ML models, the RF model was found to be the most accurate (ACC = 0.989); this was also confirmed in the CV, with an average ACC-CV value of 0.947. Table 7 presents the training, evaluation, and CV metrics of the ML models generated in this study.

3.4. MM Hazard Scenarios

The MM hazard for El Niño phenomenon and earthquakes greater than 8.8 Mw was determined using the heuristic method, generated from the geospatial product between the MMS and the triggering factors. As mentioned in the previous section, the RF-derived MMS model was used because it was the model that presented the highest metrics in relation to the other models for both training and evaluation events. For both the El Niño phenomenon scenario and the earthquake exceeding 8.8 Mw, approximately 40% and 35% respectively, of the study area is classified as high and very high hazard levels. Conversely, the lowest hazard levels for both scenarios are found in the lower areas of the study region, which coincide with alluvial plains not susceptible to MMs. Figure 8 presents the hazard maps by MMs for both El Niño and the seismic scenario greater than 8.8 Mw.

4. Discussion

In this study, six quantitative MMS models were generated: the first was generated using WoE, a bivariate statistical method; and the other five were ML models, namely LR, SVM, NB, MLP, and RF, which were coupled with WoE. The models generated with ML exhibit the same spatial pattern; that is, in the elevated areas where the MMs are distributed in the study area, the highest susceptibilities are found, translating into values close to 1. This coincides with geomorphological units of mountains composed of intrusive, volcanic, or sedimentary rocks with steep slopes. From a geomechanical perspective, these rocks are fractured, altered, or weathered, generating high susceptibilities to MMs, which have been recognized by the ML models. Regarding WoE, although it captures spatial patterns similar to the ML models, in areas where the conditioning factors and the MMIs indicate zones of high probability of MMs, high and very high susceptibilities were not identified, particularly in the northern and southeastern extremes of the study area, where steep slopes and fractured, weathered rocks are present. This demonstrates that ML techniques achieve a better representation of MMS in the study area, with the RF model standing out as the one that most homogeneously represented the areas of susceptibility to MM and the conditioning factors that predispose their occurrence. The opposite is observed in the lowland areas, where values close to 0 predominate, with very low and low susceptibility levels for all MMS models; this coincides with geomorphological units of alluvial plains, aeolian deposits, and alluvial-torrential deposits with low slopes. In terms of area, around 33% to 40% of the study area for the six MMS models fall into the highest susceptibility levels, high and very high, while the lowest susceptibilities, that is, very low and low, cover a larger area, ranging from 40% to 44% of the study area. It is also noted that the trend in terms of area across all models, that is, the area covered by each level in the different ML models, is similar. Spatially, the highest susceptibilities in the study area are found on the eastern edge of the MLN, where the highest parts of the Andean foothills are located.
The validation process is crucial in MMS mapping. Several studies have suggested that an AUC value between 0.8 and 0.9 indicates a very good model, while a value higher than 0.9 indicates an excellent model [10,11]. The results revealed that all MMS models demonstrate an excellent performance in accurately predicting the presence and absence of MM phenomena during the training and evaluation of the models; among them, the RF model stands out in terms of AUC, F-1 score, and accuracy, both in the training and test data, as well as in cross-validation. For the latter, high values of AUC-CV, F-1 score-CV, and ACC-CV were obtained across different datasets. Similar results have been found in studies by [64,71,78], where the susceptibility to landslides was compared by applying different ML techniques, such as NB, k-nearest neighbors, RF, deep neural network, LR, boosted regression tree, and SVM. They found that the model with the best training and evaluation metrics is the RF model, with AUC values exceeding 0.920 in training and up to 1.000 in evaluation.
To compare the MMS results of WoE, LR, MLP, SVM, RF, and NB with the heuristic method, the susceptibility levels were standardized into five classes. Subsequently, the susceptibility levels were extracted for the point-type vectors (centroids of the test polygons) of MM in the study area. It was determined that 69.7% of the points are in the high and very high MMS levels for the heuristic model. On the other hand, for RF, SVM, LR, NB, and WoE, 97.0%, 90.9%, 90.7%, 90.9%, 87.9%, and 78.8% of the points are at the highest susceptibility levels, namely high and very high (Figure 9). The above indicates that the proposed machine learning-based models for determining MMS exhibit good performances in discriminating MM events compared to the heuristic method and that coupling statistical methods with ML models generates accurate and reliable models, as indicated by several authors [20,21,22]. This is because they are designed to automatically obtain the optimal nonlinear relationship between the study variables [17,64,78].
The hazard levels suggest that, in the event of an El Niño phenomenon, close to 40% of the surface area of the NLC would be in the highest hazard levels, high and very high. Similarly, more than half of the surface area of the Carabayllo district (54.2%) would be under the same high and very high hazard levels, followed by the districts of Comas, Independencia, and Ancon, with high and very high hazard levels of approximately 37.0%. On the other hand, Los Olivos, Puente Piedra, and San Martin de Porras have less than 8% of their surface area at the highest hazard levels. Regarding the seismic hazard scenario, the seismic microzonation did not cover the entire study area spatially; it represented only 4.9%, 24.6%, 75.3%, 63.2%, 94.3%, 79.7%, and 70.4% of the surface area of Ancon, Carabayllo, Comas, Independencia, Los Olivos, Puente Piedra, and San Martin de Porras, respectively. Therefore, the percentages shown refer to the proportion of the total area covered by the seismic microzonation spatial coverage. The districts of Ancon, Carabayllo, Comas, Independencia, and Puente Piedra have between 36% and 52% of their surface area under high and very high seismic hazard levels in the event of a magnitude greater than 8.8 Mw. Table 8 shows the hazard levels expressed in surface area for the El Niño phenomenon and the seismic event.
In this study, MMS mapping was implemented with the purpose of identifying the areas most prone to MM, as well as to evaluate the associated hazard under two scenarios: the first one considering El Niño phenomenon and the second one considering a seismic above 8.8 Mw. Susceptibility and hazard mapping are fundamental processes in disaster risk management, as they enable the identification of areas prone to risk to propose prevention and risk reduction strategies. Therefore, errors in susceptibility and hazard mapping can lead to false conclusions, resulting in the loss of lives and livelihoods [71].
As evidenced in both this study and previous research [8,79,80], the quantitative approach based on ML techniques offers a precise and efficient methodology for processing large and complex datasets; this includes geological, topographic, hydrological, climatic, environmental, and anthropogenic factors. In contrast, classical qualitative and semi-quantitative methods determine subjective and artificial weights based on expert judgment and experience.
It is relevant to highlight that there are variables that can introduce uncertainty in the application of ML models, such as the number and type of variables, the data quality, and the number of inventories of MM and non-MM for training, among others [17,79]. In this research, the aim was not to control all variables but to maximize the available resources; therefore, the uncertainty regarding the number and type of variables was minimized by conducting a comprehensive analysis of the topographic variables in the models, including correlation, multicollinearity, and dimensional reduction using PCA. This approach allowed us to exclude variable correlations, reduce noise, and mitigate the risk of overfitting, thus improving the accuracy of the models [45,81]. Additionally, the WoE analysis was employed to identify causal relationships between instability factors and the distribution of MM. In summary, methodologies were integrated and combined to enhance MMS accuracy, resulting in hybrid models based on PCA and coupled using WoE as a foundation for ML models such as LR, MLP, SVM, RF, and NB.
However, it is important to recognize that the success of applying ML models depends on the information provided by experts and the quality of the input data. Therefore, its implementation at a national or regional scale in other territories must be carefully evaluated, ensuring proper methodological flow and the availability of high-quality inputs, especially regarding geological, topographical, and environmental factors. Additionally, it is necessary to establish an appropriate spatial resolution of the triggering factors in relation to their spatial and temporal resolution and variability.

4.1. Limitations

Regarding the limitations of this study, it is noteworthy that there is a lack of information about the triggering events of the MM, meaning that it is not specified whether they were triggered by extreme rainfall, earthquakes, anthropogenic causes, etc. Additionally, the spatial resolution of the geological, geomorphological, and hydrogeological inputs used in this study (1/100,000) may not be suitable if decisions need to be made at a detailed scale. On the other hand, the DEM used was generated at the beginning of the last decade, so there could be changes in the topography that are not considered. In the MMS models, the use of different sets of negative samples and their impact on the final MMS results were not evaluated; however, to overcome this limitation, negative samples were mapped using a hybrid approach to map negative samples not only in flat areas, as indicated in the methodology section. Additionally, it is recognized that hyperparameter optimization was not carried out, as it was not the objective of the research; however, satisfactory results were obtained in the training, evaluation, and CV metrics of the models. Finally, it is noted that there is a need for further studies to improve seismic microzonation in the study area, especially due to the longitudinal growth in the periphery of Lima.

4.2. Perspectives

In terms of future perspectives, six MMS mapping were presented, five of them based on machine learning techniques with excellent results and one based on bivariate statistics with good results. All models showed better classification metrics for MM events compared to the classic heuristic method. These ML models offer a valuable tool for disaster risk management, particularly in the processes of estimation, prevention, reduction, and reconstruction of disaster risk management. The application of ML techniques, supported by available data, has the potential to significantly improve MM zoning and, ultimately, contribute to the resilience of communities against these natural events.

5. Conclusions

Six quantitative models were constructed, trained, and evaluated to model MMS mapping: WoE, LR, MLP, SVM, RF, and NB. Their purpose was to identify the most susceptible areas to MM in the NLC and to assess the associated hazards under two scenarios: El Niño phenomenon and a seismic with a magnitude greater than 8.8 Mw. Before modelling the MMS mapping, an exploratory analysis of the variables was conducted, including correlation analysis, multicollinearity assessment, and dimensionality reduction, using PCA for the topographic variables. Subsequently, a combination of methods was applied, incorporating the WoE technique to topographic, geological, and environmental variables, which served as inputs for the ML models applied. The models were constructed, trained, and evaluated using metrics including the AUC, F-1score, and ACC. The findings demonstrate the excellent performance of the ML models, as all exhibit high metrics in training, evaluation, and CV, with the RF model standing out for its predictive capability. Additionally, the results of the quantitative susceptibility models were compared with those of a heuristic model, revealing that the latter exhibits between 10 and 20% less ability to discriminate MM events compared to the quantitative models. Regarding the hazard levels, in the event of El Niño phenomenon and a seismic event exceeding 8.8 Mw, approximately 40% and 35% respectively of the NLC area would be exposed to the highest danger levels. The rapid growth of cities on the outskirts of Lima will increase pressure on land occupation. Therefore, the findings of this research are expected to serve as a tool to support decision-makers, the technical–scientific community, and civil society in developing effective strategies for disaster prevention and risk reduction. Ultimately, this will contribute to enhancing the resilience of communities in the face of disasters.

Author Contributions

Conceptualization, E.B.-R.; methodology, E.B.-R.; software, E.B.-R., C.R., F.O., T.P. and R.S.; validation, E.B.-R., M.O. and N.M.; formal analysis, E.B.-R., L.E., C.R.-L., T.C. and T.P.; investigation, E.B.-R. and M.O.; resources, E.B.-R.; data curation, E.B.-R.; writing—original draft preparation, E.B.-R.; writing—review and editing, E.B.-R., L.E., C.R., N.M., T.P., R.S., T.C., M.O., C.R.-L. and F.O.; visualization, E.B.-R. and M.O.; supervision, E.B.-R. and M.O.; project administration, E.B.-R. and M.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in the study are openly available in FigShare at https://doi.org/10.6084/m9.figshare.25996711.v1 (accessed on 9 June 2024). For any other information, please contact the corresponding author.

Acknowledgments

The authors would like to express their gratitude to the public and private institutions that collaborated by sharing information on the geological, topographical, and environmental variables of the study area. Also, thanks to the anonymous reviewers for their valuable feedback on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Topographic variables of MMS models.
Figure A1. Topographic variables of MMS models.
Geosciences 14 00168 g0a1
Figure A2. Geological, environmental, and triggering variables of MMS models.
Figure A2. Geological, environmental, and triggering variables of MMS models.
Geosciences 14 00168 g0a2

References

  1. El Comercio. Vivir en Las Alturas. 2016. Available online: https://elcomercio.pe/eldominical/actualidad/vivir-alturas-392960-noticia/ (accessed on 27 March 2024).
  2. Tavera, H. Escenario de Sismo y Tsunami en el Borde Occidental de la Región Central del Perú. Lima, Perú, 2014. Available online: https://repositorio.igp.gob.pe/handle/20.500.12816/779 (accessed on 9 June 2024).
  3. INDECI. Escenario Sísmico para Lima Metropolitana y Callao: Sismo 8.8Mw. 2017. Available online: https://portal.indeci.gob.pe/wp-content/uploads/2019/01/201711231521471-1.pdf (accessed on 9 June 2024).
  4. INDECI. Compendio Estadístico del INDECI 2017. 2017. Available online: https://www.indeci.gob.pe/wp-content/uploads/2019/01/201802271714541.pdf (accessed on 9 June 2024).
  5. INDECI. Dashboard de Control—Reporte de Emergencias. 2024. Available online: https://app.powerbi.com/view?r=eyJrIjoiNTFkOWRhYWQtYmMwMS00OWNmLTg4ZTctNjZjYTc1OTIyN2M0IiwidCI6IjNlZWNkMjZlLTlhNTUtNDg4MC04ODEyLWEzMGZjZGU3OGEyZCJ9&pageName=ReportSectioncd99edcca07a5ff10551 (accessed on 30 January 2022).
  6. Chang, L.; Xing, G.; Yin, H.; Fan, L.; Zhang, R.; Zhao, N.; Huang, F.; Ma, J. Landslide susceptibility evaluation and interpretability analysis of typical loess areas based on deep learning. Nat. Hazards Res. 2023, 3, 155–169. [Google Scholar] [CrossRef]
  7. Van Westen, C.J.; van Asch, T.W.J.; Soeters, R. Landslide hazard and risk zonation—Why is it still so difficult? Bull. Eng. Geol. Environ. 2006, 65, 167–184. [Google Scholar] [CrossRef]
  8. Huang, F.; Cao, Z.; Guo, J.; Jiang, S.-H.; Li, S.; Guo, Z. Comparisons of heuristic, general statistical and machine learning models for landslide susceptibility prediction and mapping. CATENA 2020, 191, 104580. [Google Scholar] [CrossRef]
  9. Panchal, S.; Shrivastava, A.K. Landslide hazard assessment using analytic hierarchy process (AHP): A case study of National Highway 5 in India. Ain Shams Eng. J. 2022, 13, 101626. [Google Scholar] [CrossRef]
  10. Sun, X.; Chen, J.; Bao, Y.; Han, X.; Zhan, J.; Peng, W. Landslide susceptibility mapping using logistic regression analysis along the Jinsha river and its tributaries close to Derong and Deqin County, southwestern China. ISPRS Int. J. Geo-Inf. 2018, 7, 438. [Google Scholar] [CrossRef]
  11. Pourghasemi, H.R.; Moradi, H.R.; Aghda, S.M.F. Landslide susceptibility mapping by binary logistic regression, analytical hierarchy process, and statistical index models and assessment of their performances. Nat. Hazards 2013, 69, 749–779. [Google Scholar] [CrossRef]
  12. Pourghasemi, H.R.; Pradhan, B.; Gokceoglu, C. Application of fuzzy logic and analytical hierarchy process (AHP) to landslide susceptibility mapping at Haraz watershed, Iran. Nat. Hazards 2012, 63, 965–996. [Google Scholar] [CrossRef]
  13. Hu, X.; Zhang, H.; Mei, H.; Xiao, D.; Li, Y.; Li, M. Landslide Susceptibility Mapping Using the Stacking Ensemble Machine Learning Method in Lushui, Southwest China. Appl. Sci. 2020, 10, 4016. [Google Scholar] [CrossRef]
  14. Mao, Y.; Li, Y.; Teng, F.; Sabonchi, A.K.S.; Azarafza, M.; Zhang, M. Utilizing Hybrid Machine Learning and Soft Computing Techniques for Landslide Susceptibility Mapping in a Drainage Basin. Water 2024, 16, 380. [Google Scholar] [CrossRef]
  15. Yu, H.; Pei, W.; Zhang, J.; Chen, G. Landslide Susceptibility Mapping and Driving Mechanisms in a Vulnerable Region Based on Multiple Machine Learning Models. Remote Sens. 2023, 15, 1886. [Google Scholar] [CrossRef]
  16. Boussouf, S.; Fernández, T.; Hart, A.B. Landslide susceptibility mapping using maximum entropy (MaxEnt) and geographically weighted logistic regression (GWLR) models in the Río Aguas catchment (Almería, SE Spain). Nat. Hazards 2023, 117, 207–235. [Google Scholar] [CrossRef]
  17. Achu, A.; Aju, C.; Di Napoli, M.; Prakash, P.; Gopinath, G.; Shaji, E.; Chandra, V. Machine-learning based landslide susceptibility modelling with emphasis on uncertainty analysis. Geosci. Front. 2023, 14, 101657. [Google Scholar] [CrossRef]
  18. Goyes-Peñafiel, P.; Hernandez-Rojas, A. Double landslide susceptibility assessment based on artificial neural networks and weights of evidence. Boletin Geol. 2021, 43, 173–191. [Google Scholar] [CrossRef]
  19. Sabokbar, H.A.F.; Roodposhti, M.S.; Tazik, E. Landslide susceptibility mapping using geographically-weighted principal component analysis. Geomorphology 2014, 226, 15–24. [Google Scholar] [CrossRef]
  20. He, W.; Chen, G.; Zhao, J.; Lin, Y.; Qin, B.; Yao, W.; Cao, Q. Landslide Susceptibility Evaluation of Machine Learning Based on Information Volume and Frequency Ratio: A Case Study of Weixin County, China. Sensors 2023, 23, 2549. [Google Scholar] [CrossRef]
  21. Yan, H.; Chen, W. Landslide susceptibility modeling based on GIS and ensemble techniques. Arab. J. Geosci. 2022, 15, 762. [Google Scholar] [CrossRef]
  22. Di Napoli, M.; Carotenuto, F.; Cevasco, A.; Confuorto, P.; di Martire, D.; Firpo, M.; Pepe, G.; Raso, E.; Calcaterra, D. Machine learning ensemble modelling as a tool to improve landslide susceptibility mapping reliability. Landslides 2020, 17, 1897–1914. [Google Scholar] [CrossRef]
  23. Tehrany, M.S.; Pradhan, B.; Jebur, M.N. Flood susceptibility mapping using a novel ensemble weights-of-evidence and support vector machine models in GIS. J. Hydrol. 2014, 512, 332–343. [Google Scholar] [CrossRef]
  24. Anagnostopoulos, G.G.; Fatichi, S.; Burlando, P. An advanced process-based distributed model for the investigation of rainfall-induced landslides: The effect of process representation and boundary conditions. Water Resour. Res. 2015, 51, 7501–7523. [Google Scholar] [CrossRef]
  25. Alvioli, M.; Baum, R.L. Parallelization of the TRIGRS model for rainfall-induced landslides using the message passing interface. Environ. Model. Softw. 2016, 81, 122–135. [Google Scholar] [CrossRef]
  26. Van Westen, C.J.; Terlien, M.T.J. An approach towards deterministic landslide hazard analysis in GIS. A case study from Manizales (Colombia). Earth Surf. Process. Landf. 1996, 21, 853–868. [Google Scholar] [CrossRef]
  27. Büechi, E. Modelling of Landslide Susceptibilities in the Cordillera Blanca (Peru). Master’s Thesis, Geographisches Institut der Universität Zürich, Zurich, Switzerland, 2018. [Google Scholar]
  28. INGEMMET. Mapa de Susceptibilidad por Movimientos en Masa en Lima Metropolitana. SIGRID, 2015. Available online: https://sigrid.cenepred.gob.pe/sigridv3/documento/3653 (accessed on 27 March 2024).
  29. CAF. El Fenomeno el Niño 1997–1998. Lima, Perú, 1998. Available online: http://scioteca.caf.com/bitstream/handle/123456789/675/Las_lecciones_de_El_Niño._Ecua-dor.pdf?sequence=1&isAllowed=y (accessed on 9 June 2024).
  30. SENAMHI. El Fenómeno EL NIÑO en el Perú. Lima, Perú, 2014. Available online: https://www.minam.gob.pe/wp-content/uploads/2014/07/Dossier-El-Niño-Final_web.pdf (accessed on 9 June 2024).
  31. Villacorta, S.; Nuñez, S.; Obregón, C.; Tatard, L. Modelos de Susceptibilidad por Movimientos en Masa en Lima Metropolitana y El Callao. 2014. Available online: https://repositorio.ingemmet.gob.pe/bitstream/20.500.12544/2724/1/Villacorta-Susceptibilidad_movimientos_en_masa_Lima_Metropolitana-Callao.pdf (accessed on 9 June 2024).
  32. Chambi, S.P.V.; Juárez, S.N.; Pari, W.; Smoll, L.F. Peligros Geológicos en el Área de Lima Metropolitana y la Región Callao—[Boletín C 59]. Lima, Perú, 2015. Available online: https://hdl.handle.net/20.500.12544/309 (accessed on 9 June 2024).
  33. Achour, Y.; Boumezbeur, A.; Hadji, R.; Chouabbi, A.; Cavaleiro, V.; Bendaoud, E.A. Landslide susceptibility mapping using analytic hierarchy process and information value methods along a highway road section in Constantine, Algeria. Arab. J. Geosci. 2017, 10, 194. [Google Scholar] [CrossRef]
  34. Gómez, H.; Kavzoglu, T. Assessment of shallow landslide susceptibility using artificial neural networks in Jabonosa River Basin, Venezuela. Eng. Geol. 2005, 78, 11–27. [Google Scholar] [CrossRef]
  35. Khabiri, S.; Crawford, M.M.; Koch, H.J.; Haneberg, W.C.; Zhu, Y. An Assessment of Negative Samples and Model Structures in Landslide Susceptibility Characterization Based on Bayesian Network Models. Remote Sens. 2023, 15, 3200. [Google Scholar] [CrossRef]
  36. Hu, Q.; Zhou, Y.; Wang, S.; Wang, F. Machine learning and fractal theory models for landslide susceptibility mapping: Case study from the Jinsha River Basin. Geomorphology 2020, 351, 106975. [Google Scholar] [CrossRef]
  37. Field, A. Discovering Statistics Using SPSS (and Sex and Drugs and Rock “n” Roll), 3rd ed.; SAGE Publications: Thousand Oaks, CA, USA, 2009. [Google Scholar]
  38. Chan, J.Y.-L.; Leow, S.M.H.; Bea, K.T.; Cheng, W.K.; Phoong, S.W.; Hong, Z.-W.; Chen, Y.-L. Mitigating the Multicollinearity Problem and Its Machine Learning Approach: A Review. Mathematics 2022, 10, 1283. [Google Scholar] [CrossRef]
  39. Menard, S. Applied Logistic Regression Analysis; SAGE Publications: Thousand Oaks, CA, USA, 2002. [Google Scholar]
  40. Alibuhtto, M.C.; Peiris, T.S.G. Principal component regression for solving multicollinearity problem. In Proceedings of the 5th International Symposium, Oluvil, Sri Lanka, 7–8 December 2015; pp. 231–238. [Google Scholar]
  41. Kelkar, K.A. Mass Movement Phenomena in the Western San Juan Mountains, Colorado. Master’s Thesis, Texas A&M University, College Station, TX, USA, 2017. [Google Scholar]
  42. Aristizábal-Giraldo, E.; Guarin, M.V.; Ruíz, D. Métodos estadísticos para la evaluación de la susceptibilidad por movimientos en masa. TecnoLógicas 2019, 22, 39–60. [Google Scholar] [CrossRef]
  43. Cao, B.; Li, Q.; Zhu, Y. Comparison of Effects between Different Weight Calculation Methods for Improving Regional Landslide Susceptibility—A Case Study from Xingshan County of China. Sustainability 2022, 14, 11092. [Google Scholar] [CrossRef]
  44. Wang, Q.; Kong, Y.; Zhang, W.; Chen, J.; Xu, P.; Li, H.; Xue, Y.; Yuan, X.; Zhan, J.; Zhu, Y. Regional debris flow susceptibility analysis based on principal component analysis and self-organizing map: A case study in Southwest China. Arab. J. Geosci. 2016, 9, 718. [Google Scholar] [CrossRef]
  45. Tang, Y.; Feng, F.; Guo, Z.; Feng, W.; Li, Z.; Wang, J.; Sun, Q.; Ma, H.; Li, Y. Integrating principal component analysis with statistically-based models for analysis of causal factors and landslide susceptibility mapping: A comparative study from the loess plateau area in Shanxi (China). J. Clean. Prod. 2020, 277, 124159. [Google Scholar] [CrossRef]
  46. Basu, T.; Das, A.; Pal, S. Application of geographically weighted principal component analysis and fuzzy approach for unsupervised landslide susceptibility mapping on Gish River Basin, India. Geocarto Int. 2022, 37, 1294–1317. [Google Scholar] [CrossRef]
  47. Goyes-Peñafiel, P.; Hernandez-Rojas, A. Landslide susceptibility index based on the integration of logistic regression and weights of evidence: A case study in Popayan, Colombia. Eng. Geol. 2021, 280, 105958. [Google Scholar] [CrossRef]
  48. Bonham-Carter, G. Geographic Information Systems for Geoscientists: Modelling with GIS; Pergamon: Oxford, UK, 1995; Volume 21. [Google Scholar]
  49. Van Westen, C.J. Use of Weights of Evidence Modeling for Landslide Susceptibility Mapping; International Institute for Geoinformation Science and Earth Observation: Enschede, The Netherlands, 2002. [Google Scholar]
  50. Servicio Geológico Colombiano. Guía Metodológica para Estudios de Amenaza, Vulnerabilidad y Riesgo por Movimientos en Masa Escala 1:25,000; Servicio Geológico Colombiano: Bogotá, Colombia, 2017. [Google Scholar]
  51. Feby, B.; Achu, A.; Jimnisha, K.; Ayisha, V.; Reghunath, R. Landslide susceptibility modelling using integrated evidential belief function based logistic regression method: A study from Southern Western Ghats, India. Remote Sens. Appl. Soc. Environ. 2020, 20, 100411. [Google Scholar] [CrossRef]
  52. Umar, Z.; Pradhan, B.; Ahmad, A.; Jebur, M.N.; Tehrany, M.S. Earthquake induced landslide susceptibility mapping using an integrated ensemble frequency ratio and logistic regression models in West Sumatera Province, Indonesia. Catena 2014, 118, 124–135. [Google Scholar] [CrossRef]
  53. Yilmaz, I. Landslide susceptibility mapping using frequency ratio, logistic regression, artificial neural networks and their comparison: A case study from Kat landslides (Tokat—Turkey). Comput. Geosci. 2009, 35, 1125–1138. [Google Scholar] [CrossRef]
  54. Huang, J.; Zhou, Q.; Wang, F. Mapping the landslide susceptibility in Lantau Island, Hong Kong, by frequency ratio and logistic regression model. Ann. GIS 2015, 21, 191–208. [Google Scholar] [CrossRef]
  55. Biçer, T.; Ercanoglu, M. A semi-quantitative landslide risk assessment of central Kahramanmaraş City in the Eastern Mediterranean region of Turkey. Arab. J. Geosci. 2020, 13, 732. [Google Scholar] [CrossRef]
  56. Vakhshoori, V.; Pourghasemi, H.R.; Zare, M.; Blaschke, T. Landslide Susceptibility Mapping Using GIS-Based Data Mining Algorithms. Water 2019, 11, 2292. [Google Scholar] [CrossRef]
  57. Pham, B.T.; Tien Bui, D.; Prakash, I.; Dholakia, M.B. Hybrid integration of Multilayer Perceptron Neural Networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena 2017, 149, 52–63. [Google Scholar] [CrossRef]
  58. Nguyen, V.V.; Pham, B.T.; Vu, B.T.; Prakash, I.; Jha, S.; Shahabi, H.; Shirzadi, A.; Ba, D.N.; Kumar, R.; Chatterjee, J.M.; et al. Hybrid machine learning approaches for landslide susceptibility modeling. Forests 2019, 10, 157. [Google Scholar] [CrossRef]
  59. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  60. Liu, S.; Wang, L.; Zhang, W.; He, Y.; Pijush, S. A comprehensive review of machine learning-based methods in landslide susceptibility mapping. Geol. J. 2023, 58, 2283–2301. [Google Scholar] [CrossRef]
  61. Li, M.; Li, L.; Lai, Y.; He, L.; He, Z.; Wang, Z. Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 2023, 15, 11228. [Google Scholar] [CrossRef]
  62. Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance. Earth Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
  63. Pham, B.T.; Pradhan, B.; Bui, D.T.; Prakash, I.; Dholakia, M.B. A comparative study of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India). Environ. Model. Softw. 2016, 84, 240–250. [Google Scholar] [CrossRef]
  64. Sajadi, P.; Sang, Y.-F.; Gholamnia, M.; Bonafoni, S.; Mukherjee, S. Evaluation of the landslide susceptibility and its spatial difference in the whole Qinghai-Tibetan Plateau region by five learning algorithms. Geosci. Lett. 2022, 9, 9. [Google Scholar] [CrossRef]
  65. Zhang, W.; Zhang, Y.; Gu, X.; Wu, C.; Han, L. Application of Soft Computing, Machine Learning, Deep Learning and Optimizations in Geoengineering and Geoscience; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  66. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  67. Wu, X.; Song, Y.; Chen, W.; Kang, G.; Qu, R.; Wang, Z.; Wang, J.; Lv, P.; Chen, H. Analysis of Geological Hazard Susceptibility of Landslides in Muli County Based on Random Forest Algorithm. Sustainability 2023, 15, 4328. [Google Scholar] [CrossRef]
  68. Karakas, G.; Can, R.; Kocaman, S.; Nefeslioglu, H.A.; Gokceoglu, C. Landslide susceptibility mapping with random forest model for ordu, turkey. ISPRS—Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 1229–1236. [Google Scholar] [CrossRef]
  69. Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
  70. Mia, U.; Chowdhury, T.N.; Chakrabortty, R.; Pal, S.C.; Al-Sadoon, M.K.; Costache, R.; Islam, A.R.M.T. Flood Susceptibility Modeling Using an Advanced Deep Learning-Based Iterative Classifier Optimizer. Land 2023, 12, 810. [Google Scholar] [CrossRef]
  71. Nurwatik, N.; Ummah, M.H.; Cahyono, A.B.; Darminto, M.R.; Hong, J.-H. A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning. ISPRS Int. J. Geo-Inf. 2022, 11, 602. [Google Scholar] [CrossRef]
  72. Pradhan, B.; Sameen, M.I.; Al-Najjar, H.A.H.; Sheng, D.; Alamri, A.M.; Park, H.-J. A Meta-Learning Approach of Optimisation for Spatial Prediction of Landslides. Remote Sens. 2021, 13, 4521. [Google Scholar] [CrossRef]
  73. Friedl, H.; Stampfer, E. Cross-Validation. In Encyclopedia of Environmetrics; El-Shaarawi, A.H., Piegorsch, W.W., Eds.; John Wiley & Sons, Ltd.: Chichester, UK, 2002. [Google Scholar]
  74. Chung, C.-J.; Fabbri, A.G. Predicting landslides for risk analysis—Spatial models tested by a cross-validation technique. Geomorphology 2008, 94, 438–452. [Google Scholar] [CrossRef]
  75. SENAMHI. Lluvias Máximas—Escenarios Críticos con Información Climática Durante el Fenómeno el Niño. Lima, Perú, 2023. Available online: https://repositorio.senamhi.gob.pe/handle/20.500.12542/2867 (accessed on 9 June 2024).
  76. Zhao, Y.; Wang, R.; Jiang, Y.; Liu, H.; Wei, Z. GIS-based logistic regression for rainfall-induced landslide susceptibility mapping under different grid sizes in Yueqing, Southeastern China. Eng. Geol. 2019, 259, 105147. [Google Scholar] [CrossRef]
  77. Thai Pham, B.T.; Shirzadi, A.; Shahabi, H.; Omidvar, E.; Singh, S.K.; Sahana, M.; Asl, D.T.; Bin Ahmad, B.; Quoc, N.K.; Lee, S. Landslide susceptibility assessment by novel hybrid machine learning algorithms. Sustainability 2019, 11, 4386. [Google Scholar] [CrossRef]
  78. Achour, Y.; Pourghasemi, H.R. How do machine learning techniques help in increasing accuracy of landslide susceptibility maps? Geosci. Front. 2020, 11, 871–883. [Google Scholar] [CrossRef]
  79. Aditian, A.; Kubota, T.; Shinohara, Y. Comparison of GIS-based landslide susceptibility models using frequency ratio, logistic regression, and artificial neural network in a tertiary region of Ambon, Indonesia. Geomorphology 2018, 318, 101–111. [Google Scholar] [CrossRef]
  80. Bijukchhen, S.M.; Kayastha, P.; Dhital, M.R. A comparative evaluation of heuristic and bivariate statistical modelling for landslide susceptibility mappings in Ghurmi–Dhad Khola, east Nepal. Arab. J. Geosci. 2013, 6, 2727–2743. [Google Scholar] [CrossRef]
  81. Song, Y.; Yang, D.; Wu, W.; Zhang, X.; Zhou, J.; Tian, Z.; Wang, C.; Song, Y. Evaluating Landslide Susceptibility Using Sampling Methodology and Multiple Machine Learning Models. ISPRS Int. J. Geo-Inf. 2023, 12, 197. [Google Scholar] [CrossRef]
Figure 1. Ubication map. High vulnerability of housing on the hillsides of Metropolitan Lima.
Figure 1. Ubication map. High vulnerability of housing on the hillsides of Metropolitan Lima.
Geosciences 14 00168 g001
Figure 2. Flowchart of the study.
Figure 2. Flowchart of the study.
Geosciences 14 00168 g002
Figure 3. Pearson correlation of topographic variables.
Figure 3. Pearson correlation of topographic variables.
Geosciences 14 00168 g003
Figure 4. Explanation of the cumulative variance of the number of PCAs. The blue line represents the cumulative explained variance value, and the red line indicates the chosen number of principal components (PCA).
Figure 4. Explanation of the cumulative variance of the number of PCAs. The blue line represents the cumulative explained variance value, and the red line indicates the chosen number of principal components (PCA).
Geosciences 14 00168 g004
Figure 5. Reclassified principal components with weights of evidence.
Figure 5. Reclassified principal components with weights of evidence.
Geosciences 14 00168 g005
Figure 6. MMS models by the method WoE, LR, MLP, SVM, RF, and NB. The red and black dots represent MM and non-MM, respectively, that were used in the training and evaluation of the ML models.
Figure 6. MMS models by the method WoE, LR, MLP, SVM, RF, and NB. The red and black dots represent MM and non-MM, respectively, that were used in the training and evaluation of the ML models.
Geosciences 14 00168 g006
Figure 7. ROC curve and AUC value for training (a) and test (b) data. The dashed line drawn from point 0,0 to point 1,1 is the reference diagonal or the line of no-discrimination.
Figure 7. ROC curve and AUC value for training (a) and test (b) data. The dashed line drawn from point 0,0 to point 1,1 is the reference diagonal or the line of no-discrimination.
Geosciences 14 00168 g007
Figure 8. Hazard by MM (a) for El Niño phenomenon and (b) for seismic events greater than 8.8 Mw.
Figure 8. Hazard by MM (a) for El Niño phenomenon and (b) for seismic events greater than 8.8 Mw.
Geosciences 14 00168 g008
Figure 9. Predictive capability of methods, heuristics, WoE, LR, MLP, SVM, RF, and NB.
Figure 9. Predictive capability of methods, heuristics, WoE, LR, MLP, SVM, RF, and NB.
Geosciences 14 00168 g009
Table 1. Type and description of variables used in the study. Information provided by national and international sources.
Table 1. Type and description of variables used in the study. Information provided by national and international sources.
TypeDescriptionTypeScale or Spatial ResolutionYearSourceLink
VectorialLithologyCategorical1/100,000-INGEMMEThttps://geocatmin.ingemmet.gob.pe/geocatmin/ (last access 2 February 2024)
GeomorphologyCategorical1/100,000-INGEMMEThttps://geocatmin.ingemmet.gob.pe/geocatmin/ (last access 2 February 2024)
HydrogeologyCategorical1/100,000 INGEMMEThttps://geocatmin.ingemmet.gob.pe/geocatmin/ (last access 2 February 2024)
Mass movements inventoryCategorical1/50,0002021INGEMMEThttps://geocatmin.ingemmet.gob.pe/geocatmin/ (last access 2 February 2024)
RasterVegetation coverCategorical1/100,000 INGEMMEThttps://www.datosabiertos.gob.pe/dataset/cobertura-vegetal-ministerio-del-ambiente (last access 2 February 2024)
Digital elevation model (DEM)Continuous12.5 m2010USGShttps://earthexplorer.usgs.gov/ (last access 2 February 2024)
Seismic microzonationCategorical--IGP/CISMIDhttps://www.igp.gob.pe/servicios/infraestructura-de-datos-espaciales/componentes/webservice (last access 2 February 2024)
Precipitation anomalies in El Niño phenomenonContinuous100 m2021SENAMHIhttps://idesep.senamhi.gob.pe/portalidesep/ (last access 2 February 2024)
INGEMMET, Instituto Geológico, Minero y Metalúrgico; USGS, United States Geological Survey; IGP, Instituto Geofísico del Perú; CISMID, Centro Peruano Japonés de Investigaciones Sísmicas y Mitigación de Desastres; SENAMHI, Servicio Nacional de Meteorología e Hidrología del Perú.
Table 2. Research variables names.
Table 2. Research variables names.
ClassNameVariablePCAType of Variable
Conditioning factor
Geological and environmentalLithologyU1-Categorical
GeomorphologyU2-Categorical
HydrologyU3-Categorical
Vegetation coverU4-Categorical
TopographicalSlopeT1PCA-1
PCA-2
PCA-3
Continuous
AspectT2Continuous
Topographic Wetness Index (TWI)T3
Terrain Roughness Index (TRI)T4Continuous
Flow directionT5Continuous
Profile curvatureT6Continuous
General curvatureT7Continuous
Triggering factors
Seismic 8.8 Mw (seismic microzonation)
Precipitation anomalies in El Niño phenomenon
D1-Categorical
D2-Continuous
Table 3. Hyperparameters used in this research.
Table 3. Hyperparameters used in this research.
ModelHyperparameters
LRmethod = “bfgs”,
MLPlr = 0.1, architecture [1, 4, 4, 4], epochs = 1000, activation “relu”
SVMKernel = “linear”
RFn_estimators = 360, max_depth = 11, criterion = “gini”, min_samples_split = 5, min_samples_leaf = 1
NBpriors = None, var_smoothing = 1 × 10−9
Table 4. Multicollinearity analysis of topographic variables.
Table 4. Multicollinearity analysis of topographic variables.
NameVariableVIF
Intercept-10.1
SlopeT167.3
AspectT21.5
Topographic Wetness Index (TWI)T3-
Terrain Roughness Index (TRI)T467.3
Flow directionT51.4
Profile curvatureT63
General curvatureT73
Table 5. Weights of topographic variables in the PC.
Table 5. Weights of topographic variables in the PC.
PCAWeightsT1T2T3T4T5T6T7
PCA-10.3770.565−0.102−0.5020.566−0.0380.2370.203
PCA-20.2500.2400.104−0.0650.2380.087−0.648−0.667
PCA-30.218−0.019−0.6890.035−0.034−0.705−0.112−0.114
Table 6. Area of MMS levels for all models.
Table 6. Area of MMS levels for all models.
ModelsVL
(km2)
L
(km2)
M
(km2)
H
(km2)
VH
(km2)
WoE92.352257.426180.316145.113116.071
LR157.959162.505157.243153.230162.130
MLP128.209192.514170.778152.981146.793
SVM158.280162.000156.579155.760160.448
RF122.548194.783158.764154.008162.963
NB145.511174.507156.579155.573160.895
Heuristic *137.610203.527205.480168.48277.181
* Model of MMS generated by INGEMMET, provided by CENEPRED.
Table 7. MMS methods and model validation metrics.
Table 7. MMS methods and model validation metrics.
ModelAUC-TrainAUC-TestAUC-CVF-1 ScoreF-1 Score-CVACCACC-CV
LR0.9861.0000.981 ± 0.0270.9570.946 ± 0.0210.9520.946 ± 0.036
MLP0.9860.9980.980 ± 0.0210.9630.947 ± 0.0220.9580.944 ± 0.039
SVM0.9941.0000.979 ± 0.0390.9510.943 ± 0.0570.9470.937 ± 0.064
RF1.0000.9960.981 ± 0.0290.9910.959 ± 0.0240.9890.947 ± 0.038
NB0.9811.0000.980 ± 0.0340.9610.955 ± 0.0420.9580.945 ± 0.043
Table 8. Hazard levels for MM under El Niño phenomenon and seismic greater than 8.8 Mw.
Table 8. Hazard levels for MM under El Niño phenomenon and seismic greater than 8.8 Mw.
DistrictEl Niño Phenomenon—Hazard Level (km2)Seismic—Hazard Level (km2)
VLLMHVHVLLMHVH
Ancon38.71173.60580.31647.44669.5381.3565.5732.1526.1120.043
Carabayllo41.49150.58050.61881.98386.70335.8021.8542.35511.62625.039
Comas15.0129.0326.6638.5919.47321.8830.5880.9984.2929.004
Independencia5.4410.6783.8175.0870.9874.2200.3320.3332.0953.174
Los Olivos12.6213.6001.3250.6780.00015.7720.2390.4420.7290.000
Puente Piedra20.15514.17912.0973.8490.02616.4935.1442.8828.5536.946
San Martin de Porres26.6926.1962.5990.4770.00026.6361.5722.4992.9480.109
Sum160.122157.868157.435148.114166.725122.16315.30211.66136.35544.316
%20.320.019.918.721.153.26.75.115.819.3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Badillo-Rivera, E.; Olcese, M.; Santiago, R.; Poma, T.; Muñoz, N.; Rojas-León, C.; Chávez, T.; Eyzaguirre, L.; Rodríguez, C.; Oyanguren, F. A Comparative Study of Susceptibility and Hazard for Mass Movements Applying Quantitative Machine Learning Techniques—Case Study: Northern Lima Commonwealth, Peru. Geosciences 2024, 14, 168. https://doi.org/10.3390/geosciences14060168

AMA Style

Badillo-Rivera E, Olcese M, Santiago R, Poma T, Muñoz N, Rojas-León C, Chávez T, Eyzaguirre L, Rodríguez C, Oyanguren F. A Comparative Study of Susceptibility and Hazard for Mass Movements Applying Quantitative Machine Learning Techniques—Case Study: Northern Lima Commonwealth, Peru. Geosciences. 2024; 14(6):168. https://doi.org/10.3390/geosciences14060168

Chicago/Turabian Style

Badillo-Rivera, Edwin, Manuel Olcese, Ramiro Santiago, Teófilo Poma, Neftalí Muñoz, Carlos Rojas-León, Teodosio Chávez, Luz Eyzaguirre, César Rodríguez, and Fernando Oyanguren. 2024. "A Comparative Study of Susceptibility and Hazard for Mass Movements Applying Quantitative Machine Learning Techniques—Case Study: Northern Lima Commonwealth, Peru" Geosciences 14, no. 6: 168. https://doi.org/10.3390/geosciences14060168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop