Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale

Cetinkaya, Sinem; Kocaman, Sultan

doi:10.3390/ijgi13090312

Open AccessArticle

Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale

by

Sinem Cetinkaya

^1,2

and

Sultan Kocaman

^2,*

¹

Graduate School of Science and Engineering, Hacettepe University, Beytepe, 06800 Ankara, Türkiye

²

Department of Geomatics Engineering, Hacettepe University, Beytepe, 06800 Ankara, Türkiye

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2024, 13(9), 312; https://doi.org/10.3390/ijgi13090312

Submission received: 14 May 2024 / Revised: 21 August 2024 / Accepted: 27 August 2024 / Published: 29 August 2024

(This article belongs to the Special Issue Advances in Remote Sensing and GIS for Natural Hazards Monitoring and Management)

Download

Browse Figures

Versions Notes

Abstract

:

Snow avalanche susceptibility (AS) mapping is a crucial step in predicting and mitigating avalanche risks in mountainous regions. The conditioning factors used in AS modeling are diverse, and the optimal set of factors depends on the environmental and geological characteristics of the region. Using a sub-optimal set of input features with a data-driven machine learning (ML) method can lead to challenges like dealing with high-dimensional data, overfitting, and reduced model generalization. This study implemented a robust framework involving the Sequential Backward Selection (SBS) algorithm and a decision-tree based ML model, CatBoost, for the automatic selection of predictive variables for AS mapping. A comprehensive inventory of a large avalanche period, previously derived from satellite images, was used for the investigations in three distinct catchment areas in the Swiss Alps. The integrated SBS-CatBoost approach achieved very high classification accuracies between 94% and 97% for the three catchments. In addition, the Shapley additive explanations (SHAP) method was employed to analyze the contributions of each feature to avalanche occurrences. The proposed methodology revealed the benefits of integrating advanced feature selection algorithms with ML techniques for AS assessment. We aimed to contribute to avalanche hazard knowledge by assessing the impact of each feature in model learning.

Keywords:

natural hazard assessment; snow avalanche susceptibility; feature selection; CatBoost; feature importance

1. Introduction

A snow avalanche, subsequently referred to as an avalanche, is a natural phenomenon resulting from the interaction of the cryosphere, hydrosphere, atmosphere, and the living ecosystems that seriously affects humankind. Avalanche susceptibility (AS) mapping is a scientific process that involves predicting the likelihood of avalanche occurrences in an area based on a variety of environmental and topographical factors [1]. The field of AS mapping is continuously evolving with advancements in data collection, analysis techniques, and computational methodologies [2]. These advancements provide superior tools for predicting avalanches, helping to save lives and protect settlements and infrastructure by enabling effective risk management and mitigation strategies in mountainous regions.

Researchers have been using both qualitative and quantitative methods for AS mapping [3]. Qualitative approaches (knowledge-driven), such as analytical hierarchy process (AHP), fuzzy logic (e.g., see [4,5,6]), are based on expert opinion, which can be subjective. On the other hand, quantitative (data-driven) methods use numerical data, statistical analysis, and mathematical models produced without human intervention. Machine learning (ML) approaches provide valuable insights into the context of data-driven avalanche studies due to the complexity of avalanche characteristics.

The new era of computing has witnessed the great utilization of ML-based approaches with the advancements in geographic information system (GIS) and earth observation (EO) technologies in natural hazard assessments. Although relatively simple statistical analysis methods (e.g., the frequency ratio (FR), see [4,7]) can be used for AS assessments, novel ML approaches have introduced new possibilities to process large datasets and identify complex patterns and relationships among various factors that influence avalanches. In their study, Choubin et al. [8] proposed an AS mapping framework with multivariate discriminant analysis (MDA) and support vector machine (SVM) for the Sirvan watershed. The AS investigations of the Darvan and Zarrinehroud watersheds by Rahmati et al. [9] indicated that the random forest (RF) model was superior to SVM, generalized additive model (GAM), and naïve Bayes (NB) models. Wen et al. [10] presented a case study comparing the prediction capability of the SVM, K-nearest neighbors (KNN), classification and regression tree (CART), and multilayer perceptron (MLP) methods.

The tree-based ML algorithms, such as RF, categorical boosting (CatBoost), light gradient boosting (LightGBM), natural gradient boosting (NGBoost), and extreme gradient boosting (XGBoost) have yielded satisfactory prediction results in the literature ([2,11,12]). Liu et al. [2] utilized four tree-based ML methods (CatBoost, LightGBM, RF, and XGBoost) to predict AS using multi-source spatial data, including remote sensing imagery, meteorological data, and topography in the Tianshan Mountains, China. In these studies, CatBoost outperformed the others in its predictive performance, based on accuracy value (0.93). The performance measures indicated that the spatial probability of avalanches can be computed rapidly and accurately using ML methods. However, selecting the appropriate conditioning factors and ML methods remains a challenging task.

The process of feature selection has proven highly effective in investigating the correlations and potential causal relationships between parameters that influence natural hazards [13,14]. However, incorporating a large number of features can lead to overfitting and limit the ability of a model to generalize due to high dimensionality [15]. Costache et al. [13] utilized the correlation-based feature selection (CFS) method to evaluate the predictive capacity of fourteen predictors in the context of flood susceptibility assessment. Pham et al. [14] employed two established feature selection techniques, chi-square (CS) and backward elimination (BE), to determine the optimal set of input features for training artificial neural networks (ANNs) in landslide susceptibility modeling. Although several conditioning factors may influence AS at different levels, some may be redundant or less relevant in a specific environment. This can lead to inefficient computation, resource overconsumption, and reduced accuracy of ML algorithms. Table 1 summarizes the ML methods and the features used for AS mapping in the recent literature.

As shown in Table 1, the features used for AS mapping can be diverse, and the optimal set depends on the regional characteristics and data availability. Even when data availability is not an issue, a feature selection strategy needs to be implemented to identify the optimal set of descriptive features for achieving high model performance. Different feature selection methods proposed in the literature can be utilized to address this issue. These methods are broadly categorized into filter, wrapper, and embedded approaches based on their evaluation criteria [16]. Filter methods are popular for feature selection due to their relatively low computational cost compared to wrapper and embedded methods [17]. Although filter-based methods are computationally efficient, wrapper and embedded methods are indispensable because they consider the relationship between features and the learning algorithm [18]. Embedded methods integrate feature selection within the model training process, providing effective selection with a reduced risk of overfitting through model-specific optimization [19]. In contrast, wrapper methods offer high predictive accuracy by evaluating feature subsets through iterative model training and testing, thereby optimizing the selected feature subset [20].

Yariyan et al. [3] applied multicollinearity analysis and Pearson correlation to assess the suitability of eleven snow avalanche conditioning factors in their study. Their findings demonstrated that all hybrid models performed strongly in evaluating the susceptibility of locations to avalanches. Tiwari et al. [15] emphasized the importance of identifying the most pertinent parameters before training. To demonstrate this, they applied parameter importance assessment (PIA) to predict AS along the Leh-Manali highway, one of the most affected regions in India. Eleven avalanche conditioning parameters were considered, and their relevance was evaluated using the Boruta algorithm.

A number of studies have been conducted on AS mapping in the Alpine region. Ghinoi and Chung [21] employed fuzzy logic with slope, aspect, elevation, distance from crest lines, and concavities/convexities as conditioning factors in Italian Dolomites and found that the accuracies of the AS maps were between 67% and 82%. As preliminary efforts, Cetinkaya and Kocaman [12] investigated the logistic regression (LR) and the RF methods in Davos, Switzerland, using a total of 12 conditioning factors. They stated that LR provided poor accuracy (0.74) when compared to RF (0.96). The inventory employed in the study was manually generated by Hafner and Bühler [22,23] based on satellite images [24] for two avalanche periods in 2018 and 2019. Hafner et al. [25] stated that when compiled from satellite images, small and medium-sized avalanches can be missed. In addition, shadows hinder avalanche detection. In a more recent study, Cetinkaya and Kocaman [26] assessed the effect of different sampling strategies for AS mapping with RF in the Gross Spannort Mountain region, Switzerland, using a total of 18 conditioning factors. They suggested that implementing appropriate feature elimination and backward selection strategies could improve AS maps, though further research is needed. Iban and Bilgilioglu [11] studied AS mapping using DT-based methods in the province of Sondrio, Italy, using a total of 17 conditioning factors and a point-based inventory with 1880 samples representing the avalanche starting points. The geographic distribution of avalanche starting points may not be uniform across the study area. Certain regions might have a higher density of samples due to accessibility, historical avalanche occurrences, or reporting biases. This could lead to an overrepresentation of specific areas and potentially skew the model’s performance in predicting AS in underrepresented regions. Regarding conditioning factor elimination, they applied a multicollinearity test but found no collinear factors to eliminate. They observed that the XGBoost performed the best with an AUC of 0.978, although the difference between the best and the worst model outputs was merely 1%, indicating similar performances among the evaluated methods.

The present study addresses the challenges of feature selection for AS mapping by combining a comprehensive set of factors with the sequential backward selection (SBS) strategy and the CatBoost algorithm to identify the minimum set of predictive variables for the target region. The CatBoost was preferred here because it is a decision-tree based method that is known for its success in many natural hazard assessment studies. It also has the advantage of handling categorical features without requiring extensive pre-processing and is often faster than RF, especially on large datasets. To the best of our knowledge, this represents a novel application of SBS in AS mapping. Reducing feature dimensionality would decrease the need for large inventories in ML modeling for AS mapping. This study integrates SBS with CatBoost by using feature selection capabilities of SBS to identify the most relevant features based on accuracy results from CatBoost at each step. The approach was demonstrated in three catchment areas in Alpine terrain—Engelberger Aa, Meienreuss, and Göschenerreuss—using a subset of the inventory compiled by [22]. The main contributions of this study include the following:

(a): An iterative feature selection and validation method, SBS, integrated with CatBoost was proposed for improving AS prediction accuracy.
(b): The Shapley additive explanations (SHAP) method [27,28] was employed to interpret the model results and analyze the influence of individual input features on model learning.

2. Materials and Methods

In Figure 1, a schematic overview of the proposed framework is presented. The framework utilizes the CatBoost algorithm, an advanced ensemble method known for its effective handling of categorical data. The feature selection was performed using SBS, aiming to optimize the model by iteratively removing the least significant variables based on variations in model accuracy. The GridSearchCV method [29] was employed to fine-tune model parameters, ensuring the best possible settings for achieving the highest performance. Precision, recall, specificity, negative predictive value (NPV), accuracy, and F1-score measures were used to validate the results at each iteration, enhancing model reliability and determining significant features. The SHAP values were used to interpret feature importance [27,28] as they have been successful in several other natural hazard susceptibility assessment studies (e.g., [11,30,31]).

2.1. Study Area and the Inventory

As shown in Figure 2, three catchment areas were selected in the Alpine terrain (Engelberger Aa, Meienreuss, and Göschenerreuss). All three catchments are part of the Reuss River basin, located in the Swiss Alps. Each catchment represents unique geomorphological and hydrological characteristics that contribute to the AS within the region. The Engelberger Aa catchment is the largest one (226.5 km²) among the three, with a complex terrain that includes high-altitude peaks and steep slopes. The Meienreuss (71.4 km²) and Göschenerreuss (92.8 km²) catchments are relatively small but still characterized by steep and rugged Alpine landscapes.

The dataset of avalanche inventory polygons was produced by Hafner and Bühler [22]. The investigation focused on extreme avalanche occurrences within a large avalanche period in January 2018, when Switzerland experienced a series of significant snowfall events that resulted in the first use of the highest avalanche danger level (level 5) since 1999. This inventory, compiled using 1.5 m resolution multispectral SPOT6 satellite imagery, provides a valuable foundation for modeling the AS, enabling comprehensive analyses that capture the spatial patterns of avalanche-prone and non-avalanche pixels across the selected catchments. Although the avalanches were provided as polygons (see purple polygons in Figure 2), the inventory was rasterized with a 10 m grid interval. Thus, large numbers of avalanche and non-avalanche pixels were obtained in the Engelberger Aa (126,915 avalanche and 2,138,089 non-avalanche), Meienreuss (47,714 avalanche and 667,143 non-avalanche), and Göschenerreuss (44,175 avalanche and 883,484 non-avalanche) catchments. To ensure a balance between the avalanche and non-avalanche samples, the number of the latter one was reduced to match the number of the avalanche pixels by random sampling, resulting in equal numbers of pixels (1:1 ratio between the classes) to avoid class imbalance problems. The dataset was randomly split into training (70%) and testing (30%) datasets for AS prediction using the model.

2.2. Conditioning Factors

As shown in the methodological workflow given in Figure 1, seventeen key factors were extracted to train the susceptibility model, based on their relevance to avalanche dynamics as determined from the literature and the available geospatial data. A digital elevation model (DEM) with 2 m resolution was obtained from swissALTI3D, provided by the Swiss Federal Office of Topography [32] (Figure 3). To increase computational efficiency, the DEM was resampled to 10 m.

Similarly, the land use and land cover (LULC) data sourced from the freely available databases of Swisstopo [32] were resampled to 10 m (Figure 4). As shown in the figure, the Engelberger Aa catchment is largely covered with forest and grassland, with several settlements in the region. The Meienreuss and Göschenerreuss have greater altitudes with almost no settlements. Higher altitudes usually have bareland as the LULC type.

Several of the conditioning factors used in the analysis were derived from the DEM using the SAGA GIS software [33]. Statistical summaries of these numerical factors are provided for each catchment in Table A1, Table A2 and Table A3 in Appendix A. Altitude affects temperature and precipitation patterns, with higher elevations often receiving more snowfall [34]. The gravitational impact on snow is determined by the slope; avalanches generally occur on slopes between 30° and 45° [35]. Aspect affects solar radiation exposure, impacting the melting and refreezing cycles that lead to snow stability. The LULC type, such as forests or bareland, significantly impacts snow stability and avalanche risk. Open areas are more prone to avalanches due to the absence of natural barriers [5,36].

Plan curvature affects the lateral accumulation of water and snow on a slope, whereas profile curvature affects their downslope movement. The topographic ruggedness index (TRI) [37] measures terrain roughness by assessing elevation changes between adjacent cells and computed from the DEM using Equation (1).

T R I = \frac{\sum_{i = 1}^{n} |z_{i} - \bar{z}|}{n}

(1)

where

z_{i}

is the elevation value of each cell,

\bar{z}

denotes the mean elevation of all cells within the window, and

n

is the number of cells in the defined window.

Rugged terrain can create topographic barriers or act as a trigger for avalanches [2]. The vector ruggedness measure (VRM) is a 3D measure of ruggedness that accounts for slope and aspect variations. VRM affects snowpack stability due to variable grounding and distribution conditions, influencing avalanche potential.

The diurnal anisotropic index (DAH) represents the daytime warming of various slopes caused by solar radiation [38]. This differential temperature influences the cycles of melting and refreezing, which are crucial for establishing snowpack stability and subsequent avalanche risk. The DAH was assessed using Equation (2) [39].

D A H = \cos (γ_{m a x} - γ) \times \arctan (λ)

(2)

where γ is the slope aspect,

γ_{m a x}

denotes the aspect with the maximum total heat excess, and λ represents the slope angle.

The slope length factor (LS-Factor) assesses the combined impact of slope length and steepness [40]. Longer, steeper slopes, characterized by higher LS-factor values, accumulate more snow, increasing the avalanche risk due to greater snow depth and pressure. The topographic wetness index (TWI) is a measure of the terrain’s exposure to accumulate wetness by combining the local slope and upstream drainage area. High TWI values can indicate zones with higher wetness content, potentially destabilizing the snowpack [41]. The convergence index (CI) (Equation (3)) quantifies terrain convergence or divergence [42]. Convergent zones, such as valleys, concentrate snow and water, increasing AS due to higher snow depths.

C o n v e r g e n c e I n d e x (C I) = \frac{\sum (\arctan (\tan (θ i - θ c) - 90))}{n}

(3)

where

(θ i - θ c)

represents the angle between the aspect of the surrounding cell

(θ i)

and the aspect of the center cell

(θ c)

, while n is the total number of surrounding cells considered within the moving square window.

The valley depth (VD) quantifies the relative depth of a valley compared to the surrounding peaks or ridges. Due to the topographic convergence in valleys, substantial snow accumulation often occurs, significantly increasing AS [43]. Water movement transports sediments deposited along the river, which are subsequently deposited as avalanche debris [44]. The distance to stream (DTS) influences this sediment transport process, affecting the accumulation of snow avalanche debris.

The wind exposition index (WEI) assesses the exposure of the terrain to wind direction [6]. The MSP parameter assigns a value of 0 to mid-slope positions, while maximum vertical distances in both valley and crest directions receive a value of 1 [45]. The TPI proposed by De Reu et al. [46] describes the relative position of a location with respect to the surrounding topography.

2.3. Categorical Boosting (Catboost)

CatBoost is a gradient boosting algorithm developed by Yandex [47], designed to handle categorical variables efficiently and robustly. It features an innovative approach to processing categorical features, eliminating the need for manual encoding and reducing the risk of overfitting. CatBoost’s ability to handle categorical variables makes it particularly well suited for datasets with mixed data types, common in AS mapping. The method employs an ordered boosting strategy, effectively mitigating target leakage during model training. This is achieved through a novel mathematical formulation that adjusts feature values based on target statistics, as illustrated in Equation (4):

x_{p, k} = \frac{\sum_{j = 1}^{p} [x_{j, k} = x_{i, k}] Y_{i}}{\sum_{j = 1}^{n} [x_{j, k} = x_{i, k}]}

(4)

Equation (4) computes the conditional expectation of the target variable, thereby enhancing the integrity of the data utilized for training decisions. Moreover, CatBoost incorporates permutations within the training process to refine feature values further, adapting to the inherent order of data entries as depicted in Equation (5):

x_{σ (p), k} = \frac{\sum_{j = 1}^{p - 1} [x_{σ (j), k} = x_{σ (i), k}] Y_{σ (j)} + β P}{\sum_{j = 1}^{p - 1} [x_{σ (j), k} = x_{σ (i), k}] + β}

(5)

where β and P denote the weight and the prior value, respectively.

The method facilitates a robust update mechanism that integrates both prior information and empirical data. These mathematical constructs allow CatBoost to effectively manage high-dimensional categorical data, reduce overfitting, and improve generalization across different datasets.

2.4. Sequential Backward Selection (SBS) Algorithm

SBS is a feature selection algorithm that iteratively removes features from the dataset based on specific criteria, such as the decrease in model performance upon removal. It starts with the entire set of features and progressively eliminates one feature at a time until the minimum number of features achieving high accuracy is reached. This technique helps in reducing dimensionality, improving model interpretability, and potentially enhancing model performance.

With SBS, the model is trained multiple times, each time omitting one of the features. The performance of each model configuration (with one feature removed) is evaluated and compared. The feature whose absence results in the least decrease in model performance (or even potentially cause an increase) is permanently eliminated from the dataset. The elimination process is repeated with the reduced set of features from the previous iteration. Each iteration assesses the impact of removing each feature one at a time and eliminates the least impactful feature based on model performance. The iterative process is continued until a desired number of features is reached or no further improvement in model performance is observed. If multiple subsets achieve similar performance metrics, the smallest feature set is selected to maintain model simplicity and efficiency.

Figure 5 illustrates SBS using the CatBoost algorithm, combined with GridSearchCV, to optimize feature selection for modeling. The process begins with a dataset comprising 17 different features or factors, such as altitude, slope, LULC, among others, which are considered for model training. The entire set of 17 features was used to train the initial CatBoost model, employing GridSearchCV for parameter optimization [48]. To achieve the highest accuracy, different parameters for learning rate, tree depth, and the number of iterations were tested. Learning rates of 0.2 and 0.3 were used, depending on the number of input features. A tree depth of 9 was consistently applied in all runs, with 250 and 500 iterations identified as the optimal values by GridSearchCV. This step involves systematically searching through different combinations of parameter values to find the most effective settings based on a chosen performance measure.

The training and testing subsets were partitioned using the Python-based scikit-learn package [48]. The same package was also used for the implementation of SBS, tuning of initial parameters, and the training of classifiers. Performance evaluation was subsequently carried out using the same toolkit. The LULC data, which included classes such as bareland, forest, grassland, settlement, and water bodies, were transformed into new binary features representing the presence or absence of that category with a 1 or 0 with one-hot encoding. The models were tested with feature sets ranging from 17 to 2 features, revealing how the predictive power changes with varying levels of input complexity. To ensure comprehensive model evaluation, performance metrics, including accuracy, precision, and recall (see Section 2.6 for details), were used alongside a visual inspection of the output AS maps. Here, although the minimum sets for each catchment were computed with SBS, as few as 2 features are also provided in Appendix B and Appendix C for interpretation.

2.5. SHAP (Shapley Additive Explanations)

The SHAP value quantifies the contribution of each feature to the predictive accuracy of the model for a particular sample [27,28]. The general formula for the model prediction is given in Equation (6):

y_{i} = y_{base} + f (x_{i, 1}) + f (x_{i, 2}) + \dots + f (x_{i, i})

(6)

where yi is the predicted outcome for the ith sample, y_base denotes the baseline score of the model (often the mean prediction over the training set), and f(x_i,j) represents the SHAP value corresponding to the jth feature’s contribution to the ith sample’s prediction. Essentially, the predicted value for a sample is composed of the baseline model score plus the cumulative contributions from each feature.

The SHAP values provide a detailed interpretation of the model, calculating the individual and collective contributions of features to the predicted output. This enables a comprehensive understanding of the model functioning on both global and local scales. Regarding the AS assessments performed here, the SHAP values helped to interpret the specific contributions of various factors to the likelihood of an avalanche.

2.6. Validation

Here, we assessed the model performances both qualitatively and quantitatively, complemented by 5-fold cross-validation (CV) to analyze model robustness and generalizability. The qualitative assessments are based on visual inspections of the AS maps. Quantitative measures include precision (Equation (7)), recall (Equation (8)), F1-score (Equation (9)), specificity (Equation (10)), NPV (Equation (11)), and accuracy (Equation (12)) given in Table 2. These values were computed using the numbers of true positive (TP), true negative (TN), false negative (FN), and false positive (FP) classifications. While additional measures such as ROC-AUC (receiver operating characteristic—area under the curve) could also be computed, these six parameters provided—precision, recall, accuracy, specificity, NPV, and F1-score—sufficiently demonstrate the reliability and completeness of the AS predictions for avalanche class. Although a prediction made with CatBoost is a probability value ranging between 0 and 1 at each sample location, a probability above 0.5 was classified as “true” (avalanche), and those below as “false” (no avalanche) for calculating the measures in Table 2. TPs refer to correctly identified objects, the TNs to correctly omitted ones, the FNs to undetected ones, and FPs to incorrectly identified objects. The values of precision, recall, F1-score, specificity, NPV, and accuracy also range from 0 to 1, where 0 indicates poor classification performance, and 1 indicates perfect classification.

2.7. Multicollinearity Analysis

The variance inflation factor (VIF) and tolerance (TOL) are common multicollinearity analyses employed by researchers to identify and eliminate conditioning factors [2,10,11]. Multicollinearity is identified when the VIF value exceeds 10 or the TOL value falls below 0.1 [11]. Table 3 presents the VIF and TOL values for numerical conditioning factors across the three catchments (Engelberger Aa, Meienreuss, and Göschenerreuss). The results shown in the table indicate that most factors exhibit acceptable or moderate multicollinearity. Although the slope slightly exceeds the mentioned limits, it has not been removed from the feature set as it is an essential parameter in predicting AS. Thus, all 17 factors were used as input and analyzed with SBS-CatBoost. The results are given in the next section.

3. Results

This study evaluated the performance of the SBS-CatBoost model trained on varying combinations of conditioning factors for AS mapping in the Engelberger Aa, Meienreuss, and Göschenerreuss catchments. The accuracy measures for each catchment are given in Table 4. The Engelberger Aa catchment, the largest of the three with 226.5 km², achieved optimal accuracy with 11 features. In Meienreuss, the optimal accuracy was achieved with seven features, as it is also the smallest one with an area of 71.4 km². All AS maps, generated with predicted values ranging from 0 to 1.0, were reclassified into categories of very low, low, moderate, high, and very high susceptibility areas for visual interpretation, using equal intervals of 0.2. These correspond to the intervals of 0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, and 0.8–1.0, respectively. The final accuracy values are ordered similarly based on the catchment size. The results demonstrate a direct relationship between catchment size, environmental complexity, and accuracy. As expected, smaller regions can be modeled with fewer features. The results for each catchment are explained in the following sub-headings. The 95% confidence intervals for the mean CV scores further validate the reliability of these models, as shown in Table 4 for Engelberger Aa, Meienreuss, and Göschenerreuss. These intervals indicate that the observed high-performance measures are consistent and robust across different CV runs, suggesting that the models are well calibrated and reliable for predicting AS in diverse catchments.

3.1. Engelberger Aa Catchment

The best-performing model utilized 11 features (Figure 6) and achieved an accuracy of 0.94, with precision and recall for the avalanche class notably high at 0.91 and 0.96, respectively (see Table 4). This indicated that the model effectively identifies avalanche-prone and non-avalanche areas with a high degree of reliability. The resulting map (Figure 6a) illustrated regions with high AS, primarily along specific elevations and terrain features. According to the SHAP graph (Figure 6b), in the Engelberger Aa region, altitude, DTS, and aspect have significant impacts on the model outputs. Higher altitude values are positively correlated with avalanche occurrences, while lower altitude values negatively influence them, as interpreted from the SHAP graph (Figure 6b). On the other hand, greater distances from streams are associated with lower susceptibility to avalanches, whereas closer proximity (low DTS value) has a positive correlation. Regarding the aspect, 0°, 90°, 180°, and 270° correspond to north, east, south, and west. Thus, it can be interpreted from Figure 6b that low aspect values (close to north) have a positive influence on avalanche occurrence.

3.2. Meienreuss Catchment

The model, using just seven features, achieved robust performance with an overall accuracy of 0.96. It balances precision (0.94) and recall (0.97) for the avalanche class, indicating a strong predictive power despite the reduced number of features. The resulting map (Figure 7a) shows highly susceptible areas concentrated in specific regions, likely reflecting the unique local topographical influences. Figure 7b shows that VD, altitude, and aspect have the greatest impact on AS in this catchment. From the graph, it can be seen that high VD values yield to lower AS. The relationship between the altitude and aspect features and the AS is similar to that observed in the Engelberger Aa catchment.

3.3. Göschenerreuss Catchment

With nine features, this model delivered outstanding performance, achieving an accuracy of 0.97. It demonstrated exceptionally high recall (0.98) for the avalanche class, highlighting its ability to identify AS accurately. The distribution of highly susceptible areas was more widespread (Figure 8a), suggesting that varied terrain or more comprehensive data coverage influenced the predictions. According to the SHAP values (Figure 8b), aspect, altitude, and DTS are highly influential, similar to the results of the Engelberger Aa catchment.

4. Discussion

In this study, a comprehensive feature selection and elimination strategy, SBS, was evaluated for AS mapping using the CatBoost algorithm in three neighbouring catchments in the Swiss Alps. As avalanches are complex natural phenomena influenced by numerous environmental and geological conditions, the probability of their spatial occurrences can be controlled by a large number of factors. However, a factor set effective in a region may be less influential in another one. For this reason, this study started from a large feature set with 17 inputs and sequentially eliminated features by analyzing their influence on model performance. The parameter optimization for CatBoost was also performed using an automated algorithm to achieve the best possible accuracy.

Across the three catchments, the model performances were robust, with accuracies ranging from 0.94 to 0.97. These results are notable, considering the complexity and variability of the factors influencing avalanche occurrences. The iterative process of feature elimination using SBS revealed that not all 17 conditional factors were equally important in all catchments, even though they were neighboring. The key factors, such as altitude, aspect, and valley depth, consistently showed a strong influence on model prediction accuracies across all catchments, proving their critical role in avalanche dynamics.

Interestingly, the results also highlighted the lesser relevance of factors like plan and profile curvature, the DAH, and the TPI, which were excluded in the optimized models at earlier stages (see Appendix B). The LULC features were included in models with larger feature sets (16 features) across all catchments, which can be associated with the weaker relation with the AS. This suggests that while LULC is informative, other factors like altitude, aspect, and distance to stream might carry more predictive weight in the reduced models, particularly when computational efficiency or model simplicity is prioritized.

The initial analysis of multicollinearity indicated that only slope revealed high correlations with the other factors. However, it was not omitted from the initial set as the level of correlation was not excessively high. On the other hand, slope was not included in the final feature set for any of the catchments, likely due to its effect being covered by other topographic features.

On the other hand, although the inventory used here is comprehensive [22,24], possibly the most extensive one in the literature, it also suffers from data collection conditions, such as shadows and image resolution [23,25], its representativeness for only a large avalanche period, and avalanche boundary uncertainty caused by operator differences [49]. Therefore, the results must be interpreted accordingly.

5. Conclusions and Future Work

This study evaluated the usability of SBS with the CatBoost model for AS modeling. By starting with a large feature set and applying advanced feature selection, the approach enhanced model accuracy and interpretability while providing insights into avalanche conditioning factors. The proposed feature elimination strategy yielded varying numbers of features across the three catchments, ranging from seven to eleven, based on the specific characteristics and sizes of each catchment. The SHAP values provided a unified measure of feature importance and enhanced interpretability by attributing the contribution of each feature to the model outputs. By incorporating SHAP values into the integrated SBS-CatBoost framework, we gained deeper insights into the relative importance of selected features and their impact on AS mapping. Features such as altitude, aspect, and VD were consistently important across all three catchments, highlighting their general significance in AS mapping. Accuracy values ranged from 0.94 to 0.97, showing a linear relationship with catchment size.

The SBS algorithm can be computationally intensive, especially with a large number of features, as it involves features and retraining the model, which can be time-consuming and resource-intensive. Future work could explore assessing multicollinearity results with or without VIF and TOL in conjunction with SBS.

Despite the strengths of the integrated SBS-CatBoost approach, the study faces limitations related to the inherent challenges of modeling this natural phenomenon. The variability in feature importance across different catchments suggested that regional characteristics can significantly influence model performance and feature selection. Additionally, the reliance on historical data and the assumptions inherent in the modeling process may affect the generalizability of the results to other regions or future scenarios.

Author Contributions

Sinem Cetinkaya contributed to conceptualization, methodology, validation, writing—original draft preparation, software, data curation, investigation, and visualization. Sultan Kocaman contributed to conceptualization, methodology, validation, writing—original draft preparation, writing—review and editing, and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

Sinem Cetinkaya received the PhD grant from the YOK 100/2000 program.

Data Availability Statement

The avalanche data used in this study were originally provided by Hafner and Bühler [22] https://doi.org/10.16904/envidat.77 (accessed on 20 December 2021). The authors do not have permission to share the data.

Acknowledgments

This study is part of the PhD thesis research of Sinem Cetinkaya.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In our study focused on AS modeling in the Engelberger Aa, Meienreuss, and Göschenerreuss catchments, we provide comprehensive statistical summaries of key environmental and topographical factors across three descriptive statistics tables (Table A1, Table A2 and Table A3). These tables delineate the variations in factors related to avalanche (1) and non-avalanche (0) occurrences within each specific catchment area. Each row in the table represents a distinct factor, offering insights into its mean, standard deviation, minimum, first quartile (0.25), median (0.50), third quartile (0.75), and maximum values across both categories.

Table A1. Descriptive statistics of Engelberger Aa factors based on avalanche inventory.

Factors	Inventory	Mean	Std Dev	Min	0.25	0.50	0.75	Max
Altitude (m)	0	1593.24	588.45	458.39	1150.01	1572.94	2035.39	3233.93
Altitude (m)	1	1911.27	350.30	846.60	1668.41	1924.14	2171.01	2897.65
Aspect (°)	0	201.05	109.28	−1.00	104.52	220.52	299.49	360.00
Aspect (°)	1	170.60	87.72	0.00	116.92	159.34	224.30	359.99
CI	0	0.07	5.89	−96.81	−1.97	0.04	2.04	96.76
CI	1	−1.17	4.30	−88.14	−2.82	−0.76	1.01	63.09
DAH	0	−0.03	0.35	−0.97	−0.30	−0.03	0.23	0.97
DAH	1	0.10	0.36	−0.95	−0.17	0.15	0.39	0.95
DTS	0	135.55	155.86	0.00	30.00	80.62	183.58	1124.50
DTS	1	61.98	116.85	0.00	10.00	22.36	58.31	842.14
LS-Factor	0	34.72	39.33	0.00	16.34	28.03	41.90	2605.34
LS-Factor	1	50.22	50.25	0.00	27.65	38.34	54.88	1580.63
MSP	0	0.54	0.30	0.00	0.29	0.56	0.81	1.00
MSP	1	0.54	0.29	0.00	0.30	0.57	0.80	1.00
Plan Curv.	0	0.00	0.01	−0.17	0.00	0.00	0.00	0.30
Plan Curv.	1	0.00	0.01	−0.14	−0.01	0.00	0.00	0.13
Profile Curv.	0	0.00	0.01	−0.41	0.00	0.00	0.00	0.35
Profile Curv.	1	0.00	0.01	−0.25	0.00	0.00	0.00	0.24
Slope (°)	0	29.96	15.24	0.00	19.47	30.21	38.41	87.56
Slope (°)	1	33.92	13.21	0.09	25.66	33.51	40.29	86.09
TPI	0	0.40	12.67	−124.38	−4.95	−0.44	4.41	284.12
TPI	1	−4.20	11.93	−100.18	−8.31	−3.17	0.66	134.30
TRI	0	4.55	4.34	0.00	2.40	3.86	5.24	191.15
TRI	1	5.14	4.05	0.01	3.18	4.37	5.66	124.74
TWI	0	5.95	2.14	−0.65	4.53	5.65	6.95	23.64
TWI	1	6.26	1.93	−0.02	4.92	6.12	7.37	18.44
VD	0	118.74	106.56	−332.17	32.25	87.27	181.83	509.81
VD	1	152.48	116.11	−23.52	52.58	125.83	236.44	471.73
VRM	0	0.01	0.02	0.00	0.00	0.00	0.01	0.57
VRM	1	0.01	0.02	0.00	0.00	0.00	0.01	0.46
WEI	0	1.00	0.09	0.77	0.95	1.00	1.06	1.34
WEI	1	1.01	0.07	0.78	0.96	1.01	1.05	1.33

Table A2. Descriptive statistics of Meienreuss factors based on avalanche inventory.

Factors	Inventory	Mean	Std Dev	Min	0.25	0.50	0.75	Max
Altitude (m)	0	2166.36	466.11	819.96	1826.98	2205.81	2517.94	3410.88
Altitude (m)	1	2256.97	382.40	1073.60	2013.04	2304.16	2556.63	2983.17
Aspect (°)	0	159.51	101.57	−1.00	66.47	163.80	226.03	360.00
Aspect (°)	1	136.73	69.27	−1.00	86.04	131.60	178.82	359.98
CI	0	0.12	5.09	−88.31	−1.93	−0.03	1.89	89.33
CI	1	−1.66	4.16	−70.98	−3.52	−1.28	0.63	51.92
DAH	0	−0.01	0.43	−0.95	−0.40	0.00	0.38	0.96
DAH	1	0.06	0.35	−0.91	−0.21	0.08	0.33	0.93
DTS	0	617.98	445.85	0.00	260.00	537.12	881.42	2304.10
DTS	1	597.44	440.60	0.00	199.25	518.65	991.82	1765.11
LS-Factor	0	41.65	53.50	0.00	23.23	34.22	48.49	4021.48
LS-Factor	1	51.87	40.75	0.00	28.98	41.20	60.53	1233.13
MSP	0	0.52	0.29	0.00	0.26	0.52	0.77	1.00
MSP	1	0.52	0.30	0.00	0.25	0.51	0.80	1.00
Plan Curv.	0	0.00	0.01	−0.15	0.00	0.00	0.00	0.16
Plan Curv.	1	0.00	0.01	−0.14	−0.01	0.00	0.00	0.08
Profile Curv.	0	0.00	0.01	−0.14	0.00	0.00	0.00	0.18
Profile Curv.	1	0.00	0.01	−0.13	−0.01	0.00	0.00	0.06
Slope (°)	0	34.68	13.46	0.00	25.83	34.23	43.01	83.71
Slope (°)	1	30.72	12.42	0.00	21.68	30.51	38.08	77.89
TPI	0	0.66	12.02	−65.43	−5.79	−0.83	5.03	146.03
TPI	1	−5.34	8.78	−47.42	−9.79	−4.58	−0.53	83.74
TRI	0	5.18	3.22	0.00	3.23	4.49	6.26	75.05
TRI	1	4.32	2.57	0.00	2.63	3.89	5.21	38.02
TWI	0	5.73	1.97	0.05	4.38	5.61	6.81	21.00
TWI	1	6.87	2.01	1.21	5.52	6.75	8.09	18.78
VD	0	138.90	127.06	−164.37	35.18	99.55	218.08	792.84
VD	1	165.87	115.90	−14.09	83.89	141.40	207.34	485.38
VRM	0	0.01	0.02	0.00	0.00	0.00	0.01	0.66
VRM	1	0.01	0.02	0.00	0.00	0.00	0.01	0.53
WEI	0	1.00	0.08	0.76	0.96	1.00	1.05	1.34
WEI	1	0.98	0.06	0.78	0.95	0.99	1.02	1.32

Table A3. Descriptive statistics of Göschenerreuss factors based on avalanche inventory.

Factors	Inventory	Mean	Std Dev	Min	0.25	0.50	0.75	Max
Altitude (m)	0	2372.25	502.28	0.00	2009.94	2414.89	2755.65	3628.03
Altitude (m)	1	2370.60	370.32	1171.30	2122.57	2392.38	2631.75	3214.11
Aspect (°)	0	160.10	105.27	−1.00	71.91	150.96	226.27	360.00
Aspect (°)	1	166.62	65.73	0.03	116.26	176.04	210.56	359.92
CI	0	0.07	5.32	−89.97	−1.93	−0.06	1.79	93.95
CI	1	−1.37	4.15	−52.58	−3.10	−0.87	1.03	34.15
DAH	0	−0.04	0.41	−0.95	−0.39	−0.08	0.33	0.94
DAH	1	0.22	0.33	−0.85	−0.03	0.28	0.47	0.94
DTS	0	790.18	554.85	0.00	339.41	694.62	1157.67	3035.34
DTS	1	714.37	509.53	0.00	272.03	643.27	1114.14	2308.68
LS-Factor	0	42.17	39.42	0.00	22.98	34.96	50.94	3612.33
LS-Factor	1	54.71	51.88	0.00	26.93	41.13	63.96	1138.44
MSP	0	0.52	0.30	0.00	0.26	0.53	0.78	1.00
MSP	1	0.55	0.31	0.00	0.27	0.59	0.83	1.00
Plan Curv.	0	0.00	0.01	−0.17	0.00	0.00	0.00	0.21
Plan Curv.	1	0.00	0.01	−0.15	−0.01	0.00	0.00	0.09
Profile Curv.	0	0.00	0.01	−0.16	0.00	0.00	0.00	0.19
Profile Curv.	1	0.00	0.01	−0.19	−0.01	0.00	0.00	0.06
Slope (°)	0	33.64	14.49	0.00	24.31	32.88	42.31	84.51
Slope (°)	1	30.66	13.57	0.10	20.99	29.96	37.84	81.42
TPI	0	0.50	12.47	−99.56	−5.53	−0.68	4.53	137.40
TPI	1	−4.56	10.93	−96.50	−8.83	−3.43	0.76	58.42
TRI	0	5.09	3.54	0.00	3.02	4.27	6.11	69.97
TRI	1	4.47	3.23	0.15	2.56	3.81	5.16	52.46
TWI	0	6.04	2.28	0.08	4.54	5.88	7.19	22.93
TWI	1	6.91	1.97	0.61	5.51	6.84	8.14	17.24
VD	0	141.79	143.02	−111.37	28.48	86.56	223.07	710.84
VD	1	182.72	145.79	−25.12	70.43	127.31	300.43	563.13
VRM	0	0.01	0.03	0.00	0.00	0.00	0.01	0.60
VRM	1	0.01	0.02	0.00	0.00	0.00	0.01	0.51
WEI	0	1.00	0.08	0.76	0.96	1.01	1.05	1.34
WEI	1	0.98	0.06	0.80	0.94	0.99	1.02	1.23

Appendix B

The results obtained from the SBS algorithm implemented with the CatBoost model in three alpine catchments (Engelberger Aa, Meienreuss, and Göschenerreuss) are presented in Table A4, Table A5 and Table A6, which illustrate the evolution of the classification accuracy as features are systematically reduced, illustrating the effectiveness of the combined SBS-CatBoost approach in enhancing the predictive accuracy of AS models.

Table A4. Selected features and classification accuracy for AS models in Engelberger Aa.

Selected Features of Engelberger Aa		Accuracy
16	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Forest, Grassland, Settlement, Water Bodies	0.93
15	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Forest, Grassland, Water Bodies	0.94
14	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Grassland, Water Bodies	0.93
13	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Grassland	0.93
12	Altitude, Aspect, DAH, DTS, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Grassland	0.93
11	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Grassland	0.94 *
10	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Grassland	0.93
9	Altitude, Aspect, DTS, MSP, TPI, TRI, VD, WEI, Grassland	0.93
8	Altitude, Aspect, DTS, TPI, TRI, VD, WEI, Grassland	0.93
7	Altitude, Aspect, DTS, TRI, VD, WEI, Grassland	0.92
6	Altitude, Aspect, DTS, TRI, VD, WEI	0.92
5	Altitude, Aspect, DTS, VD, WEI	0.90
4	Altitude, Aspect, DTS, VD	0.87
3	Altitude, Aspect, DTS	0.82
2	Altitude, DTS	0.76

* The best performance of the model.

Table A5. Selected features and classification accuracy for AS models in Meienreuss.

Selected Features of Meienreuss		Accuracy
16	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, VRM, WEI, Forest, Grassland, Settlement, Water Bodies	0.96
15	Altitude, Aspect, DAH, DTS, MSP, TPI, TRI, TWI, VD, VRM, WEI, Forest, Grassland, Settlement, Water Bodies	0.96
14	Altitude, Aspect, DAH, DTS, MSP, TPI, TRI, TWI, VD, WEI, Forest, Grassland, Settlement, Water Bodies	0.96
13	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Forest, Grassland, Settlement, Water Bodies	0.96
12	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Forest, Grassland, Settlement	0.96
11	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Grassland, Settlement	0.96
10	Altitude, Aspect, DTS, MSP, TPI, TRI, TWI, VD, WEI, Grassland	0.96
9	Altitude, Aspect, DTS, MSP, TPI, TWI, VD, WEI, Grassland	0.96
8	Altitude, Aspect, DTS, MSP, TPI, TWI, VD, WEI	0.96
7	Altitude, Aspect, DTS, MSP, TWI, VD, WEI	0.96 *
6	Altitude, Aspect, DTS, MSP, VD, WEI	0.95
5	Altitude, Aspect, DTS, VD, WEI	0.95
4	Altitude, Aspect, DTS, VD	0.94
3	Altitude, DTS, VD	0.89
2	Altitude, VD	0.75

* The best performance of the model.

Table A6. Selected features and classification accuracy for AS models in Göschenerreuss.

Selected Features of Göschenerreuss		Accuracy
16	Altitude, Aspect, CI, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Forest, Grassland, Settlement	0.96
15	Altitude, Aspect, CI, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Bareland, Grassland, Settlement	0.97
14	Altitude, Aspect, CI, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Grassland, Settlement	0.96
13	Altitude, Aspect, CI, DAH, DTS, LS-Factor, MSP, TPI, TRI, TWI, VD, WEI, Grassland	0.97
12	Altitude, Aspect, CI, DAH, DTS, LS-Factor, MSP, TPI, TRI, VD, WEI, Grassland	0.96
11	Altitude, Aspect, DAH, DTS, LS-Factor, MSP, TPI, TRI, VD, WEI, Grassland	0.96
10	Altitude, Aspect, DTS, LS-Factor, MSP, TPI, TRI, VD, WEI, Grassland	0.97
9	Altitude, Aspect, DTS, LS-Factor, MSP, TPI, VD, WEI, Grassland	0.97 *
8	Altitude, Aspect, DTS, MSP, TPI, VD, WEI, Grassland	0.96
7	Altitude, Aspect, DTS, MSP, VD, WEI, Grassland	0.96
6	Altitude, Aspect, DTS, MSP, VD, WEI	0.96
5	Altitude, Aspect, DTS, VD, WEI	0.96
4	Altitude, Aspect, DTS, VD	0.94
3	Altitude, DTS, VD	0.90
2	Altitude, DTS	0.76

* The best performance of the model.

Appendix C

Here, the AS maps for the Engelberger Aa, Meienreuss, and Göschenerreuss catchments are presented to demonstrate the impact of feature set reduction on the spatial predictions of avalanche occurrences.

Figure A1. The AS maps for Engelberger Aa, Meienreuss, and Göschenerreuss catchments, based on models with varying numbers of features (16, 15, 14, 13, and 12).

Figure A2. The AS maps for Engelberger Aa, Meienreuss, and Göschenerreuss catchments, based on models with varying numbers of features (11, 10, 9, 8, and 7).

Figure A3. The AS maps for Engelberger Aa, Meienreuss, and Göschenerreuss catchments, based on models with varying numbers of features (6, 5, 4, 3, and 2).

References

Bergua, S.B.; Piedrabuena, M.P.; Alfonso, J.L.M. Snow avalanche susceptibility in the eastern hillside of the Aramo Range (Asturian Central Massif, Cantabrian Mountains, NW Spain). J. Maps 2018, 14, 373–381. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Yang, J.; Li, L.; Wang, T. Snow avalanche susceptibility mapping from tree-based machine learning approaches in ungauged or poorly-gauged regions. Catena 2023, 224, 106997. [Google Scholar] [CrossRef]
Yariyan, P.; Omidvar, E.; Karami, M.; Cerdà, A.; Pham, Q.B.; Tiefenbacher, J.P. Evaluating novel hybrid models based on GIS for snow avalanche susceptibility mapping: A comparative study. Cold Reg. Sci. Technol. 2022, 194, 103453. [Google Scholar] [CrossRef]
Varol, N. Avalanche susceptibility mapping with the use of frequency ratio, fuzzy and classical analytical hierarchy process for Uzungol area, Turkey. Cold Reg. Sci. Technol. 2022, 194, 103439. [Google Scholar] [CrossRef]
Akbar, M.; Bhat, M.S.; Chanda, A.; Lone, F.A.; Thoker, I.A. Integrating Traditional Knowledge with GIS for Snow Avalanche Susceptibility Mapping in Kargil-Ladakh Region of Trans-Himalayan India. Spat. Inf. Res. 2022, 30, 773–789. [Google Scholar] [CrossRef]
Durlević, U.; Valjarević, A.; Novković, I.; Ćurčić, N.B.; Smiljić, M.; Morar, C.; Stoica, A.; Barišić, D.; Lukić, T. GIS-Based Spatial Modeling of Snow Avalanches Using Analytic Hierarchy Process. A Case Study of the Šar Mountains, Serbia. Atmosphere 2022, 13, 1229. [Google Scholar] [CrossRef]
Kumar, S.; Srivastava, P.K.; Snehmani Bhatiya, S. Geospatial probabilistic modelling for release area mapping of snow avalanches. Cold Reg. Sci. Technol. 2019, 165, 102813. [Google Scholar] [CrossRef]
Choubin, B.; Borji, M.; Mosavi, A.; Sajedi-Hosseini, F.; Singh, V.P.; Shamshirband, S. Snow avalanche hazard prediction using machine learning methods. J. Hydrol. 2019, 577, 123929. [Google Scholar] [CrossRef]
Rahmati, O.; Ghorbanzadeh, O.; Teimurian, T.; Mohammadi, F.; Tiefenbacher, J.P.; Falah, F.; Pirasteh, S.; Ngo, P.-T.T.; Bui, D.T. Spatial Modeling of Snow Avalanche Using Machine Learning Models and Geo-Environmental Factors: Comparison of Effectiveness in Two Mountain Regions. Remote Sens. 2019, 11, 2995. [Google Scholar] [CrossRef]
Wen, H.; Wu, X.; Liao, X.; Wang, D.; Huang, K.; Wünnemann, B. Application of machine learning methods for snow avalanche susceptibility mapping in the Parlung Tsangpo catchment, southeastern Qinghai-Tibet Plateau. Cold Reg. Sci. Technol. 2022, 198, 103535. [Google Scholar] [CrossRef]
Iban, M.C.; Bilgilioglu, S.S. Snow avalanche susceptibility mapping using novel tree-based machine learning algorithms (XGBoost, NGBoost, and LightGBM) with eXplainable Artificial Intelligence (XAI) approach. Stoch. Environ. Res. Risk Assess. 2023, 37, 2243–2270. [Google Scholar] [CrossRef]
Cetinkaya, S.; Kocaman, S. Snow Avalanche Susceptibility Mapping for Davos, Switzerland. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2022, XLIII-B3-2022, 1083–1090. [Google Scholar] [CrossRef]
Costache, R.; Arabameri, A.; Costache, I.; Crăciun, A.; Islam, A.R.M.T.; Abba, S.; Sahana, M.; Pham, B.T. Flood susceptibility evaluation through deep learning optimizer ensembles and GIS techniques. J. Environ. Manag. 2022, 316, 115316. [Google Scholar] [CrossRef] [PubMed]
Pham, B.T.; Van Dao, D.; Acharya, T.D.; Van Phong, T.; Costache, R.; Van Le, H.; Nguyen, H.B.T.; Prakash, I. Performance assessment of artificial neural network using chi-square and backward elimination feature selection methods for landslide susceptibility analysis. Environ. Earth Sci. 2021, 80, 686. [Google Scholar] [CrossRef]
Tiwari, A.; Arun, G.; Vishwakarma, B.D. Parameter importance assessment improves efficacy of machine learning methods for predicting snow avalanche sites in Leh-Manali Highway, India. Sci. Total Environ. 2021, 794, 148738. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar] [CrossRef]
Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. Recent advances and emerging challenges of feature selection in the context of big data. Knowl. Based Syst. 2015, 86, 33–45. [Google Scholar] [CrossRef]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer Science Business Media: New York, NY, USA, 2013; pp. 27–59. [Google Scholar] [CrossRef]
Hu, S.; Liu, H.; Zhao, W.; Shi, T.; Hu, Z.; Li, Q.; Wu, G. Comparison of Machine Learning Techniques in Inferring Phytoplankton Size Classes. Remote Sens. 2018, 10, 191. [Google Scholar] [CrossRef]
Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef]
Ghinoi, A.; Chung, C.J. STARTER: A statistical GIS-based model for the prediction of snow avalanche susceptibility using terrain features—Application to Alta Val Badia, Italian Dolomites. Geomorphology 2005, 66, 305–325. [Google Scholar] [CrossRef]
Hafner, E.; Bühler, Y. SPOT6 Avalanche Outlines 24 January 2018; EnviDat: Zurich, Switzerland, 2019. [Google Scholar] [CrossRef]
Hafner, E.; Leinss, S.; Techel, F.; Bühler, Y. Satellite Avalanche Mapping Validation Data; EnviDat: Zurich, Switzerland, 2021. [Google Scholar] [CrossRef]
Bühler, Y.; Hafner, E.D.; Zweifel, B.; Zesiger, M.; Heisig, H. Where are the avalanches? Rapid SPOT6 satellite data acquisition to map an extreme avalanche period over the Swiss Alps. Cryosphere 2019, 13, 3225–3238. [Google Scholar] [CrossRef]
Hafner, E.D.; Techel, F.; Leinss, S.; Bühler, Y. Mapping avalanches with satellites—Evaluation of performance and completeness. Cryosphere 2021, 15, 983–1004. [Google Scholar] [CrossRef]
Cetinkaya, S.; Kocaman, S. Impact of Learning Set and Sampling for Snow Avalanche Susceptibility Mapping with Random Forest. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, XLVIII-M-1, 57–64. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S. A Unified Approach to Interpreting Model Predictions. In NIPS’17 Proceedings of the 31st International Conference on Neural Information Processing Systems; ACM Press: New York, NY, USA, 2017; Volume 1705, pp. 4765–4774. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Scikit Learn: GridSearchCV. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html (accessed on 13 May 2024).
Can, R.; Kocaman, S.; Gokceoglu, C. A Comprehensive Assessment of XGBoost Algorithm for Landslide Susceptibility Mapping in the Upper Basin of Ataturk Dam, Turkey. Appl. Sci. 2021, 11, 4993. [Google Scholar] [CrossRef]
Yao, Z.; Chen, M.; Zhan, J.; Zhuang, J.; Sun, Y.; Yu, Q.; Yu, Z. Refined Landslide Susceptibility Mapping by Integrating the SHAP-CatBoost Model and InSAR Observations: A Case Study of Lishui, Southern China. Appl. Sci. 2023, 13, 12817. [Google Scholar] [CrossRef]
swissALTI3D. Available online: https://www.swisstopo.admin.ch/en/height-model-swissalti3d (accessed on 20 January 2024).
Conrad, O.; Bechtel, B.; Bock, M.; Dietrich, H.; Fischer, E.; Gerlitz, L.; Wehberg, J.; Wichmann, V.; Böhner, J. System for Automated Geoscientific Analyses (SAGA) v. 2.1.4. Geosci. Model Dev. 2015, 8, 1991–2007. [Google Scholar] [CrossRef]
McClung, D.; Schaerer, P. The Avalanche Handbook; The Mountaineers Books: Seattle, WA, USA, 2006. [Google Scholar]
Schweizer, J.; Jamieson, J.B. Snowpack properties for snow profile analysis. Cold Reg. Sci. Technol. 2003, 37, 233–241. [Google Scholar] [CrossRef]
Akay, H. Towards Linking the Sustainable Development Goals and a Novel-Proposed Snow Avalanche Susceptibility Mapping. Water Resour. Manag. 2022, 36, 6205–6222. [Google Scholar] [CrossRef]
Riley, S.J.; De Gloria, S.D.; Elliot, R. A Terrain Ruggedness that Quantifies Topographic Heterogeneity. Intermt. J. Sci. 1999, 5, 23–27. [Google Scholar]
Revuelto, J.; Billecocq, P.; Tuzet, F.; Cluzet, B.; Lamare, M.; Larue, F.; Dumont, M. Random forests as a tool to understand the snow depth distribution and its evolution in mountain areas. Hydrol. Process. 2020, 34, 5384–5401. [Google Scholar] [CrossRef]
Böhner, J.; Antonić, O. Chapter 8 Land-Surface Parameters Specific to Topo-Climatology. Dev. Soil Sci. 2009, 33, 195–226. [Google Scholar] [CrossRef]
Chen, Y.; Chen, W.; Rahmati, O.; Falah, F.; Kulakowski, D.; Lee, S.; Rezaie, F.; Panahi, M.; Bahmani, A.; Darabi, H.; et al. Toward the development of deep learning analyses for snow avalanche releases in mountain regions. Geocarto Int. 2021, 37, 7855–7880. [Google Scholar] [CrossRef]
Liu, Y.; Chen, X.; Qiu, Y.; Hao, J.; Yang, J.; Li, L. Mapping snow avalanche debris by object-based classification in mountainous regions from Sentinel-1 images and causative indices. Catena 2021, 206, 105559. [Google Scholar] [CrossRef]
Panahi, M.; Sadhasivam, N.; Pourghasemi, H.R.; Rezaie, F.; Lee, S. Spatial prediction of groundwater potential mapping based on convolutional neural network (CNN) and support vector regression (SVR). J. Hydrol. 2020, 588, 125033. [Google Scholar] [CrossRef]
Mosavi, A.; Shirzadi, A.; Choubin, B.; Taromideh, F.; Hosseini, F.S.; Borji, M.; Shahabi, H.; Salvati, A.; Dineva, A.A. Towards an Ensemble Machine Learning Model of Random Subspace Based Functional Tree Classifier for Snow Avalanche Susceptibility Mapping. IEEE Access 2020, 8, 145968–145983. [Google Scholar] [CrossRef]
Choubin, B.; Borji, M.; Hosseini, F.S.; Mosavi, A.; Dineva, A.A. Mass wasting susceptibility assessment of snow avalanches using machine learning models. Sci. Rep. 2020, 10, 18363. [Google Scholar] [CrossRef] [PubMed]
Dietrich, H.; Böhner, J. Cold air production and flow in a low mountain range landscape in Hessia (Germany). Hambg. Beiträge Phys. Geogr. Landschaftsökologie 2008, 19, 37–48. [Google Scholar]
De Reu, J.; Bourgeois, J.; Bats, M.; Zwertvaegher, A.; Gelorini, V.; De Smedt, P.; Chu, W.; Antrop, M.; De Maeyer, P.; Finke, P.; et al. Application of the topographic position index to heterogeneous landscapes. Geomorphology 2013, 186, 39–49. [Google Scholar] [CrossRef]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient boosting with categorical features support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Hafner, E.D.; Techel, F.; Daudt, R.C.; Wegner, J.D.; Schindler, K.; Bühler, Y. Avalanche size estimation and avalanche outline determination by experts: Reliability and implications for practice. Nat. Hazards Earth Syst. Sci. 2023, 23, 2895–2914. [Google Scholar] [CrossRef]

Figure 1. A schematic overview of the proposed framework.

Figure 2. Engelberger Aa, Meienreuss, and Göschenerreuss catchments (sub-catchments of the Reuss River Basin).

Figure 3. Altitude maps of Engelberger Aa, Meienreuss, and Göschenerreuss catchments.

Figure 4. LULC classification maps for Engelberger Aa, Meienreuss, and Göschenenreuss.

Figure 5. Flowchart of iterative feature elimination using CatBoost with GridSearchCV.

Figure 6. (a) The AS map of the Engelberger Aa catchment using the optimal feature set and (b) SHAP summary plot.

Figure 7. (a) The AS map of the Meienreuss catchment using the optimal feature set and (b) SHAP summary plot.

Figure 8. (a) The AS map of the Göschenerreuss catchment using the optimal feature set and (b) SHAP summary plot.

Table 1. Brief summary of the ML methods and the features used for AS mapping in the recent literature.

Reference	Year	Method	Location	Features	No. of Features
Choubin et al. [8]	2019	SVM, MDA	Indian Himalayas	Precipitation, Temperature, Elevation, Slope, Aspect, Curvature, Drainage Density, Topographic Wetness Index, Topographic Position Index, Vector Ruggedness Measure, Distance to Stream, Distance to Fault, Land Use, Lithology	14
Rahmati et al. [9]	2019	SVM, RF, NB, GAM	Darvan and Zarrinehroud Watersheds, Iran	Elevation, Topographic Position Index, Terrain Ruggedness Index, Topographic Wetness Index, Vector Ruggedness Measure, Weighted Elevation Index, LS Factor, Relative Slope Position, Slope, Aspect, Profile Curvature, Distance from Stream, Lithology, Land Use	14
Tiwari et al. [15]	2021	SVM (with Linear, Polynomial, Sigmoid, RBF Kernels)	Leh-Manali Highway, India	Elevation, Slope, Aspect, Plan Curvature, Roughness, Topographic Wetness Index, Precipitation, Temperature, Wind Speed, Normalized Difference Vegetation Index, Soil Wetness Index	11
Varol [4]	2022	FR, AHP, Fuzzy AHP	Uzungol, Trabzon, Turkey	Elevation, Slope, Curvature, Aspect, Vegetation	5
Akbar et al. [5]	2022	AHP	Kargil-Ladakh Region, Trans-Himalayas	Elevation, Slope, Profile Curvature, Land Cover, Aspect, Distance from Lineaments	6
Durlević et al. [6]	2022	AHP	Šar Mountains, Serbia	Elevation, Slope, Aspect, Profile Curvature, Terrain Ruggedness Index, Topographic Wetness Index, LS Factor, Air Temperature, Normalized Difference Snow Index, Weighted Elevation Index, Normalized Difference Vegetation Index, Bare Soil Index, Distance from Stream, Lithology	14
Wen et al. [10]	2022	SVM, KNN, CART, MLP	Parlung Tsangpo Catchment, Southeastern Tibet	Elevation, Slope, Aspect, Surface Roughness, Curvature, Relief Amplitude, Earth’s Surface Incision, Variance Coefficient in Elevation, Average Annual Snowfall, Average Annual Snowfall Days, Average Temperature of January, Maximum Snow Depth, Distance to Rivers, Distance to Faults, Normalized Difference Vegetation Index, Land Use	16
Cetinkaya and Kocaman [12]	2022	LR, RF	Davos, Switzerland	Elevation, Slope, Plan Curvature, Profile Curvature, Aspect, Topographic Position Index, Terrain Ruggedness Index, Topographic Wetness Index, Land Use and Land Cover, Lithology, Distance to Road, Distance to River	12
Liu et al. [2]	2023	CatBoost, LightGBM, RF, XGBoost	Tianshan Mountains, China	Aspect, Slope, Elevation, Precipitation, Average Temperature, Slope Length, Drainage Density, Distance to River, Profile Curvature, Plan Curvature, Average Wind Speed, Maximum Snow Depth, Relative Slope Position, Topographic Wetness Index, Topographic Position Index, Terrain Ruggedness Index, Terrain Surface Convexity, Terrain Surface Texture, Vector Ruggedness Measure, Relief Degree of Land Surface	20
Iban and Bilgilioglu [11]	2023	XGBoost, LightGBM, NGBoost, AdaBoost, RF, GB	Province of Sondrio, Italy	Elevation, Slope, Aspect, Plan Curvature, Profile Curvature, Topographic Wetness Index, Terrain Ruggedness Index, Topographic Position Index, Proximity to Road, Proximity to Stream, Land Use, Precipitation, Maximum Temperature, Minimum Temperature, Wind Speed, Solar Radiation, Lithology	17

Table 2. The performance measures used in the study.

Metric	Formula	Equation
Precision	$\frac{T P}{T P + F P}$	(7)
Recall	$\frac{T P}{T P + F N}$	(8)
F1-score	$2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}$	(9)
Specificity	$\frac{T N}{T N + F P}$	(10)
Negative Predictive Value	$\frac{T N}{T N + F N}$	(11)
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$	(12)

Table 3. The results of multicollinearity analysis.

	Engelberger Aa		Meienreuss		Göschenerreuss
	VIF	TOL	VIF	TOL	VIF	TOL
Altitude	2.606	0.384	2.716	0.368	3.377	0.296
Aspect	1.094	0.914	1.304	0.767	1.234	0.810
CI	2.063	0.485	2.653	0.377	2.460	0.406
DAH	1.134	0.882	1.314	0.761	1.281	0.781
DTS	1.270	0.787	1.699	0.589	1.965	0.509
LS-Factor	2.461	0.406	2.271	0.440	2.591	0.386
MSP	1.238	0.808	1.297	0.771	1.323	0.756
Plan Curv.	2.285	0.438	2.790	0.358	2.701	0.370
Profile Curv.	1.894	0.528	1.709	0.585	1.876	0.533
Slope	6.354	0.157	10.816	0.092	10.409	0.096
TPI	3.022	0.331	3.249	0.308	3.149	0.318
TRI	3.589	0.279	7.972	0.125	7.340	0.136
TWI	4.204	0.238	4.857	0.206	5.077	0.197
VD	2.005	0.499	2.561	0.391	2.855	0.350
VRM	1.338	0.747	1.410	0.709	1.396	0.716
WEI	3.932	0.254	5.030	0.199	4.477	0.223

Table 4. Performance measures for the Engelberger Aa, Meienreuss, and Göschenerreuss catchments expressed in precision, recall, F1-score, specificity, NPV, and accuracy.

Metrics	Engelberger Aa (11 Features)	Meienreuss (7 Features)	Göschenerreuss (9 Features)
Precision	0.91	0.95	0.94
Recall	0.96	0.98	0.97
F1-score	0.94	0.97	0.96
Specificity	0.96	0.98	0.97
Negative Predictive Value	0.91	0.95	0.94
Accuracy	0.94	0.96	0.97
5-fold Mean CV Score	0.931	0.958	0.954
Confidence Intervals [Lower Bound, Upper Bound]	[0.929, 0.934]	[0.956, 0.960]	[0.953, 0.955]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cetinkaya, S.; Kocaman, S. Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale. ISPRS Int. J. Geo-Inf. 2024, 13, 312. https://doi.org/10.3390/ijgi13090312

AMA Style

Cetinkaya S, Kocaman S. Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale. ISPRS International Journal of Geo-Information. 2024; 13(9):312. https://doi.org/10.3390/ijgi13090312

Chicago/Turabian Style

Cetinkaya, Sinem, and Sultan Kocaman. 2024. "Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale" ISPRS International Journal of Geo-Information 13, no. 9: 312. https://doi.org/10.3390/ijgi13090312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Sequential Backward Selection (SBS) and CatBoost for Snow Avalanche Susceptibility Mapping at Catchment Scale

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and the Inventory

2.2. Conditioning Factors

2.3. Categorical Boosting (Catboost)

2.4. Sequential Backward Selection (SBS) Algorithm

2.5. SHAP (Shapley Additive Explanations)

2.6. Validation

2.7. Multicollinearity Analysis

3. Results

3.1. Engelberger Aa Catchment

3.2. Meienreuss Catchment

3.3. Göschenerreuss Catchment

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI