**1. Introduction**

Flooding may occur any time of the year in Canada; however, the risk is generally higher in the spring due to heavy rainfall coupled with a rapidly melting snowpack or ice jamming [1]. The development of flood hazard maps is a provincial responsibility. Thus, to date, the areas selected for mapping are determined provincially, based on their selection criteria [2]. Hazard maps generally cover a geographically small region using high-resolution datasets, which provides high-quality maps for these targeted areas. The limitation to this approach is that only those selected areas have flood maps generated, leaving many communities and large geographic areas without flood maps or indicators of their flood risk.

The emergence of Machine learning (ML) algorithms has shown competency in a variety of fields and are growing in popularity in their application to geospatial science issues. Most recently, and notably, they have been applied to FS mapping [3–6]. There are hundreds of ML algorithms available, which can be generally categorized into groups of classification or regression and may belong to different families, such as neural networks, boosting, random forests, generalized linear models, etc. [7]. These models come from different backgrounds, e.g., statistics (generalized linear models), artificial intelligence

**Citation:** McGrath, H.; Gohl, P.N. Prediction and Classification of Flood Susceptibility Based on Historic Record in a Large, Diverse, and Data Sparse Country. *Environ. Sci. Proc.* **2023**, *25*, 18. https://doi.org/ 10.3390/ECWS-7-14235

Academic Editor: Athanasios Loukas

Published: 16 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and data mining (decision trees), and clustering (K-means), while others represent a connectionist approach (neural networks) [7]. Tools, such as Classification And Regression (caret) library in R, provide an easy mechanism to execute and compare results from a large number of classifiers [8]. The pre-eminence of Random Forest (RF) in solving problems, such as FS, are well supported [3,7,9–12].

In the literature, there is not a consensus on whether a single model or an ensemble model is more appropriate, and different disciplines report varying outcomes. Ensemble learning and hybrid ML methods have been shown to significantly improve performance, especially in cases of high dimensional data [13–15]. Both ensembles and hybrid approaches employ a type of information fusion through differing approaches [14].

Most of the existing literature which has explored ML as applied to the FS problem have covered relatively small geographic regions, and often, single watersheds. Few have explored how these models perform over vast geographic and meteorologically diverse environments. In this research, two questions are explored to create a national FS map for Canada: (i) can a single ML model outperform a collection of regional models across a geographically large and diverse region, and (ii) how model performance is affected with increasing distance from the training sites.

### **2. Materials and Methods**

Following common practices in ML workflows, datasets that may contribute to flooding were first identified, and historic flood events were used as labeled data and split into two classes: flooded/not flooded. The data was pre-processed to scale and centre or set as factors for the nominal datasets. It was split into training and test sets (70/30) using the Classification And Regression Training (caret) create data partition function to create a balanced split [11]. Next, Variable Selection using Random Forest (VSURF) was run to identify the relevant (important) datasets (factors) by finding the model and associated datasets with the smallest out-of-bag (OOB) error. This sub-set of factors was then refined by testing for multicollinearity among the independent variables by assessing the Variance Inflation Factor (VIF) and Pearson correlation coefficient to come up with a final list of important factors. Then, the selected factors were run through a variety of ML models to determine performance. Selection of models include those commonly cited in the literature, found to perform well in FS problems, and aligned with findings from [8] in their comprehensive study. K-fold cross-validation of the training points (K = 3) was repeated for five resampling iterations and the average score was recorded. All ML models were accessed from the caret library in R. The train function in caret was used, which sets up a grid of tuning parameters for a large number of classification and regression algorithms [7]. These control parameters prescribe the computational nuances of the train function. The test data was then run through the trained model and the classification and FS prediction were saved. These results were then analyzed in a confusion matrix, summary statistics were calculated, and the results compared to a historic event database, Figure 1.

### *2.1. Single Model vs. Multi-Model*

As Canada spans a geographically large area and contains a diverse set of climate characteristics, a regional approach was first considered. This approach would see several models developed. National datasets in Canada have been developed based on a variety of frameworks, such as regional, physiological, and ecozone approaches. Commonly accepted in Canada is adopting the National Ecological Framework of Canada, which provides standard ecological units [16]. This classification system "incorporates all major components of ecosystems: air, water, land, and biota" [16]. A second framework considers physiographic regions and is comprised of the shield and borderlands, each sub-divided into additional groupings [17]. As undertaken by [3], in their US work of flood risk, drainage areas were considered as natural region boundaries. In Canada, there are 11 major drainage areas.

In the national approach, all training data were processed together to generate a single model, which was then applied across the whole of the Country.

**Figure 1.** Methodology and tests run to create a flood susceptibility dataset in Canada.

#### *2.2. Single Model vs. Ensemble over Large Distance*

Most of the existing literature applying ML to the FS problem have tested and applied the results over a relatively small geographic areas, which are not dissimilar from the data found in the training/test set. Given the vast region of interest and the variety in land use, geography, climate, etc., several single ML models, which encompass a variety of classification and regression algorithms, were tested to evaluate their performance, Table 1. In addition, an ensemble model was developed and evaluated. Qualitative and quantitative analysis were performed to evaluate how the different ML models perform over Canada as a whole.


**Table 1.** Machine learning algorithms tested, C = classification, R = Regression).

### *2.3. ML Models Tested*

The single model approach focused on Random Forest, due to its superior performance as found in the literature, [3,7,9–12]. Additional models were considered for the ensemble method and synthesized via a generalized linear model (GLM).
