Elevation:

Water has a tendency of flowing from high altitude to lower elevation. The continuous flow of the rainwater therefore easily creates a flash flood situation in the low elevation areas [48,49]. However, pluvial flash floods also occur at higher elevation. In this study, an elevation map was generated from the DEM with five classes (Figure 3).

#### Slope:

Many factors affect catchment hydrologic characteristics, which ultimately influence the production of surface runoff. One of the important factors controlling runoff is surface slope [50]. On the steeper slopes, infiltration will be less and runoff will be more. This excessive runoff will cause flash flooding of the down slope flat areas. Thus, flat areas near and adjacent to high gradient slope generally have high probability of occurrence of flash floods [51]. In the present study, a slope map was created from DEM with five classes (Figure 3).

#### Rainfall:

Rainfall is the primary source of water for runoff generation over the land surface causing flooding of the low-lying areas. Runoff occurs whenever rain intensity exceeds the infiltration capacity of the ground (soil and jointed weathered rock). Intense short duration rainfall may cause flash floods. Rainfall is the most important factor for flooding of an area [50]. Flooding may also occur due to ice melting. In order to determine the annual rainfall map, the data of four rainfall-gauge stations for a period of 30 years were used. The rainfall map was divided into two classes (Figure 3).

Distance from faults:

Some of the major faults exposed on the surface with wide permeable fault zones may increase infiltration and thus reduce the runoff and can saturate surrounding groundmass causing local flooding. However, fault may also cause failure of levees and earthen dams due to structure failure and may result in flash flooding. In the present study, the distance from fault map was prepared into six classes (Figure 3).

#### Soil:

Soil is one of the important factors affecting infiltration and runoff and thus has a great impact on flooding. Soils rich in clay are mostly impermeable and cause more runoff and thus cause flooding of the area. In the present study, the soil map was developed from data obtained from the Soil Survey Department of Iran (Figure 3).

Land use:

Land use types affect the degree and frequency of floods in an area [52,53]. Infiltration and runoff depend on the land use pattern as well as other factors. Alterations in the land use configuration can change the flooding pattern of a region. Land-cover change due to anthropogenic activities such as urbanization, deforestation, and cultivation results in increased flash flood frequency and severity. In the present study, the land use map was obtained from the Department of Natural Resources of Markazi Province. Google Earth images and field survey were used to update the map (Figure 3).

#### Lithology:

Variation of lithology can strongly amplify or reduce the degree of flash flood vulnerability [54,55]. Infiltration and runoff depend on the permeability of lithounits as well as other geo-environmental factors. In this study, the lithology map was prepared from the Geological Survey of Iran data with sixteen groups (Table 2 and Figure 3).


**Table 2.** Lithology units in the Tafresh watershed and their relative permeability.

#### **4. Methods Used**

#### *4.1. Frequency Ratio*

Frequency Ratio (FR) determines the quantitative relationship between a flash flood event and its various variables [32,56]. In order to determine the FR, the ratio of flash flood events in each class of influencing factors is calculated relative to the total flash flood events. The ratio of the area of each class to the total area is also determined. Finally, by dividing the percentage of flash flood events in each class by the percentage of the area of each class relative to the entire research area, the FR of the classes of each factor is calculated. FR for each class of factors affecting the flash flood are calculated using the following equation [32,33]:

$$\text{FR} = \left(\frac{A}{B} \Big| \frac{\text{C}}{D}\right) = \frac{E}{F} \tag{1}$$

where A: number of flash flood pixels per class, B: total flash flood pixels of the entire area, C: number of pixels per subclass of effective flash flood factors, D: total number of pixels in a region. E: percentage of flash flood occurrence in each class of effective factors, F: relative percentage of area of each class of total area.

#### *4.2. Correlation Based Feature Selection*

Irrelevant and redundant factors must be removed to improve data quality for modeling [57]. According to Pham et al. [58], working with a large number of factors reduces the speed of model execution, low modeling accuracy, and overfitting due to the large number of irrelevant factors as model inputs. There are many factors influencing the flood phenomenon, but the factors with higher correlation coefficients are more relevant in modeling and vice versa [58]. In this study, correlation based feature selection was selected to evaluate the importance of the factors used for better modeling of landslide susceptibility. This method is based on the assumption that features/factors are relevant if their values vary systematically with category membership [57,59]. In other words, a feature is useful if it is correlated with or predictive of the class; otherwise it is irrelevant [57,59]. In correlation based feature selection, the score of the evaluation is defined as Average Merit (AM) which is expressed as the following equation [57]:

$$AM\_{\bar{i}} = \frac{A\mathbb{C}\_{\bar{i}}}{AI\_{\bar{i}}} \tag{2}$$

where *AM<sup>i</sup>* is the score of factor *ith*, *AC<sup>i</sup>* is the average correlation between the subset *ith* with the dependent variable, and *AI<sup>i</sup>* is the average intercorrelation within the subset *ith*.

#### *4.3. AdaBoostM1*

AdaBoostM1 is a popular adaptive boosting algorithm proposed by Freund and Schapire [60]. AdaBoostM1 enhanced the predictive ability of the classifier. This method is employed to solve the classification problem, which contains a complicated dataset generated from previous classifiers. Firstly, the weight values are allocated to occurrences in learning dataset. After that, the weights are substituted in iterations of training process according to the performance of the previous base classifier. The training process will be terminated when the optimal weights have been given specifically to achieve the best performance of the base classifier [61].

#### *4.4. Bagging*

Bagging is known as one of the earliest ensemble methods which was proposed by Breiman [62] to improve the algorithm accuracy of machine learning methods [63]. In this method, bootstrap sample technique is used to produce numerous samples for creating a training classifier. Each generated training set is then employed to establish a decision tree. After that, these subsets are combined with the output in the final model [61]. This method not only enhances the capacity of generalization but also decreases the error of classification [64,65]. The optimum result of classification can be drawn using the following equation:

$$\mathbf{L}'(\mathbf{x}) = \underset{y \in Y}{\operatorname{argmax}} \sum\_{i=1}^{t} \mathbf{1}(\mathbb{C}\_{i}(\mathbf{x}) = y) \tag{3}$$

where L' (*x*) expresses a combination of classifier and *C<sup>i</sup>* (*x*) denotes an indicator function.

#### *4.5. Dagging*

Dagging was initially introduced by Ting and Witten [66]. This method is recognized as one of the famous ensemble techniques. Aim of Dagging method is to improve accuracy in prediction of the classifier by combining varied samples of the training set [67]. A number of disjointed samples are employed rather than bootstrap samples to achieve the base classifier [66,68]. This method is a powerful technique for a single classifier, which has a poor time of complexity; thus, the outputs of algorithms with weak training are linked via the popular voting rule [67].

#### *4.6. MultiBoostAB*

MultiBoostAB is a combination ensemble learning algorithm, which is established on the basis of AdaBoostM1 and Wagging methods in order to hinder overfitting problem [69]. Wagging is a variable of Bagging, which exploits training cases using various weights that can reduce remarkably the bias of AdaBoostM1 technique [70]. Combinations of Wagging and AdaBoostM1 produce a framework that can transform a weak training classifier to a robust one. As MultiBoostAB is able to perform parallel processing, it is considered as a potential and computational method that has more advantages in comparison to Wagging and AdaBoostM1 methods [69]. MultiBoostAB method involves three main steps: (1) selection of a subset randomly from the original learning data and then to use it to produce fundamental classifier-based models; (2) the weights of occurrence are adjusted based on the predictive competence of the models; and (3) finally, new subsets from the occurrence weighting are chosen for training newer models [71].

#### *4.7. Credal Decision Tree*

Abellan and Moral originally proposed Credal Decision Tree (CDT) using an original split criterion that was built based on uncertainty measures as well as inaccurate probabilities [70]. CDT is used to tackle classification problems by employing credal sets [64,72,73]. In order to reduce generating a complicated decision tree in the building process of CDT, an exclusive criterion was introduced in case of the summation of uncertainties raising due to splitting, the construction process will stop [64,74]. In order to quantitatively evaluate the entire uncertainty of credal sets, an updated method was recommended based on the theory of Dempster and Shafer [75,76]. The function applied in measuring the total uncertainty is expressed in the following equation [77]:

$$EL(\chi) = \text{NG}(\chi) + \text{RG}(\chi) \tag{4}$$

where EU expresses a value of entire uncertainty (i.e., total uncertainty), NG denotes a general nonspecificity function, and RG is a general randomness function for a credal set that represents a credal set. The successes and conclusions on the measurement of the summation of uncertainty were derived in previous literature of Abellan and Moral [78]. Besides, the detailed procedure for computing and properties of this measurement on EU were clearly described in previous studies [72,78]. To analyze probability intervals of individual variables, the inaccurate probability model was adopted [79,80]. Supposing that the Z is known as a variable that has values which are expressed by zj; then p(zj) is considered as the probability distribution, which reflects that each value of zj is determined as per the following formula [73,81]:

$$p(z\_j) \in \left[\frac{m\_{zj}}{M+r'}, \frac{m\_{zj}+r}{M+r}\right]; j = 1, \dots, \text{k};\tag{5}$$

where *M* and *mzj* express the sample size and the event frequency (Z = *z<sup>j</sup>* ), respectively; and *r* is called the hyperparameter, which has values of 1 or 2, as stated by Walley [80].

#### *4.8. Validation of the Models*

Validation is important to determine the accuracy of the flash flood susceptibility models. To verify the prediction capability of models, it is desirable to assess as well as compare both learning and validating datasets [17,25,42]. In the present study, various validation criteria were adopted, namely Area Under the Receiver Operating Characteristic (ROC) curve and statistical measures.

#### 4.8.1. Receiver Operating Characteristic (ROC) Curve

ROC curve is considered as a good tool for analyzing landslide and flood susceptibility models [17,42,82,83]. The *x*-axis of the ROC curve graph shows the specificity whereas the *y*-axis presents the sensitivity [84–88]. The area located under the ROC curve which is called the AUC is commonly employed to evaluate the prediction capacity of models [89–93]. Normally, the value of AUC has a range of 0.5–1.0 [94–96]. Higher value of AUC indicates better prediction capacity of the models [97–100]. The value of AUC is calculated by the following equation:

$$A \text{UC} = \frac{(\sum \text{EC} + \sum \text{IC})}{(a+b)} \tag{6}$$

where EC indicates the number of the accurately classified flash flood events, IC denotes the number of the inaccurately classified flash flood events, a is single flash flood event, and b is denotes the total number of flash flood events.

#### 4.8.2. Statistical Measures

In the present study, seven popular statistical measures, namely Positive Predictive Value (PPV), Negative Predictive Value (NPV), Root Mean Square Error (RSME), Accuracy (ACC), Sensitivity (SST), Specificity (SPF), and Kappa index (k) were employed for assessing performance of the flash flood prediction models. The description of these indexes is summarized in Table 3.


**Table 3.** List of statistical measures employed in this research [101–105].

Where, A (true positive) and C (true negative) denotes the number of pixels of flash flood event classified correctly, whereas B (false positive) and D (false negative) are the numbers of pixels of nonflash flood event classified incorrectly. *P<sup>a</sup>* and *Pest* are the measured and expected agreements, respectively.

RMSE is defined as the squared difference error between the model simulated and measured values. This method is popularly employed to assess flash flood susceptibility maps [17,42]. The smaller values of RMSE means the prediction capacity of the model is better. Determination of RMSE is calculated as follows [59,106–108]:

$$RMSE = \sqrt{\frac{1}{N}} \sum\_{i=1}^{L} \left( X\_{model} - X\_{act} \right)^2 \tag{7}$$

where *Xmodel* and *Xact* denote the model simulated and actual (i.e., measured) value, respectively; L stands for the summation of samples.

#### **5. Methodology**

Methodology of the study is presented below in several main steps: (1) Data collection and preparation; (2) Generating training and testing datasets; (3) Building the flash flood models; (4) Validation of the models; and (5) Generation of flash flood susceptibility maps (Figure 4). A more detailed description of these steps is given below:
