**5. Bootstrapping Called into Question**

The non-parametric bootstrap statistical techniques, including that introduced by [9] in the frequentist approach of threshold estimation, were designed to estimate the sampling distribution of a variable based on an empirical data set and assign measures of accuracy to statistical estimates [67]. While [9] acknowledge that, owing to the use of the same data for calculating the regression and estimating its parameters' uncertainties, the bootstrap may yield optimistic estimates of the latter, other possible drawbacks are not discussed in studies having incorporated the bootstrap technique in threshold estimations [35,41,64]. However, the bootstrap may fail when the data set is incomplete, resulting in overestimation of the uncertainty, or when there are outliers in the data set, to which least-square regression estimates are highly sensitive [67]. Therefore, in the light of the observed uncertainty level and hints of variability in the bootstrap results, we decided to evaluate the pros and cons of applying this technique by performing a run of threshold estimation without using it.

Performing a single threshold calculation (no bootstrap), we obtained the following *AR* thresholds (Figure 6a):

$$AR\left(5\%\right) = 4.6 \times S^{-1.18} \text{ (R}^2 = 0.70\text{)}\tag{8}$$

$$AR\left(10^{\circ}\right) = 6.2 \times S^{-1.10}\ \left(R^2 = 0.65\right). \tag{9}$$

Parameters α and β are significant for both threshold levels, with α barely smaller and β barely larger compared to the thresholds obtained using the bootstrap method (Equations (6) and (7)), thus well within the bootstrap-defined uncertainty boundaries. Opposed changes in α and β might be anticipated from the inverse correlation that links coefficient and exponent of power law fits to a given data set. Therefore, the two parameter changes damp each other, thus inducing almost no difference in thresholds calculated with or without bootstrap (Table 1). Using no bootstrap, only the information about fit uncertainty is lost because date uncertainty is still accounted for through data weighting. Moreover, in the case of the *AR-S* approach, the inherent poor *S*-spread of the data and the presence of large outliers in the data subset used for threshold estimation imply that the bootstrap

procedure, which, sampling with replacement *n* data from a set of size *n*, is nothing more than a kind of random data weighting, includes a number of iterations with oversampled outliers. These iterations yield erratic results and may alter the final mean threshold estimate and exaggerate the fit uncertainty to an unknown extent. This is highlighted here by the better coefficients of determination of the *AR-S* thresholds obtained from the no-bootstrap approach. Furthermore, with or without bootstrap, the *AR*-*S* method does not account for crucial uncertainties affecting *AR* and *S* data themselves, so that providing bootstrap-derived uncertainties is actually misleading. We thus conclude that the *AR-S* threshold procedure is more meaningful when no bootstrap is applied. The corresponding source code of the *AR-S* threshold method is provided in the Supplementary Material (Code S2). As for the other issue affecting the modified *AR-S* approach mentioned in the previous section, namely the bias in higher exceedance probability threshold estimates (FNR < TPE), it is essentially linked to the lack of data in the low-*S* classes. It is thus independent of the use of a bootstrap technique and cannot be solved by discarding the latter.

**Table 1.** *AR* threshold values (in mm) at 5% and 10% exceedance probability with (Equations (6) and (7)) and without (Equations (8) and (9)) bootstrap for the extreme susceptibility values *S* observed in the data set.


**Figure 6.** *Cont*.

**Figure 6.** Log–log plots of antecedent rain (mm) vs. landslide susceptibility (regional-scale [49]) for the landslide events on the reported day and the days prior and after that date (with the point size relative to their attributed weights, i.e., 0.67 and 0.17 respectively). Thresholds are based on the calibration inventory (**a**), and the complete (calibration + validation) inventory (**b**). The threshold method applied is outlined in Figure 4 without adopting the bootstrapping statistical technique. Data subsets used for the calibration of thresholds at the 5% (green dots) and 10% (green and red dots) exceedance probability are highlighted (*T* in Figure 4). Dashed green and red lines in (**b**) present the thresholds based on the calibration data set only, as shown in (**a**). Ndata is the number of data in the respective expanded data set. The dashed lines delimit the log(*S*) classes.
