**3. Problem Statement**

We applied the *AR-S* threshold method according to [35] at the 5% and 10% exceedance probability levels, using the same calibration landslide data set, the same TMPA-RT-based *AR* data, but the new regional-scale *S* data of [49]. We obtained the following general *AR*-*S* relation and threshold equations:

$$AR = (a \pm \Lambda a) \times S^{(\not\equiv \pm \Lambda\emptyset)} = (38.8 \pm 1.6) \times S^{-0.06 \pm 0.06} \text{ (}R^2 = 0.00\text{)}\tag{2}$$

$$AR\left(5\%\right) = \left(13.1 \pm 1.7\right) \times S^{0.24 \pm 0.16} \text{ (}R^2 = 0.05\text{)}\tag{3}$$

$$AR\left(10\%\right) = \left(17.2 \pm 1.7\right) \times S^{0.22 \pm 0.16} \text{ (}R^2 = 0.03\text{)}.\tag{4}$$

Contrary to [35], the close to zero determination coefficients *R*<sup>2</sup> (averaged from 5000 bootstrap iterations) associated with the two calculated thresholds show no dependence of threshold *AR* values on *S* (Equations (3) and (4)). The meaningless character of these threshold estimates is further confirmed by the positive slope of the regression lines suggesting counter evidence that higher rainfall would be needed to trigger landslides in more susceptible areas (Figure 3). Analysis of the individual bootstrap iterations likewise uncovers a major issue lying in the estimation of parameter β, which is significant in only ~1 of 2 iterations, with relative uncertainties Δβ/β of 0.7 on average, much larger than the generally accepted 10% level [9].

Such poor thresholding cannot be ascribed to low-quality *S* data, the regional-scale data of [49] having been shown more accurate than the continental-scale *S* data of [45]. The reason for very weak and unrealistic positive correlation between *AR* and *S* has thus to be found elsewhere, most certainly in some hidden deficiency of the *AR-S* threshold method of [35]. We suggest and test hereafter that the problem arises from the way the data subset used in the threshold calibration is defined in the frequentist-based approach, based on the selection of the most negative residuals of the general fit. Indeed, in the case of the relatively small data set available in the WEAR and the unequal spread of the data across the *S* range, the frequentist method's assumption that the data set is large and well-spread [40] is not satisfied. In particular, using the regional *S* data, the distribution of the data points within the *AR-S* space is such that the 10% and 20% subsets sample (i.e., *2x*%) comprise almost no data in the domain of low *S*, due also to the quasi horizontality of the general fit that forces the

location of the most negative residuals in the high-*S* region (Figure S1). This means that a large number of the bootstrap iterations are based on data belonging exclusively to a narrow range of high *S* values, biasing the threshold *AR-S* relation and degrading the method's robustness. In any case, this failed test of the method highlights the need for improving it in order to overcome limitations imposed by heterogeneously distributed and relatively small-sized data sets. It also points to the possible role of the bootstrap procedure and calls for a critical evaluation of its use in such contexts. We thus propose two major methodological modifications of the *AR-S* approach in the next sections.

**Figure 3.** Log–log plot of antecedent rain (mm) vs. landslide susceptibility (regional-scale [49]) for the landslide events on the reported day and the days prior and after that date (with the point size relative to their associated weights, i.e., 0.67 and 0.17 respectively). Thresholds are obtained through the adoption of the *AR-S* method proposed by [35]. The black line is the regression curve obtained from the whole data set; the green and red curves are the *AR* thresholds at 5% and 10% exceedance probability levels respectively, along with their uncertainties shown as shaded areas. Ndata is the number of data in the expanded calibration set.
