**4. Improving the Data Distribution over the** *S* **Range of the Data Used for Threshold Calculation**

#### *4.1. Rationale*

While sticking to the choice of [35] of defining *x*% thresholds from 2*x*% subsets of data, we first propose a major modification of the *AR-S* method aimed at optimizing the use of the information available over the entire *S* range. Fundamentally, the data that are now considered for inclusion in the calibration subsets rely no longer on residuals of an often non-significant fit over the whole data set but rather on minimum *AR* values. The best possible distribution of the latter is obtained by stratified sampling, dividing the actual *S* range of the data set in a number of slices from which (as much as possible) equal numbers of minimum *AR* data are selected. The slices are taken of equal size in log(*S*) and their optimal number was fixed at 10 on a trial-and-error basis. As an example, suppose you want to estimate a 10% threshold based on a data set containing 150 landslide events, i.e., 450 event dates (see [35] and A.1 below). Homogeneously distributing over 10 *S* classes the 90 data of

the 20% subset required for this threshold calculation implies to select the nine events with lowest *AR* in each class. Obviously, some classes may contain less than nine data, thus contributing less to the composition of the data subset, whose final size will often be slightly smaller than expected. In addition, when an *S* class does not contain enough data to fully contribute to the subset, all its data will be selected, however far their *AR* values are from minimum. However, tested through down weighting of the data proportionately to the deficit in contribution of their provenance class, this possible bias appeared to insignificantly affect the threshold estimates. The modified method is described in detail hereafter (see also Figure 4). The source code is provided in the Supplementary Material (Code S1).

	- A.1. *AR* values associated with each day of a reported landslide plus the days prior and after these dates are extracted from the *AR* time series of the corresponding pixels calculated according to Equation (1) and the parameterization adopted in [35], i.e., *a* = *b* = 1.2, *n* = 42 days, for which the index is relevant for landslide types ranging from shallow to deep-seated landslides [35,63]. Data with *AR* < 5 mm are discarded from the data set as unlikely to have been triggered by rainfall [35]. The size of the provisional data set *Q* is then *q* ≤ 3*p*, where *p* is the number of landslide events in the raw calibration set.
	- A.2. The data are weighted to account for the event date uncertainty: *w* = 24/36 for the day a landslide was reported, *w* = 6/36 for the days prior and after the landslide was reported. This weighting is implemented by expanding the data set as described in [35]. The expanded set is noted *R*.
	- B.1. The number *tC* of data to be selected per *S* class is determined as

$$t\_{\mathbb{C}} = \frac{2 \times \text{TPE} \times r}{10},\tag{5}$$

where TPE refers to the desired threshold probability of exceedance, *r* is the number of data in *R*, and 10 is the number of log(*S*) classes.


Threshold quality is evaluated through the correspondence between the obtained false negative rate (FNR, actual ratio of data in *R* below the calculated threshold) and the nominal TPE. Differences may result from *t* significantly smaller than (2 × TPE × *r*), large outliers in *T*, and possibly also from bootstrap issues (see Section 5).

**Figure 4.** Workflow of the modified antecedent rainfall (*AR*)–susceptibility (*S*) threshold approach for landslides. The data sets used or derived from the respective part of the workflow are highlighted in red. *RT* refers to the number of data in *R* below the threshold.
