*2.4. Statistical Analysis*

#### 2.4.1. Model Derivation

We used XG Boost, a scalable ML system for tree boosting [25]. We used the available open source package [26] for Python, Version 3.7.4 (Python Software Foundation, Delaware, DE, United States) [27].

The first release of the model was trained using data related to 1st April 2020 (training dataset index date), while the second and the third versions were derived using data related to 15th July 2020 and 1st November 2020, respectively. We considered all the clinics delivering services to at least one patient on the index date as well as over the week before index date.

#### 2.4.2. Model Accuracy and Feature Importance

Prediction accuracy of each release was tested every first and fifteenth day (validation dataset index dates). Therefore, development and validation datasets can include the same set of clinics/patients every two weeks.

To evaluate model performance, we measured the area under the curve (AUC) of the receiver operating characteristics (ROC) curve in the testing datasets [28] using Python, Version 3.7.4 (Python Software Foundation, Delaware, DE, United States) [27]. The AUC provides an aggregate measure of performance as the ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at all classification thresholds. Model discrimination ability over time was monitored by visual inspection of AUC trends. For illustrative purposes, we also reported the classification performance in terms of P(Outbreak|Class) (i.e., probability of outbreak (Yes/No) given the assigned risk class (L/M/H)) and P(Class|Outbreak) (i.e., probability of the assigned risk class given the outbreak) for the two action-thresholds chosen (0.015 and 0.125). In order to calculate P(Outbreak|Class) and P(Class|Outbreak) we artificially treated our problem as a binary decision for each threshold. We computed average probability values across the whole study period.

Feature importance was computed using the SHapley Additive exPlanations (SHAP) method [29]. This analysis enables intuitive model explainability via an accurate and efficient estimation of the contribution to risk of each input variable.

### 2.4.3. Descriptive Statistics

For both the training and validation datasets, we analyzed the number of active clinics, frequency and incidence of a COVID-19 outbreak, the distribution of clinics in each prediction level of risk (low, medium, high), as well as the relative risk compared to clinics in low-risk groups with Python, Version 3.7.4.

#### **3. Results**

#### *3.1. Dialysis Clinic Characteristics*

Model version 1, 2, and 3 were trained using a dataset related to 1st April 2020, 15th July 2020, and 1st November 2020, respectively. On these dates, active clinics were 589, 597, 603, while 34 (5.77%), 44 (7.37%), and 233 (38.64%) clinics had two or more patients with COVID-19 infection in the fortnight after the index date.

The surveillance system stratifies clinics by their risk of new local outbreak within two weeks. To facilitate the interpretation of the results, we established three risk categories: (1) Low, when outbreak risk is less than or equal to 1.5%; (2) Medium, risk greater than 1.5% and less than or equal to 12.5%; (3) High, if risk is greater than 12.5%. Risk thresholds depend both on the incidence of pandemic and on the ability of any given clinic to implement containment measures. Figure 2 reports the share of active dialysis clinics in different risk classes at each testing date.

The actual outbreak incidence in the dialysis clinics during the validation period is reported in Figure 3.

**Figure 2.** Number of dialysis clinics at the validation dates. Colors denote risk categories: Red, high > 12.5%; Yellow, medium 1.5% < x ≤ 12.5%; Green, low ≤ 1.5%.

**Figure 3.** Model Performance and Incidence of Clinics with Outbreaks: the plot reports data related to the 1 year observation period.
