*2.4. Evaluation Process*

Hourly precipitation measurements covering the entire study area were used to evaluate the model performance under both the CTL and ZTD numerical experiments. The data were collected from a dense network of weather stations operated by NOA [29]. In terms of data availability, different rain gauges were used for each case study. In total, 340 to 360 rain gauges were utilized for each rainfall event (Figure 1b). The observed and modeled precipitation data were paired in time and space, considering the nine nearest to the location of each rain gauge model grid points. The grid point having the closest predicted value to the observed one was selected for evaluation in order to avoid penalizing the model performance due to possible small spatial displacements of rainfall. The evaluation was performed for the 24 h and 6 h accumulated precipitation at 0000 UTC, 0600 UTC, 1200 UTC, and 1800 UTC.

Using the observation-model pairs, qualitative statistical measures were computed on the basis of a dichotomous application system (occurrence/no occurrence of precipitation) for six distinct rain thresholds: above 0.2, 1, 2, 5, 10, and 20 mm. The computed scores included the probability of detection (POD), the false alarm ratio (FAR), the equitable threat score (ETS), and the frequency bias (FBIAS). POD shows the fraction of observed events that were correctly modeled and ranges from zero (0: wrong forecast) to one (1: perfect forecast). FAR is the fraction of forecast events that were not observed and spans from zero (0: perfect forecast) to one (1: wrong forecast). ETS measures the skill of a model prediction considering the chance of randomly correct forecasts. ETS values close to unity (1) indicate a high-accuracy forecast, whereas ETS values that are close to zero (0), or even negative, shows a poor or random forecast quality. FBIAS is the ratio between the frequencies of forecasted and observed events and indicates whether a model tends to underestimate (FBIAS < 1) or overestimate (FBIAS > 1) the frequency of the occurrence of the observed events. To determine the statistical significance of the qualitative score differences between the conducted experiments, a hypothesis test approach was applied, using two confidence intervals: (i) 90% and (ii) 95%. The test was based on the construction of a probability density function that was consistent with the assumption that there was no difference

between the qualitative statistical measures computed using the CTL simulations and those calculated based on the ZTD simulations. A brief description of the implemented method can be found in Giannaros et al. [64]. Quantitative statistical measures were also calculated, namely the mean bias (MB) and mean absolute error (MAE), for each precipitation threshold in order to account for the magnitude of errors. MB is used as a measure of the model tendency of rain underestimation (MB > 0) or overestimation (MB < 0), while MAE represents the absolute deviation between the observational and modeled precipitation. Following Giannaros et al. [64], the statistical significance of the MAE differences between the CTL and ZTD experiments was computed by applying the non-parametric Wilcoxon signed-rank test at the 90% and 95% confidence intervals. In addition to the statistical evaluation, the modeled differences concerning the rainfall distribution were thoroughly investigated for two events representing different synoptic conditions.
