*2.5. Pseudo-Reality Framework*

To make inferences on the potential future performance of the selected univariate and multivariate methods, intermodel cross-validation is performed using the so-called pseudo-reality approach (Figure 4). In the first stage, one GCM-RCM at a time is used as the verifying model (i.e., pseudo-reality) against which the rest of the models are adjusted using the selected methods. The bias adjusted simulations are then compared against the pseudo-reality GCM-RCM with a set of performance measures. The resulting cross-validation statistics are then averaged over all pseudo-realities to

obtain an overall view of the bias correction performance in changing climatic conditions. The same framework is applied to the hydrological simulations to see to what extent the relative performance of the selected MOS methods differs when inspected from the hydrological modeling point of view. To this end, the bias adjusted temperature and precipitation time-series are used as input to the E-HYPE sub models, which are then run to simulate the future hydrological conditions in the selected catchments. Hydrological simulations are cross-validated in a similar manner as the GCM-RCM simulations using complementary performance measures.

**Figure 4.** An illustration of the pseudo-reality framework procedures in the baseline period, applied both from climate modeling and hydrological modeling perspectives.

In order to improve the applicability of the pseudo-reality approach to hydrological simulations, two ways to construct pseudo-realities were tested (Figure 4): (a) raw GCM-RCM simulations were used as pseudo-realities without taking biases in relation to observations (i.e., WFDEI) into account [25]; (b) the annual cycle of the GCM-RCM acting as pseudo-reality was adjusted to biases in comparison to WFDEI by simply removing the mean bias at each day of the annual cycle using a 30-day sliding window. Daily adjustments were applied instead of monthly ones in order to avoid additional jumps in the annual cycle of the pseudo-reality time series. This shift in the mean values obviously alters the bias between pseudo-reality and the verifying models but leaves the changes in this relatively untouched (see Figure S2 in supplementary material). The motivation for the second approach is apparent: biases in relation to the observed climate are substantial in some of the selected GCM-RCMs, which leads to unrealistic hydrological model behavior both in the pseudo-reality runs and the verifying hydrological simulations. For example, substantial cold biases at high altitude regions in Sava sub-model and during winter in Tornio sub-model cause unrealistic volumes of snow to accumulate throughout the

simulation periods. We argue that without this additional bias adjustment step the use of GCM-RCMs as pseudo-realities when cross-validating bias adjustment methods from hydrological modeling perspective might not be reasonable due to unrealistic shifts in hydrological regimes. One should note that although the intention is to keep the daily variability in the pseudo-reality time series untouched, the multiplicative scaling applied to daily precipitation slightly modifies the spread of precipitation distributions both in the baseline and scenario periods. This also slightly changes the daily variability of hydrological simulations accordingly.

#### *2.6. Metrics for GCM-RCM Simulations*

To assess the general similarity between the empirical cumulative probability distributions of the predicting models *F*pred and the GCM-RCM acting as pseudo-reality *F*ver, 2-sample Cramér–von Misés (CM) statistic [43] was calculated according to

$$\text{CM} = A \left\langle \frac{mn}{(m+n)^2} \left\{ \sum\_{i=1}^{m} \left[ \hat{F}\_{\text{pred}}(\mathbf{x}\_i) - F\_{\text{ver}}(\mathbf{x}\_i) \right]^2 + \sum\_{j=1}^{n} \left[ \hat{F}\_{\text{pred}}(y\_j) - F\_{\text{ver}}(y\_j) \right]^2 \right\} \right\rangle,\tag{3}$$

where ˆ( ) denotes the pooled sample of the four predicting GCM-RCM simulations, while *m* and *n* are the numbers of values within the pooled sample (x) and in pseudo-reality (y), respectively. The actual calculations were made for binned data using bin widths of 1 ◦C and 1 mmd−<sup>1</sup> and the same number of bins with identical bin boundaries for both predicting GCM-RCMs and the pseudo-reality GCM-RCM. *A* indicates an average over 12 months and the area of a sub-model. CM measures the similarity of two empirical distributions in probability space and puts more weight on discrepancies in the tails of the cumulative distributions than the widely used Kolmogorov–Smirnov statistic, which measures the maximum distance between the cumulative probability distributions. Comparison with these statistics did not reveal significant differences, and the results are shown only for CM.

The second statistic, mean absolute error (MAE), was calculated over quantiles *i* (*i* ∈ [1, ..., 100]) of the predicting and verifying (i.e., pseudo-reality) model distributions following

$$\text{MAE} = A \langle |\hat{F}\_{\text{pred}}^{-1}(i) - F\_{\text{ver}}^{-1}(i)| \rangle,\tag{4}$$

where *A* encompasses averaging over the distribution quantiles in addition to temporal and spatial averaging. The analysis was also repeated using the mean squared error, but the results did not show substantial differences to MAE. Thus, the relative method performance is illustrated in terms of MAE in the remainder of the paper.

Two statistics measuring errors in inter-variable correlations were calculated. First, to assess to what extent the linear correlation is modified by different methods, MAE in the Pearson correlation coefficient was calculated between the average correlation coefficient of the four verifying models and pseudo-reality, averaged in a similar manner as in Equation (3). Secondly, to evaluate the remaining errors in the full dependence structure, the empirical copula density was approximated from the pseudo-observations (*u*, *v*), estimated for the *i*th temperature (*x*) and precipitation (*y*) value as *u* = rank(*xi*)/(*n* + 1) and *v* = rank(*yi*)/(*n* + 1), where *n* is the number of values for both variables. These values were binned 2-dimensionally and normalized such that the histogram approximately corresponds to the copula density. The 2-dimensional binning was done at 0.1 interval. MAE between empirical copula densities of the predicting GCM-RCMs and pseudo-reality was then calculated according to

$$\text{MAE}\_{\mathbb{C}} = \frac{1}{n} \sum\_{i=1}^{n} |\pounds(i)\_{\text{pred}} - c(i)\_{\text{ver}}|. \tag{5}$$

In Equation (5), *c*ˆpred denotes the empirical copula density averaged over the four predicting models, *c*ver the copula density calculated for pseudo-reality and *n* is the number of bins used to estimate the copula density. In the following, the subscript c is dropped for brevity. MAE based on kernel density estimates were also tested but the resulting statistics depended substantially on the used kernel method and the kernel width and thus, were not considered further in this study. To reduce the effect of sampling noise to the results, temperature-precipitation pairs were pooled over the area of each sub-model and season before estimating the empirical copula densities. Identical values were handled using the same approach as in Gennaretti et al. [35]: ranks were first given randomly to identical values before estimating the empirical copula density. This was repeated 10 times and the final copula density was calculated as the average of the randomly ranked estimates. Despite being a simple and not a proper goodness-of-fit measure, this statistic readily illustrates how well each method is capable to adjust the full dependence structure. Gennaretti et al. [35] briefly pointed out that the measured performance depended on whether dry days were included when estimating the empirical copula density. While the focus is here on the copula density including the full time-series, the results for the wet-day copula can be found from the supplementary material (Figure S3).
