*4.6. Measures*

To compare the performances of the forecasts, we use five different types of measures. The first three are measures of point forecasts, while the last two are measures of density forecasts. The difference between measures using point forecasts and measures using density forecasts is that measures using point forecasts use the mean of the simulations, while measures using density forecasts use all simulations. Measures using density forecasts give a grea<sup>t</sup> view of the full simulation and are not be averaged out as the measures using point forecasts. However, measures using point forecasts still give a good interpretation of the performance and are more efficient in time.

The first measure is the so-called 95% credible interval, which is an interval obtained by simulations. The 2.5% and 97.5% quantiles of the simulations are the lower and upper bounds, respectively. The idea behind this credible interval is that in 95% of the cases the forecast will be in this interval. Another measure is the sign predictability, in this paper referred as the "success rate", which is the percentage of the forecasts which are in the right direction, as the actual observations. When the actual observation goes down and the forecast as well, then it counts as a "success". It is also a "success" when the actual observation goes up and the forecast as well. In the two other cases, it counts as a "fail"; in this way, the "success rate" is built. We do not perform sign predictability tests for the reason indicated by Christoffersen and Diebold (2006). Tests that rely on the sign give no information about volatility dynamics, which is potentially valuable for detecting sign predictability.

The third measure is called the Root Mean Squared Error (RMSE). The RMSE is preferred over the Mean Squared Error (MSE) since it is on the same scale as the data. Some authors (e.g., Armstrong (2001)) recommend the use of the RMSE since it is more sensitive to outliers than commonly used Mean Absolute Error (MAE). The RMSE is computed for each cryptocurrency series, *i* = Bitcoin, Ethereum, Ripple and Litecoin:

$$\text{RMSE}\_{i} = \sqrt{\frac{\sum\_{t=R}^{T-1} (\hat{y}\_{i,t+1} - y\_{i,t+1})^2}{T - R}}$$

where *R* is the length of the rolling window, *T* is the number of observations, *y*<sup>ˆ</sup>*i*,*t*+<sup>1</sup> is the *i*th cryptocurrency forecast at time *t*, and *yi*,*t*+<sup>1</sup> is the actual observation at time *t*.

The fourth type of measure is for evaluating the density forecasts; this measure is called the Log Predictive Score (LS). In the same way as for the RMSE, it is computed for each series:

$$\text{LS}\_{i} = \sum\_{t=R}^{T-1} \ln f(y\_{i,t+1})$$

where *f*(*yi*,*t*+<sup>1</sup>) is the predictive density for *yi*,*t*+1, given the information up to time *t*. The fifth measure is the Continuous Rank Probability Score (CRPS). This is a continuous extension of the RPS and can be defined by considering an integral of the Brier scores over all possible thresholds *x*. Denoting the predicted cumulative density function by *<sup>F</sup>*(*x*) = *p*(*<sup>X</sup>* ≤ *x*) and the observed value of *X* by *yi*, the continuous ranked probability score can be written for each series as:

$$\text{CRPS}\_{i} = E\left(\int\_{-\infty}^{\infty} [F(\mathbf{x}) - H(\mathbf{x} - y\_i)]^2 d\mathbf{x}\right),$$

where *<sup>H</sup>*(*x* − *yi*) is the Heaviside function that takes the value 0 when the observed value is smaller than the threshold, and 1 otherwise (Jolliffe and Stephenson 2003, Forecast Verification).

For the RMSE, LS and CRPS, we apply the *t*-test by Diebold and Mariano (1995) for each model versus the benchmark. This test gives a *p*-value which indicates a certain significance level. If in a table a value has one asterisk, then the model performs better, by a significance level of 5%, than the benchmark model. If in a table a value has two asterisks, then the model performs better, by a significance level of 1%, than the benchmark model. The first row of the tables contain the results of the RMSE, LS and CRPS of the benchmark, which is the BVAR model. Ratios of each models RMSE and CRPS to the benchmark are done such that entries less than 1 indicate that the given model yields forecasts more accurate than those from the benchmark. The differences of each models LS to the benchmark are performed such that a positive number indicates a model beats the baseline.

The other procedure we use is the model confidence set procedure of Hansen et al. (2011) using a *R* package called *MCS*, detailed by Bernardi and Catania (2016). The model confidence set procedure compares all the predictions jointly and deletes a model if it is significantly worse, finally ending up with the best possible models of the models that were put in. The models which have a grey background in tables are chosen to be not significantly worse than the other models.
