**5. Empirical Results**

First, the appropriate ARIMA and NNAR models are to be selected to forecast next-day Bitcoin price for the test-sample. ARIMA models are chosen based on the lowest AIC, while considering the PP test for stationarity using the auto.arima function provided by the Forecast package in R. However, it is challenging to select the appropriate NNAR model. For the first training-sample period (500 days), 14 different NNAR(p,k) specifications are estimated and evaluated for the forecast (without re-estimation) performance of the first test-sample period (1966 days). The results are presented in Figure A1 in Appendix A. Interestingly, training-sample forecast performance gets better with increasing the numbers of lags and hidden layers (see Figure A1a) but NNAR (2,1) performs best for test-sample forecast (see Figure A1b). Therefore, NNAR (2,1) is selected for the estimation of the first training and test samples. The same 14 models are estimated and compared for the second training and test samples (see Figure A2), and NNAR (1,2) is selected based on test-sample forecast performance. In the employed NNAR framework, it is noteworthy that test-sample forecast performance is always better with a lower number of lags and nodes in contrast to the training-sample forecast performance.

For next-day Bitcoin price forecast without re-estimation of the model for next step, the two selected models for first training and test samples are ARIMA (4,1,0) and NNAR (2,1), and for the second training and test samples are ARIMA (4,1,1) and NNAR (1,2). We adopt the static forecast approach, as depicted in Figure 3. When using an autoregressive model in the static forecast approach, the actual value of the dependent variable in previous periods is used to estimate each step forecast for the training sample. On the contrary, when forecasting multiple periods, dynamic forecast approach uses the previously forecasted value (out-sample period) of the dependent variable to compute a forecast. In Table 2, first, we present the training-sample forecast performance of ARIMA and NNAR models by means of RMSE, MAPE and MASE. Then, in Table 3, we present the test-sample forecast performance of the employed models. According to Table 2, NNAR models perform better than ARIMA in the first training-sample period, but ARIMA is better in the second training-sample. According to Table 3, for both cases, without and with re-estimation of forecast models for next-day Bitcoin price forecasting, ARIMA models outperform NNAR in the test-sample forecast. Log-transformed Bitcoin price series and its forecasted values using ARIMA and NNAR under different estimation approaches are presented in Figure 4.


**Figure3.**Illustrationof staticforecastapproach.


**Table 2.** Training-sample forecast performance.

> Bold numbers indicate best performance.

**Table 3.** Test-sample static forecast performance.


Bold numbers indicate best performance.

To confirm the validity of forecast models, diagnostic checks are conducted. *p*-values of the Box-Ljung (BL) test (Ljung and Box 1978) sugges<sup>t</sup> that residuals of all employed models are free from autocorrelation (*p*-values > 0.05 considering eight lags). The BL test result of squared residuals of ARIMA models indicates the presence of conditional heteroscedasticity (*p*-values < 0.05); thus, future research on Bitcoin price forecast should consider nested ARIMA models combining ARCH and GARCH. The Jarque-Bera test (Jarque and Bera 1980) results sugges<sup>t</sup> that residuals are not normally distributed (*p*-values < 0.05). Normality of residuals should not be an issue for the NNAR model as the error series in such models are assumed to be homoscedastic (and normally distributed) when training the model based on the training-sample (Hyndman and Athanasopoulos 2018).

(**a**) Actual and forecasted Bitcoin price (training sample:500 days, test-sample:1966 days) ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ ϮϬϭϯͬϱͬϭϱ ϮϬϭϯͬϳͬϭϱ ϮϬϭϯͬϵͬϭϱ ϮϬϭϯͬϭϭͬϭϱ ϮϬϭϰͬϭͬϭϱ ϮϬϭϰͬϯͬϭϱ ϮϬϭϰͬϱͬϭϱ ϮϬϭϰͬϳͬϭϱ ϮϬϭϰͬϵͬϭϱ ϮϬϭϰͬϭϭͬϭϱ ϮϬϭϱͬϭͬϭϱ ϮϬϭϱͬϯͬϭϱ ϮϬϭϱͬϱͬϭϱ ϮϬϭϱͬϳͬϭϱ ϮϬϭϱͬϵͬϭϱ ϮϬϭϱͬϭϭͬϭϱ ϮϬϭϲͬϭͬϭϱ ϮϬϭϲͬϯͬϭϱ ϮϬϭϲͬϱͬϭϱ ϮϬϭϲͬϳͬϭϱ ϮϬϭϲͬϵͬϭϱ ϮϬϭϲͬϭϭͬϭϱ ϮϬϭϳͬϭͬϭϱ ϮϬϭϳͬϯͬϭϱ ϮϬϭϳͬϱͬϭϱ ϮϬϭϳͬϳͬϭϱ ϮϬϭϳͬϵͬϭϱ ϮϬϭϳͬϭϭͬϭϱ ϮϬϭϴͬϭͬϭϱ ϮϬϭϴͬϯͬϭϱ ϮϬϭϴͬϱͬϭϱ ϮϬϭϴͬϳͬϭϱ ϮϬϭϴͬϵͬϭϱ >ŽŐƚƌĂŶƐĨŽƌŵĞĚŝƚĐŽŝŶWƌŝĐĞ Z/D;ƌĞͲĞƐƚŝŵĂƚŝŽŶͿ EEZ;ƌĞͲĞƐƚŝŵĂƚŝŽŶͿ Z/D;ǁŝƚŚŽƵƚƌĞͲĞƐƚŝŵĂƚŝŽŶͿ EEZ;ǁŝƚŚŽƵƚƌĞͲĞƐƚŝŵĂƚŝŽŶͿ

(**c**) Actual and forecasted Bitcoin price (training sample:2000 days, test-sample:466 days) 

Z/D;ƌĞͲĞƐƚŝŵĂƚŝŽŶͿ

EEZ;ǁŝƚŚŽƵƚƌĞͲĞƐƚŝŵĂƚŝŽŶͿ

EEZ;ƌĞͲĞƐƚŝŵĂƚŝŽŶͿ

>ŽŐƚƌĂŶƐĨŽƌŵĞĚŝƚĐŽŝŶWƌŝĐĞ

Z/D;ǁŝƚŚŽƵƚƌĞͲĞƐƚŝŵĂƚŝŽŶͿ

**Figure 4.** *Cont.*

(**d**) Concentrated view on the forecast period (test-sample:466 days) 

**Figure 4.** Bitcoin price forecast. (**<sup>a</sup>**,**b**) refer to the first training and test samples forecast in comprehensive and concentrated view, respectively, and (**<sup>c</sup>**,**d**) refer to the second training and test samples forecast in comprehensive and concentrated view, respectively.

Further, we perform the Diebold Mariano (DM) test (Diebold and Mariano 1995) to compare test-sample forecast results obtained from the two models used, ARIMA and NNAR. DM test results are presented in Table 4. In this case, the alternative hypothesis is that the forecast results of the second method are less accurate than the first method. Thus, a *p*-value of less than 0.05 indicates better accuracy of the first method. Result of the DM test is similar to as revealed in Table 3—the ARIMA model is more accurate than NNAR in forecasting the test-sample Bitcoin price. It is noteworthy that, forecast of ARIMA models, with or without model re-estimation in each step, are identical. Meanwhile, the NNAR model with re-estimation in each step performs considerably better than the without re-estimation approach.

### **Table 4.** DM test of forecast results.


*p* < 0.05 indicates that forecast results of the first method is better than the second method.
