**5. Numerical Experiments**

In this section, we evaluate the performance of the three presented ensemble strategies which utilize advanced deep learning models as component learners. The implementation code was written in Python 3.4 while for all deep learning models Keras library [40] was utilized and Theano as back-end.

For the purpose of this research, we utilized data from 1 January 2018 to 31 August 2019 from the hourly price of the cryptocurrencies BTC, ETH and XRP. For evaluation purposes, the data were divided in training set and in testing set as in [7,11]. More specifically, the training set comprised data from 1 January 2018 to 28 February 2019 (10,177 datapoints), covering a wide range of long and short term trends while the testing set consisted of data from 1 March 2019 to 31 August 2019 (4415 datapoints) which ensured a substantial amount of unseen out-of-sample prices for testing.

Next, we concentrated on the experimental analysis to evaluate the presented ensemble strategies using the advanced deep learning models CNN-LSTM and CNN-BiLSTM as base learners. A detailed description of both component models is presented in Table 1. These models and their hyper-parameters were selected in previous research [7] after extensive experimentation, in which they exhibited the best performance on the utilized datasets. Both component models were trained for 50 epochs with Adaptive Moment Estimation (ADAM) algorithm [41] with a batch size equal to 512, using a mean-squared loss function. ADAM algorithm ensures that the learning steps, during the training process, are scale invariant relative to the parameter gradients.


**Table 1.** Parameter specification of two base learners.

The performance of all ensemble models was evaluated utilizing the performance metric: Root Mean Square Error (RMSE). Additionally, the classification accuracy of all ensemble deep models was measured, relative to the problem of predicting whether the cryptocurrency price would increase or decrease on the next day. More analytically, by analyzing a number of previous hourly prices, the model predicts the price on the following hour and also predicts if the price will increase or decrease, with respect to current cryptocurrency price. For this binary classification problem, three performance metrics were used: Accuracy (Acc), Area Under Curve (AUC) and *F*1-score (*F*1).

All ensemble models were evaluated using 7 and 11 component learners which reported the best overall performance. Notice that any attempt to increase the number of classifiers resulted to no improvement to the performance of each model. Moreover, stacking was evaluated using the most widely used state-of-the-art algorithms [42] as meta-learners: Support Vector Regression (SVR) [43], Linear Regression (LR) [44], *k*-Nearest Neighbor (*k*NN) [45] and Decision Tree Regression (DTR) [46]. For fairness and for performing an objective comparison, the hyper-parameters of all meta-learners were selected in order to maximize their experimental performance and are briefly presented in Table 2.


**Table 2.** Parameter specification of state-of-art algorithms used as meta-learners.

Summarizing, we evaluate the performance of the following ensemble models:


Tables 3 and 4 summarize the performance of all ensemble models using CNN-LSTM as base learner for *m* = 4 and *m* = 9, respectively. Stacking using LR as meta-learner exhibited the best regression performance, reporting the lowest RMSE score, for all cryptocurrencies. Notice that stacking(LR) 7 and Stacking(LR) 11 reported the same performance, which implies that the increment of component learners from 7 to 11 did not affect the regression performance of this ensemble algorithm. In contrast, stacking(LR) 7 exhibited better classification performance than Stacking(LR) 11 , reporting higher accuracy, AUC and *F*1-score. Additionally, the stacking ensemble reported the worst performance utilizing DTR and SVR as meta-learners among all ensemble models, also reporting worst performance than CNN-LSTM model; while the best classification performance was reported using *k*NN as meta-learner in almost all cases.

The average and bagging ensemble reported slightly better regression performance, compared to the single model CNN-LSTM. In contrast, both ensembles presented the best classification performance, considerably outperforming all other forecasting models, regarding all datasets. Moreover, the bagging ensemble reported the highest accuracy, AUC and *F*1 in most cases, slightly outperforming the average-ensemble model. Finally, it is worth noticing that both the bagging and average-ensemble did not improve their performance when the number of component classifiers increased. for *m* = 4 while for *m* = 9 a slightly improvement in their performance was noticed.


**Table 3.** Performance of ensemble models using convolutional neural network and long short-term memory (CNN-LSTM) as base learner for *m* = 4.


**Table 4.** Performance of ensemble models using CNN-LSTM as base learner for *m* = 9.

Tables 5 and 6 present the performance of all ensemble models utilizing CNN-BiLSTM as base learner for *m* = 4 and *m* = 9, respectively. Firstly, it is worth noticing that stacking model using LR as meta-learner exhibited the best regression performance, regarding to all cryptocurrencies. Stacking(LR) 7 and Stacking(LR) 11 presented almost identical RMSE score Moreover, stacking(LR) 7 presented slightly higher accuracy, AUC and *F*1-score than stacking(LR) 11 , for ETH and XRP datasets, while for BTC dataset Stacking(LR) 11 reported slightly better classification performance. This implies that the increment of component learners from 7 to 11 did not considerably improved and affected the regression and classification performance of the stacking ensemble algorithm. Stacking ensemble reported the worst (highest) RMSE score utilizing DTR, SVR and *k*NN as meta-learners. It is also worth mentioning that it exhibited the worst performance among all ensemble models and also worst than that of the single model CNN-BiLSTM. However, stacking ensemble reported the highest classification performance using *k*NN as meta-learner. Additionally, it presented slightly better classification performance using DTR or SVR than LR as meta-learners for ETH and XRP datasets, while for BTC dataset it presented better performance using LR as meta-learner as meta-learner.

**Table 5.** Performance of ensemble models using CNN and bi-directional LSTM (CNN-BiLSTM) as base learner for *m* = 4.



**Table 6.** Performance of ensemble models using convolutional neural network with CNN-BiLSTM as base learner for *m* = 9.

Regarding the other two ensemble strategies, averaging and bagging, they exhibited slightly better regression performance compared to the single CNN-BiLSTM model. Nevertheless, both averaging and bagging reported the highest accuracy, AUC and *F*1-score, which implies that they presented the best classification performance among all other models with bagging exhibiting slightly better classification performance. Furthermore, it is also worth mentioning that both ensembles slightly improved their performance in term of RMSE score and Accuracy, when the number of component classifiers increased from 7 to 11.

In the follow-up, we provided a deeper insight classification performance of the forecasting models by presenting the confusion matrices of averaging11, bagging11, stacking(LR) 7 and stacking(*k*NN) 11 for *m* = 4, which exhibited the best overall performance. The use of the confusion matrix provides a compact and to the classification performance of each model, presenting complete information about mislabeled classes. Notice that each row of a confusion matrix represents the instances in an actual class while each column represents the instances in a predicted class. Additionally, both stacking ensembles utilizing DTR and SVM as meta-learners were excluded from the rest of our experimental analysis, since they presented the worst regression and classification performance, relative to all cryptocurrencies.

Tables 7–9 present the confusion matrices of the best identified ensemble models using CNN-LSTM as base learner, regarding BTC, ETH and XRP datasets, respectively. The confusion matrices for BTC and ETH revealed that stacking(LR) 7 is biased, since most of the instances were misclassified as "Down", meaning that this model was unable to identify possible hidden patterns despite the fact that it exhibited the best regression performance. On the other hand, bagging11 exhibited a balanced prediction distribution between "Down" or "Up" predictions, presenting its superiority over the rest forecasting models, followed by averaging11 and stacking(kNN) 11 . Regarding XRP dataset, bagging11 and stacking(kNN) 11 presented the highest prediction accuracy and the best trade-off between true positive and true negative rate, meaning that these models may have identified some hidden patters.

**Table 7.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-LSTM as base model for Bitcoin (BTC) dataset.


**Table 8.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-LSTM as base model for Etherium (ETH) dataset.


**Table 9.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-LSTM as base model for Ripple (XRP) dataset.


Tables 10–12 present the confusion matrices of averaging11, bagging11, stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-BiLSTM as base learner, regarding BTC, ETH and XRP datasets, respectively. The confusion matrices for BTC dataset demonstrated that both average11 and bagging11 presented the best performance while stacking(LR) 7 was biased, since most of the instances were misclassified as "Down". Regarding ETH dataset, both average11 and bagging11 were considered biased since most "Up" instances were misclassified as "Down". In contrast, both stacking ensembles presented the best performance, with stacking(*k*NN) 11 reporting slightly considerably better trade-off between sensitivity and specificity. Regarding XRP dataset, bagging11 presented the highest prediction accuracy and the best trade-off between true positive and true negative rate, closely followed by stacking(kNN) 11.

**Table 10.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-BiLSTM as base model for BTC dataset.


**Averaging11 Down Up Down** 1913 337 **Up** 1721 442 **Bagging11 Down Up Down** 1837 413 **Up** 1631 532 **Stacking(LR) 7 Down Up Down** 1322 928 **Up** 1199 964 **Stacking(***k***NN) 11 Down Up Down** 1277 973 **Up** 1083 1080

**Table 11.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-BiLSTM as base model for ETH dataset.

**Table 12.** Confusion matrices of Averaging11, Bagging11, Stacking(LR) 7 and Stacking(*k*NN) 11 using CNN-BiLSTM as base model for XRP dataset.


In the rest of this section, we evaluate the reliability of the best reported ensemble models by examining if they have properly fitted the time series. In other words, we examine if the models' residuals defined by

$$
\pounds\_t = y\_t - y\_t
$$

are identically distributed and asymptotically independent. It is worth noticing the residuals are dedicated to evaluate whether the model has properly fitted the time series.

For this purpose, we utilize the AutoCorrelation Function (ACF) plot [47] which is obtained from the linear correlation of each residual ˆ*t* to the others in different lags, <sup>ˆ</sup>*t*−1, <sup>ˆ</sup>*t*−2, ... and illustrates the intensity of the temporal autocorrelation. Notice that in case the forecasting model violates the assumption of no autocorrelation in the errors implies that its predictions may be inefficient since there is some additional information left over which should be accounted by the model.

Figures 4–6 present the ACF plots for BTC, ETH and XRP datasets, respectively. Notice that the confident limits (blue dashed line) are constructed assuming that the residuals follow a Gaussian probability distribution. It is worth noticing that averaging11 and bagging11 ensemble models violate the assumption of no autocorrelation in the errors which suggests that their forecasts may be inefficient, regarding BTC and ETH datasets. More specifically, the significant spikes at lags 1 and 2 imply that there exists some additional information left over which should be accounted by the models. Regarding XRP dataset, the ACF plot of average11 presents that the residuals have no autocorrelation; while the ACF plot of bagging11 presents that there is a spike at lag 1, which violates the assumption of no autocorrelation in the residuals. Both ACF plots of stacking ensemble are within 95% percent confidence interval for all lags, regarding BTC and XRP datasets, which verifies that the residuals have no autocorrelation. Regarding the ETH dataset, the ACF plot of stacking(LR) 7 reported a small spike at lag 1, which reveals that there is some autocorrelation of the residuals but not particularly large; while the ACF plot of stacking(*k*NN) 11 reveals that there exist small spikes at lags 1 and 2, implying that there is some autocorrelation.

**Figure 4.** Autocorrelation of residuals for BTC dataset of ensemble models using CNN-LSTM as base learner.

**Figure 5.** Autocorrelation of residuals for ETH dataset of ensemble models using CNN-LSTM as base learner.

**Figure 6.** Auto-correlation of residuals for XRP dataset of ensemble models using CNN-LSTM as base learner.

Figures 7–9 present the ACF plots of averaging11, bagging11, stacking(LR) 7 and stacking(*k*NN) 11 ensembles utilizing CNN-BiLSTM as base learner for BTC, ETH and XRP datasets, respectively. Both averaging11 and bagging11 ensemble models violate the assumption of no autocorrelation in the errors, relative to all cryptocurrencies, implying that these models are not properly fitted the time-series. In more detail, the significant spikes at lags 1 and 2 sugges<sup>t</sup> that the residuals are not identically distributed and asymptotically independent, for all datasets . The ACF plot of stacking(LR) 7 ensemble for BTC dataset verify that the residuals have no autocorrelation since are within 95% percent confidence interval for all lags. In contrast, for ETH and XRP datasets the spikes at lags 1 and 2 illustrate that there is some autocorrelation of the residuals. Regarding the ACF plots of stacking(*k*NN) 11 present that there exists some autocorrelation in the residuals but not particularly large for BTC and XRP datasets; while for ETH dataset, the significant spikes at lags 1 and 2 sugges<sup>t</sup> that the model's prediction may be inefficient.

**Figure 7.** Autocorrelation of residuals for BTC dataset of ensemble models using CNN-BiLSTM as base learner.

**Figure 8.** Autocorrelation of residuals for ETH dataset of ensemble models using CNN-BiLSTM as base learner.

**Figure 9.** Autocorrelation of residuals for XRP dataset of ensemble models using CNN-BiLSTM as base learner.
