**1. Introduction**

The global financial crisis of 2007–2009 was the most severe crisis over the last few decades with, according to the National Bureau of Economic Research, a peak to trough contraction of 18 months. The consequences were severe in most aspects of life including economy (investment, productivity, jobs, and real income), social (inequality, poverty, and social tensions), leading in the long run to political instability and the need for further economic reforms. In an attempt to "think outside the box" and bypass the governments and financial institutions manipulation and control, Satoshi Nakamoto [1] proposed Bitcoin which is an electronic cash allowing online payments, where the double-spending problem was elegantly solved using a novel purely peer-to-peer decentralized blockchain along with a cryptographic hash function as a proof-of-work.

Nowadays, there are over 5000 cryptocurrencies available; however, when it comes to scientific research there are several issues to deal with. The large majority of these are relatively new, indicating that there is an insufficient amount of data for quantitative modeling or price forecasting. In the same manner, they are not highly ranked when it comes to market capitalization to be considered

as market drivers. A third aspect which has not attracted attention in the literature is the separation of cryptocurrencies between mineable and non-mineable. Minable cryptocurrencies have several advantages i.e., the performance of different mineable coins can be monitored within the same blockchain which cannot be easily said for non-mineable coins, and they are community driven open source where different developers can contribute, ensuring the fact that a consensus has to be reached before any major update is done, in order to avoid splitting. Finally, when it comes to the top cryptocurrencies, it appears that mineable cryptocurrencies like Bitcoin (BTC) and Ethereum (ETH), recovered better the 2018 crash rather than Ripple (XRP) which is the highest ranked pre-mined coin. In addition, the non-mineable coins transactions are powered via a centralized blockchain, endangering price manipulation through inside trading, since the creators keep a given percentage to themselves, or through the use of pump and pull market mechanisms. Looking at the number one cryptocurrency exchange in the world, Coinmarketcap, by January 2020 at the time of writing there are only 31 mineable cryptocurrencies out of the first 100, ranked by market capitalization. The classical investing strategy in cryptocurrency market is the "buy, hold and sell" strategy, in which cryptocurrencies are bought with real money and held until reaching a higher price worth selling in order for an investor to make a profit. Obviously, a potential fractional change in the price of a cryptocurrency may gain opportunities for huge benefits or significant investment losses. Thus, the accurate prediction of cryptocurrency prices can potentially assist financial investors for making their proper investment policies by decreasing their risks. However, the accurate prediction of cryptocurrency prices is generally considered a significantly complex and challenging task, mainly due to its chaotic nature. This problem is traditionally addressed by the investor's personal experience and consistent watching of exchanges prices. Recently, the utilization of intelligent decision systems based on complicated mathematical formulas and methods have been adopted for potentially assisting investors and portfolio optimization.

Let *y*1, *y*2, ... , *yn* be the observations of a time series. Generally, a nonlinear regression model of order *m* is defined by

$$y\_t = f(y\_{t-1}, y\_{t-2}, \dots, y\_{t-m\_t}\theta),\tag{1}$$

where *m* values of *yt*, *θ* is the parameter vector. After the model structure has been defined, function *f*(·) can be determined by traditional time-series methods such as ARIMA (Auto-Regressive Integrated Moving Average) and GARCH-type models and their variations [2–4] or by machine learning methods such as Artificial Neural Networks (ANNs) [5,6]. However, both mentioned approaches fail to depict the stochastic and chaotic nature of cryptocurrency time-series and be successfully effective for accurate forecasting [7]. To this end, more sophisticated algorithmic approaches have to be applied such as deep learning and ensemble learning methods. From the perspective of developing strong forecasting models, deep learning and ensemble learning constitute two fundamental learning strategies. The former is based on neural networks architectures and it is able to achieve state-of-the-art accuracy by creating and exploiting new more valuable features by filtering out the noise of the input data; while the latter attempts to generate strong prediction models by exploiting multiple learners in order to reduce the bias or variance of error.

During the last few years, researchers paid special attention to the development of time-series forecasting models which exploit the advantages and benefits of deep learning techniques such as convolutional and long short-term memory (LSTM) layers. More specifically, Wen and Yuan [8] and Liu et al. [9] proposed Convolutional Neural Network (CNN) and LSTM prediction models for stock market forecasting. Along this line, Livieris et al. [10] and Pintelas et al. [11] proposed CNN-LSTM models with various architectures for efficiently forecasting gold and cryptocurrency time-series price and movement, reporting some interesting results. Nevertheless, although deep learning models are tailored to cope with temporal correlations and efficiently extract more valuable information from the training set, they failed to generate reliable forecasting models [7,11]; while in contrast ensemble learning models although they are an elegant solution to develop stable models and address the high variance of individual forecasting models, their performance heavily depends on the diversity and

accuracy of the component learners. Therefore, a time-series prediction model, which exploits the benefits of both mentioned methodologies may significantly improve the prediction performance.

The main contribution of this research is the combination of ensemble learning strategies with advanced deep learning models for forecasting cryptocurrency hourly prices and movement. The proposed ensemble models utilize state-of-the-art deep learning models as component learners which are based on combinations of Long Short-Term Memory (LSTM), Bi-directional LSTM (BiLSTM) and convolutional layers. An extensive experimental analysis is performed considering both classification and regression problems, to evaluate the performance of averaging, bagging and stacking ensemble strategies. More analytically, all ensemble models are evaluated on prediction of the cryptocurrency price on the next hour (regression) and also on the prediction if the price on the following hour will increase or decrease with respect to the current price (classification). It is worth mentioning that the information of predicting the movement of a cryptocurrency is probably more significant that the prediction of the price for investors and financial institutions. Additionally, the efficiency of the predictions of each forecasting model is evaluated by examining for autocorrelation of the errors, which constitutes a significant reliability test of each model.

The remainder of the paper is organized as follows: Section 2 presents a brief review of state of the art deep learning methodologies for cryptocurrency forecasting. Section 3 presents the advanced deep learning models, while Section 4 presents the ensemble strategies utilized in our research. Section 5 presents our experimental methodology including the data preparation and preprocessing as well as the detailed experimental analysis, regarding the evaluation of ensemble of deep learning models. Section 6 discusses the obtained results and summarizes our findings. Finally, Section 7 presents our conclusions and presents some future directions.

### **2. Deep Learning in Cryptocurrency Forecasting: State-of-the-Art**

Yiying and Yeze [12] focused on the price non-stationary dynamics of three cryptocurrencies Bitcoin, Etherium, and Ripple. Their approach aimed at identifying and understand the factors which influence the value formation of these digital currencies. Their collected data contained 1030 trading days regarding opening, high, low, and closing prices. They conducted an experimental analysis which revealed the efficiency of LSTM models over classical ANNs, indicating that LSTM models are more capable of exploiting information hidden in historical data. Additionally, the authors stated that probably the reason for the efficiency of LSTM networks is that they tend to depend more on short-term dynamics while ANNs tends to depend more on long-term history. Nevertheless, in case enough historical information is given, ANNs can achieve similar accuracy to LSTM networks.

Nakano et al. [13] examined the performance of ANNs for the prediction of Bitcoin intraday technical trading. The authors focused on identifying the key factors which affect the prediction performance for extracting useful trading signals of Bitcoin from its technical indicators. For this purposed, they conducted a series of experiments utilizing various ANN models with shallow and deep architectures and datasets structures The data utilized in their research regarded Bitcoin time-series return data at 15-min time intervals. Their experiments illustrated that the utilization of multiple technical indicators could possibly prevent the prediction model from overfitting of non-stationary financial data, which enhances trading performance. Moreover, they stated that their proposed methodology attained considerably better performance than the primitive technical trading and buy-and-hold strategies, under realistic assumptions of execution costs.

Mcnally et al. [14] utilized two deep learning models, namely a Bayesian-optimised Recurrent Neural Network and a LSTM network, for Bitcoin price prediction. The utilized data ranged from the August 2013 until July 2016, regarding open, high, low and close of Bitcoin prices as well as the block difficulty and hash rate. Their performance evaluation showed that the LSTM network demonstrated the best prediction accuracy, outperforming the other recurrent model as well as the classical statistical method ARIMA.

Shintate and Pichl [15] proposed a new trend prediction classification framework which is based on deep learning techniques. Their proposed framework utilized a metric learning-based method, called Random Sampling method, which measures the similarity between the training samples and the input patterns. They used high frequency data (1-min) ranged from June 2013 to March 2017 containing historical data from OkCoin Bitcoin market (Chinese Yuan Renminbi and US Dollars). The authors concluded that the profit rates based on utilized sampling method considerably outperformed those based on LSTM networks, confirming the superiority of the proposed framework. In contrast, these profit rates were lower than those obtained of the classical buy-and-hold strategy; thus they stated that it does not provide a basis for trading.

Miura et al. [16] attempted to analyze the high-frequency Bitcoin (1-min) time series utilizing machine learning and statistical forecasting models. Due to the large size of the data, they decided to aggregate the realized volatility values utilizing 3-h long intervals. Additionally, they pointed out that these values presented a weak correlation based on high-low price extent with the relative values of the 3-h interval. In their experimental analysis, they focused on evaluating various ANNs-type models, SVMs and Ridge Regression and the Heterogeneous Auto-Regressive Realized Volatility model. Their results demonstrated that Ridge Regression considerably presented the best performance while SVM exhibited poor performance.

Ji et al. [17] evaluated the prediction performance on Bitcoin price of various deep learning models such as LSTM networks, convolutional neural networks, deep neural networks, deep residual networks and their combinations. The data used in their research, contained 29 features of the Bitcoin blockchain from 2590 days (from 29 November 2011 to 31 December 2018). They conducted a detailed experimental procedure considering both classification and regression problems, where the former predicts whether or not the next day price will increase or decrease and the latter predicts the next day's Bitcoin price. The numerical experiments illustrated that the deep neural DNN-based models performed the best for price ups-and-downs while the LSTM models slightly outperformed the rest of the models for forecasting Bitcoin's price.

Kumar and Rath [18] focused on forecasting the trends of Etherium prices utilizing machine learning and deep learning methodologies. They conducted an experimental analysis and compared the prediction ability of LSTM neural networks and Multi-Layer perceptron (MLP). They utilized daily, hourly, and minute based data which were collected from the CoinMarket and CoinDesk repositories. Their evaluation results illustrated that LSTM marginally outperformed MLP but not considerably, although their training time was significantly high.

Pintelas et al. [7,11] conducted a detailed research, evaluating advanced deep learning models for predicting major cryptocurrency prices and movements. Additionally, they conducted a detailed discussion regarding the fundamental research questions: Can deep learning algorithms efficiently predict cryptocurrency prices? Are cryptocurrency prices a random walk process? Which is a proper validation method of cryptocurrency price prediction models? Their comprehensive experimental results revealed that even the LSTM-based and CNN-based models, which are generally preferable for time-series forecasting [8–10], were unable to generate efficient and reliable forecasting models. Moreover, the authors stated that cryptocurrency prices probably follow an almost random walk process while few hidden patterns may probably exist. Therefore, new sophisticated algorithmic approaches should be considered and explored for the development of a prediction model to make accurate and reliable forecasts.

In this work, we advocate combining the advantages of ensemble learning and deep learning for forecasting cryptocurrency prices and movement. Our research contribution aims on exploiting the ability of deep learning models to learn the internal representation of the cryptocurrency data and the effectiveness of ensemble learning for generating powerful forecasting models by exploiting multiple learners for reducing the bias or variance of error. Furthermore, similar to our previous research [7,11], we provide detailed performance evaluation for both regression and classification problems. To the

best of our knowledge, this is the first research devoted to the adoption and combination of ensemble learning and deep learning for forecasting cryptocurrencies prices and movement.

### **3. Advanced Deep Learning Techniques**

### *3.1. Long Short-Term Memory Neural Networks*

Long Short-term memory (LSTM) [19] constitutes a special case of recurrent neural networks which were originally proposed to model both short-term and long-term dependencies [20–22]. The major novelty unit in a LSTM network is the memory block in the recurrent hidden layer which contains memory cells with self-connections memorizing the temporal state and adaptive gate units for controlling the information flow in the block. With the treatment of the hidden layer as a memory unit, LSTM can cope the correlation within time-series in both short and long term [23].

More analytically, the structure of the memory cell *ct* is composed by three gates: the input gate, the forget gate and the output gate. At every time step *t*, the input gate *it* determines which information is added to the cell state *St* (memory), the forget gate *ft* determines which information is thrown away from the cell state through the decision by a transformation function in the forget gate layer; while the output gate *ot* determines which information from the cell state will be used as output.

With the utilization of gates in each cell, data can be filtered, discarded or added. In this way, LSTM networks are capable of identifying both short and long term correlation features within time series. Additionally, it is worth mentioning that a significant advantage of the utilization of memory cells and adaptive gates which control information flow is that the vanishing gradient problem can be considerably addressed, which is crucial for the generalization performance of the network [20].

The simplest way to increase the depth and the capacity of LSTM networks is to stack LSTM layers together, in which the output of the (*L* − 1)th LSTM layer at time *t* is treated as input of the *L*th layer. Notice that this input-output connections are the only connections between the LSTM layers of the network. Based on the above formulation, the structure of the stacked LSTM can be describe as follows: Let *hLt* and *hL*−<sup>1</sup> *t* denote outputs in the *L*th and (*L* − 1)th layer, respectively. Each layer *L* produces a hidden state *hLt* based on the current output of the previous layer *hL*−<sup>1</sup> *t* and time *<sup>h</sup>Lt*−1. More specifically, the forget gate *f Lt*of the *L* layer calculates the input for cell state *<sup>c</sup>Lt*−<sup>1</sup>by

$$f\_t = \sigma \left( \mathcal{W}\_f^L [h\_{t-1}^L, h\_t^{L-1}] + b\_f^L \right) \prime$$

where *<sup>σ</sup>*(·) is a sigmoid function while *WLf* and *bf* are the weights matrix and bias vector of layer *L* regarding the forget gate, respectively. Subsequently, the input gate *iLt* of the *L* layer computes the values to be added to the memory cell *cLt* by

$$\dot{\mathbf{u}}\_t^L = \sigma \left( \mathbf{W}\_i^L \left[ h\_{t-1}^L, h\_t^{L-1} \right] + b\_i^L \right),$$

where *WLi* is the weights matrix of layer *L* regarding the input gate. Then, the output gate *oLt* of the *L*th layer filter the information and calculated the output value by

$$o\_t^L = \sigma \left( \mathcal{W}\_o^L [h\_{t-1}^L, h\_t^{L-1}] + b\_o^L \right) \prime$$

where *Wof* and *bo* are the weights matrix and bias vector of the output gate in the *L* layer, respectively. Finally, the output of the memory cell is computed by

$$h\_t = o\_t^L \cdot \tanh\left(c\_t^L\right),$$

where · denotes the pointwise vector multiplication, tanh the hyperbolic tangent function and

$$\begin{array}{rcl} \mathcal{c}\_t^L &=& f\_t^L \cdot \mathcal{c}\_{t-1}^L + i\_t^L \cdot \vec{\mathcal{c}}\_{t-1}^L \\ \vec{\mathcal{c}}\_t^L &=& \tanh\left(\mathcal{W}\_{\mathcal{C}}^L[h\_{t-1}^L, h\_t^{L-1}] + b\_{\mathcal{C}}^L\right) \end{array}$$

### *3.2. Bi-Directional Recurrent Neural Networks*

Similar with the LSTM networks, one of the most efficient and widely utilized RNNs architectures are the Bi-directional Recurrent Neural Networks (BRNNs) [24]. In contrast with the LSTM, these networks are composed by two hidden layers, connected to input and output. The principle idea of BRNNs is that each training sequence is presented forwards and backwards into two separate recurrent networks [20]. More specifically, the first hidden layer possesses recurrent connections from the past time steps, while in the second one, the recurrent connections are reversed, transferring activation backwards along the sequence. Given the input and target sequences, the BRNN can be unfolded across time in order to be efficiently trained utilizing a classical backpropagation algorithm.

In fact, BRNN and LSTM are based on compatible techniques in which the former proposes the wiring of two hidden layers, which compose the network, while the latter proposes a new fundamental unit for composing the hidden layer.

Along this line, Bi-directional LSTM (BiLSTM) networks [25] were proposed in the literature, which incorporate two LSTM networks in the BRNN framework. More specifically, BiLSTM incorporates a forward LSTM layer and a backward LSTM layer in order to learn information from preceding and following tokens. In this way, both past and future contexts for a given time *t* are accessed, hence better prediction can be achieved by taking advantage of more sentence-level information.

In a bi-directional stacked LSTM network, the output of each feed-forward LSTM layer is the same as in the classical stacked LSTM layer and these layers are iterated from *t* = 1 to *T*. In contrast, the output of each backward LSTM layer is iterated reversely, i.e., from *t* = *T* to 1. Hence, at time *t*, the output of value ← *h t* in backward LSTM layer *L* can be calculated as follows

$$\begin{array}{rcl} \stackrel{\leftarrow}{f}\_{t} &=& \sigma\left(W\_{\frac{L}{f}}^{L}[h\_{t-1}^{L},h\_{t}^{L-1}]+b\_{\frac{L}{f}}^{L}\right),\\ \stackrel{\leftarrow}{i}\_{t}^{\leftarrow L} &=& \sigma\left(W\_{\frac{L}{f}}^{L}[h\_{t-1}^{L},h\_{t}^{L-1}]+b\_{\frac{L}{f}}^{L}\right),\\ \stackrel{\leftarrow}{o}\_{t}^{\leftarrow L} &=& \sigma\left(W\_{\frac{L}{o}}^{L}[h\_{t-1}^{L},h\_{t}^{L-1}]+b\_{\frac{L}{o}}^{L}\right),\\ \stackrel{\leftarrow}{c}\_{t}^{\leftarrow L} &=& \stackrel{\leftarrow}{f}\_{t}^{\leftarrow L}\cdot\stackrel{\leftarrow}{c}\_{t-1}^{L}+\stackrel{\leftarrow}{i}\_{t}^{L}\cdot\stackrel{\leftarrow}{\tilde{c}}\_{t-1},\\ \stackrel{\leftarrow}{\tilde{c}}\_{t}^{\leftarrow L} &=& \tanh\left(W\_{\frac{L}{\tilde{c}}}^{L}[h\_{t-1}^{L},h\_{t}^{L-1}]+b\_{\frac{L}{\tilde{c}}}^{L}\right),\\ \stackrel{\leftarrow}{h}\_{t}^{\leftarrow L} &=& \stackrel{\leftarrow}{o}\_{t}^{\leftarrow L}\cdot\tanh\left(\stackrel{\leftarrow}{\tilde{c}}\_{t}^{L}\right). \end{array}$$

Finally, the output of this BiLSTM architecture is given by

$$y\_t = W\left[\stackrel{\longrightarrow}{h}\_{t\prime}\stackrel{\leftarrow}{h}\_{t\prime}\right] + b\_{-}$$

### *3.3. Convolutional Neural Networks*

Convolutional Neural Network (CNN) models [26,27] were originally proposed for image recognition problems, achieving human level performance in many cases. CNNs have grea<sup>t</sup> potential to identify the complex patterns hidden in time series data. The advantage of the utilization of CNNs for time series is that they can efficiently extract knowledge and learn an internal representation from the raw time series data directly and they do not require special knowledge from the application domain to filter input features [10].

A typical CNN consists of two main components: In the first component, mathematical operations, called convolution and pooling, are utilized to develop features of the input data while in the second component, the generated features are used as input to a usually fully-connected neural network.

The convolutional layer constitutes the core of a CNN which systematically applies trained filters to input data for generating feature maps. Convolution can be considered as applying and sliding a one dimension (time) filter over the time series [28]. Moreover, since the output of a convolution is a new filtered time series, the application of several convolutions implies the generation of a multivariate times series whose dimension equals with the number of utilized filters in the layer. The rationale behind this strategy is that the application of several convolution leads to the generation of multiple discriminative features which usually improve the model's performance. In practice, this kind of layer is proven to be very efficient and stacking different convolutional layers allows deeper layers to learn high-order or more abstract features and layers close to the input to learn low-level features.

Pooling layers were proposed to address the limitation that feature maps generated by the convolutional layers, record the precise position of features in the input. These layers aggregate over a sliding window over these feature maps, reducing their length, in order to attain some translation invariance of the trained features. More analytically, the feature maps obtained from the previous convolutional layer are pooled over temporal neighborhood separately by sum pooling function or by max pooling function in order to developed a new set of pooled feature maps. Notice that the output pooled feature maps constitute a filtered version of the features maps which are imported as inputs in the pooling layer [28]. This implies that small translations of the inputs of the CNN, which are usually detected by the convolutional layers, will become approximately invariant.

Finally, in addition to convolutional and pooling layers, some include batch normalization layers [29] and dropout layers [30] in order to accelerate the training process and reduce overfitting, respectively.

### **4. Ensemble Deep Learning Models**

Ensemble learning has been proposed as an elegant solution to address the high variance of individual forecasting models and reduce the generalization error [31–33]. The basic principle behind any ensemble strategy is to weigh a number of models and combine their individual predictions for improving the forecasting performance; while the key point for the effectiveness of the ensemble is that its components should be characterized by accuracy and diversity in their predictions [34]. In general, the combination of multiple models predictions adds a bias which in turn counters the variance of a single trained model. Therefore, by reducing the variance in the predictions, the ensemble can perform better than any single best model.

In the literature, several strategies were proposed to design and develop ensemble of regression models. Next, we present three of the most efficient and widely employed strategies: ensemble-averaging, bagging, and stacking.

### *4.1. Ensemble-Averaging of Deep Learning Models*

Ensemble-averaging [35] (or averaging) is the simplest combination strategy for exploiting the prediction of different regression models. It constitutes a commonly and widely utilized ensemble strategy of individual trained models in which their predictions are treated equally. More specifically, each forecasting model is individually trained and the ensemble-averaging strategy linearly combines all predictions by averaging them to develop the output. Figure 1 illustrates a high-level schematic representation of the ensemble-averaging of deep learning models.

**Figure 1.** Ensemble-averaging of deep learning models.

Ensemble-averaging is based on the philosophy that its component models will not usually make the same error on new unseen data [36]. In this way, the ensemble model reduces the variance in the prediction, which results in better predictions compared to a single model. The advantages of this strategy are its simplicity of implementation and the exploitation of the diversity of errors of its component models without requiring any additional training on large quantities of the individual predictions.

### *4.2. Bagging Ensemble of Deep Learning Models*

Bagging [33] is one the most widely used and successful ensemble strategies for improving the forecasting performance of unstable models. Its basic principle is the development of more diverse forecasting models by modifying the distribution of the training set based on a stochastic strategy. More specifically, it applies the same learning algorithm on different bootstrap samples of the original training set and the final output is produced via a simple averaging. An attractive property of the bagging strategy is that it reduces variance while simultaneously retains the bias which assists in avoiding overfitting [37,38]. Figure 2 demonstrates a high-level schematic representation of the bagging ensemble of *n* deep learning models.

**Figure 2.** Bagging ensemble of deep learning models.

It is worth mentioning that bagging strategy is significantly useful for dealing with large and high-dimensional datasets where finding a single model which can exhibit good performance in one step is impossible due to the complexity and scale of the prediction problem.

### *4.3. Stacking Ensemble of Deep Learning Models*

Stacked generalization or stacking [39] constitutes a more elegant and sophisticated approach for combining the prediction of different learning models. The motivation of this approach is based on the limitation of simple ensemble-average which is that each model is equally considered to the ensemble prediction, regardless of how well it performed. Instead, stacking induces a higher-level model for exploiting and combining the prediction of the ensemble's component models. More specifically, the models which comprise the ensemble (Level-0 models) are individually trained using the same training set (Level-0 training set). Subsequently, a Level-1 training set is generated by the collected outputs of the component classifiers.

This dataset is utilized to train a single Level-1 model (meta-model) which ultimately determines how the outputs of the Level-0 models should be efficiently combined, to maximize the forecasting performance of the ensemble. Figure 3 illustrates the stacking of deep learning models.

**Figure 3.** Stacking ensemble of deep learning models.

In general, stacked generalization works by deducing the biases of the individual learners with respect to the training set [33]. This deduction is performed by the meta-model. In other words, the meta-model is a special case of weighted-averaging, which utilizes the set of predictions as a context and conditionally decides to weight the input predictions, potentially resulting in better forecasting performance.
