**1. Introduction**

Currently, there is tremendous interest in determining the dynamics and direction of the price of Bitcoin due to its unique characteristics, such as its decentralization, transparency, anonymity, and speed in carrying out international transactions. Recently, these characteristics have attracted the attention of both institutional and retail investors. Thanks to technological developments, investor trading strategies are benefited by digital platforms; therefore, market participants are more likely to digest and create information for this market. Of special interest is its decentralized character, since its value is not determined by a central bank but, essentially, only by supply and demand, recovering the ideal of a free market economy. At the same time, it is accessible to all sectors of society, which breaks down geographic and particular barriers for investors. The fact that there are a finite number of coins and the cost of mining new coins grows exponentially has suggested to some specialists that it may be a good instrument for preserving value. That is, unlike fiat money, Bitcoin cannot be arbitrarily issued, so its value is not affected by the excessive issuance of currency that central banks currently follow, or by low interest rates as a strategy to control inflation. In other words, it has been recently suggested that bitcoin is a safe-haven asset or store of value, having a role similar to that once played by gold and other metals.

The study of cryptocurrencies and bitcoin has been approached from different perspectives and research areas. It has been addressed from the point of view of financial

**Citation:** García-Medina, A.; Luu Duc Huynh, T. What Drives Bitcoin? An Approach from Continuous Local Transfer Entropy and Deep Learning Classification Models. *Entropy* **2021**, *23*, 1582. https://doi.org/10.3390/ e23121582

Academic Editors: Ryszard Kutner, Christophe Schinckus and H. Eugene Stanley

Received: 28 September 2021 Accepted: 23 November 2021 Published: 26 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

economics, econometrics, data science, and more recently by econophysics. In these approaches, various methodologies and mathematical techniques have been utilised to understand different aspects of these new financial instruments. These topics range from systemic risk, the spillover effect, autoscaling properties, collective patterns, price formation, and forecasting in general. Remarkable work in the line of multiscale analysis of cryptocurrency markets can be found in [1]. However, this paper is motivated by using the econphysics approach, incorporated with rigorous control variables to predict Bitcoin price patterns. We would like to offer a comprehensive review of the determinants of Bitcoin prices. The first pillar can be defined as sentiment and social media content. While Bitcoin is widely considered a digital financial asset, investors pay attention to this largest market capitalization by searching its name. Therefore, the strand of literature on Google search volume has become popular for capturing investor attention [2]. Concomitantly, not only peer-to-peer sentiment (individual Twitter accounts or fear from investors) [3,4] but also influential accounts (the U.S. President, media companies) [5–7] significantly contribute to Bitcoin price movement. Given the greatest debate on whether Bitcoin should act as a hedging, diversifying or safe-haven instrument, Bitcoin exhibits a mixture of investing features. More interestingly, uncertain shocks might cause changes in both supply and demand in Bitcoin circulation, implying a change in its prices [8]. Thus, the diverse stylized facts of Bitcoin, including heteroskedasticity and long memory, require uncertainty to be controlled in the model. While uncertainties represent the amount of risk (compensated by the Bitcoin returns) [9], our model also includes the price of risk, named the 'risk aversion index' [10]. These two concepts (amount of risk and the price of risk) demonstrate discount rate factors in the time variation of any financial market [11]. In summary, the appearance of these determinants could capture the dynamics of the cryptocurrency market. Since cryptocurrency is a newly emerging market, the level of dependence in the market structure is likely higher than that in other markets [12]. Furthermore, the contagion risk and the connectedness among these cryptocurrencies could be considered the risk premium for expected returns [13,14]. More importantly, this market can be driven by small market capitalization, implying vulnerability of the market [15]. Hence, our model should contain alternative coins (altcoins) to capture their movements in the context of Bitcoin price changes. Finally, investors might consider the list of these following assets as alternative investment, precious metals being the first named. They are not only substitute assets [16] but also predictive factors (for instance, gold and platinum) [17], which additionally include commodity markets (such as crude oil [18,19], exchange rate [20], equity market [21]), and Tesla's owner [22]). In summary, there are voluminous determinants of Bitcoin prices. In the scope of this study, we focus on the predictability of our model, especially the inclusion of social media content, representing the high popularity of information, on the Bitcoin market. However, the more control variables there are, the higher the accuracy of prediction. Our model thus may be a useful tool by combining the huge predictive factors for training and forecasting the response dynamics of Bitcoin to other relevant information.

This study approaches Bitcoin from the framework of behavioural and financial economics using an approach from econophysics and data science. In this sense, it seeks to understand the speculative character and the possibilities of arbitrage through a model that includes investor attention and the effect of the news, among other factors. For this, we will use a causality method originally proposed by Schreiber [23], and we will use the information as characteristics of a deep learning model. The current literature only focuses on specific sentiment indicators (such as Twitter users [3] or the number of tweets [24,25]), and our study crawled the original text from influential Twitter social media users (such as the President of United States, CEO of Tesla, and well-known organizations such as the United Nations and BBC Breaking News). Then, we processed language analyses to construct the predictive factor for Bitcoin prices. Therefore, our model incorporates a new perspective on Bitcoin's drivers.

In this direction, the work of [26] uses the effective transfer entropy as an additional feature to predict the direction of U.S. stock prices under different machine learning approaches. However, the approximation is discrete and based on averages. Furthermore, the employed metrics are not exhaustive to determine the predictive power of the models. In a similar vein, the authors of [27] perform a comparative analysis of machine learning methods for the problem of measuring asset risk premiums. Nevertheless, they do not take into account recurrent neural network models or additional nontraditional features. Furthermore, an alternative approach to study the main drivers of Bitcoin is discussed in [28], where the author explores wavelet coherence to examine the time and frequency domains between short- and long-term interactions. In the same vein, the recent studies employed the correlation networks and vector error correction models to explain the price prediction and exchange spillovers [29,30]. Of course, Bitcoin prediction is more likely to have sentimental and 'noise' factors differing from stock prediction.

On the other hand, there are methodologies to explain machine learning results known as eXplainable Artificial Intelligence (XAI). Among these, two of the most popular are Local Interpretable Model Agnostic [31] and Shapley Additive Explanation (SHAP) [32]. Both techniques are based on disturbing the model locally. The former assumes a linear model to obtain the score of the characteristics in terms of the importance of making predictions; the latter uses game theory concepts to find the best feature fitting in terms of predictive gain. In [33] these techniques are extended to include temporal dependencies and demonstrate the need to develop XAI techniques applicable to time series. In [34,35] is proposed an XAI method applicable to credit risk. In a similar vein, the authors of [36] mention the difficulty of estimating out-of-sample behavior in stress scenarios. An interesting work is [37], where it is considered a gradient boosting decision trees approximation to predict the drops of the S&P 500 markets using a large number of characteristics. The authors claim that retaining a small and carefully selected amount of features improves the learning model results.

However, as mentioned in the cornerstone work [31] it is not possible to explain a highly non-linear model through local perturbations. That is, there is a high instability derived from the characteristics of the inherent dynamical system. In addition, the examples of the articles mentioned above run in most cases in seconds or minutes. Therefore, the LIME and SHAP methods are appropriate mainly for machine learning models or simple deep learning scenarios [38]. In this spirit, it is not practical to follow the traditional XAI approach, given the computational demand derived from the number of hyperparameters and configurations to be implemented. However, our proposal to use transfer entropy in the variable selection process can be considered an alternative strategy to XAI. In particular, of interest for highly non-linear dependency conditions, such as bitcoin dynamics.

Our study embodied a wide range of Bitcoin's drivers from alternative investment, economic policy uncertainty, investor attention, and so on. However, social media is our main contribution to predictive factors. Specifically, we study the effect that a set of Twitter accounts belonging to politicians and millionaires has on the behaviour of Bitcoin's price direction. In this work, the statistically significant drivers of Bitcoin are detected in the sense of the continuous estimation of local transfer entropy (local TE) through nearest neighbours and permutation tests. The proposed methodology deals with non-Gaussian data and nonlinear dependencies in the problem of variable selection and forecasting. One main aim is to quantify the effects of investor attention and social media on Bitcoin in the context of behavioural finance. Another aim is to apply classification metrics to indicate the effects of including or not the statistically significant features in an LSTM's classification problem.

The next Section 2 introduce the local transfer entropy, the nearest neighbour estimation technique, the deep learning forecasting models, and the classification metrics. Section 3 describes the data and their main descriptive characteristics. Section 4 presents and highlights the main results. Finally, Section 5 highlights the implications of the results, and future work is proposed.

#### **2. Materials and Methods**

### *2.1. Transfer Entropy*

Transfer Entropy (TE) [23] measures the flow of information from system *Y* over system *X* in a nonsymmetric way. Denote the sequences of states of systems *X*, *Y* in the following way: *xi* = *x*(*i*) and *yi* = *y*(*i*), *i* = 1, ... , *N*. The idea is to model the signals or time series as Markov systems and incorporate the temporal dependencies by considering the states *xi* and *yi* to predict the next state *xi*+1. If there is no deviation from the generalized Markov property *p*(*xi*+<sup>1</sup>|*xi*, *yi*) = *p*(*xi*+<sup>1</sup>|*xi*), then *Y* has no influence on *X*. Hence, TE is derived using the last idea and defined as

$$T\_{Y \to X}(k, l) = \sum p(\mathbf{x}\_{i+1}, \mathbf{x}\_i^{(k)}, y\_i^{(l)}) \log \frac{p(\mathbf{x}\_{i+1} | \mathbf{x}\_i^{(k)}, y\_i^{(l)})}{p(\mathbf{x}\_{i+1} | \mathbf{x}\_i^{(k)})},\tag{1}$$

where *x*(*k*) *i* = (*xi*,..., *xi*−*k*+<sup>1</sup>) and *y*(*l*) *i* = (*yi*,..., *yi*−*l*+<sup>1</sup>).

TE can be thought of as a global average or expected value of a local transfer entropy at each observation [39]

$$T\_{Y \to X}(k, l) = \left\langle \log \frac{p(\mathbf{x}\_{i+1} | \mathbf{x}\_i^{(k)}, y\_i^{(l)})}{p(\mathbf{x}\_{i+1} \mathbf{x}\_i^{(k)})} \right\rangle \tag{2}$$

The main characteristic of the local version of TE is to be measured at each time *n* for each destination element *X* in the system and each causal information source *Y* of the destination. It can be either positive or negative for a specific event set (*xi*+1, *x*(*k*) *i* , *y*(*l*) *i* ), which gives the opportunity to have a measure of informativeness or noninformativeness at each point of a pair of time series.

On the other hand, there exist several approximations to estimate the probability transition distributions involved in TE expression. Nevertheless, there is not a perfect estimator. It is generally impossible to minimize both the variance and the bias at the same time. Then, it is important to choose the one that best suits the characteristics of the data under study. That is the reason finding good estimators is an open research area [40]. This study followed the Kraskov-Stögbauer-Grassberger) KSG estimator [41], which focused on small samples for continuous distributions. Their approach is based on nearest neighbours. Although obtaining insight into this estimator is not easy, we will try it in the following.

Let **X** = (*<sup>x</sup>*1, *x*2, ... , *xd*) now denote a d-dimensional continuous random variable whose probability density function is defined as *p* : R*<sup>d</sup>* → R. The continuous or differential Shannon entropy is defined as

$$H(\mathbf{X}) = -\int\_{\mathbb{R}^d} p(\mathbf{X}) \log p(\mathbf{X}) d\mathbf{X} \tag{3}$$

The KSG estimator aims to use similar length scales for K-nearest-neighbour distance in different spaces, as in the joint space to reduce the bias [42].

To obtain the explicit expression of the differential entropy under the KSG estimator, consider *N* i.i.d. samples *χ* = {**X**(*i*)}*Ni*=1, drawn from *p*(**X**). Beneath the assumption that *i*,*<sup>K</sup>* is twice the (maximum norm) distance to the k-th nearest neighbour of **<sup>X</sup>**(*i*), the differential entropy can be estimated as

$$\hat{H}\_{\text{KSG},\mathcal{K}}(\mathbf{X}) \equiv \psi(\mathcal{N}) - \psi(\mathcal{K}) + \frac{d}{N} \sum\_{i=1}^{N} \log \varepsilon\_{i,\mathcal{K}\_{\mathcal{I}}} \tag{4}$$

where *ψ* is known as the digamma function and can be defined as the derivative of the logarithm of the gamma function <sup>Γ</sup>(*x*)

$$\psi(K) = \frac{1}{\Gamma(K)} \frac{d\Gamma(K)}{dK} \tag{5}$$

The parameter *K* defines the size of the neighbourhood to use in the local density estimation. It is a free parameter, but there exists a trade-off between using a smaller or larger value of *K*. The former approach should be more accurate, but the latter reduces the variance of the estimate. For further intuition, Figure 1 graphically shows the mechanism for choosing the nearest neighbours at *K* = 3.

**Figure 1.** Graphical representation of nearest-neighbors selection. At a given sample point, **<sup>X</sup>**(*i*), the max-norm rectangle contains the *K* = 3 nearest-neighbors.

The KSG estimator of TE can be derived based on the previous estimation of the differential entropy. Yet, in most cases, as analysed in this work, no analytic distribution is known. Hence, the distribution of *TYs*→*<sup>X</sup>*(*k*, *l*) must be computed empirically, where *Ys* denotes the surrogate time series of *Y*. This is done by a resampling method, creating a large number of surrogate time-series pairs {*Ys*, *X*} by shuffling (for permutations or redrawing for bootstrapping) the samples of Y. In particular, the distribution of *TYs*→*<sup>X</sup>*(*k*, *l*) is computed by permutation, under which surrogates must preserve *p*(*xn*+<sup>1</sup>|*xn*) but not *p*(*xn*+<sup>1</sup>|*xn*, *yn*).

#### *2.2. Deep Learning Models*

We can think of artificial neural networks (ANNs) as a mathematical model whose operation is inspired by the activity and interactions between neuronal cells due to their electrochemical signals. The main advantages of ANNs are their non-parametric and nonlinear characteristics. The essential ingredients of an ANN are the neurons that receive an input vector *xi*, and through the point product with a vector of weights *w*, generate an output via the activation function *g*(·):

$$f(\mathbf{x}\_i) = \mathbf{g}(\mathbf{x}\_u \cdot \mathbf{w}) + b\_\prime \tag{6}$$

where *b* is a trend to be estimated during the training process. The basic procedure is the following. The first layer of neurons or input layer receives each of the elements of the input vector *xi* and transmits them to the second (hidden) layer. The next hidden layers calculate their output values or signals and transmit them as an input vector to the next layer until reaching the last layer or output layer, which generates an estimation for an output vector.

Further developments of ANNs have brought recurrent neural networks (RNNs), which have connections in the neurons or units of the hidden layers to themselves and are more appropriate to capture temporal dependencies and therefore are better models for time series forecasting problems. Instead of neurons, the composition of an RNN includes a unit, an input vector *xt*, and an output signal or value *ht*. The unit is designed with

a recurring connection. This property induces a feedback loop, which sends a recurrent signal to the unit as the observations in the training data set are analysed. In the internal process, backpropagation is performed to obtain the optimal weights. Unfortunately, backpropagation is sensitive to long-range dependencies. The involved gradients face the problem of vanishing or exploding. Long-short-term memory (LSTM) models were introduced by Hochreiter and Schmidhuber [43] to avoid these problems. The fundamental difference is that LSTM units are provided with memory cells and gates to store and forget unnecessary information.

The final ANNs we need to discuss are convolutional neural networks (CNNs). They can be thought of as a kind of ANN that uses a high number of identical copies of the same neuron. This allows the network to express computationally large models while keeping the number of parameters small. Usually, in the construction of these types of ANNs, a max-pooling layer is included to capture the largest value over small blocks or patches in each feature map of previous layers. It is common that CNN and pooling layers are followed by a dense fully connected layer that interprets the extracted features. Then, the standard approach is to use a flattened layer between the CNN layers and the dense layer to reduce the feature maps to a single one-dimensional vector [44].

### *2.3. Classification Metrics*

In classification problems, we have the predicted class and the actual class. The possible scenarios under a classification prediction are given by the confusion matrix. They are true positive (*TP*), true negative (*TN*), false positive (*FP*), and false negative (*FN*). Based on these quantities, it is possible to define the following classification metrics:


The most complex measure is the area under the curve (AUC) of the receiver operating characteristic (ROC), where it expresses the pair (*TPR<sup>τ</sup>*, 1 − *TNRτ*) for different thresholds *τ*. Contrary to the other metrics, the AUC of the ROC is a quality measure that evaluates all the operational points of the model. A model with the aforementioned metric equal to 0.5 is considered a random model. Then, a value significantly higher than 0.5 is considered a model with predictive power, with a value of 1 the upper bound of this quantity.
