**4. Results**

#### *4.1. Variable Selection*

The observed characteristics of the data in the previous section justify the use of a nonparametric approach to determine the explainable features to be employed in the predictive classification model. Therefore, the variable selection procedure consisted of applying the continuous transfer entropy from each driver to Bitcoin using the KSG estimation. Figure 5 shows the average transfer entropy when varying the Markov order *k*, *l* and neighbour parameter *K* from one to ten for a total of 1000 different estimations by each driver. The higher the intensity of the colour, the higher the average transfer entropy (measured in nats). The grey cases do not transfer information to BTC. In other words, these cases do not

show a statistically significant flow of information, where the permutation test is applied to construct 100 surrogate measurements under the null hypothesis of no directed relationship between the given variables.

**Figure 5.** Average transfer entropy from each potential driver to BTC. The y-axis indicates the driver, and the x-axis indicates the Markov order pair *k*, *l* of the source and target. From (**a**) to (**j**), nearest neighbours *K* run from one to ten, respectively.

The tuple of parameters {*k*, *l*, *K*} that give the highest average transfer entropy from each potential driver to BTC are considered optimal, and the associated local TE is kept as a feature in the classification model of Bitcoin's price direction. Figure 6 shows the local TE from each statistically significant driver to BTC at the optimal parameter tuple {*k*, *l*, *<sup>K</sup>*}. Note that the set of local TE time series is limited to 23 features. Consequently, the set of originally proposed potential drivers is reduced from 25 to 23. Surprisingly, NASDAQ and Tesla do not send significative information to BTC for any value of {*k*, *l*, *K*} in the grid of the 1000 different configurations. The variations are smooth on *K*, but not on the Markov order *k*, *l*. It is also notorious the negligible amounts of the flow of information at *k* = *l* = 1.

**Figure 6.** Local TE of the highest significant average values on the tuple {*k*, *l*, *<sup>K</sup>*}. NASDAQ and Tesla are omitted because they do not send significative information to BTC for any considered value on the grid of the tuple {*k*, *l*, *<sup>K</sup>*}.

#### *4.2. Bitcoin's Price Direction*

The task of detecting Bitcoin's price direction was done through a deep learning approach. The first step consisted of splitting the data into training, validation, and test datasets. The chosen training period runs from 1 January 2017 to 4 January 2020, or 75% of the original entire period of time, and is characterized as a prepandemic scenario. The validation dataset is restricted to the period from 5 January 2020 to 11 July 2020, or 13% of the original data, and is considered the pandemic scenario. The test dataset involves the postpandemic scenario from 12 July 2020 to 9 January 2021 and contains 12% of the complete dataset. Deep learning forecasting requires transforming the original data into a supervised data set. Here, samples of 74 historical days and a one-step prediction horizon are given to the model to obtain a supervised training dataset, with the first dimension being a power of two, which is important for the hyperparameter selection of the batch dimension. Specifically, the sample dimensions are 1024, 114, and 107 for training, validation, and testing, respectively. Because we are interested in predicting the direction of BTC, the time series are not demeaned and instead are only scaled by their variance when feeding the deep learning models. An important piece in a deep learning model is the selection of the activation function. In this work, the rectified linear unit (ReLU) was selected for the hidden layers. Then, for the output layer, the sigmoid function is chosen

because we are dealing with a classification problem. In addition, an essential ingredient is the selection of the stochastic gradient descent method. Here, Adam optimization is selected based on adaptive estimation of the first- and second-order moments. In particular, we used version [46] to search for the long-term memory of past gradients to improve the convergence of the optimizer.

There exist several hyperparameters to take into account when modelling a classification problem under a deep learning approach. These hyperparameters must be calibrated on the training and validation datasets to obtain reliable results on the test dataset. The usual procedure to set them is via a grid search. Nevertheless, deeper networks with more computational power are necessary to obtain the optimal values in a reasonable amount of time. To avoid excessive time demands, we vary the most crucial parameters in a small grid and apply some heuristics when required. The number of epochs, is selected under the early stopping procedure. Another crucial hyperparameter is the batch, or the number of samples to work through before updating the internal weight of the model. For this parameter the selected grid was {32, 64, 128, <sup>256</sup>}. Additionally, we consider the initial learning rates at which the optimizer starts the algorithm, which were {0.001, 0.0001}. As an additional method of regularization, the effect of dropping between consecutive layers is added. This value can take values from 0 to 1. Our grid for this hyperparameter is {0.3, 0.5, 0.7}. Finally, because of the stochastic nature of the deep learning models, it is necessary to run several realizations and work with averages. We repeat the hyperparameter selection with ten different random seeds for robustness. The covered scenarios are the following: *univariate* (S1), where bitcoin is self-driven; *all features* (S2), where all the potential drivers listed in Table 1 are included as features of the model; *significative features* (S3), only statistically significant drivers under the KSG transfer entropy approach are considered as features; *local TE*, only the local TE of the statistically significant drivers are included as a feature; and finally the *significative features + local TE* (S5) scenario, which combines scenarios (S3) and (S4). Finally, five different designs have been proposed for the architectures of the neural networks, which are denoted as *deep LSTM* (D1), *wide LSTM* (D2), *deep bidirectional LSTM* (D3), *wide bidirectional LSTM* (D4), and *CNN* (D5). The specific designs and diagrams of these architectures are displayed in Figure 7. In total, 6000 configurations or models were executed, which included the grid search for the optimal hyperparameters, the different scenarios and architectures, and the realizations on different seeds to avoid biases due to the stochastic nature of the considered machine learning models.

The computation was done in a workstation with the following characteristics: Alienware Aurora R7, Ubuntu 20.10, Processor i9-9900X 8 cores, 16 logic, 64 GB RAM, Dual NVIDIA RTX 2080 ti, 3TB HHD. On this equipment, the computational demand extends the execution to nearly 60 h of computation. Tables 3 and 4 present the main results for the validation and test datasets, respectively. Table 2 explicitly states the best value for the dropout, learning rate (LR), and batch hyperparameters. In both tables, the hashtag (#) column indicates the number of times the specific scenario gives the best score for the different metrics considered so far. Hence, the architecture design D3 for case S3 yields the highest number of metrics with the best scores in the validation dataset. In contrast, in the test dataset, the highest number of metrics with the best scores correspond to design D2 for case S1. Nevertheless, design D5 from case S5 is close in the sense of the # value, where it presents the best AUC and PPV scores. An important point to keep in mind is that only during the validation stage we find models with an AUC greater than 0.6, so this metric does not give evidence of predictive power in the testing stage.

**Figure 7.** From (**top**) to (**bottom**): D1, D2, D3, D4, and D5.


**Table 3.** Classification metrics on the validation dataset.

In a robustness discussion, we would like to compare our predictive feature with the existing approaches. While the current studies look at the conventional approach of econometrics [29,30], our study sheds light on the deep learning method. Accordingly, we had two samples (training sample and test group). Therefore, it allows us to validate our findings with different periods. The unique, comparable study that we have found in the area of learning models is due to [26]. However, they only show the results for two accuracy metrics when predicting the direction of the US markets. Even so, barely the metrics exceed the value of 0.6, and it is not clear if they are considering a test set.


**Table 4.** Classification metrics on the test dataset.

## **5. Discussion**

We start from descriptive statistics as a first approach to intuitively grasp the complex nature of Bitcoin, as well as its proposed heterogeneous drivers. As expected, the variables did not satisfy the normality assumption and presented high kurtosis, highlighting the need to use non-parametric and nonlinear analyses.

The KSG estimation of TE found a consistent flow of information from the potential drivers to Bitcoin through the considered range of K nearest neighbours. Even when, in principle, the variance of the estimate decreases with *K*, the results obtained with *K* = 1 do not change abruptly for larger values. In fact, the variation in the structure of the TE matrix for different Markov orders *k*, *l* is more notorious. Additionally, attention must be paid to the evidence about the order *k* = *l* = 1 through values near zero. Practitioners usually assume this scenario under Gaussian estimations. A precaution must then be made about the memory parameters of Markov, at least when working with the KSG estimation. The associated local TE does not show any particular pattern beyond high volatility, reaching values of four nats when the average is below 0.1. Thus, volatility might be a better proxy for price fluctuations in future studies.

In terms of intuitive explanations, we found that the drivers of Bitcoin might not truly capture its returns in distressed periods. Although we expected to witness that the predictive power of these determinants might play an important role across time horizons, it turns out that the prediction model of Bitcoin relies on a choice of a specific period. Thus, our findings also confirm the momentum effect that exists in this market [47]. Due to the momentum effect, the timing of market booms could not truly be supported much for further analysis by our models. In regard to our main social media hypothesis, the popularity of Bitcoin content still exists as the predictive component in the model. More noticeably, our study highlights that Bitcoin prices can be driven by momentum on social media [24]. However, the selection of training and testing periods should be cautious with the boom and burst of this cryptocurrency. Apparently, while the fundamental value of Bitcoin is still debatable [48], using behavioural determinants could have some merits in predicting Bitcoin. Thus, we believe that media content would support the predictability of Bitcoin prices alongside other financial indicators. Concomitantly, after clustering these factors, we found that the results seem better able to provide insights into Bitcoin's drivers.

On the other hand, the forecasting of Bitcoin's price direction improves in the validation set but not for all metrics in the test dataset when including significant drivers or local TE as a feature. Nonetheless, the last assertion relies on the number of metrics with the best scores. Although the test dataset having the best performance corresponds to the *deep bidirectional LSTM* (D3) for the scenario *univariate* (S3), this case only beat three of the eight metrics. The other five metrics are outperformed by scenarios including *significative features* (S3) and *significative features + local TE* (S5). Furthermore, the second-best performances are tied with two of the eight metrics with leading values. Interestingly, the last case shows the best predictive power on the CNN model using significant features as well as local TE indicators (D5–S5). In particular, it outperforms the AUC and PPV overall, ye<sup>t</sup> AUC is in the border of a random model. To delve into the explainable aspect, a future work will seek to apply the Shapley-Lorentz decomposition proposed in [49,50]. There the authors develop a global methodology, which can be associated with a generalization of AUC-ROC.

Moreover, it is important to note that the selected test period is atypical in the sense of a bull period for Bitcoin as a result of the turbulence generated by the COVID-19 public health emergency; this might induce safe haven behaviour related to this asset and increase its price and capitalization. This atypical behaviour opens the door to propose future work to model Bitcoin by the self-exciting process of the Hawkes model during times of grea<sup>t</sup> turbulence.

We would like to end by emphasizing that we were not exhaustive in modelling classification forecasting. In contrast, our intention was to exemplify the effect of including the significant features and local TE indicators under different configurations of a deep learning model through a variety of classification metrics. Two methodological contributions to highlight are the use of nontraditional indicators such as market sentiment, as well as a continuous estimation of the local TE as a tool to determine additional drivers in the classification model. Finally, the models presented here are easily adaptable to high-frequency data because they are non-parametric and nonlinear in nature.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/1099-430 0/23/12/1582/, File S1: preprocessed data.

**Author Contributions:** Conceptualization, A.G.-M. & T.L.D.H.; Data curation, A.G.-M.; Formal analysis, A.G.-M.; Funding acquisition, A.G.-M.; Investigation, A.G.-M. & T.L.D.H.; Methodology, A.G.-M.; Writing original draf, A.G.-M. & T.L.D.H.; Writing—review & editing, A.G.-M. & T.L.D.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded Consejo Nacional de Ciencia y Tecnología (CONACYT, Mexico) through fund FOSEC SEP-INVESTIGACION BASICA (Grant No. A1-S-43514).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable. **Data Availability Statement:** The data is available as Supplementary Materials.

**Acknowledgments:** We thank Román A. Mendoza for support in the acquisition of the financial time series. To Rebeca Moreno for her kindness in drawing the diagrams of the Supplementary Materials. Also, it is necessary to thank the fruitful discussion with Victor Muñiz. T.L.D.H. acknowledges funding from the University of Economics Ho Chi Minh City (Vietnam) with registered project 2021-08-23-0530.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
