**3. Data**

An important part of the work is the acquisition and preprocessing of data. We focus on the period of time from 1 January 2017 to 9 January 2021 at a daily frequency for a total of *n* = 1470 observations. As a priority, we consider the variables listed in Table 1 as potential drivers of the price direction of Bitcoin (BTC). Investor attention is considered Google Trends with the *query = "Bitcoin"*. Additionally, the number of mentions is properly scaled to make comparisons between days of different months because by default, Google Trends weighs the values by a monthly factor. Then, the log return of the resulting time series is calculated.


**Table 1.** Type of driver and variable name.

The social media data are collected from the Twitter API (https://developer.twitter. com/en/docs/twitter-api, accessed on 15 January 2021). Nevertheless, the API of Twitter only enables downloading the latest 3200 tweets of a public profile, which generally was not enough to cover the period of study. Then, the dataset has been completed with the freely available repository of https://polititweet.org/ (accessed on 15 January 2021). In this way, the collected number of tweets was 21,336, 22,808, 24,702, 11,140, and 26,169 for each of the profiles listed on Table 1 in the social media type, respectively. The textual data of each tweet in the collected dataset are transformed to a sentiment polarity score through the VADER lexicon [45]. Then, the scores are aggregated daily for each profile. The resulting daily time series have missing values due to the inactivity of the users, and then a third-order spline is considered before calculating their differences. The last is to stationarize the polarity time series. It is important to remember that Donald Trump's account was blocked on 8 January 2021, so it was also necessary to impute the last value to have series of the same length.

The economic policy uncertainty index is a Twitter-based uncertainty index (Twitter-EPU). The creators of the index used the Twitter API to extract tweets containing keywords related to uncertainty ("uncertain", "uncertainly", "uncertainties", "uncertainty") and econ-

omy ("economic", "economical", "economically", "economics", "economies", "economist", "economists", "economy"). Then, we use the index consisting of the total number of daily tweets containing inflections of the words uncertainty and economy (Please consult https://www.policyuncertainty.com/twitter\_uncert.html for further details of the index, accessed on 15 January 2021). The risk aversion category considers the financial proxy to risk aversion and economic uncertainty proposed as a utility-based aversion coefficient [10]. A remarkable feature of the index is that in early 2020, it reacted more strongly to the new COVID-19 infectious cases than did a standard uncertainty proxy.

As complementary drivers, it includes a set of highly capitalized cryptocurrencies and a heterogeneous portfolio of financial indices. Specifically, Ethereum (ETH), Litecoin (LTC), Ripple (XRP), Dogecoin (DOGE), and the stable coin TETHER are included from yahoo finance (https://finance.yahoo.com/, accessed on 15 January 2021). The components of the heterogeneous portfolio are listed in Table 1, which takes into account the Chicago Board Options Exchange's CBOE Volatility Index (VIX). This last information was extracted from Bloomberg (https://www.bloomberg.com/, accessed on 15 January 2021). It is important to point out that risk aversion and the financial indices do not have information that corresponds to weekends. The imputation method to obtain a complete database consisted of repeated Friday values as a proxy for Saturday and Sunday. Then, the log return of the resulting time series is calculated. This last transformation was also made for Twitter-EPU and cryptocurrencies. The complete dataset can be found in the Supplementary Material.

Usually, the econophysics and data science approaches share the perspective of observing data first and then modelling the phenomena of interest. In this spirit, and with the intention of gaining intuition on the problem, the standardized time series (target and potential drivers), as well as the cumulative return of the selected cryptocurrencies and financial assets are plotted in Figures 2 and 3. The former figure shows high volatility in almost all the studied time series around March 2020, which might be due to the declaration of the pandemic by the World Health Organization (WHO) and the consequent fall of the worlds main stock markets. The latter figure exhibits the overall best cumulative gains for BTC, ETH, LTC, XRP, DOGE, and Tesla. It is worth noting that the only asset with a comparable profit to that of the cryptocurrencies is Tesla, which reaches high cumulative returns starting at the end of 2019 and increases its uptrend immediately after the announcement of the worldwide health emergency.

**Figure 2.** Standardized time series after preprocessing, as explained in the main text.

**Figure 3.** Cumulative returns of the selected cryptocurrencies and financial assets. The scale is logarithmic in the y-axis and starts in one to be financially interpreted as the gains.

Furthermore, Figure 4 shows the heatmap of the correlation matrix of the preprocessed dataset. We can observe the formation of certain clusters, such as cryptocurrencies, metals, energy, and financial indices, which tells us about the heterogeneity of the data. It should also be noted that the VIX volatility index is anti-correlated with most of the variables.

**Figure 4.** Correlation matrix of the preprocessed time series.

Additionally, the main statistical descriptors of the data are presented in Table 2. The first column is the variable's names or tickers. The subsequent columns represent the mean, standard deviation, skewness, kurtosis, Jarque Bera test (JB), and the associated *p* value

of the test for each variable, i.e., target, and potential drivers. Basically, none of the time series passes the test of normality distribution, and most of them present a high kurtosis, which is indicative of heavy tail behaviour. Finally, stationarity was checked in the sense of Dickey-Fuller and the Phillips-Perron unit root tests, where all variables pass both tests.


**Table 2.** The symbols \*\*, and \*\*\* denote the significance at the 5%, and 1% levels, respectively.
