In order to analyze the informational content in the COVID-19 series of data, we have chosen entropy and statistical complexity as information quantifiers. We focus mainly on entropy; however, in certain cases, we also show the complexity measure and justify its usefulness. As mentioned before, we used as a source the data on the coronavirus disease pandemic collected in Our World in Data. We have implemented our approach for every country as published in that survey; however, for brevity, we have chosen only to show a reduced set of representative examples.
We have computed the entropic and complexity quantifiers introduced in
Section 1, with the probability distribution functions determined from the series of data using the two methodologies recalled in
Section 3: the Bandt–Pompe permutation method and a wavelet analysis. For the permutation method, the calculations have been carried out considering different time intervals, with various lengths, determined by different starting and ending dates. We have taken
, but we have verified consistency comparing with the results corresponding to
. In addition, we have assumed
as is generally done in the literature [
28]. We have also used other values of this parameter, observing that the results are consistent.
As the entropies are normalized, when their values are closer to 1, less information will be available. We will naturally consider that values greater than show a tendency to disorder.
4.1. Increasing Intervals
In this article, we propose for the first time (as far as we know) to use increasing time intervals. In addition to being a possible solution to the lack of data, our goal is to provide clarity on the global structure of the pandemic dynamics. We do so through the study of the relationship between entropy and statistical complexity as time goes by and the pandemic evolves. First of all, from Equations (
2) and (
4), we define time-dependent normalized entropy and statistical complexity,
and
, respectively. These functions are defined as the entropy and complexity calculated for the interval
, where
is kept fixed and
t is variable. We take consecutive intervals separated by one day. For construction, these functions represent the entropy and complexity for intervals with an increasing number of data. The usefulness of these functions is explained in the following. We focus mainly on entropy; in certain cases, we also show the complexity measure and justify its usefulness. As for the choice of the starting date for our study, there are several possibilities. For example, one option could be to consider the starting date as that of the detection of the first cases of infection with SARS-CoV-2 virus in China, considering series of the same length, but with zero cases for the rest of the countries in certain intervals. Alternatively, what seems more convenient for the application of our methodologies is to take an initial date for which all countries have a sufficient amount of previous data. Thus, we have chosen 11 March 2020, which is when the World Health Organization declared COVID-19 a pandemic, as the starting date. This day corresponds to the first
t-value in all our computations, and we will consider data until 13 July 2021. Obviously, the total interval considered is very large, and several waves are included, but we will not emphasize these phenomena.
In
Figure 1, we show the permutation entropy and statistical complexity for some chosen countries that provide a representative synthesis of the behaviors we found for all other nations. In addition, we depict the behavior for the world as a whole (the interested reader can find more information at the end of the article, where we show a larger set of countries, a total of 20). The graphs obtained for France (FRA) and the world are typical of most developed and developing countries. Australia (AUS), China (CHN), and some other countries exhibit a somewhat atypical behavior and will be discussed in detail separately. Notice that we use ISO-3166 alpha-3 country codes; for details, see
https://www.iso.org/iso-3166-country-codes.html (accessed on 4 August 2021). In
Figure 1 we plot the time dependence of the normalized permutation entropy
and of the statistical complexity
. This is carried out for the statistics of daily new confirmed infected cases and daily new deaths.
In
Figure 1 it can be seen that, about 60 days after 11 March 2020, entropy curves end their fast growth and begin to flatten (a similar conclusion can be reached studying other countries, as will be shown below). This would make it reasonable to determine the minimum length
N, for the following method, where intervals of equal length are used. We take intervals with
data. This particular number is chosen for the sake of comparison with wavelet results.
The first important characteristic we observe in all cases shown is a high entropy value for , regarding infections as well as deaths. We observe that and increases in time to values even very close to the maximum value 1, at least in 2020. These results exhibit a marked lack of information, i.e., a high degree of randomness. Across the world, this phenomenon is observed. Perhaps, in some cases, it might be related to the capability of collecting robust statistical data. However, it might also reflect differences in the dynamics of the spread of COVID-19.
Another peculiarity that we can conclude by looking at the entropy plots is that the general trend is: more data imply less information, i.e., is always increasing. This fact would indicate the difficulty of predictive mathematical methods that could be used, even in the case of those models with stochastic components. Let us remember that our analyses are based only on the data series of infected and deceased people. Of course, if concrete information is supplemented (such as mobility data, contagion factor, etc.), many inferences could be made for decision-making policies. However, our point here is that the high entropy values computed reflect the fact that it was difficult to infer the dynamics of the coronavirus disease by just looking at the considered curves, and that modeling such dynamics was a challenge, as reflected in some works from the analyzed period.
Another possible use of this type of analysis is to measure the fidelity of the original data series. By common sense, one expects that the data corresponding to the deaths will be more reliable than those of infected cases. This means lower entropy for death cases. We can use the two available curves, for infected (IC) and death (DC) cases, to obtain more information about the treatment in each country. For our conclusions, we will consider goodwill in the data upload. In general, we expect the IC curves to be above the DC curves for a correct data collection, but: (1) the gap between curves has importance. Greater distance between curves can represent an incorrect collection of infected cases or a good collection of data on deaths. However, in the context studied, we interpret that this result implies a good collection of data on deaths. (2) If the DC curve is slightly above or below or coincides by parts with the IC one, it can be considered that the test is likely the adequate.
Most developed and developing countries show similar IC and DC curves, which are also similar to those for the world taken as a whole (corresponding to situation 2, with small differences between the values of the IC and DC curves).
It is seen from the plots that Australia, China, Cuba, and New Zealand (NZL) exhibit a different behavior between the curves corresponding to infected and death cases, starting from approximately the day (12 May 2020)—that is, from the value of N (or t) for which we consider the method reliable. The normalized entropy corresponding to the DC curves is below that of IC ones, by a factor of around 0.3. Therefore, DC curves contain more information than IC ones. The analysis corresponding to these countries corresponds to situation 2. Additionally, the entropy DC curve for Australia has decreased monotonously since October 2020, which is an improvement in the information level of the pandemic. This behavior coincides with a decrease observed in the same period, in the original series of data on deceased people. Finally, we observe that Congo has marked differences from the mentioned countries, which may suggest that the data were not correctly recorded in that case. This happens not only for this country but also for some others in Central America or Central Africa, and for small islands.
The curves for complexity go along with the growth of the entropies until the complexity reaches its absolute maximum and then they decrease, moving away from the entropy curves towards small values that must also be associated with randomness. This change in behavior occurs around (3 April 2020). The effect is produced because the disequilibrium Q compensates the growth of the entropy up to that point, but then it is overcome. The growth of is noticeable and then it cannot compete. Perhaps one might think that the length of these intervals (and intervals with the same length), where both quantifiers are competitive, is the only reliable ones for making predictions. Similar plots are obtained starting from any day as long as the interval has points.
The results obtained for are all confirmed by those of complexity . Moreover, it can be said that, in this problem, the complexity is virtually determined by the behavior of the entropy, with such large values and with such a speed of growth. For this reason, from here on, we concentrate on the plots for entropies.
Thus, for the countries with typical behavior, the values are very small , while in those special ones, it grows and decreases according to the entropy decrease and growth, showing, in some cases, some degree of complexity.
We notice the following remarkable facts: (1) most of the countries that appear in the OWID database (which we have examined) show the same characteristics listed above; (2) the curves corresponding to the daily infected data are similar, despite the different geographical and cultural characteristics, different seasons, and applied health policies; and (3) something similar happens with the daily deaths curves, except in some special countries to which we have referred to.
This makes us think that what the monotonically increasing curves is a representation of the intrinsic or inherent form of spread of the SARS-CoV-2 virus, in terms of entropy (mainly IC curves). Therefore, we could choose, for example, the curve corresponding to the whole World, showed in
Figure 1, as the model.
The deviations from the monotonically increasing curve of a country, coincide with a decrease in the cases publicly communicated by the respective government and registered in the OWID database.
Naturally, it has been thought that the decrease in cases correspond to effective health policies applied by the governments (among other causes) in the previous period to the massive use of anti-COVID vaccines.
4.2. Rolling Windows
The golden rule in series analysis is to compare series of the same interval length, but as mentioned before, there was not enough data available for the period considered to draw reliable conclusions. We consider here another way to see the development of the quantifiers as a function of time. Unlike the treatment given in
Section 4.1, we now keep constant the intervals length, but these are not consecutive and share data. We do this by using so-called rolling or sliding windows. A similar analysis is conducted in [
8,
9], for time series of financial data using permutation entropy.
Here, we employ the rolling windows method for the permutation method and the wavelet transform. We will consider intervals (windows) of three different lengths. As we want to carry out a comparative analysis between the permutation and wavelet entropies, these lengths will be
,
and
, which arise because wavelet analysis require intervals of length
, with
(see
Section 3.2). In wavelet decomposition of these series, the maximum number of scales were considered (corresponding to 6, 7, and 8 scales, respectively) and detail coefficients were used to calculate wavelet energy.
The analysis will be conducted only for the entropies (
2). We will proceed to take consecutive windows
, with
,
being the maximum number of intervals that fit in the total number of data of the original series, according to the value of
N. Thus, we will obtain values for the entropies of the considered windows. Of course, we can associate these windows with the corresponding days-dates. We will take as the first day 11 March 2020 and we will consider data until 13 July 2021.
Representative examples of the results are shown in the following plots. In
Figure 2,
Figure 3 and
Figure 4, we depict the IC and DC curves of wavelet and permutation entropies, with rolling windows of different lengths. The figures correspond, respectively, to the United States of America (USA) with
, Brazil (BRA) with
, and Australia with
.
The first thing to observe in these figures is a difference in shape with respect to the curves corresponding to increasing intervals. This not only happens with wavelet entropies, as one thinks at first, but also with permutation ones. Obviously, an information quantifier based on the permutation methodology, in general, does not have to give the same result as the corresponding quantifier evaluated using the wavelet transform. The first method takes account of causality, while the second one provides a general representation in time and frequency. What one expects to find in the main characteristics of the problem is coherence between both methodologies. On the other hand, we have observed that, if we look at the figures corresponding to increasing intervals, but from the value 64, they are consistent with those of rolling windows with
, which represent a large amount of points as can be seen in
Figure 2, thus throwing a good result.
The differences between the curves for wavelet and permutation entropies can be observed in
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6. We can see that the wavelet entropy values in general are lower than the permutation, although they are also very high. This is the main coherence that interests us in this article.
We note in
Figure 2,
Figure 3 and
Figure 4 a distinction between the IC and DC curves, for both wavelet and permutation entropies. It can also be observed that the fluctuations increase when
N decreases and that they are greater in the wavelet framework. The curves for USA in
Figure 2 (
) appear to show a slight but constant increase in information for the wavelet entropy, a result that is not accompanied by the permutation entropy. Once again, Australia shows an increase in information since the end of 2020, reaching zero when
is taken for the length of the rolling windows (see
Figure 4).
We exhibit in
Figure 5 and
Figure 6 a series of results for a set of 20 chosen countries (they are identified with their ISO-3166 alpha-3 code and appear in the same order—alphabetic by name—as in
Figure 7). In
Figure 5, which corresponds to the wavelet analysis, we see that entropy mean values are greater than 0.7 for both infected and death cases, except for some particular countries. Standard deviations remain below 0.2. For the permutation procedure, shown in
Figure 6, we see that, for infected cases, the entropy mean values remain above 0.8 and mostly close to 1. In the DC curves, we observe that the mean values are grouped close to their maximum value, except for Australia, China, Cuba, and New Zealand, showing standard deviations greater than those of all the other countries studied. Both IC and DC curves have the characteristic of maintaining the relative differences between their values when
N is changed.
All of our analysis leads to the expected result, but in a notorious way that the curves corresponding to the daily deaths may contain much more information than the infected curves. Australia and New Zealand are good examples of this. According to what is observed in the registered cases, the DC curves would show a better description with respect to the pandemic.
Finally, we must mention Israel, a country for which, in
Figure 7, there is no increase in information towards the last days considered (mid-July), when there was a drastic decrease in mortality. However, in the graphs corresponding to
calculated with the rolling windows method, a significant increase in information is observed for a short time, until a decrease is noted again, probably due to a new virus variant. This could happen with some other country and is a consequence of the large number of points accumulated in the increasing intervals of
Figure 7.