1. Introduction
Complex dynamical systems consisting of nonlinearly coupled subsystems can be found in many application areas ranging from biomedicine [
1] to engineering [
2,
3]. Teasing apart the subsystems and identifying and characterizing their interactions from observations of the system’s behavior can be extremely difficult depending on the magnitude and nature of the coupling and the number of variables involved. In fact, the identification of a subsystem can be an ill-posed problem since the definition of strong or weak coupling is necessarily subjective.
The direction of the coupling between two variables is often thought of in terms of one variable driving another so that the values of one variable at a given time influence the future values of the other. This is a simplistic view based in part on our predilection for linear or “intuitively understandable” systems. In nonlinear systems, there may be mutual coupling across a range of temporal and spatial scales so that it is impossible to describe one variable as driving another without specifying the temporal and spatial scale to be considered.
Even in situations where one can unambiguously describe one variable as driving another, inferring the actual nature of the coupling between two variables from data can still be misleading since co-varying variables could reflect either a situation involving coupling where one variable drives another with a time delay or a situation where both variables are driven by an unknown third variable each with different time delays. While co-relation (we use the term co-relation to describe the situation where there is a relationship between the dynamics of the two variables; this is to be distinguished from correlation, which technically refers only to a second-order statistical relationship) cannot imply causality [
4], one cannot have causality without co-relation. Thus co-relation can serve as a useful index for a potential causal interaction.
However, if past values of one variable enable one to predict future values of another variable, then this can be extremely useful despite the fact that the relationship may not be strictly causal. The majority of tests to identify and quantify co-relation depend on statistical tests that quantify the amount of information that one variable provides about another. The most common of these are based on linear techniques, which rely exclusively on second-order statistics, such as correlation analysis and Principal Component Analysis (PCA), which is called Empirical Orthogonal Functions (EOFs) in geophysical studies [
5]. However, these techniques are insensitive to higher-order nonlinear interactions, which can dominate the behavior of a complex coupled dynamical system. In addition, such linear methods are generally applied by normalizing the data, which implies that they do not depend on scaling effects.
Information-theoretic techniques rely on directly estimating the amount of information contained in a dataset and, as such, rely not only on second-order statistics, but also on statistics of higher orders [
6]. Perhaps most familiar is the Mutual Information (MI), which quantifies the amount of information that one variable provides about another variable. Thus MI can quantify the degree to which two variables co-relate. However, since it is a symmetric measure MI cannot distinguish potential directionality, or causality, of the coupling between variables [
7].
The problem of finding a measure that is sensitive to the directionality of the flow of information has been widely explored. Granger Causality [
8] was introduced to quantify directional coupling between variables. However, it is based on second-order statistics, and as such, it focuses on correlation, which constrains its relevance to linear systems. For this reason, generalizations to quantify nonlinear interactions between bi-variate time-series have been studied [
9]. Schreiber proposed an information-theoretic measure called Transfer Entropy (TE) [
7], which can be used to detect the directionality of the flow of information. Transfer Entropy, along with other information-based approaches, is included in the survey paper by Hlavackova-Schindler
et al. [
10] and differentiation between the information transfer and causal effects are discussed by Lizier and Propenko [
11]. Kleeman presented both TE and time-lagged MI as applied to ensemble weather prediction [
12]. In [
13], Liang explored the information flow in dynamical systems that can be modeled by equations obtained by the underlying physical concepts. In such cases, the information flow has been analyzed by the evolution of the joint probability distributions using the Liouville equations and by the Fokker-Planck equations, in the cases of the deterministic and stochastic systems, respectively [
13].
TE has been applied in many areas of science and engineering, such as neuroscience [
1,
14], structural engineering [
2,
3], complex dynamical systems [
15,
16] and environmental engineering [
17,
18]. In each of these cases, different approaches were used to estimate TE from the respective datasets. TE essentially quantifies the degree to which past information from one variable provides information about future values of the other variable based solely on the data without assuming any model regarding the dynamical relation of the variables or the subsystems. In this sense TE is a non-parametric method. The dependency of the current sample of a time series on its past values is formulated by
kth and
lth order Markov processes in Schreiber [
7] to emphasize the fact that the current sample depends only on its
k past values and the other process’s past
l values. There also exist parametric approaches where the spatio-temporal evolution of the dynamical system is explicitly modeled [
15,
16]. However, in many applications it is precisely this model that we would like to infer from the data. For this reason, we will focus on non-parametric methods.
Kaiser and Schreiber [
19], Knuth
et al. [
20], and Ruddell and Kumar [
17,
18] have expressed the TE as a sum of Shannon entropies [
21]. In [
17,
18], individual entropy terms were estimated from the data using histograms with bin numbers chosen using a graphical method. However, as we discuss in
Appendix AI, TE estimates are sensitive to the number of bins used to form the histogram. Unfortunately, it is not clear how to optimally select the number of bins in order to optimize the TE estimate.
In the literature, various techniques have been proposed to efficiently estimate information-theoretic quantities, such as the entropy and MI. Knuth [
22] proposed a Bayesian approach, implemented in Matlab and Python and known as the Knuth method, to estimate the probability distributions using a piecewise constant model incorporating the optimal bin-width estimated from data. Wolpert and Wolf [
23] provided a successful Bayesian approach to estimate the mean and the variance of entropy from data. Nemenman
et al. [
24] utilized a mixture of Dirichlet distributions-based prior in their Bayesian Nemenman, Shafee, and Bialek (NSB) entropy estimator. In another study, Kaiser and Schreiber [
19] give different expressions for TE as a summation and subtraction of various (conditional/marginal/joint) Shannon entropies and MI terms. However, it has been pointed out that summation and subtraction of information-theoretic quantities can result in large biases [
25,
26]. Prichard and Theiler [
25] discuss the “bias correction” formula proposed by Grassberger [
27] and conclude that it is better to estimate MI utilizing a “correlation integral” method by performing a kernel density estimation (KDE) of the underlying probability density functions (pdfs). KDE tends to produce a smoother pdf estimate from data points as compared to its histogram counterpart. In this method, a preselected distribution of values around each data point is averaged to obtain an overall, smoother pdf in the data range. This preselected distribution of values within a certain range, which is known as a “kernel”, can be thought of as a window with a bandwidth [
28]. Commonly-used examples of kernels include “Epanechnikov”, “Rectangular”, and “Gaussian” kernels. Prichard and Theiler showed that pdf models obtained by KDE can be utilized to estimate entropy [
25] and other information theoretic quantities, such as the generalized entropy and the Time Lagged Mutual Information (TLMI), using the correlation integral and its approximation through the correlation sums [
7]. In [
25], Prichard and Theiler demonstrated that the utilization of correlation integrals corresponds to using a kernel that is far from optimal, also known as the “naïve estimator” described in [
28]. It is also shown that the relationship between the correlation integral and information theoretic statistics allows defining “local” versions of many information theoretical quantities. Based on these concepts, Prichard and Theiler demonstrated the interactions among the components of a three-dimensional chaotic Lorenz model with a fractal nature [
25]. The predictability of the dynamical systems, including the same Lorenz model have been explored by Kleeman in [
29,
30], where a practical approach for estimating entropy was developed for dynamical systems with non-integral information dimension.
In the estimation of information-theoretical quantities, the KDE approach requires estimation of an appropriate radius (aka bandwidth or rectangle kernel width) for the estimation of the correlation integral. In general cases, this can be accomplished by the Garassberger-Procaccia algorithm, as in [
31–
33]. In order to compute the TE from data using a KDE of the pdf, Sabesan and colleagues proposed a methodology to explore an appropriate region of radius values to be utilized in the estimation of the correlation sum [
14].
The TE can be expressed as the difference between two relevant MI terms [
19], which can be computed by several efficient MI estimation techniques using variable bin-width histograms. Fraser and Swinney [
34] and Darbellay and Vajda [
35] proposed adaptive partitioning of the observation space to estimate histograms with variable bin-widths thereby increasing the accuracy of MI estimation. However, problems can arise due to the subtraction of the two MI terms as described in [
19] and explained in [
25,
26].
Another adaptive and more data efficient method was developed by Kraskov
et al. [
36] where MI estimations are based on
k-nearest neighbor distances. This technique utilizes the estimation of smooth probability densities from the distances between each data sample point and its
k-th nearest neighbor and as well as bias correction to estimate MI. It has been demonstrated [
36] that no fine tuning of specific parameters is necessary unlike the case of the adaptive partitioning method of Darbellay and Vajda [
35] and the efficiency of the method has been shown for Gaussian and three other non-Gaussian distributed data sets. Herrero
et al. extended this technique to TE in [
37] and this has been utilized in many applications where TE is estimated [
38–
40] due to its advantages.
We note that a majority of the proposed approaches to estimate TE rely on its specific parameter(s) that have to be selected prior to applying the procedure. However, there are no clear prescriptions available for picking these ad hoc parameter values, which may differ according to the specific application. Our main contribution is to synthesize three established techniques to be used together to perform TE estimation. With this composite approach, if one of the techniques does not agree with the others in terms of the direction of information flow between the variables, we can conclude that method-specific parameter values have been poorly chosen. Here, we propose using three methods to validate the conclusions drawn about the directions of the information flow between the variables, as we generally do not possess a priori facts about any physical phenomenon we explore.
In this paper, we propose an approach that employs efficient use of histogram based methods, adaptive partitioning technique of Darbellay and Vajda, and KDE based TE estimations, where fine tuning of parameters is required. We propose a Bayesian approach to estimate the width of the bins in a fixed bin-width histogram method to estimate the probability distributions from data.
In the rest of the paper, we focus on the demonstration of synthesizing three established techniques to be used together to perform TE estimation. As the TE estimation based on the
k-th nearest neighbor approach of Kraskov
et al. [
36] is demonstrated to be robust to parameter settings, it does not require fine tunings to select parameter values. Thus it has been left for future exploration, as our main goal is to develop a strategy for the selection of parameters in the case of non-robust methods.
The paper is organized as follows. In Section 2, background material is presented on the three TE methods utilized. In Section 3, the performance of each method is demonstrated by applying it to both a linearly coupled autoregressive (AR) model and the Lorenz system equations [
41] in both the chaotic and sub-chaotic regimes. The latter represents a simplified model of atmospheric circulation in a convection cell that exhibits attributes of non-linear coupling, including sensitive dependence on model parameter values that can lead to either periodic or chaotic variations. Finally conclusions are drawn in Section 4.
3. Experiments
In the preceding section, we described three different methods for estimating the TE from data, namely: the Generalized Knuth method, the adaptive bin-width histogram and the KDE method. We emphasized that we can compute different TE values by these three different methods, as the TE estimations depend on various factors, such as the value of the selected fixed bin-width, the bias resulting due to the subtraction and addition of various Shannon entropies, embedding dimensions and the value of the chosen KDE radius value. Due to this uncertainty in the TE estimations, we propose to use these three main techniques
together to compute the TE values and to consistently identify the direction of relative information flows between two variables. With this approach, if one of the techniques does not agree with the others in terms of the direction of information flows between the variables, we determine that we need to fine tune the relevant parameters until all three methods agree with each other in the estimation of the NetTE direction between each pair of variables. The NetTE between two variables
X and
Y is defined to be the difference between
TEXY and
TEYX, which is defined as the difference of the TE magnitudes with opposite directions between
X and
Y:
The NetTE allows us to compare the relative values of information flow in both directions and conclude which flow is larger than the other, giving a sense of main interaction direction between the two variables X and Y.
In order to use three methods together, we demonstrate our procedure on a synthetic dataset generated by a bivariate autoregressive model given by
Equation (24). In Section 2.3.1, we have already described the KDE method using this autoregressive model example and we have explored different radius values in the KDE method by utilizing the Grassberger-Procaccia approach in conjunction with different selections of k values. In Section 3.1, we continue demonstrating the results using the same bivariate autoregressive model. We focus on the analysis of the adaptive partitioning and the Generalized Knuth methods. First, we analyze the performance of the adaptive partitioning method at a preferred statistical significance level. Then, we propose to investigate different
β values to estimate the optimal fixed bin-width using
Equation (15) in the Generalized Knuth method.
If an information flow direction consensus is not reached among the three methods, we try different values for the fine-tuning parameters until we get a consensus in the NetTE directions.
When each method has been fine-tuned to produce the same NetTE estimate, we conclude that the information flow direction has been correctly identified.
In Section 3.2, we apply our procedure to explore the information flow among the variables of the nonlinear dynamical system used by Lorenz to model an atmospheric convection cell.
3.1. Linearly-Coupled Bivariate Autoregressive Model
In this section, we apply the adaptive partitioning and the Generalized Knuth methods to estimate the TE among the processes defined by the same bivariate linearly-coupled autoregressive model (with variable coupling values) given by the equations in
Equation (24). We demonstrate the performance of each TE estimation method using an ensemble of 10 members to average. The length of the synthetically generated processes is taken to be 1000 samples after eliminating the first 10,000 samples as the transient. For each method, TE estimations
versus the value of coupling coefficients are shown in
Figures 5–
7 for both directions between processes
X and
Y. It should be noted that the process
X is coupled to
Y through coefficient
c. Thus, there is no information flow from
X to
Y for this example,
i.e.,
TEXY = 0 analytically. The analytical values of
TEYX have been obtained using the equations in [
19] for
k = 1 and
l = 1. The performance of the three methods have been compared for the case of
k = 1 and
l = 1.
Below, TE is estimated for both directions using coupling values ranging from
c=0.01 to
c=1 in
Equation (24). The information flows are consistently estimated to be in the same direction for all three methods,
i.e.,
TEYX≥
TEXY. If we compare the magnitudes of these TE estimates, we observe that the biases between the analytic solution and the
TEYX of the adaptive partitioning method, KDE and the Generalized Knuth method increase as the coefficient of the coupling in the autoregressive model increases to
c = 1.
Above, we demonstrate the TE estimations using the KDE method with different embedding dimensions and different radius values. In
Figure 5A, B, we observe that the directions of each TE can be estimated correctly,
i.e.,
TEYX≥
TEXY. for the model given in
Equation (24), demonstrating that we can obtain the same information flow directions, but with different bias values.
Below, results in
Figure 5A are compared with the other two techniques for
k =
l = 1.
When the magnitudes of the
TEYX estimates are compared in
Figures 5A and
6, we observe bias both in
TEYX and TEXY, whereas there is no bias in the
TEXY estimate in the Generalized Knuth method using
β =10
−10. On the other hand, the adaptive partitioning method provides the least bias for
TEYX whereas KDE seems to produce larger bias for low coupling values and lower bias for high coupling values in
Figure 5A, compared to the Generalized Knuth method with
β =10
−10 in
Figure 7.
For example, for
c = 1, we note from the three graphs that the estimated transfer entropies are
TEYX ≅ 0.52,
TEYX ≅ 0.43,
TEYX ≅ 0.2, for the adaptive partitioning, the KDE with
k = l = 1 and the Generalized Knuth method with
β = 10
−10, respectively. As the bias is the difference between the analytical value (
TEYX = 0.55
for k = 1 = 1) and the estimates, it obtains its largest value in the case of the Generalized Knuth method with
β = 10
−10. On the other hand, we know that there is no information flow from the variable
X to variable
Y,
i.e., TEXY = 0. This fact is reflected in
Figure 7, but not in
Figures 5A and
6 where
TExy is estimated to be non-zero, implying bias. As the same computation is also utilized to estimate
TEYX (in the other direction), we choose to analyze the NetTE, which equals the difference between
TEYX and
TEXY, which is defined in
Equation (27). Before comparing the NetTE obtained by each method, we present the performance of the proposed Generalized Knuth method for different
β values.
3.1.1. Fine-Tuning the Generalized Knuth Method
In this sub-section, we investigate the effect of
β on the TE estimation bias in the case of the Generalized Knuth method. The piecewise-constant model of the Generalized Knuth method approaches a pure likelihood-dependent model, which has almost a constant value as
β goes to zero in
Equation (18). In this case, the mean posterior bin heights approach their frequencies in a bin,
i.e.,
. In this particular case, empty bins of the histogram cause large biases in entropy estimation, especially in higher dimensions as the data becomes sparser. This approach can only become unbiased asymptotically [
54]. However, as shown in
Equation (18), the Dirichlet prior with exponent
β artificially fills each bin by an amount,
β, reducing the bias problem. In
Appendix III,
Figure A3 illustrates the effect of the free parameter
β on the performance of the marginal and joint entropy estimates. We find that the entropy estimates fall within one to two standard deviations for
β ≅ 0.1. The performance degrades for much smaller and much larger
β values.
Figures 8 and
9 illustrate less bias in
TEYX estimates for
β =0.1 and
β =0.5 unlike the case in shown in
Figure 7 where we use
β=10
−10. However, the bias increases for low coupling values in these two cases. To illustrate the net effect of the bias, we explore NetTE estimates of
Equation (27) for these cases in Section 3.1.2.
3.1.2. Analysis of NetTE for the Bivariate AR Model
Since we are mainly interested in the direction of the information flow, we show that the estimation of the NetTE values exhibit more quantitative similarity among the methods for the case where
k = l = 1 (
Figure 10).
In the KDE (
Figure 5A), Adaptive partitioning (
Figure 6) and the Generalized Knuth method with
β =0.1 and
β =0.5, (
Figures 8 and
9) a non-zero
TEXY is observed. The NetTE between the variables
X and
Y of the bivariate auroregressive model in
Equation (24) still behaves similarly giving a net information flow in the direction of the coupling from
Y to
X as expected. Thus, in this case we find that the NetTE behaves in the same way, even though the individual TE estimates of each method have different biases. Above, we observe that the NetTE estimate of the adaptive partitioning outperforms the Generalized Knuth method with
β =0.1 and
β =0.5 and KDE. The largest bias in NetTE is achieved by the Generalized Knuth method with
β =10
−10. However, all methods agree that the information flow from
Y to
X is greater than that of
X to
Y, which is in agreement with the theoretical result obtained from
Equation (24) using the equations in [
19]. In the literature, the bias in the estimation has been obtained using surrogates of TE’s estimated by shuffling the data samples [
38]. These approaches will be explored in future work.
3.2. Lorenz System
In this section, the three methods of Section 2 are applied to a more challenging problem involving the detection of the direction of information flow among the three components of the Lorenz system, which is a simplified atmospheric circulation model that exhibits significant non-linear behavior. The Lorenz system is defined by a set of three coupled first-order differential equations [
41]:
where
σ = 10,
b = 8⁄3,
R = 24 (
sub − chaotic)
or R = 28 (
chaotic). These equations derive from a simple model of an atmospheric convection cell, where the variables
x,
y, and
z denote the convective velocity, vertical temperature difference and the mean convective heat flow, respectively. These equations are used to generate a synthetic time series, which is then used to test our TE estimation procedure. In the literature, the estimation of the TE of two Lorenz systems with nonlinear couplings have found applications in neuroscience [
14,
39,
55]. Here, we explore the performance of our approach on a single Lorenz system which is not coupled to another one. Our goal is to estimate the interactions among the three variables of a single Lorenz system–not coupling from one system to another.
In our experiments, we tested the adaptive partitioning, KDE and Generalized Knuth methods in the case where the Rayleigh number, R = 28, which is well-known to result in chaotic dynamics and also for the sub-chaotic case where R = 24. For each variable, we generated 15,000 samples and used the last 5000 samples after the transient using a Runge-Kutta-based differential equation solver in MATLAB (ode45). Both in the chaotic and sub-chaotic cases, β = 0.1 was used at the Generalized Knuth method and a 5% significance level was selected in the adaptive partitioning method. Embedding dimensions of k = l = 1 have been selected in these two methods.
The embedding dimension values were implemented according to Section 2.3.2 at the KDE method: The log ε versus curves have been estimated for the chaotic and sub-chaotic cases.
In the chaotic case, the first minimum of TLMI was found to be at
k = 17 and ε =
e−1 occured in the middle of the radii range of the linear part of the curve. The value of
l = 1 was selected for both the chaotic and sub-chaotic cases. The curves for different
k values have been illustrated in
Figure 11 for the analysis of the interaction between
X and
Y. Similar curves have been observed for the analysis of the interactions between the other pairs in the model.
In the sub-chaotic case, values around
k = 15 have been observed to provide the first local minimum of TLMI(k). However, the NetTE direction consistency cannot be obtained with the other two techniques, namely, the adaptive partitioning and the Generalized Knuth method. Therefore, as we propose in our method,
k value has been fine-tuned along with the radius until we obtain consistency of NetTE directions among the three methods. Selection of
k = 3,
l=1,
ε =
e−2 has provided this consistency, where the NetTE directions are illustrated in
Figure 15.
Figure 12 illustrates log
ε versus curves used in the selection of the appropriate region for
ε, in the sub-chaotic case.
We estimated TE for both directions for each pair of variables (
x,
y), (
x,
z), and (
y,
z) using each of the three methods described in Section 2. Similar to the MI normalization of
Equation (26) recommended in [
53], we adapt the normalization for the NetTE as follows:
where δ
XY denotes the normalized NetTE between variables
X and
Y, having values in the range of [0,1]. In
Figures 13 and
14, we illustrate the information flow between each pair of the Lorenz equation variables using both the un-normalized TE values obtained by the each of the three methods and the normalized NetTE estimates showing the net information flow between any pair of variables.
Above, the
un-normalized TE values are denoted by solid lines between each pair of variables. Also, the
normalized NetTE estimates
(29) are illustrated with dashed lines. The direction of the NetTE has the same direction as the maximum of two un-normalized TE estimates between each pair, the magnitudes of which are shown in rectangles. For example, in the case of the adaptive partitioning method, the un-normalized TE values are estimated to be
TEYZ =−0.45 and
TEZY =−0.50 between variables
Y and
Z, due to the biases originating from the subtraction used in
Equation (21). However, the normalized NetTE is estimated to be
and shows a net information flow from variable
Y to
Z. Thus, we conclude that variable
Y affects variable
Z.
In
Figure 14, we illustrate the estimates of TE’s between each variable of the Lorenz
Equation (28) in sub-chaotic regime with
R = 24.
Above, we demonstrated the concept of our method: If the directions of information flows are not consistent with the three methods, then we can explore new parameter values to provide consistency in the directions. Above, for the selected parameters, the Generalized Knuth method and the adaptive partitioning provided consistent NetTE directions between the pairs of variables in the chaotic case. However, in the sub-chaotic case, we needed to explore a new parameter set for the KDE method as the NetTE directions were different than the other two consistent methods.
Based on the fact that the directions of the NetTE estimations obtained using each of the three methods agree, we conclude that information flow direction between the pairs of the Lorenz equation variables are as shown in
Figure 15.
Note that these information flow directions are not only not obvious, but also not obviously obtainable, given the Lorenz system equations in
Equation (28) despite the fact that these equations comprise a complete description of the system (sensitive dependence on initial conditions not withstanding). However, given the fact that this system of equations is derived from a well-understood physical system, one can evaluate these results based on the corresponding physics. In an atmospheric convection roll, it is known that both the velocity (X) and the heat flow (Z) are driven by the temperature difference (Y), and that it is the velocity (X) that mediates the heat flow (Z) in the system. This demonstrates that complex nonlinear relationships between different subsystems can be revealed by a TE analysis of the time series of the system variables. Furthermore, such an analysis reveals information about the system that is not readily accessible even with an analytic model, such as
Equation (28), in hand.