Next Article in Journal
Entropy of Entanglement between Quantum Phases of a Three-Level Matter-Radiation Interaction Model
Next Article in Special Issue
Strong- and Weak-Universal Critical Behaviour of a Mixed-Spin Ising Model with Triplet Interactions on the Union Jack (Centered Square) Lattice
Previous Article in Journal
Information Theoretic-Based Interpretation of a Deep Neural Network Approach in Diagnosing Psychogenic Non-Epileptic Seizures
Previous Article in Special Issue
Oscillations in Multiparticle Production Processes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Minimising the Kullback–Leibler Divergence for Model Selection in Distributed Nonlinear Systems

1
Australian Centre for Field Robotics, The University of Sydney, Sydney NSW 2006, Australia
2
Complex Systems Research Group, The University of Sydney, Sydney NSW 2006, Australia
3
Centre for Autonomous Systems, University of Technology Sydney, Ultimo NSW 2007, Australia
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(2), 51; https://doi.org/10.3390/e20020051
Submission received: 21 December 2017 / Revised: 17 January 2018 / Accepted: 18 January 2018 / Published: 23 January 2018
(This article belongs to the Special Issue New Trends in Statistical Physics of Complex Systems)

Abstract

:
The Kullback–Leibler (KL) divergence is a fundamental measure of information geometry that is used in a variety of contexts in artificial intelligence. We show that, when system dynamics are given by distributed nonlinear systems, this measure can be decomposed as a function of two information-theoretic measures, transfer entropy and stochastic interaction. More specifically, these measures are applicable when selecting a candidate model for a distributed system, where individual subsystems are coupled via latent variables and observed through a filter. We represent this model as a directed acyclic graph (DAG) that characterises the unidirectional coupling between subsystems. Standard approaches to structure learning are not applicable in this framework due to the hidden variables; however, we can exploit the properties of certain dynamical systems to formulate exact methods based on differential topology. We approach the problem by using reconstruction theorems to derive an analytical expression for the KL divergence of a candidate DAG from the observed dataset. Using this result, we present a scoring function based on transfer entropy to be used as a subroutine in a structure learning algorithm. We then demonstrate its use in recovering the structure of coupled Lorenz and Rössler systems.

1. Introduction

Distributed information processing systems are commonly studied in complex systems and machine learning research. We are interested in inferring data-driven models of such systems, specifically in the case where each subsystem can be viewed as a nonlinear dynamical system. In this context, the Kullback–Leibler (KL) divergence is commonly used to measure the quality of a statistical model [1,2,3]. When a model is compared with fully observed data, computing the KL divergence can be straightforward. However, in the case of spatially distributed dynamical systems, where individual subsystems are coupled via latent variables and observed through a filter, the presence of hidden variables renders typical approaches unusable. We derive the KL divergence in such systems as a function of two information-theoretic measures using methods from differential topology.
The model selection problem has applications in a wide variety of areas due to its usefulness in performing efficient inference and understanding the underlying phenomena being studied. Dynamical systems are an expressive model characterised by a map that describes their evolution over time and a read-out function through which we observe the latent state. Our research focuses on the more general case of a multivariate system, where a set of these subsystems are distributed and unidirectionally coupled to one another. The problem of inferring this coupling is an important multidisciplinary study in fields such as ecology [4], neuroscience [5,6], multi-agent systems [7,8,9], and various others that focus on artificial and biological networks [10].
We represent such a spatially distributed system as a probabilistic graphical model termed a synchronous graph dynamical system (GDS) [11,12], whose structure is given by a DAG. Model selection in this context is the problem of inferring directed relationships between hidden variables from an observed dataset, also known as structure learning. A main challenge in structure learning for DAGs is the case where variables are unobserved. Exact methods are known for fully observable systems (i.e., Bayesian network (BNs)) [13]; however, these are not applicable in the more expressive case when the state variables in dynamical systems are latent. The main focus of this paper is to analytically derive a measure for comparing a candidate graph to the underlying graph that generated a measured dataset. Such a measure can then be used to solve the two subproblems that comprise structure learning, evaluation and identification [14], and hence find the optimal model that explains the data.
For the evaluation problem, it is desirable to select the simplest model that incorporates all statistical knowledge. This concept is commonly expressed via information theory, where an established technique is to evaluate the encoding length of the data, given the model [1,15,16]. The simplest model should aim to minimise code length [2], and therefore we can simplify our problem to that of minimising KL divergence for the synchronous GDS. Using this measure, we find a factorised distribution (given by the graph structure) that is closest to the complete (unfactorised) distribution. We first analytically derive an expression for this divergence, and build on this result to present a scoring function for evaluating candidate graphs based on a dataset.
The main result of this paper is an exact decomposition of the KL divergence for synchronous GDSs. We show that this measure can be decomposed as the difference between two well-known information-theoretic measures, stochastic interaction [17,18] and collective transfer entropy [19]. We establish this result by first representing discrete-time multivariate dynamical systems as dynamic Bayesian network (DBNs) [20]. In this form, both the complete and factorised distributions cannot be directly computed due to the hidden system state. Thus, we draw on state space reconstruction methods from differential topology to reformulate the KL divergence in terms of computable distributions. Using this expression, we show that the maximum transfer entropy graph is the most likely to have generated the data. This is experimentally validated using toy examples of a Lorenz–Rössler system and a network of coupled Lorenz attractors (Figure 1) of up to four nodes. These results support the conjecture that transfer entropy can be used to infer effective connectivity in complex networks.

2. Related Work

Networks of coupled dynamical systems have been introduced under a variety of terms, such as complex networks [10], distributed dynamical systems [6] and master–slave configurations [21]. The defining feature of these networks is that the dynamics of each subsystem are given by a set of either discrete-time maps or first-order ordinary differential equation (ODEs). In this paper, we use the discrete-time formulation, where a map can be obtained numerically by integrating ODEs or recording observations at discrete-time intervals [22].
An important precursor to network reconstruction is inferring causality and coupling strength between complex nonlinear systems. Causal inference is intractable when the experimenter can not intervene with the dataset [23], and so we focus our attention on methods that determine conditional independence (coupling) rather than causality. In seminal work, Granger [24] proposed Granger causality for quantifying the predictability of one variable from another; however, a key requirement of this measure is linearity of the system, implying subsystems are separable [4]. Schreiber [25] extended these ideas and introduced transfer entropy using the concept of finite-order Markov processes to quantify the information transfer between coupled nonlinear systems. Transfer entropy and Granger causality are equivalent for linearly-coupled Gaussian systems (e.g., Kalman models) [26]; however, there are clear distinctions between the concepts of information transfer and causal effect [27]. Although transfer entropy has received criticism over spuriously identifying causality [28,29,30], we are concerned with statistical modelling and not causality of the underlying process.
Recently, a number of measures have been proposed to infer coupling between distributed dynamical systems based on reconstruction theorems. Sugihara et al. [4] proposed convergent cross-mapping that involves collecting a history of observed data from one subsystem and uses this to predict the outcome of another subsystem. This history is the delay reconstruction map described by Takens’ Delay Embedding Theorem [31]. Similarly, Schumacher et al. [6] used the Bundle Delay Embedding Theorem [32,33] to infer causality and perform inference via Gaussian processes. Although the algorithms presented in these papers can infer driving subsystems in a spatially distributed dynamical system, the results obtained differ from ours as inference is not considered for an entire network structure, nor is a formal derivation presented. Contrasting this, we recently derived an information criterion for learning the structure of distributed dynamical systems [12]. However, the criterion proposed required parametric modelling of the probability distributions, and thus a detailed understanding of the physical phenomena being studied. In this paper, we extend this framework by first showing that KL divergence can be decomposed as information-theoretically useful measures, and then arriving at a similar result but employing non-parametric density estimation techniques to allow for no assumptions about the underlying distributions.
It is important to distinguish our approach from dynamic causal modelling (DCM), which attempts to infer the parameters of explicit dynamic models that cause (generate) data. In DCM, the set of potential models is specified a priori (typically in the form of ODEs) and then scored via marginal likelihood or evidence. The parameters of these models include effective connectivity such that their posterior estimates can be used to infer coupling among distributed dynamical systems [34]. As a consequence, these approaches can be used to recover networks that reveal the effective structure of observed systems [35,36]. In contrast, our approach does not require an explicitly specified model because the scoring function can be computed directly from the data. However, it does assume an implicit model in the form of a DAG where the subsystem processes are generated by generic functions.
Unlike effective connectivity, which is defined in relation to a (dynamic causal) model, the concept of functional connectivity refers to recovering statistical dependencies [37]. Consequently, statistical measures such as Granger causality and transfer entropy are typically used to identify functional, rather than effective structure. For example, transfer entropy has been used previously to infer networks in numerous fields, e.g., computational neuroscience [5,38], multi-agent systems [8], financial markets [39], supply-chain networks [40], and biology [41]. However, most of these results build on the work of Schreiber [25] by assuming the system is composed of finite-order Markov chains and thus there is a dearth of work that provides formal derivations for the use of this measure for inferring effective connectivity. Our work allows us to compute scoring functions directly from multivariate time series (as in functional connectivity), yet still assumes an implicit model (albeit with weaker assumptions on the model than those considered in inferring effective connectivity).

3. Background

3.1. Notation

We use the convention that ( · ) denotes a sequence, { · } a set, and · a vector. In this work, we consider a collection of stationary stochastic temporal processes Z . Each process Z i comprises a sequence of random variables ( Z 1 i , , Z N i ) with realisation ( z 1 i , , z N i ) for countable time indices n N . Given these processes, we can compute probability distributions of each variable by counting relative frequencies or by density estimation techniques [42,43]. We use bold to denote the set of all variables, e.g., z n = { z n 1 , , z n M } is the collection of M realisations at index n. Furthermore, unless otherwise stated, X n i is a latent (hidden) variable, Y n i is an observed variable, and Z n i is an arbitrary variable; thus, Z n = { X n , Y n } is the set of all hidden and observed variables at temporal index n. Given a graphical model G, the p i parents of variable Z n + 1 i are given by the parent set Π G ( Z n + 1 i ) = { Z n i j } j = { Z n i 1 , , Z n i p i } . Finally, let the superscript z n i , ( k ) = z n i , z n 1 i , , z n k + 1 i denote the vector of k previous values taken by variable Z n i .

3.2. Representing Distributed Dynamical Systems as Probabilistic Graphical Models

We are interested in modelling discrete-time multivariate dynamical systems, where the state is a vector of real numbers given by a point x n lying on a compact d-dimensional manifold M . A map f : M M describes the temporal evolution of the state at any given time, such that the state at the next time index x n + 1 = f ( x n ) . Furthermore, in many practical scenarios, we do not have access to x n directly, and can instead observe it through a measurement function ψ : M R M that yields a scalar representation y n = ψ ( x n ) of the latent state [22,44]. We assume the multivariate system can be factorised and modelled as a DAG with spatially distributed dynamical subsystems, termed a synchronous GDS (see Figure 2a). This definition is restated from [12] as follows.
Definition 1. (Synchronous GDS)
A synchronous GDS ( G , x n , y n , { f i } , { ψ i } ) is a tuple that consists of: a finite, directed graph G = ( V , E ) with edge-set E = { E i } and M vertices comprising the vertex set V = { V i } ; a multivariate state x n = x n i , composed of states for each vertex V i confined to a d i -dimensional manifold x n i M i ; an M-variate observation y n = y n i , composed of scalar observations for each vertex y n i R ; a set of local maps { f i } of the form f i : M M i , which update synchronously and induce a global map f : M M ; and a set of local observation functions { ψ 1 , ψ 2 , , ψ M } of the form ψ i : M i R .
The global dynamics and observations can therefore be described by the set of local functions [12]:
x n + 1 i = f i ( x n i , x n i j j ) + υ f i ,
y n + 1 i = ψ i ( x n + 1 i ) + υ ψ i ,
where υ f i and υ ψ i are additive noise terms. The subsystem dynamics (1) are a function of the subsystem state x n i and the subsystem parents’ state x n i j j at the previous time index, i.e., f i : ( M i × j M i j ) M i . However, the observation y n + 1 i is a function of the subsystem state alone, i.e., ψ i : M i R . We assume that the maps { f i } and { ψ i } , as well as the graph G, are time-invariant.
The discrete-time mapping for the dynamics (1) and measurement functions (2) can be modelled as a DBN in order to facilitate structure learning of the graph [12] (see Figure 2b). DBNs are a probabilistic graphical model that represent probability distributions over trajectories of random variables ( Z 1 , Z 2 , ) using a prior BN and a two-time-slice BN (2TBN) [45]. To model the maps, however, we need only to consider the 2TBN B = ( G , Θ G ) , which can model a first-order Markov process p B ( z n + 1 z n ) graphically via a DAG G and a set of conditional probability distribution (CPD) parameters Θ G [45]. Given a set of stochastic processes ( Z 1 , Z 2 , , Z N ) , the realisation of which constitutes the sample path ( z 1 , z 2 , , z N ) , the 2TBN distribution is given by p B ( z n + 1 z n ) = i Pr ( z n + 1 i π G ( Z n + 1 i ) ) , where π G ( Z n + 1 i ) denotes the (index-ordered) set of realisations { z o j : Z o j Π G ( Z n + 1 i ) } .
To model the synchronous GDS as a DBN, we associate each subsystem vertex V i with a state variable X n i and an observation variable Y n i . The parents of subsystem V i are denoted Π G ( V i ) [12]. From the dynamics (1), variables in the set Π G ( X n + 1 i ) come strictly from the preceding time slice, and additionally, from the measurement function (2), Π G ( Y n + 1 i ) = X n + 1 i . Thus, we can build the edge set E in the GDS by means of the edges in the DBN [12], i.e., given an edge X n i X n + 1 j of the DBN, the equivalent edge V i V j exists for the GDS. The distributions for the dynamics (1) and observation (2) maps of M arbitrary subsystems can therefore be factorised according to the DBN structure such that [12]
p B ( z n + 1 z n ) = i = 1 M Pr ( x n + 1 i x n i , x n i j j ) · Pr ( y n + 1 i x n + 1 i ) .
The goal of learning nonlinear dynamical networks thus becomes that of inferring the parent set Π G ( X n i ) for each latent variable X n i .
Finally, recall that the parents of each observation are constrained such that Π G ( Y n + 1 i ) = X n + 1 i . As a consequence, we use the shorthand notation y n i j to denote the observation of the j-th parent of the i-th subsystem at time n (and the same for x n i j ).

3.3. Network Scoring Functions

A number of exact and approximate DBN structure learning algorithms exist that are based on Bayesian statistics and information theory. We have shown in prior work how to compute the log-likelihood function for synchronous GDSs. In this section, we will briefly summarise the problem of structure learning for DBNs, focusing on the factorised distribution (3).
The score and search paradigm [46] is a common method for recovering graphical models from data. Given a dataset D = ( y 1 , y 2 , , y N ) , the objective is to find a DAG G * such that
G * = arg max G G g ( B : D ) ,
where g ( B : D ) is a scoring function measuring the degree of fitness of a candidate DAG G to the data set D, and G is the set of all DAGs. Finding the optimal graph G * in Equation (4) requires solutions to the two subproblems that comprise structure learning: the evaluation problem and the identification problem [14]. The main problem we focus on in this paper is the evaluation problem, i.e., determining a score that quantifies the quality of a graph, given data. Later, we will address the identification problem by discussing the attributes of this scoring function in efficiently finding the optimal graph structure.
In prior work, we developed a score based on the posterior probability of the network structure G, given data D. That is, we considered maximising the expected log-likelihood [12]
( Θ ^ G : D ) = E   [ log Pr ( D G , Θ ^ G ) ] = E   [ log ( p B ( z n + 1 z n ) ) ] ,
where the expectation E [ Z ] = z Pr ( z ) d z . It was shown that state space reconstruction techniques (see Appendix A) can be used to compute the log-likelihood of Equation (3) as a difference of conditional entropy terms [12]. In the same work, we illustrated that the log-likelihood ratio of a candidate DAG G to the empty network G is given by collective transfer entropy (see Appendix B), i.e.,
( Θ ^ G : D ) ( Θ ^ G : D ) = N · i = 1 M T Y i j j Y i .
For the nested log-likelihoods above, the statistics of 2 ( ( Θ ^ G : D ) ( Θ ^ G : D ) ) asymptotically follow the χ q 2 -distribution, where q is the difference between the number of parameters of each model [47,48]. We will draw on this log-likelihood decomposition in later sections for statistical significance testing.

4. Computing Conditional KL Divergence

In this section, we present our main result, which is an analytical expression of KL divergence that facilitates structure learning in distributed nonlinear systems. We begin by considering the problem of finding an optimal DBN structure as searching for a parsimonious factorised distribution p B that best represents the complete digraph distribution p K M . That is, p K M is the joint distribution yielded by assuming no factorisation (the complete graph K M ) and thus no information loss. The distribution is expressed as:
p K M ( z n + 1 z n ( n ) ) = Pr ( { z n + 1 1 , , z n + 1 M } { z n 1 , , z n M } , { z n 1 1 , , z n 1 M } , { z 1 1 , , z 1 M } ) .
We quantify the similarity of the factorised distribution p B to this joint distribution via KL divergence. In prior work, De Campos [3] derived the MIT scoring function for BNs by this approach and it was later used for DBN structure learning with complete data [49]. We extend the analysis to DBNs with latent variables, i.e., we compare the joint and factorised distributions of time slices, given the entire history,
D KL p K M p B = D KL p K M ( z n + 1 z n ( n ) ) p B ( z n + 1 z n ( n ) ) = z n ( n ) Pr ( z n ( n ) ) z n + 1 Pr ( z n + 1 z n ( n ) ) log Pr ( z n + 1 z n ( n ) ) p B ( z n + 1 z n ( n ) ) = E log Pr ( z n + 1 z n ( n ) ) p B ( z n + 1 z n ) .
Substituting the synchronous GDS model (3) into Equation (8), we get
D KL p K M p B = E log Pr ( z n + 1 z n ( n ) ) i = 1 M Pr ( x n + 1 i x n i , x n i j j ) · Pr ( y n + 1 i x n + 1 i ) .
However, Equation (9) comprises maximum likelihood distributions with unobserved (latent) states x n . It is common in model selection to decompose the KL divergence as
D KL p K M p B = E log Pr ( z n + 1 z n ( n ) ) [ log ( p B ( z n + 1 z n ) ) ] ,
where the second term is simply the log-likelihood (5). In this form, p K M is often identical for all models considered and, in practice, it suffices to ignore this term and thus avoid the problem of computing distributions of latent variables. The resulting simpler expression can be viewed as log-likelihood maximisation (as in our previous work outlined in Section 3.3). However, as we show in this section, p K M is not equivalent for all models unless certain parameters of the dynamical systems are known. Hence, for now, we cannot ignore the first term of Equation (10) and we instead propose an alternative decomposition of KL divergence that comprises only observed variables.

4.1. A Tractable Expression via Embedding Theory

In order to compute the distributions in (9), we use the Bundle Delay Embedding Theorem [32,33] to reformulate the factorised distribution (denominator), and the Delay Embedding Theorem for Multivariate Observation Functions [50] for the joint distribution (numerator). We describe these theorems in detail in Appendix A, along with the technical assumptions required for ( f , ψ ) . Although the following theorems assume a diffeomorphism, we also discuss application of the theory towards inferring the structure of endomorphisms (e.g., coupled map lattices [51]) in the same appendix.
The first step is to reproduce a prior result for computing the factorised distribution (denominator) in Equation (9). First, the embedding
y n i , ( κ i ) = y n i , y n τ i i , , y n ( κ i 1 ) τ i i ,
where τ i is the (strictly positive) lag, and κ i is the embedding dimension of the i-th subsystem (the embedding parameters). Note that, although we can take either the future or past delay embedding (11) for diffeomorphisms, we explicitly consider a history of values to account for both endomorphisms and diffeomorphisms. Moreover, an important assumption of our approach is that the the structure (enforced by coupling between subsystems) is a DAG; this comes from the Bundle Delay Embedding Theorem [32,33] (see Lemma 1 of [12] for more detail). Our previous result is expressed as follows.
Lemma 1 (Cliff et al. [12]).
Given an observed dataset D, where y n R M , generated by a directed and acyclic synchronous GDS ( G , x n , y n , { f i } , { ψ i } ) , the 2TBN distribution can be written as
i = 1 M Pr ( x n + 1 i x n i , x n i j j ) · Pr ( y n + 1 i x n + 1 i ) = i = 1 M Pr ( y n + 1 i y n i , ( κ i ) , y n i j , ( κ i j ) j ) Pr ( x n y n i , ( κ i ) ) .
Next, we present a method for computing the joint distribution (numerator) in Lemma 3. For convenience, Lemma 2 restates part of the delay embedding theorem in [50] in terms of subsystems of a synchronous GDS and establishes existence of a map G for predicting future observations from a history of observations.
Lemma 2.
Consider a diffeomorphism f : M M on a d-dimensional manifold M , where the multivariate state x n consists of M subsystem states x n 1 , , x n M . Each subsystem state x n i is confined to a submanifold M i M of dimension d i d , where i d i = d . The multivariate observation is given, for some map G , by y n + 1 = G ( y n i , ( κ i ) ) .
Proof. 
The proof restates part of the proof of Theorem 2 of Deyle and Sugihara [50] in terms of subsystems.
Given M inhomogeneous observation functions { ψ i } , the following map
Φ f , ψ ( x ) = Φ f 1 , ψ 1 ( x ) , Φ f 2 , ψ 2 ( x ) , , Φ f M , ψ M ( x )
is an embedding where each subsystem (local) map Φ f i , ψ i : M R κ i , smoothly (at least C 2 ), and, at time index n is described by
Φ f i , ψ i ( x n ) = ψ i x n , ψ i ( x n τ ) , , ψ i ( x n ( k 1 ) τ ) = y n i , ( κ i ) ,
where i κ i = 2 d + 1 [50]. Note that, from (13) and (14), we have the global map
Φ f , ψ ( x n ) = y n i , ( κ i ) = y n 1 , ( κ 1 ) , , y n m , ( κ M ) .
Now, since Φ f , ψ is an embedding, it follows that the map F = Φ f , ψ f Φ f , ψ 1 is well defined and a diffeomorphism between two observation sequences F : R 2 d + 1 R 2 d + 1 , i.e.,
y n + 1 i , ( κ i ) = Φ f , ψ x n + 1 = Φ f , ψ f x n = Φ f , ψ f Φ f , ψ 1 y n i , ( κ i ) = F ( y n i , ( κ i ) ) .
The last 2 d + 1 components of F are trivial, i.e., the set y n i , ( κ i ) is observed; denote the first M components by G : Φ f , ψ R M , and then we have y n + 1 = G ( y n i , ( κ i ) ) . ☐
We now use the result of Lemma 2 to obtain a computable form of the KL divergence.
Lemma 3.
Consider a discrete-time multivariate dynamical system with generic ( f , ψ ) modelled as a directed and acyclic synchronous GDS ( G , x n , y n , { f i } , { ψ i } ) with M subsystems. The KL divergence of a candidate graph G from the observed dataset D can be computed from tractable probability distributions:
D KL p K M p B = E log Pr ( y n + 1 y n i , ( κ i ) ) i = 1 M Pr ( y n + 1 i y n i , ( κ i ) , y n i j , ( κ i j ) j ) .
Proof. 
Lemma 1, we can substitute (12) into (9), and express the KL divergence D KL p K M p B as
D KL p K M p B = E log Pr ( z n + 1 z n ( n ) ) · Pr ( x n y n i , ( κ i ) ) i = 1 M Pr ( y n + 1 i y n i , ( κ i ) , y n i j , ( κ i j ) j ) .
We now focus on p K M ( z n + 1 | z n ( n ) ) . Using the chain rule,
p K M ( z n + 1 z n ( n ) ) = Pr ( x n + 1 z n ( n ) ) · Pr ( y n + 1 x n + 1 , z n ( n ) ) .
Given the Markov property of the dynamics (1) and observation (2) maps, we get
p K M ( z n + 1 z n ( n ) ) = Pr ( X n + 1 = f ( x n ) x n ) · Pr ( Y n + 1 = ψ ( x n + 1 ) x n + 1 ) .
Now, recall fom Lemma 2 that global equations for the entire system state x n and observation y n are
x n + 1 = f ( x n ) + υ f = f Φ f , ψ 1 ( y n i , ( κ i ) ) + υ f ,
y n + 1 = ψ ( x n + 1 ) + υ ψ = G ( y n i , ( κ i ) ) + υ ψ .
Given the assumption of i.i.d noise on the function f, from (18), we express the probability of the dynamics x n + 1 , given by the embedding, as
Pr x n + 1 y n i , ( κ i ) = Pr X n + 1 = f Φ f , ψ 1 y n i , ( κ i ) y n i , ( κ i ) = Pr X n = Φ f , ψ 1 y n i , ( κ i ) y n i , ( κ i ) · Pr X n + 1 = f ( x n ) x n .
By assumption, the observation noise is i.i.d or dependent only on the state x n + 1 , and thus the probability of observing y n + 1 , from (19) is
Pr y n + 1 y n i , ( κ i ) = Pr Y n + 1 = G ( y n i , ( κ i ) ) y n i , ( κ i ) = Pr X n + 1 = f Φ f , ψ 1 y n i , ( κ i ) y n i , ( κ i ) × Pr Y n + 1 = ψ ( x n + 1 ) x n + 1 .
By (20) and (21), we have that
Pr ( x n + 1 x n ) · Pr ( y n + 1 x n + 1 ) = Pr ( y n + 1 y n i , ( κ i ) ) Pr ( x n y n i , ( κ i ) ) .
Substituting Equation (22) into (17) gives
p K M ( z n + 1 z n ( n ) ) = Pr ( y n + 1 y n i , ( κ i ) ) Pr ( x n y n i , ( κ i ) ) .
Finally, substituting (23) back into (16) yields the statement of the theorem. ☐
Given all variables in (15) are observed, it is now straightforward to compute KL divergence; however, as we will see, it is more convenient to express (15) as a function of known information-theoretic measures.

4.2. Information-Theoretic Interpretation

The main theorem of this paper states KL divergence in terms of transfer entropy and stochastic interaction. These information-theoretic concepts are defined in Appendix B for convenience.
Theorem 4.
Consider a discrete-time multivariate dynamical system with generic ( f , ψ ) represented as a directed and acyclic synchronous GDS ( G , x n , y n , { f i } , { ψ i } ) with M subsystems. The KL divergence D KL p K M p B of a candidate graph G from the observed dataset D can be expressed as the difference between stochastic interaction (A9) and collective transfer entropy (A8), i.e.,
D KL p K M p B = S Y i = 1 M T { Y i j } j Y i .
Proof. 
We can reformulate the KL divergence in (15) as
D KL p K M p B = E log Pr ( y n + 1 y n i , ( κ i ) ) E log i = 1 M Pr ( y n + 1 i y n i , ( κ i ) , y n i j , ( κ i j ) j ) = H ( Y n + 1 { Y n ( κ i ) } ) + i = 1 M H ( Y n + 1 i Y n i , ( κ i ) , { Y n i j , ( κ i j ) } j ) = H ( Y n + 1 { Y n ( κ i ) } ) + i = 1 M H ( Y n + 1 i Y n i , ( κ i ) ) + i = 1 M H ( Y n + 1 i Y n i , ( κ i ) , { Y n i j , ( κ i j ) } j ) H ( Y n + 1 i Y n i , ( κ i ) ) .
Substituting in the definitions of transfer entropy (A8) and stochastic interaction (A9) completes the proof. ☐
To conclude this section, we present the following corollary showing that, when we assume a maximum or fixed embedding dimension κ i and time delay τ i , it suffices to maximise the collective transfer entropy alone in order to minimise KL divergence for a synchronous GDS.
Corollary 1.
Fix an embedding dimension κ i and time delay τ i for each subsystem V i V . Then, the graph G that minimises the KL divergence D KL p K M p B is equivalent to the graph that maximises transfer entropy, i.e.,
arg min G G D KL p K M p B = arg max G G i = 1 M T { Y i j } j Y i .
Proof. 
The first term of (24) is constant, given a constant vertex set V , time delay τ and embedding dimension κ and is thus unaffected by the parent set Π G ( V i ) of a variable. As a result, S Y does not depend on the graph G being considered, and, therefore, we only need to consider transfer entropy when optimising KL divergence (24). ☐
As mentioned above, Corollary 1 is, in practice, equivalent to the maximum log-likelihood (5) and log-likelihood ratio (6) approaches. However, the statement only holds for constant embedding parameters. In the general case, where these parameters are unknown, one requires Theorem 4 to perform structure learning. Given this result, we can now confidently derive scoring functions from Corollary 1.

5. Application to Structure Learning

We now employ the results above in selecting a synchronous GDS that best fits data generated by a multivariate dynamical system. The most natural way to find an optimal model based on Theorem 4 is to minimise KL divergence. Here, we assume constant embedding parameters and use Corollary 1 to present the transfer entropy score and discuss some attributes of this score. We then use this scoring function as a subroutine for learning the structure of coupled Lorenz and Rössler attractors.
From Corollary 1, a naive scoring function can be defined as
g T E ( B : D ) = i = 1 M T { Y i j } j Y i .
Given parameterised probability distributions, this score is insufficient, since the sum of transfer entropy in (27) is non-decreasing when including more parents in the graph [38]. Thus, we use statistical significance tests in our scoring functions to mitigate this issue.

5.1. Penalising Transfer Entropy by Independence Tests

Building on the maximum likelihood score (27), we propose using independence tests to define two new scores of practical value. Here, we draw on the result of de Campos [3], who derived a scoring function for BN structure learning based on conditional mutual information and statistical significance tests, called MIT. The central idea is to use collective transfer entropy T Y i j j Y i to measure the degree of interaction between each subsystem V i and its parent subsystems Π G ( V i ) , but also to penalise this term with a value based on significance testing. As with the MIT score, this gives a principled way to re-scale the transfer entropy when including more edges in the graph.
To develop our scores, we form a null hypothesis H 0 that there is no interaction T Y i j j Y i , and then compute a test statistic to penalise the measured transfer entropy. To compute the test statistic, it is necessary to consider the measurement distribution in the case where the hypothesis is true. Unfortunately, this distribution is only analytically tractable in the case of discrete and linear-Gaussian systems, where 2 N T Y i j j Y i is known to asymptotically approach the χ 2 -distribution [48]. Since this distribution is a function of the parents of Y i , we let it be described by the function χ 2 ( { l i j } j ) . Now, given this distribution, we can fix some confidence level α and determine the value χ α , { l i j } j such that p ( χ 2 ( { l i j } j ) χ α , { l i j } j ) . This represents a conditional independence test: if 2 N T Y i j j Y i χ α , { l i j } j , then we accept the hypothesis of conditional independence between Y i and Y i j j ; otherwise, we reject it. We express this idea as the TEA score:
g TEA ( B : D ) = i = 1 M 2 N T { Y i j } j Y i χ α , { l i j } j .
In general, we only have access to continuous measurements of dynamical systems, and so are limited by the discrete or linear-Gaussian assumption. We can, however, use surrogate measurements T Y i j j s Y i to empirically compute the distribution under the assumption of H 0 [52]. This same technique has been used by [38] to derive a greedy structure learning algorithm for effective network analysis. Here, Y i j j s are surrogate sets of variables for Y i j j , which have the same statistical properties as Y i j j , but the correlation between Y i j j s and Y i is removed. Let the distribution of these surrogate measurements be represented by some general function T ( s i ) where, for the discrete and linear-Gaussian systems, we could compute T ( s i ) analytically as an independent set of χ 2 -distributions χ 2 ( { l i j } j ) . When no analytic distribution is known, we use a resampling method (i.e., permutation or bootstrapping), creating a large number of surrogate time-series pairs { Y i j j s , Y i } by shuffling (for permutations, or redrawing for bootstrapping) the samples of Y i and computing a population of T Y i j j s Y i . As with the TEA score, we fix some confidence level α and determine the value T α , s i , such that p ( T ( s i ) T α , s i ) = α . This results in the TEE scoring function as
g TEE ( B : D ) = i = 1 M T { Y i j } j Y i T α , s i .
We can obtain the value T α , s i by (1) drawing S samples T Y i j j s Y i from the distribution T ( s i ) (by permutation or bootstrapping), (2) fixing α { 0 , 1 / S , 2 / S , , 1 } , and then (3) taking T α , s i such that
α = 1 S T { Y i j } j Y i 1 T { Y i j } j s Y i T α , s i .
We can alternatively limit the number of surrogates S to α / ( 1 α ) and take the maximum as T α , s i [22]; however, taking a larger number of surrogates will improve the validity of the distribution T ( s i ) .
Both the analytical (TEA) and empirical (TEE) scoring functions are illustrated in Figure 3. Note that the approach of significance testing is functionally equivalent to considering the log-likelihood ratio in (6), where, as stated, nested log-likelihoods (and thus transfer entropy) follows the above χ 2 -distribution [48].

5.2. Implementation Details and Algorithm Analysis

The two main implementation challenges that arise when performing structure learning are: (1) computing the score for every candidate network and (2) obtaining a sufficient number of samples to recover the network. The main contributions of this work are theoretical justifications for measures already in use and, fortunately, algorithmic performance has already been addressed extensively using various heuristics. Here, we present an exact, exhaustive implementation for the purpose of validating our theoretical contributions.
First, for computing collective transfer entropy for the score (29), we require CPDs to be estimated from data. Given these CPDs, collective transfer entropy (A8) decomposes as a sum of p conditional transfer entropy (A7) terms, where p = | { Y i j } j | is the size of the parent set (see Appendix B for details). Since most observations of dynamical systems are expected to be continuous, we employ a non-parametric, nearest-neighbour based approach to density estimation called the Kraskov–Stögbauer–Grassberger (KSG) estimator [43]. For any arbitrary decomposition of collective transfer entropy (i.e., any ordering of the parent set), this density estimation can be computed in time O ( κ ( p + 1 ) K N κ ( p + 1 ) log ( N ) ) , where K is the number of nearest neighbours for each observation in a dataset of size N, and κ is the embedding dimension [52]. We upper bound this as O ( κ M K N κ M log ( N ) ) since the maximum p is M 1 .
Now, the above density estimation was described for an arbitrary ordering of the parent set. In the case of parametric (discrete or linear-Gaussian) density estimation, every permutation of the parent set yields equivalent results, with potentially different χ α , { l i j } j values for each permutation [3]; however, this is not the case for non-parametric density estimation techniques, e.g., the KSG estimator. Hence, as a conservative estimate of the score, we compute all p ! permutations of the parent set and take the minimum collective transfer entropy. In order to obtain the surrogate distribution, we require S uncorrelated samples of the density. Since the surrogate distributions decompose in a similar manner, the score for a candidate network can be computed in time O ( S · M ! · κ M K N κ M log ( N ) ) , where, again, we have upper bounded p ! as M ! .
Using this approach, we can now compute the score (29), and thus the optimal graph G * can be found using any search procedure over DAGs. Exhaustive search, where all DAGs are enumerated, is typically intractable because the search space is super-exponential in the number of variables (about 2 O ( M 2 ) ), and so heuristics are often applied for efficiency. We restrict our attention to a relatively small network (a maximum of M = 4 nodes) and thus we are able employ the dynamic programming (DP) approach of Silander and Myllymaki [53] to search through the space of all DAGs efficiently. This approach requires first computing the scores for all local parent sets, i.e., 2 M scores. Once each score is calculated, the DP algorithm runs in time o ( M · 2 M 1 ) and the entire search procedure run in time O ( M · 2 M 1 + 2 M · S · M ! · κ M K N κ M log ( N ) ) . As a consequence, the time complexity of the exhaustive algorithm is dominated by computing the 2 M scores and, in smaller networks, most of the time is spent on density estimation for surrogate distributions.
Finally, the problem of inferring optimal embedding parameters is well studied in the literature. In our experimental evaluation, we set the embedding dimension to the maximum, i.e., κ = 2 d + 1 , where d is the dimensionality of the entire latent state space (e.g., if M = 3 and d i = 3 for each subsystem, then κ = 2 i d i + 1 = 19 ). However, determining these parameters would give more insight into the system and reduce the number of samples required for inference. There are numerous criteria for optimising these parameters (e.g., [54]); most notably, the work of [55] suggests an information-theoretic approach that could be integrated into the scoring function (29) to search over the embedding parameters and DAG space simultaneously.

6. Experimental Validation

The dynamics (1) and observation (2) maps can be obtained by either differential equations, discrete-time maps, or real-world measurements. To validate our approach, we use the toy example of distributed flows, whereby the dynamics of each node are given by either the Lorenz [56] or the Rössler system of ODEs [57]. The discrete-time measurements are obtained by integrating these ODEs over constant intervals. In this section, we formally introduce this model, study the effect of changing the parameters of a coupled Lorenz–Rössler system, and finally apply our scoring function to learn the structure of up to four coupled Lorenz attractors with arbitrary graph topology. To compute the scores, we use the Java Information Dynamics Toolkit (JIDT) [52], which includes both the KSG estimator and methods for generating the surrogate distributions.

6.1. Distributed Lorenz and Rössler Attractors

For validating our scoring function, we study coupled Lorenz and Rössler attractors. The Lorenz attractor exhibits chaotic solutions for certain parameter values and has been used to describe numerous phenomena of practical interest [56,58,59]. Each Lorenz system comprises three components ( d i = 3 ), which we denote x = u , v , w ; the state dynamics are given by:
x ˙ = g ( x ) = u ˙ = σ ( v u ) , v ˙ = u ( ρ w ) v , w ˙ = u v β w ,
with free parameters { σ , ρ , β } . Similarly, the Rössler attractor has state dynamics given by:
x ˙ = g ( x ) = u ˙ = y z , v ˙ = x + a y , w ˙ = b + z ( x c ) ,
with free parameters { a , b , c } [57].
In the distributed case, the components of each state vector x t i are also driven by components of another subsystem. A number of different schemes have been proposed for coupling these variables, e.g., using the product [21,60] and the difference [61,62] of components. Our model uses the latter approach of linear differencing between one or more subsystem variables to couple the network. Let λ denote the coupling strength, C denote a three-dimensional vector of binary values, and A denote an adjacency (coupling) matrix (i.e., an M × M matrix of zeros with A i j = 1 iff V i Π G ( V j ) ). Then, the state equations for M spatially distributed systems can be expressed as
x ˙ t i = g i ( x t i ) + ν f + λ C j = 1 M A i j ( x t j x t i ) ,
where g i ( · ) represents the i-th chaotic attractor and ν f is additive noise. In our simulations, we use λ = 2 , C = 1 , 0 , 0 (each subsystem is coupled via variable u), and the adjacency matrices shown in Figure 4. In our experiments, we use common parameters for both attractors, i.e., σ = 10 , β = 8 / 3 , ρ = 28 and a = 0.1 , b = 0.1 , c = 14 . For the observation y t i , it is common to use one component of the state as the read-out function [4,32,33]; we therefore let y t i = u t i + ν ψ . The noise terms are normally distributed with ν f N ( 0 , σ f ) and ν ψ N ( 0 , σ ψ ) . Figure 1 illustrates example trajectories of Lorenz–Lorenz attractors coupled via this model.

6.2. Case Study: Coupled Lorenz–Rössler System

In order to characterise the effect of coupling on our score, we begin our evaluation by measuring the transfer entropy of a coupled Lorenz–Rössler attractor. In this setup, M = 2 , Π G ( V 1 ) = , and Π G ( V 2 ) = V 1 , g 1 ( x ) was given by (30), and g 2 ( x ) was given by (31). The transfer entropy was computed with a finite sample size of N = 100,000 .
Figure 5 shows the transfer entropy as a function of numerous parameters. In particular, the figure illustrates the effect of varying the coupling strength λ , embedding dimension κ , dynamics noise σ f , and observation noise σ ψ . As expected, increasing λ , or reducing either noise σ , increases the transfer entropy. The embedding dimension, however, increases to a set point, remains approximately constant, and then decreases. The κ -value above which transfer entropy remains constant illustrates the embedding dimension at which the dynamics are reconstructed; the decrease in transfer entropy after this point, however, is likely due to the finite sample size used for density estimation.
There are two interesting features in Figure 5 due to the dynamical systems studied. First, in the bottom row (Figure 5g–i), there is a bifurcation around κ = 6 . The theoretical embedding dimension for this system is κ = 2 ( d 1 + d 2 ) + 1 = 7 , and, in this case, for κ < 6 , the embedding does not suffice to reconstruct the dynamics. Second, in Figure 5i, the transfer entropy decreases after about λ = 2 . This appears to be the case of synchrony due to strong coupling, where the dynamics of the forced variable become subordinate to the forcing [4], thus reducing the information transferred between the two subsystems.

6.3. Case Study: Network of Lorenz Attractors

In this section, we evaluate the score (27) in learning the structure of distributed dynamical systems. We will look at systems of three and four nodes of coupled Lorenz subsystems with arbitrary topologies. Unfortunately, significantly higher number of nodes become computationally expensive due to an increased embedding dimension κ , number of data points N, and number of permutations required to calculate the collective transfer entropy. To evaluate the performance of the score (27), the dynamics noise is constant σ f = 0.01 , whereas the observation noise σ ψ and the number of observations taken N are varied. We selected the theoretical maximum embedding dimension κ = 2 d + 1 and τ = 1 as is common given discrete-time measurements [22]. It should be noted that from the results from Section 6.2 that transfer entropy is sensitive to the numerous parameters used to generate the data, and thus depending on the scenario, a significant sample size can be required for recovering the underlying graph structure. We do not make an effort to reduce this sample size and instead show the effect of using a different number of samples on the accuracy of the structure learning procedure.
In order to evaluate the scoring function, we compute the recall (R, or true positive rate), fallout (F, or false positive rate), and precision (P, or positive predictive value) of the recovered graph. Let TP denote the number of true positives (correct edges); TN denote the number of true negatives (correctly rejected edges); FP denote the number of false positives (incorrect edges); and FN denote the number of false negatives (incorrectly rejected edges). Then, R = TP / ( TP + FN ) , F = FP / ( FP + TN ) , and P = TP / ( TP + FP ) . Finally, the F 1 -score gives the harmonic mean of precision and recall to give a measure of the tests accuracy, i.e., F 1 = 2 · R · P / ( R + P ) . Note that the ideal recall, precision and F 1 -score is 1, and ideal fallout is 0. Furthermore, a ratio of R / F >1 suggests the classifier is better than random. As a summary statistic, Table 1 and Table 2 presents the F 1 -scores for all networks illustrated in Figure 4, and the full classification results (e.g., precision, recall, and fallout) are given in Appendix C. The F 1 -scores are thus a measure of how relevant the recovered network is to the original (generating) network from our data-driven approach.
In general, the results of Table 1 and Table 2 show that the scoring function is capable of recovering the network with high precision and recall, as well as low fallout. In the table, the cell colours are shaded to indicate higher (white) to lower (black) F 1 scores. The best performing score is that with a p-value of 0.01 and no penalisation (a p-value of ) has the second highest classification results. As expected, the graphs recovered from data with low observational noise ( σ ψ = 1 ) are more accurate than those inferred from noisier data ( σ ψ = 10 ). The results for three-node networks (shown in Table 1) yields mostly full recovery of the structure for a higher number of observations N 75 K, whereas, the four-node networks (shown in Table 2) are more difficult to classify.
Interestingly, the statistical significance testing does not have a strong effect on the results. It is unclear if this is due to the use of the non-parametric density estimators, which, in effect, are parsimonious in nature since transfer entropy will likely reduce when conditioning on more variables with a fixed samples size. One challenging case is the empty networks G 1 and G 5 ; this is shown in Appendix C, where the fallout is rarely 0 for any of the p-values or sample sizes (although a large number of observations N = 100 K appears to reduce spurious edges). It would be expected that significance testing on these networks would outperform the naive score (27) given that a non-zero bias is introduced for a finite number of observations. Further investigation is required to understand why the null case fails.

7. Discussion and Future Work

We have presented a principled method to compute the KL divergence for model selection in distributed dynamical systems based on concepts from differential topology. The results presented in Figure 5 and Table 1 and Table 2 illustrate that this approach is suitable for recovering synchronous GDSs from data. Further, KL divergence is related to model encoding, which is a fundamental measure used in complex systems analysis. Our result, therefore, has potential implications for other areas of research. For example, the notion of equivalence classes in BN structure learning [63] should lend insight into the area of effective network analysis [35,36].
More specifically, the approach proposed here complements explicit Bayesian identification and comparison of state space models. In DCM, and more generally in approximate Bayesian inference, models are identified in terms of their parameters via an optimisation of an approximate posterior density over model parameters with respect to a variational (free energy) bound on log evidence [64]. After these parameters have been identified, this bound can be used directly for model comparison and selection. Interestingly, free energy is derived from the KL divergence between the approximate and true posterior and thus automatically penalises more complex models; however, in Equation (8), these distributions are inverted. In future work, it would be interesting to explore the relationship between transfer entropy and the variational free energy bound. Specifically, computing an evidence bound directly from the transfer entropy may allow us to avoid the significance testing described in Section 5 and instead use an approximation to evidence for structure learning.
Multivariate extensions to transfer entropy are known to eliminate redundant pairwise relationships and take into account the influence of confounding relationships in a network (i.e., synergistic effects) [65,66]. In this work, we have shown that this intuition holds for distributed dynamical systems when confined to a DAG topology. We conjecture that these methods are also applicable when cyclic dependencies exist within a graph, given any generic observation can be used in reconstructing the dynamics [50]; however, the methods presented are more likely to reveal one source in the cycle, rather than all information sources due to redundancy.
There are a number of extensions that should be considered for further practical implementations of this algorithm. Currently, we assume that the dimensionality of each subsystem is known, and thus we can bound the embedding dimension κ for recovering the hidden structure. However, this is generally infeasible in practice and a more general algorithm would infer the embedding dimension and time delay for an unknown system. Fortunately, there are numerous techniques to recover these parameters [54,55]. Furthermore, evaluating the quality of large graphs is infeasible with our current approach. However, our exact algorithm illustrates the feasibility of state space reconstruction in recovering a graph in practice. In the future, we aim to leverage the structure learning literature on reducing the search space and approximating scoring functions to produce more efficient algorithms.
Finally, the theoretical results of this work supplements understanding in fields where transfer entropy is commonly employed. Point processes are being increasingly viewed as models for a variety of information processing systems, e.g., as spiking neural trains [67] and adversaries in robotic patrolling models [68]. It was recently shown how transfer entropy can be computed for continuous time point processes such as these [67], allowing for efficient use of our analytical scoring function g TEA in a number of contexts. Another intriguing line of research is the physical and thermodynamic interpretation of transfer entropy [69], particularly its relationship to the arrow of time [70]; this relationship between endomorphisms as discussed here and time asymmetry of thermodynamics should be explored further.

Acknowledgments

This work was supported in part by the Australian Centre for Field Robotics; the New South Wales Government; and the Faculty of Engineering & Information Technologies, The University of Sydney, under the Faculty Research Cluster Program. Special thanks go to Jürgen Jost, Michael Small, Joseph Lizier, and Wolfram Martens for their useful discussions.

Author Contributions

O.C, M.P. and R.F. conceived and designed the experiments; O.C. performed the experiments; O.C. and M.P. analyzed the data; O.C., M.P., and R.F. wrote the paper. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Embedding Theory

We refer here to embedding theory as the study of inferring the (hidden) state x n M of a dynamical system from a sequence of observations y n R . This section will cover reconstruction theorems that define the conditions under which we can use delay embeddings for recovering the original dynamics f from this observed time series.
In differential topology, an embedding refers to a smooth map Φ : M N between manifolds M and N if it maps M diffeomorphically onto its image. In Takens’ seminal work on turbulent flow [31], he proposed a map Φ f , ψ : M R κ , that is composed of delayed observations, can be used to reconstruct the dynamics for typical ( f , ψ ) . That is, fix some κ (the embedding dimension) and τ (the time delay), the delay embedding map, given by
Φ f , ψ ( x n ) = y n ( κ ) = y n , y n + τ , y n + 2 τ , , y n + ( κ 1 ) τ ,
is an embedding. More formally, denote Φ f , ψ , D r ( M , M ) as the space of C r -diffeomorphisms on M and C r ( M , R ) as the space of C r -functions on M , then the theorem can be expressed as follows.
Theorem A1 (Delay Embedding Theorem for Diffeomorphisms [31]).
Let M be a compact manifold of dimension d 1 . If κ 2 d + 1 and r 1 , then there exists an open and dense set ( f , ψ ) D r ( M , M ) × C r ( M , R ) for which the map Φ f , ψ is an embedding of M into R κ .
The implication of Theorem 1 is that, for typical ( f , ψ ) , the image Φ f , ψ ( M ) of M under the delay embedding map Φ f , ψ is completely equivalent to M itself, apart from the smooth invertible change of coordinates given by the mapping Φ f , ψ . An important consequence of this result is that we can define a map F = Φ f , ψ f Φ f , ψ 1 on Φ f , ψ , such that y n + 1 ( κ ) = F ( y n ( κ ) ) [44]. The bound for the open and dense set referred to in Theorem A1 is given by a number of technical assumptions. Denote ( D f ) x as the derivative of function f at a point x in the domain of f. The set of periodic points A of f with period less than τ has finitely many points. In addition, the eigenvalues of ( D f ) x at each x in a compact neighbourhood A are distinct and not equal to 1.
Theorem A1 was established for diffeomorphisms D r ; by definition, the dynamics are thus invertible in time. Thus, the time delay τ in (A1) can be either positive (delay lags) or negative (delay leads). Takens later proved a similar result for endomorphisms, i.e., non-invertible maps that restricts the time delay to a negative integer. Denote by E ( M , M ) the set of the space of C r -endomorphisms on M , then the reconstruction theorem for endomorphisms can be expressed as the following.
Theorem A2 (Delay Embedding Theorem for Endomorphisms [71]).
Let M be a compact m dimensional manifold. If κ 2 d + 1 and r 1 , then there exists an open and dense set ( f , ψ ) D r ( M , M ) × C r ( M , R ) for which there is a map π κ : X κ M with π κ Φ f , ψ = f κ 1 . Moreover, the map π κ has bounded expansion or is Lipschitz continuous.
As a result of Theorem A2, a sequence of κ successive measurements from a system determines the system state at the end of the sequence of measurements [71]. That is, there exists an endomorphism F = Φ f , ψ f Φ f , ψ 1 to predict the next observation if one takes a negative time (lead) delay τ in (A1).
In this work, we consider two important generalisations of the Delay Embedding Theorem A1. Both of these theorems follow similar proofs to the original and have thus been derived for diffeomorphisms, not endomorphisms. However, encouraging empirical results in [6] support the conjecture that they can both be generalised to the case of endomorphisms by taking a negative time delay, as is done in Theorem A2 above. This would allow for not only distributed flows that are used in our work, but endomorphic maps, e.g., the well-studied coupled map lattice structure [51].
The first generalisation is by Stark et al. [44] and deals with a skew-product system. That is, f is now forced by some second, independent system g : N N . The dynamical system on M × N is thus given by the set of equations
x n + 1 = f ( x n , ω n ) , ω n + 1 = g ( ω n ) .
In this case, the delay map is written as
Φ f , g , ψ ( x , ω ) = y n , y n + τ , y n + 2 τ , , y n + ( κ 1 ) τ ,
and the theorem can be expressed as follows.
Theorem A3 (Bundle Delay Embedding Theorem [44]).
Let M and N be compact manifolds of dimension d 1 and e, respectively. Suppose that κ 2 ( d + e ) + 1 and the periodic orbits of period d of g D r ( N ) are isolated and have distinct eigenvalues. Then, for r 1 , there exists an open and dense set of ( f , ψ ) D r ( M × N , M ) × C r ( M , R ) for which the map Φ f , g , ψ is an embedding of M × N into R κ .
Finally, all theorems up until now have assumed a single read-out function for the system in question. Recently, Sugihara et al. [4] showed that multivariate mappings also form an embedding, with minor changes to the technical assumptions underlying Takens’ original theorem. That is, given M 2 d + 1 different observation functions, the delay map can be written as
Φ f , ψ i ( x ) = Φ f , ψ 1 ( x ) , Φ f , ψ 2 ( x ) , , Φ f , ψ M ( x ) ,
where each delay map Φ f , ψ i is as per (A1) for individual embedding dimension κ i κ . The theorem can then be stated as follows.
Theorem A4 (Delay Embedding Theorem for Multivariate Observation Functions [50]).
Let M be a compact manifold of dimension d 1 . Consider a diffeomorphism f D r ( M , M ) and a set of at most 2 d + 1 observation functions ψ i where each ψ i C r ( M , R ) and r 2 . If i κ i 2 d + 1 , then, for generic ( f , ψ i ) , the map Φ f , ψ i is an embedding.

Appendix B. Information Theory

In this section, we introduce some key concepts of information theory: conditional entropy; conditional and collective transfer entropy; and stochastic interaction.
Consider two arbitrary random variables X and Y; the conditional entropy H ( X Y ) represents the uncertainty of X after taking into account the outcomes of another random variable Y by the equation
H ( X Y ) = x , y Pr ( x , y ) log Pr ( x y ) = E Pr ( x y ) .
Transfer entropy detects the directed exchange of information between random processes by marginalising out common history and static correlations between variables; it is thus considered a measure of information transfer within a system [25]. Let the processes X and Y have associated embedding dimensions κ X and κ Y . The transfer entropy of X to Y is given in terms of conditional entropy:
T X Y = H ( Y n + 1 Y n ( κ Y ) ) H ( Y n + 1 X n i , ( κ X ) , Y n ( κ Y ) ) .
Now, given a third process Z with embedding dimension κ Z , we can compute the information transfer of X to Y in the context of Z as:
T X Y Z = H ( Y n + 1 Y n ( κ Y ) , Z n i , ( κ Z ) ) H ( Y n + 1 X n i , ( κ X ) , Y n ( κ Y ) , Z n i , ( κ Z ) ) .
The collective transfer entropy computes the information transfer between a set of M source processes and a single destination process [19]. Consider the set Y = { Y i } of source processes. We can compute the collective transfer entropy from Y to the destination process X as a function of conditional entropy (A5) terms:
T Y X = T Y 1 X + i = 1 M T Y i X { Y 1 , , Y i 1 } ,
where the ordering of the source processes are arbitrary.
Stochastic interaction measures the complexity of dynamical systems by quantifying the excess of information processed, in time, by the system beyond the information processed by each of the nodes [17,18,72,73]. Using the same notation, stochastic interaction of the collection of processes Y is
S Y = H ( Y n + 1 { Y n i , ( κ i ) } ) + i = 1 M H ( Y n + 1 i Y n i , ( κ i ) ) .
The standard definition assumes a first-order Markov process [17,18]; In (A9), we generalise stochastic interaction to arbitrary κ -order Markov chains.

Appendix C. Extended Results

Here, we present the extended results of Table 1 and Table 2. That is, we give the precision, recall, fallout, and F 1 -scores for the eight networks of Lorenz attractors shown in Figure 4. These results are given for a number of different sample sizes to illustrate the sample complexity of this problem: N = 5000 (Table A1 and Table A2), N = 10,000 (Table A3 and Table A4), N = 25,000 (Table A5 and Table A6), N = 50,000 (Table A7 and Table A8), and N = 100,000 (Table A9 and Table A10). Each table has results for various p-values (with a p-value of denoting the maximum likelihood score (27)), as well as two different observation noise variances, σ ψ = 1 and σ ψ = 10 .
Table A1. Classification results for three-node ( M = 3 ) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A1. Classification results for three-node ( M = 3 ) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 1 R--------
F0.330.220.330.220.220.330.330.22
P00000000
F 1 --------
G 2 R10.510.510.510.5
F0.140.140.140.140.140.140.140.14
P0.670.50.670.50.670.50.670.5
F 1 0.80.50.80.50.80.50.80.5
G 3 R10.5111110.5
F00000000
P11111111
F 1 10.67111110.67
G 4 R101110.510
F0.140.430.140.140.140.140.140.43
P0.6700.670.670.670.50.670
F 1 0.8-0.80.80.80.50.8-
Table A2. Classification results for four-node ( M = 4 ) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A2. Classification results for four-node ( M = 4 ) networks for N = 5000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 5 R--------
F0.310.250.310.190.310.250.310.19
P00000000
F 1 --------
G 6 R0.670.670.670.330.670.330.670
F0.150.230.150.230.150.230.150.31
P0.50.40.50.250.50.250.50
F 1 0.570.50.570.290.570.290.57-
G 7 R10.2510.250.750.250.750.5
F00.2500.170.0830.250.0830.083
P10.2510.330.750.250.750.67
F 1 10.2510.290.750.250.750.57
G 8 R10.2510.510.7510.25
F00.2500.08300.08300.25
P10.2510.6710.7510.25
F 1 10.2510.5710.7510.25
Table A3. Classification results for three-node ( M = 3 ) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A3. Classification results for three-node ( M = 3 ) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 1 R--------
F0.220.110.220.110.220.220.220.11
P00000000
F 1 --------
G 2 R10.510.510.510.5
F00.1400.1400.1400.14
P10.510.510.510.5
F 1 10.510.510.510.5
G 3 R10.5111010.5
F00.140000.2900.14
P10.5111010.5
F 1 10.5111-10.5
G 4 R1110.510.511
F0.140.14000.140.140.140.14
P0.670.67110.670.50.670.67
F 1 0.80.810.670.80.50.80.8
Table A4. Classification results for four-node ( M = 4 ) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A4. Classification results for four-node ( M = 4 ) networks for N = 10,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 5 R--------
F0.310.250.310.190.310.190.310.25
P00000000
F 1 --------
G 6 R0.670.330.670110.670.33
F0.150.150.150.150.150.150.150.15
P0.50.330.500.60.60.50.33
F 1 0.570.330.57-0.750.750.570.33
G 7 R0.750.510.510.250.750.5
F0.0830.08300.08300.170.0830.083
P0.750.6710.6710.330.750.67
F 1 0.750.5710.5710.290.750.57
G 8 R10.2510.251010.25
F00.1700.1700.2500.17
P10.3310.331010.33
F 1 10.2910.291-10.29
Table A5. Classification results for three-node ( M = 3 ) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A5. Classification results for three-node ( M = 3 ) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 1 R--------
F0.220.110.220.110.220.220.220.11
P00000000
F 1 --------
G 2 R1110.510.511
F00.1400.1400.1400.14
P10.6710.510.510.67
F 1 10.810.510.510.8
G 3 R1110.51111
F0000.140000
P1110.51111
F 1 1110.51111
G 4 R111110.511
F000000.1400
P111110.511
F 1 111110.511
Table A6. Classification results for four-node ( M = 4 ) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A6. Classification results for four-node ( M = 4 ) networks for N = 25,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 5 R--------
F0.310.190.310.190.310.190.310.19
P00000000
F 1 --------
G 6 R10.3310.3310.3310.33
F0.150.150.150.150.150.230.150.15
P0.60.330.60.330.60.250.60.33
F 1 0.750.330.750.330.750.290.750.33
G 7 R10.510.7510.7510.5
F00.17000000.17
P10.5111110.5
F 1 10.510.8610.8610.5
G 8 R10.7510.7510.7510.75
F00000000
P11111111
F 1 10.8610.8610.8610.86
Table A7. Classification results for three-node ( M = 3 ) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A7. Classification results for three-node ( M = 3 ) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 1 R--------
F00.110000.1100.22
P-0---0-0
F 1 --------
G 2 R10.510.510.510.5
F00.1400.1400.1400.14
P10.510.510.510.5
F 1 10.510.510.510.5
G 3 R1110.51111
F00.1400.1400.1400
P10.6710.510.6711
F 1 10.810.510.811
G 4 R10.51110.511
F00.140000.1400
P10.51110.511
F 1 10.51110.511
Table A8. Classification results for four-node ( M = 4 ) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A8. Classification results for four-node ( M = 4 ) networks with N = 50,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 5 R--------
F0.190.0620.190.190.190.120.190.12
P00000000
F 1 --------
G 6 R10.331010.3310.33
F00.150000.230.150.15
P10.331-10.250.60.33
F 1 10.331-10.290.750.33
G 7 R10.7510.510.510.75
F0000.1700.08300
P1110.510.6711
F 1 10.8610.510.5710.86
G 8 R10.7510.7510.7510.75
F00000000
P11111111
F 1 10.8610.8610.8610.86
Table A9. Classification results for three-node ( M = 3 ) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A9. Classification results for three-node ( M = 3 ) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 1 R--------
F00.2200.1100.2200.11
P-0-0-0-0
F 1 --------
G 2 R10.5111111
F00.14000000.14
P10.5111110.67
F 1 10.5111110.8
G 3 R11111111
F00000000
P11111111
F 1 11111111
G 4 R11111111
F00000000
P11111111
F 1 11111111
Table A10. Classification results for four-node ( M = 4 ) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Table A10. Classification results for four-node ( M = 4 ) networks with N = 100,000 samples. We present the precision (P), recall (R), fallout (F), and F 1 -score for the eight arbitrary topologies of coupled Lorenz systems represented by Figure 4.
Graphp-Value 0.01 0.001 0.0001
σ ψ 110110110110
G 5 R--------
F0.190.0620.190.0620.190.190.190.12
P00000000
F 1 --------
G 6 R10.3310.6710.3310.33
F00.1500.1500.07700.15
P10.3310.510.510.33
F 1 10.3310.5710.410.33
G 7 R1-1-1-1-
F0-0-0-0-
P1-1-1-1-
F 1 1-1-1-1-
G 8 R10.7510.7510.510.75
F000000.08300
P111110.6711
F 1 10.8610.8610.5710.86

References

  1. Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings of the Second International Symposium on Information Theory, Tsahkadsor, Armenia, USSR, 2–8 September 1971; pp. 267–281. [Google Scholar]
  2. Lam, W.; Bacchus, F. Learning Bayesian belief networks: An approach based on the MDL principle. Comput. Intell. 1994, 10, 269–293. [Google Scholar] [CrossRef]
  3. de Campos, L.M. A Scoring Function for Learning Bayesian Networks Based on Mutual Information and Conditional Independence Tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
  4. Sugihara, G.; May, R.; Ye, H.; Hsieh, C.H.; Deyle, E.; Fogarty, M.; Munch, S. Detecting causality in complex ecosystems. Science 2012, 338, 496–500. [Google Scholar] [CrossRef] [PubMed]
  5. Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A model-free measure of effective connectivity for the neurosciences. J. Comput. Neurosci. 2011, 30, 45–67. [Google Scholar] [CrossRef] [PubMed]
  6. Schumacher, J.; Wunderle, T.; Fries, P.; Jäkel, F.; Pipa, G. A statistical framework to infer delay and direction of information flow from measurements of complex systems. Neural Comput. 2015, 27, 1555–1608. [Google Scholar] [CrossRef] [PubMed]
  7. Best, G.; Cliff, O.M.; Patten, T.; Mettu, R.R.; Fitch, R. Decentralised Monte Carlo Tree Search for Active Perception. In Proceedings of the International Workshop on the Algorithmic Foundations of Robotics (WAFR), San Francisco, CA, USA, 18–20 December 2016. [Google Scholar]
  8. Cliff, O.M.; Lizier, J.T.; Wang, X.R.; Wang, P.; Obst, O.; Prokopenko, M. Delayed Spatio-Temporal Interactions and Coherent Structure in Multi-Agent Team Dynamics. Art. Life 2017, 23, 34–57. [Google Scholar] [CrossRef] [PubMed]
  9. Best, G.; Forrai, M.; Mettu, R.R.; Fitch, R. Planning-aware communication for decentralised multi-robot coordination. In Proceedings of the International Conference on Robotics and Automation, Brisbane, Australia, 21 May 2018. [Google Scholar]
  10. Boccaletti, S.; Latora, V.; Moreno, Y.; Chavez, M.; Hwang, D.U. Complex networks: Structure and dynamics. Phys. Rep. 2006, 424, 175–308. [Google Scholar] [CrossRef]
  11. Mortveit, H.; Reidys, C. An Introduction to Sequential Dynamical Systems; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  12. Cliff, O.M.; Prokopenko, M.; Fitch, R. An Information Criterion for Inferring Coupling in Distributed Dynamical Systems. Front. Robot. AI 2016, 3. [Google Scholar] [CrossRef]
  13. Daly, R.; Shen, Q.; Aitken, J.S. Learning Bayesian networks: Approaches and issues. Knowl. Eng. Rev. 2011, 26, 99–157. [Google Scholar] [CrossRef]
  14. Chickering, D.M. Learning equivalence classes of Bayesian-network structures. J. Mach. Learn. Res. 2002, 2, 445–498. [Google Scholar]
  15. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  16. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  17. Ay, N.; Wennekers, T. Temporal infomax leads to almost deterministic dynamical systems. Neurocomputing 2003, 52, 461–466. [Google Scholar] [CrossRef]
  18. Ay, N. Information geometry on complexity and stochastic interaction. Entropy 2015, 17, 2432–2458. [Google Scholar] [CrossRef]
  19. Lizier, J.T.; Prokopenko, M.; Zomaya, A.Y. Information modification and particle collisions in distributed computation. Chaos 2010, 20, 037109. [Google Scholar] [CrossRef] [PubMed]
  20. Murphy, K. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Thesis, UC Berkeley, Berkeley, CA, USA, 2002. [Google Scholar]
  21. Kocarev, L.; Parlitz, U. Generalized synchronization, predictability, and equivalence of unidirectionally coupled dynamical systems. Phys. Rev. Lett. 1996, 76, 1816–1819. [Google Scholar] [CrossRef] [PubMed]
  22. Kantz, H.; Schreiber, T. Nonlinear Time Series Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  23. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann: Burlington, MA, USA, 2014. [Google Scholar]
  24. Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
  25. Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85, 461–464. [Google Scholar] [CrossRef] [PubMed]
  26. Barnett, L.; Barrett, A.B.; Seth, A.K. Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables. Phys. Rev. Lett. 2009, 103, e238701. [Google Scholar] [CrossRef] [PubMed]
  27. Lizier, J.T.; Prokopenko, M. Differentiating information transfer and causal effect. Eur. Phys. J. B 2010, 73, 605–615. [Google Scholar] [CrossRef]
  28. Smirnov, D.A. Spurious causalities with transfer entropy. Phys. Rev. E 2013, 87, 042917. [Google Scholar] [CrossRef] [PubMed]
  29. James, R.G.; Barnett, N.; Crutchfield, J.P. Information flows? A critique of transfer entropies. Phys. Rev. Lett. 2016, 116, 238701. [Google Scholar] [CrossRef] [PubMed]
  30. Liang, X.S. Information flow and causality as rigorous notions ab initio. Phys. Rev. E 2016, 94, 052201. [Google Scholar] [CrossRef] [PubMed]
  31. Takens, F. Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence; Lecture Notes in Math; Springer: Berlin/Heidelberg, Germany, 1981; Volume 898, pp. 366–381. [Google Scholar]
  32. Stark, J. Delay embeddings for forced systems. I. Deterministic forcing. J. Nonlinear Sci. 1999, 9, 255–332. [Google Scholar] [CrossRef]
  33. Stark, J.; Broomhead, D.S.; Davies, M.E.; Huke, J. Delay embeddings for forced systems. II. Stochastic forcing. J. Nonlinear Sci. 2003, 13, 519–577. [Google Scholar] [CrossRef]
  34. Valdes-Sosa, P.A.; Roebroeck, A.; Daunizeau, J.; Friston, K. Effective connectivity: influence, causality and biophysical modeling. Neuroimage 2011, 58, 339–361. [Google Scholar] [CrossRef] [PubMed]
  35. Sporns, O.; Chialvo, D.R.; Kaiser, M.; Hilgetag, C.C. Organization, development and function of complex brain networks. Trends Cogn. Sci. 2004, 8, 418–425. [Google Scholar] [CrossRef] [PubMed]
  36. Park, H.J.; Friston, K. Structural and functional brain networks: From connections to cognition. Science 2013, 342, 1238411. [Google Scholar] [CrossRef] [PubMed]
  37. Friston, K.; Moran, R.; Seth, A.K. Analysing connectivity with Granger causality and dynamic causal modelling. Curr. Opin. Neurobiol. 2013, 23, 172–178. [Google Scholar] [CrossRef] [PubMed]
  38. Lizier, J.T.; Rubinov, M. Multivariate Construction of Effective Computational Networks from Observational Data; Preprint 25/2012; Max Planck Institute for Mathematics in the Sciences: Leipzig, Germany, 2012. [Google Scholar]
  39. Sandoval, L. Structure of a global network of financial companies based on transfer entropy. Entropy 2014, 16, 4443–4482. [Google Scholar] [CrossRef]
  40. Rodewald, J.; Colombi, J.; Oyama, K.; Johnson, A. Using Information-theoretic Principles to Analyze and Evaluate Complex Adaptive Supply Network Architectures. Procedia Comput. Sci. 2015, 61, 147–152. [Google Scholar] [CrossRef]
  41. Crosato, E.; Jiang, L.; Lecheval, V.; Lizier, J.T.; Wang, X.R.; Tichit, P.; Theraulaz, G.; Prokopenko, M. Informative and misinformative interactions in a school of fish. arXiv, 2017; arXiv:1705.01213. [Google Scholar]
  42. Kozachenko, L.; Friston, L.F.; Leonenko, N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Inf. 1987, 23, 9–16. [Google Scholar]
  43. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
  44. Stark, J.; Broomhead, D.S.; Davies, M.E.; Huke, J. Takens embedding theorems for forced and stochastic systems. Nonlinear Anal. Theory Methods Appl. 1997, 30, 5303–5314. [Google Scholar] [CrossRef]
  45. Friedman, N.; Murphy, K.; Russell, S. Learning the structure of dynamic probabilistic networks. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI, USA, 24–26 July 1998; pp. 139–147. [Google Scholar]
  46. Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  47. Wilks, S.S. The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 1938, 9, 60–62. [Google Scholar] [CrossRef]
  48. Barnett, L.; Bossomaier, T. Transfer entropy as a log-likelihood ratio. Phys. Rev. Lett. 2012, 109, 138105. [Google Scholar] [CrossRef] [PubMed]
  49. Vinh, N.X.; Chetty, M.; Coppel, R.; Wangikar, P.P. GlobalMIT: Learning globally optimal dynamic Bayesian network with the mutual information test criterion. Bioinformatics 2011, 27, 2765–2766. [Google Scholar] [CrossRef] [PubMed]
  50. Deyle, E.R.; Sugihara, G. Generalized theorems for nonlinear state space reconstruction. PLoS ONE 2011, 6, e18295. [Google Scholar] [CrossRef] [PubMed]
  51. Lloyd, A.L. The coupled logistic map: a simple model for the effects of spatial heterogeneity on population dynamics. J. Theor. Biol. 1995, 173, 217–230. [Google Scholar] [CrossRef]
  52. Lizier, J.T. JIDT: An information-theoretic toolkit for studying the dynamics of complex systems. Front. Robot. AI 2014, 1. [Google Scholar] [CrossRef]
  53. Silander, T.; Myllymaki, P. A simple approach for finding the globally optimal Bayesian network structure. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, Cambridge, MA, USA, 13–16 July 2006; pp. 445–452. [Google Scholar]
  54. Ragwitz, M.; Kantz, H. Markov models from data by simple nonlinear time series predictors in delay embedding spaces. Phys. Rev. E 2002, 65, 056201. [Google Scholar] [CrossRef] [PubMed]
  55. Small, M.; Tse, C.K. Optimal embedding parameters: A modelling paradigm. Physica 2004, 194, 283–296. [Google Scholar]
  56. Lorenz, E.N. Deterministic nonperiodic flow. J. Atmos. Sci. 1963, 20, 130–141. [Google Scholar] [CrossRef]
  57. Rössler, O.E. An equation for continuous chaos. Phys. Lett. A 1976, 57, 397–398. [Google Scholar] [CrossRef]
  58. Haken, H. Analogy between higher instabilities in fluids and lasers. Phys. Lett. A 1975, 53, 77–78. [Google Scholar] [CrossRef]
  59. Cuomo, K.M.; Oppenheim, A.V. Circuit implementation of synchronized chaos with applications to communications. Phys. Rev. Lett. 1993, 71, 65–68. [Google Scholar] [CrossRef] [PubMed]
  60. He, R.; Vaidya, P.G. Analysis and synthesis of synchronous periodic and chaotic systems. Phys. Rev. A 1992, 46, 7387–7392. [Google Scholar] [CrossRef] [PubMed]
  61. Fujisaka, H.; Yamada, T. Stability theory of synchronized motion in coupled-oscillator systems. Prog. Theor. Phys. 1983, 69, 32–47. [Google Scholar] [CrossRef]
  62. Rulkov, N.F.; Sushchik, M.M.; Tsimring, L.S.; Abarbanel, H.D. Generalized synchronization of chaos in directionally coupled chaotic systems. Phys. Rev. E 1995, 51, 980–994. [Google Scholar] [CrossRef]
  63. Acid, S.; de Campos, L.M. Searching for Bayesian network structures in the space of restricted acyclic partially directed graphs. J. Artif. Intell. Res. 2003, 18, 445–490. [Google Scholar]
  64. Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol. Paris 2006, 100, 70–87. [Google Scholar] [CrossRef] [PubMed]
  65. Williams, P.L.; Beer, R.D. Generalized measures of information transfer. arXiv, 2011; arXiv:1102.1507. [Google Scholar]
  66. Vakorin, V.A.; Krakovska, O.A.; McIntosh, A.R. Confounding effects of indirect connections on causality estimation. J. Neurosci. Methods 2009, 184, 152–160. [Google Scholar] [CrossRef] [PubMed]
  67. Spinney, R.E.; Prokopenko, M.; Lizier, J.T. Transfer entropy in continuous time, with applications to jump and neural spiking processes. Phys. Rev. E 2017, 95, 032319. [Google Scholar] [CrossRef] [PubMed]
  68. Hefferan, B.; Cliff, O.M.; Fitch, R. Adversarial Patrolling with Reactive Point Processes. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), Brisbane, Australia, 5–7 December 2016. [Google Scholar]
  69. Prokopenko, M.; Einav, I. Information thermodynamics of near-equilibrium computation. Phys. Rev. E 2015, 91, 062143. [Google Scholar] [CrossRef] [PubMed]
  70. Spinney, R.E.; Lizier, J.T.; Prokopenko, M. Transfer entropy in physical systems and the arrow of time. Phys. Rev. E 2016, 94, 022135. [Google Scholar] [CrossRef] [PubMed]
  71. Takens, F. The reconstruction theorem for endomorphisms. Bull. Braz. Math. Soc. 2002, 33, 231–262. [Google Scholar] [CrossRef]
  72. Ay, N.; Wennekers, T. Dynamical properties of strongly interacting Markov chains. Neural Netw. 2003, 16, 1483–1497. [Google Scholar] [CrossRef]
  73. Edlund, J.A.; Chaumont, N.; Hintze, A.; Koch, C.; Tononi, G.; Adami, C. Integrated information increases with fitness in the evolution of animats. PLoS Comput. Biol. 2011, 7, e1002236. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Trajectory of a pair of coupled Lorenz systems. Top row: original state of the subsystems. Bottom row: time-series measurements of the subsystems. In each figure, the black lines represent an uncoupled simulation ( λ = 0 ), and teal lines illustrate a simulation where the first (leftmost) subsystem was coupled to the second ( λ = 10 ). (a) σ = 10 , β = 8 / 3 , ρ = 28 ; (b) σ = 10 , β = 8 / 3 , ρ = 90 .
Figure 1. Trajectory of a pair of coupled Lorenz systems. Top row: original state of the subsystems. Bottom row: time-series measurements of the subsystems. In each figure, the black lines represent an uncoupled simulation ( λ = 0 ), and teal lines illustrate a simulation where the first (leftmost) subsystem was coupled to the second ( λ = 10 ). (a) σ = 10 , β = 8 / 3 , ρ = 28 ; (b) σ = 10 , β = 8 / 3 , ρ = 90 .
Entropy 20 00051 g001
Figure 2. Representation of (a) the synchronous GDS with two vertices ( V 1 and V 2 ), and (b) the rolled-out DBN of the equivalent structure. Subsystems V 1 and V 2 are coupled by virtue of the edge X n 1 X n + 1 2 .
Figure 2. Representation of (a) the synchronous GDS with two vertices ( V 1 and V 2 ), and (b) the rolled-out DBN of the equivalent structure. Subsystems V 1 and V 2 are coupled by virtue of the edge X n 1 X n + 1 2 .
Entropy 20 00051 g002
Figure 3. Distributions of the (a) TEA penalty function (28) and the (b) TEE penalty function (28). Both distributions were generated by observing the outcome of 1000 samples from two Gaussian variables with a correlation of 0.05 . The figures illustrate: the distribution as a set of 100 sampled points (black dots); the area considered independent (grey regions); the measured transfer entropy (black line); and the difference between measurement and penalty term (dark grey region). Both tests use a value of α = 0.9 (a p-value of 0.1 ). The distribution in (a) was estimated by assuming variables were linearly-coupled Gaussians, and the distribution in (b) was computed via a kernal box method (computed by the Java Information Dynamics Toolkit (JIDT), see [52] for details).
Figure 3. Distributions of the (a) TEA penalty function (28) and the (b) TEE penalty function (28). Both distributions were generated by observing the outcome of 1000 samples from two Gaussian variables with a correlation of 0.05 . The figures illustrate: the distribution as a set of 100 sampled points (black dots); the area considered independent (grey regions); the measured transfer entropy (black line); and the difference between measurement and penalty term (dark grey region). Both tests use a value of α = 0.9 (a p-value of 0.1 ). The distribution in (a) was estimated by assuming variables were linearly-coupled Gaussians, and the distribution in (b) was computed via a kernal box method (computed by the Java Information Dynamics Toolkit (JIDT), see [52] for details).
Entropy 20 00051 g003
Figure 4. The network topologies used in this paper. The top row (ad) are four arbitrary networks with three nodes ( M = 3 ) and the bottom row (eh) are four arbitrary networks with four nodes ( M = 4 ).
Figure 4. The network topologies used in this paper. The top row (ad) are four arbitrary networks with three nodes ( M = 3 ) and the bottom row (eh) are four arbitrary networks with four nodes ( M = 4 ).
Entropy 20 00051 g004
Figure 5. Transfer entropy as a function of the parameters of a coupled Lorenz–Rössler system. These components are: coupling strength λ and embedding dimension κ in the top row (ac); coupling strength λ and observation noise σ ψ in the middle row (df); and observation noise σ ψ and embedding dimension κ in the bottom row (gi).
Figure 5. Transfer entropy as a function of the parameters of a coupled Lorenz–Rössler system. These components are: coupling strength λ and embedding dimension κ in the top row (ac); coupling strength λ and observation noise σ ψ in the middle row (df); and observation noise σ ψ and embedding dimension κ in the bottom row (gi).
Entropy 20 00051 g005
Table 1. F 1 -scores for three-node ( M = 3 ) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4b–d (network G 1 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with signifying using no significance testing, i.e., score (27).
Table 1. F 1 -scores for three-node ( M = 3 ) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4b–d (network G 1 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with signifying using no significance testing, i.e., score (27).
p = p = 0.01 p = 0.001 p = 0.0001
GraphN σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10
G 2 5 K0.80.50.80.50.80.50.80.5
25 K10.810.510.510.8
100 K10.5111110.8
G 3 5 K10.67111110.67
25 K1110.51111
100 K11111111
G 4 5 K0.8-0.80.80.80.50.8-
25 K111110.511
100 K11111111
Table 2. F 1 -scores for four-node ( M = 4 ) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4f–h (network G 5 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with signifying using no significance testing, i.e., score (27).
Table 2. F 1 -scores for four-node ( M = 4 ) networks. We present the classification summary for the three arbitrary topologies of coupled Lorenz systems represented by Figure 4f–h (network G 5 has no edges and thus an undefined F 1 -score). The p-value of the TEE score is given in the top row of each table, with signifying using no significance testing, i.e., score (27).
p = p = 0.01 p = 0.001 p = 0.0001
GraphN σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10 σ ψ = 1 σ ψ = 10
G 6 5 K0.570.50.570.290.570.290.57-
25 K0.750.330.750.330.750.290.750.33
100 K10.3310.5710.410.33
G 7 5 K10.2510.290.750.250.750.57
25 K10.510.8610.8610.5
100 K10.8610.8610.8610.86
G 8 5 K10.2510.5710.7510.25
25 K10.8610.8610.8610.86
100 K10.8610.8610.5710.86

Share and Cite

MDPI and ACS Style

Cliff, O.M.; Prokopenko, M.; Fitch, R. Minimising the Kullback–Leibler Divergence for Model Selection in Distributed Nonlinear Systems. Entropy 2018, 20, 51. https://doi.org/10.3390/e20020051

AMA Style

Cliff OM, Prokopenko M, Fitch R. Minimising the Kullback–Leibler Divergence for Model Selection in Distributed Nonlinear Systems. Entropy. 2018; 20(2):51. https://doi.org/10.3390/e20020051

Chicago/Turabian Style

Cliff, Oliver M., Mikhail Prokopenko, and Robert Fitch. 2018. "Minimising the Kullback–Leibler Divergence for Model Selection in Distributed Nonlinear Systems" Entropy 20, no. 2: 51. https://doi.org/10.3390/e20020051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop