Heterogeneous Graphical Granger Causality by Minimum Message Length

Hlaváčková-Schindler, Kateřina; Plant, Claudia

doi:10.3390/e22121400

Open AccessArticle

Heterogeneous Graphical Granger Causality by Minimum Message Length

by

Kateřina Hlaváčková-Schindler

^1,2,* and

Claudia Plant

^1,3

¹

Faculty of Computer Science, University of Vienna, 1090 Wien, Austria

²

Institute of Computer Science of the Czech Academy of Sciences, 18207 Prague, Czech Republic

³

ds:UniVie, University of Vienna, 1090 Wien, Austria

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(12), 1400; https://doi.org/10.3390/e22121400

Submission received: 2 November 2020 / Revised: 26 November 2020 / Accepted: 7 December 2020 / Published: 11 December 2020

(This article belongs to the Special Issue Information Flow in Neural Systems)

Download

Browse Figures

Versions Notes

Abstract

The heterogeneous graphical Granger model (HGGM) for causal inference among processes with distributions from an exponential family is efficient in scenarios when the number of time observations is much greater than the number of time series, normally by several orders of magnitude. However, in the case of “short” time series, the inference in HGGM often suffers from overestimation. To remedy this, we use the minimum message length principle (MML) to determinate the causal connections in the HGGM. The minimum message length as a Bayesian information-theoretic method for statistical model selection applies Occam’s razor in the following way: even when models are equal in their measure of fit-accuracy to the observed data, the one generating the most concise explanation of data is more likely to be correct. Based on the dispersion coefficient of the target time series and on the initial maximum likelihood estimates of the regression coefficients, we propose a minimum message length criterion to select the subset of causally connected time series with each target time series and derive its form for various exponential distributions. We propose two algorithms—the genetic-type algorithm (HMMLGA) and exHMML to find the subset. We demonstrated the superiority of both algorithms in synthetic experiments with respect to the comparison methods Lingam, HGGM and statistical framework Granger causality (SFGC). In the real data experiments, we used the methods to discriminate between pregnancy and labor phase using electrohysterogram data of Islandic mothers from Physionet databasis. We further analysed the Austrian climatological time measurements and their temporal interactions in rain and sunny days scenarios. In both experiments, the results of HMMLGA had the most realistic interpretation with respect to the comparison methods. We provide our code in Matlab. To our best knowledge, this is the first work using the MML principle for causal inference in HGGM.

Keywords:

Granger causality; graphical Granger model; overestimation; information theory; minimum message length

1. Introduction

Granger causality is a popular method for causality analysis in time series due to its computational simplicity. Its application to time series with non-Gaussian distributions can be, however, misleading. Recently, Behzadi et al. in [1] proposed the heterogeneous graphical Granger Model (HGGM) for detecting causal relations among time series with distributions from the exponential family, which includes a wider class of common distributions. HGGM employs regression in generalized linear models (GLM) with adaptive Lasso penalization [2] as a variable selection method and applies it to time series with a given lag. This approach allows one to apply causal inference among time series, also with discrete values. HGGM, using penalization by adaptive Lasso, showed its efficiency in scenarios when the number of time observations is much greater than the number of time series, normally by several orders of magnitude—however, on “short” time series, the inference in HGGM suffers often from overestimation.

Overestimation on short time series is a problem which also occurs in general forecasting problems. For example, when forecasting demand for a new product or a new customer, there are usually very few time series observations available. For such short time series, the traditional forecasting methods may be inaccurate. To overcome this problem in forecasting, Ref. [3] proposed to utilize a prior information derived from the data and applied a Bayesian inference approach. Similarly for another data mining problem, a Bayesian approach has shown to be efficient for the clustering of short time series [4].

Motivated by the efficiency of the Bayes approaches in these problems on short time series, we propose to use the Bayesian approach called minimum message principle, as introduced in [5] to causal inference in HGGM. The contributions of our paper are the following:

(1): We used the minimum message length (MML) principle for determination of causal connections in the heterogeneous graphical Granger model.
(2): Based on the dispersion coefficient of the target time series and on the initial maximum likelihood estimates of the regression coefficients, we proposed a minimum message length criterion to select the subset of causally connected time series with each target time series; Furthermore, we derived its form for various exponential distributions.
(3): We found this subset in two ways: by a proposed genetic-type algorithm (HMMLGA), as well as by exhaustive search (exHMML). We evaluated the complexities of these algorithms and provided the code in Matlab.
(4): We demonstrated the superiority of both methods with respect to the comparison methods Lingam [6], HGGM [1] and statistical framework Granger causality (SFGC) [7] in the synthetic experiments with short time series. In the real data experiments without known ground truth, the interpretation of causal connections achieved by HMMLGA was the most realistic with respect to the comparison methods.
(5): To our best knowledge, this is the first work applying the minimum message length principle to the heterogeneous graphical Granger model.

The paper is organized as follows. Section 2 presents definitions of the graphical Granger causal model and of the heterogeneous graphical Granger causal model as well as of the minimum message length principle. Our method including the derived criteria and algorithm are described in Section 3. Related work is discussed in Section 4. Our experiments are summarized in Section 5. Section 6 is devoted to the conclusions and the derivation of the criteria from Section 3 can be found in Appendix A and Appendix B.

2. Preliminaries

To make this paper self-contained and to introduce the notation, we briefly summarize the basics about graphical Granger causal model in Section 2.1. The heterogeneous graphical Granger model, as introduced in [1], is presented in Section 2.2. Section 2.3 discusses the strengths and limitations of the Granger causal models. The idea of the minimum message length principle is briefly explained in Section 2.4.

2.1. Graphical Granger Model

The (Gaussian) graphical Granger model extends the autoregressive concept of Granger causality to

p \geq 2

time series [8]. Let

x_{1}^{t}, \dots, x_{p}^{t}

be the time instances of p time series,

t = 1, \dots, n

. As it is common, we will use bold font in notation of vectors or matrices. Consider the vector auto-regressive (VAR) models with time lag

d \geq 1

for

i = 1, \dots, p

x_{i}^{t} = X_{t, d}^{L a g} β_{i}^{'} + ε_{i}^{t}

(1)

where

X_{t, d}^{L a g} = (x_{1}^{t - d}, \dots, x_{1}^{t - 1}, \dots, x_{p}^{t - d}, \dots, x_{p}^{t - 1})

and

β_{i}

be a matrix of the regression coefficients and

ε_{i}^{t}

be white noise. One can easily show that

X_{t, d}^{L a g} β_{i}^{'} = \sum_{j = 1}^{p} \sum_{l = 1}^{d} x_{j}^{t - l} β_{j}^{l}

.

Definition 1.

One says time series

x_{j}

Granger–causes time series

x_{i}

for a given lag d, denote

x_{j} \to x_{i}

, for

i, j = 1, \dots, p

if and only if at least one of the d coefficients in j-th row of

β_{i}

in (1) is non-zero.

The solution of problem (1) has been approached by various forms of penalization methods in the literature, e.g., Lasso in [8], truncated Lasso in [9] or group Lasso [10].

2.2. Heterogeneous Graphical Granger Model

The heterogeneous graphical Granger model (HGGM) [1] considers time series

x_{i}

, for which their likelihood function belongs into the exponential family with a canonical parameter

θ_{i}

. The generic density form for each

x_{i}

can be written as

p (x_{i} | X_{t, d}^{L a g}, θ_{i}) = h (x_{i}) exp (x_{i} θ_{i} - η_{i} (θ_{i}))

(2)

where

θ_{i} = X_{t, d}^{L a g} {(β_{i}^{*})}^{'}

(

β_{i}^{*}

is the optimum) and

η_{i}

is a link function corresponding to time series

x_{i}

. (The sign

^{'}

denotes a transpose of a matrix). The heterogeneous graphical Granger model uses the idea of generalized linear models (GLM, see e.g., [11]) and applies them to time series in the following form

x_{i}^{t} \approx μ_{i}^{t} = η_{i}^{t} (X_{t, d}^{L a g} β_{i}^{'}) = η_{i}^{t} (\sum_{j = 1}^{p} \sum_{l = 1}^{d} x_{j}^{t - l} β_{j}^{l})

(3)

for

x_{i}^{t}

,

i = 1, \dots, p, t = d + 1, \dots, n

each having a probability density from the exponential family;

μ_{i}

denotes the mean of

x_{i}

and

v a r (x_{i} | μ_{i}, ϕ_{i}) = ϕ_{i} v_{i} (μ_{i})

where

ϕ_{i}

is a dispersion parameter and

v_{i}

is a variance function dependent only on

μ_{i}

;

η_{i}^{t}

is the t-th coordinate of

η_{i}

.

Causal inference in (3) can be solved as

{\hat{β}}_{i} = a r g min_{β_{i}} \sum_{t = d + 1}^{n} (- x_{i}^{t} (X_{t, d}^{L a g} β_{i}^{'}) + η_{i}^{t} (X_{t, d}^{L a g} β_{i}^{'})) + λ_{i} R (β_{i})

(4)

for a given lag

d > 0

,

λ_{i} > 0

, and all

t = d + 1, \dots, n

with

R (β_{i})

to be the adaptive Lasso penalty function [1]. (The first two summands in (4) correspond to the maximum likelihood estimates in the GLM).

Definition 2.

One says, time series

x_{j}

Granger–causes time series

x_{i}

for a given lag d, denote

x_{j} \to x_{i}

, for

i, j = 1, \dots, p

if and only if at least one of the d coefficients in j-th row of

\hat{β_{i}}

of the solution of (4) is non-zero [1].

Remark 1.

Non-zero values in Definitions 1 and 2 are in practice, distinguished by considering values bigger than a given threshold, which is a positive number “close” to zero.

For example, Equation (4) for the Poisson graphical Granger model [12] where for each

i = 1, \dots, p

η_{i}^{t} : = exp

is considered, can be written as

{\hat{β}}_{i} = a r g min_{β_{i}} \sum_{t = d + 1}^{n} (- x_{i}^{t} (X_{t, d}^{L a g} β_{i}^{'}) + exp (X_{t, d}^{L a g} β_{i}^{'})) + λ_{i} R (β_{i}) .

(5)

Equation (4) for the binomial graphical Granger model can be written as

{\hat{β}}_{i} = a r g min_{β_{i}} \sum_{t = d + 1}^{n} (- x_{i}^{t} (X_{t, d}^{L a g} β_{i}^{'}) + log (1 + exp (X_{t, d}^{L a g} β_{i}^{'}))) + λ_{i} R (β_{i})

(6)

and finally Equation (4) for the Gaussian graphical Granger model reduces to the least squares error of (1) with a

R (β_{i})

to be adaptive Lasso penalty function. The heterogeneous graphical Granger model can be applied to causal inference among processes, for example in climatology, e.g., Ref. [1] investigated the causal inference among precipitation time series (having gamma distribution) and time series of sunny days (having Poisson distribution).

2.3. Granger Causality and Graphical Granger Models

Since its introduction, Granger causality [13] has faced criticism, since it e.g., does not take into account counterfactuals, [14,15]. In defense of his method, Granger in [16] wrote: “Possible causation is not considered for any arbitrarily selected group of variables, but only for variables for which the researcher has some prior belief that causation is, in some sense, likely.” In other words, drawing conclusions about the existence of a causal relation between time series and about its direction is possible only if theoretical knowledge of mechanisms connecting the time series is accessible.

Concerning the graphical causal models, including the Granger ones, Lindquist and Sobel in [17] claim that (1) they are not able to discover causal effects; (2) the theory of graphical causal models developed by Spirtes et al. in [18] makes no counterfactual claims; and (3) causal relations cannot be determined non-experimentally from samples that are a combination of systems with different propensities. However, Glymour in [19] argues that each of these claims are false or exaggerated. For arguments against (1) and (3), we refer the reader to [19]. We focus here only to his arguments to (2). Quoting Glymour, claims about what the outcome would be of a hypothetical experiment that has not been done are one form of counterfactual claims. Such claims say that if such and such were to happen then the result would be thus and so—where such and such has not happened or has not yet happened. (Of course, if the experiment is later done, then the proposition becomes factually true or factually false.) Glymour argues that it is not true that the graphical model framework does not represent or entail any counterfactual claims and emphasizes that no counterfactual variables are used or needed in the graphical causal model framework. In the potential outcomes framework, if nothing is known about which of many variables are causes of the others, then for each variable, and for each value of the other variables, a new counterfactual variable is required. In practice that would require an astronomical number of counterfactual variables for even a few actual variables. To summarize, as also confirmed by a recent Nature publication [20], if the theoretical background of investigated processes is insufficient, graphical causal methods (Granger causality including), to infer causal relations from data rather than knowledge of mechanisms, are helpful.

2.4. Minimum Message Length Principle

The minimum message length principle of statistical and inductive inference and machine learning was developed by C.S. Wallace and D.M. Boulton in 1968 in the seminal paper [5]. Minimum message length principle is a formal information theory restatement of Occam’s razor: even when models are not equal in goodness of fit accuracy to the observed data, the one generating the shortest overall message is more likely to be correct (where the message consists of a statement of the model, followed by a statement of data encoded concisely using that model). The MML principle selects the model which most compresses the data (i.e., the one with the “shortest message length”) as the most descriptive for the data. To be able to decompress this representation of the data, the details of the statistical model used to encode the data must also be part of the compressed data string. The calculation of the exact message is an NP hard problem, however the most widely used less computationally intensive is the Wallace–Freeman approximation called MML87 [21]. MML is Bayesian (i.e., it incorporates prior beliefs) and information-theoretic. It has the desirable properties of statistical invariance (i.e., the inference transforms with a re-parametrisation), statistical consistency (i.e., even for very hard problems, MML will converge to any underlying model) and efficiency (i.e., the MML model will converge to any true underlying model about as quickly as is possible). Wallace and Dowe (1999) showed in [22] a formal connection between MML and Kolmogorov complexity, i.e., the length of a shortest computer program that produces the object as output.

3. Method

In this section, we will describe our method in detail. First, in Section 3.1, we will derive a fixed design matrix for HGGM, so that the minimum message length principle can be applied. In Section 3.2, we propose our minimum message length criterion for HGGM. The exact forms of the criterion for various exponential distributions are derived in Section 3.3. Then, we present our two variable selection algorithms and their computational complexity in Section 3.4 and Section 3.5.

3.1. Heterogeneous Graphical Granger Model with Fixed Design Matrix

We can see that the models from Section 2 do not have fixed matrices. Since the MML principle proposed for generalized linear models in [23] requires a fixed design matrix, it cannot be directly applied to them. In the following section, we will derive the heterogeneous graphical Granger model (3) with a fixed lag d as an instance of regression in generalized linear models (GLM) with a fixed design matrix.

Consider the full model for p variables

x_{i}^{t}

and lag

d \geq 1

(be an integer) corresponding to the optimization problem (3). To be able to use the maximum likelihood (ML) estimation over the regression parameters, we reformulate the matrix of lagged time series

X_{t, d}^{L a g}

from (1) into a fixed design matrix form. Assume

n - d > p d

and denote

x_{i} = (x_{i}^{d + 1}, x_{i}^{d + 2}, \dots, x_{i}^{n})

. We construct the

(n - d) \times (d \times p)

design matrix

X = [\begin{matrix} x_{1}^{d} & \dots & x_{1}^{1} & \dots & x_{p}^{d} & \dots & x_{p}^{1} \\ x_{1}^{d + 1} & \dots & x_{1}^{2} & \dots & x_{p}^{d + 1} & \dots & x_{p}^{2} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{1}^{n - 1} & \dots & x_{1}^{n - d + 1} & \dots & x_{p}^{n - 1} & \dots & x_{p}^{n - d + 1} \end{matrix}]

(7)

and the

1 \times (d \times p)

vector

β_{i} = (β_{1}^{1}, \dots, β_{1}^{d}, \dots, β_{p}^{1}, \dots, β_{p}^{d})

. We can see that problem

x_{i}^{'} \approx μ_{i} = η_{i} (X β_{i}^{'})

(8)

for

i = d + 1, \dots, n

is equivalent to problem (3) in the matrix form where

μ_{i} = (μ_{i}^{d + 1}, \dots, μ_{i}^{d + 1})

and link function

η_{i}

operates on each coordinate.

Denote now by

γ_{i} \subset Γ = {1, \dots, p}

the subset of indices of regressor’s variables and

k_{i} : = | γ_{i} |

its cardinality. Let

β_{i} : = β_{i} (γ_{i}) \in R^{1 \times (d \times k_{i})}

be the vector of unknown regression coefficients with a fixed ordering within the

γ_{i}

subset. For illustration purposes and without lack of generality, we can assume that the first

k_{i}

indices out of p vectors belong to

γ_{i}

. Considering only the columns from matrix

X

in (7), which correspond to

γ_{i}

, we define the

(n - d) \times (d \times k_{i})

matrix of lagged vectors with indices from

γ_{i}

as

X_{i} : = X (γ_{i}) = [\begin{matrix} x_{1}^{d} & \dots & x_{1}^{1} & \dots & x_{k_{i}}^{d} & x_{k_{i}}^{d - 1} & \dots & x_{k_{i}}^{1} \\ x_{1}^{d + 1} & \dots & x_{1}^{2} & \dots & x_{k_{i}}^{d + 1} & x_{k_{i}}^{d} & \dots & x_{k_{i}}^{2} \\ x_{1}^{d + 2} & \dots & x_{1}^{3} & \dots & x_{k_{i}}^{d + 2} & x_{k_{i}}^{d + 1} & \dots & x_{k_{i}}^{3} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ x_{1}^{n - 1} & \dots & x_{1}^{n - d + 1} & \dots & x_{k_{i}}^{n - 1} & x_{k_{i}}^{n - 2} & \dots & x_{k_{i}}^{n - d + 1} \end{matrix}]

(9)

The problem (8) for explanatory variables with indices from

γ_{i}

is expressed as

x_{i}^{'} \approx μ_{i} = E (x_{i}^{'} | X_{i}) = η_{i} (X_{i} β_{i}^{'}) .

(10)

with

β_{i} : = β_{i} (γ_{i})

to be a

1 \times (d k_{i})

matrix of unknown coefficients and

η_{i}

operates on each coordinate. Wherever it is clear from context, we will simplify the notation

β_{i}

instead of

β_{i} (γ_{i})

and

X_{i}

instead of

X (γ_{i})

.

3.2. Minimum Message Length Criterion for Heterogeneous Graphical Granger Model

As before, we assume for each

x_{i}^{t}

, where

i = 1, \dots, p

,

t = d + 1, \dots, n

to have a density from the exponential family; furthermore,

μ_{i}

to be the mean of

x_{i}

and

v a r (x_{i} | μ_{i}, ϕ_{i}) = ϕ_{i} v_{i} (μ_{i})

where

ϕ_{i}

is a dispersion parameter and

v_{i}

a variance function dependent only on

μ_{i}

. In the concrete case, for Poisson regression, it is well known that it can be still used in over- or underdispersed settings. However, the standard error for Poisson regression would not be correct for the overdispersed situation. In the Poisson graphical Granger model, it is the case when, for the dispersion of at least one time series holds

ϕ_{i} \neq 1

. In the following, we assume that an estimate of

ϕ_{i}

is given. Denote

Γ

the set of all subsets of covariates

x_{i}, i = 1, \dots, p

. Assume now a fixed set

γ_{i} \in Γ

of covariates with size

k_{i} \leq p

and the corresponding design matrix

X_{i}

from (9). Furthermore, we assume that the targets

x_{i}

are independent random variables, conditioned on the features given by

X_{i}

, so that the likelihood function can be factorized into the product

p (x_{i} | β_{i}, X_{i}, γ_{i}) = \prod_{t = 1}^{n - d} p (x_{i}^{t} | β_{i}, X_{i}, γ_{i})

. The log-likelihood function

L_{i}

has then the form

L_{i} : = log p (x_{i} | β_{i}, X_{i}, γ_{i}) = \sum_{t = 1}^{n - d} log p (x_{i}^{t} | β_{i}, X_{i}, γ_{i}) .

Since

X_{i}

is highly collinear, to make the ill-posed problem for coefficients

β_{i}

(8) a well-posed one, we could use regularization by the ridge regression for GLM (see e.g., [24]). Ridge regression requires an initial estimate of

β_{i}

, which can be set as the maximum likelihood estimator of (10) obtained by the iteratively reweighted least square algorithm (IRLS). For a fixed

λ_{i} > 0

, for the ridge estimates of coefficients

{\hat{β}}_{i, λ_{i}}

holds

{\hat{β}}_{i, λ_{i}} = {arg min}_{β_{i} \in R^{1 \times d k_{i}}} {- L_{i} + λ_{i} β_{i}^{'} Σ_{i} β_{i}} .

(11)

In our paper however, we will not use the GLM ridge regression in form (11), but we apply the principle of minimum description length. Ridge regression in the minimum description length framework is equivalent to allowing the prior distribution to depend on a hyperparameter (= the ridge regularization parameter). To compute the message length of HGGM using the MML87 approximation, we need the negative log-likelihood function, prior distribution over the parameters and an appropriate Fisher information matrix, similarly as proposed in [23], where it is done for a general GLM regression. Moreover, [23] proposed the corrected form of Fisher information matrix for a GLM regression with ridge penalty. In our work, we will use this form of ridge regression and apply it to the heterogeneous graphical Granger model. In the following, we will construct the MML code for every subset of covariates in HGGM. The derivation of the criterion can be found in Appendix A.

The MML criterion for inference in HGGM.Assume

x_{i}, i = 1, \dots, p

be given time series of length n having distributions from exponential family, and for each of them, the estimate of the dispersion parameter

{\hat{ϕ}}_{i}

is given. Consider

{\hat{β}}_{i}

be an initial solution of (8) with a fixed

d \geq 1

achieved as the maximum likelihood estimate. Then

(i): the causal graph of the heterogeneous graphical Granger problem (8) can be inferred from the solutions of p variable selection problems, where for each $i = 1, \dots, p$ , the set ${\hat{γ}}_{i}$ of Granger–causal variables to $x_{i}$ is found;
(ii): For the estimated set ${\hat{γ}}_{i}$ holds

$\hat{γ_{i}} = arg min_{γ_{i} \in Γ} {H M M L (x_{i}, X_{i}, γ_{i})} = arg min_{γ_{i} \in Γ} {I (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, {\hat{λ}}_{i}, X_{i}, γ_{i}) + I (γ_{i})}$

(12)

where $I (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, \hat{λ_{i}}, X_{i}, γ_{i}) = {min}_{λ_{i} \in R^{+}} {M M L (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, λ_{i}, X_{i}, γ_{i})}$ and
$M M L (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, λ_{i}, X_{i}, γ_{i})$ is the minimum message length code of set $γ_{i}$ . It can be expressed as

$M M L (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, λ_{i}, X_{i}, γ_{i}) = - L_{i} + \frac{1}{2} log det (X_{i}^{'} W_{i} X_{i} + λ_{i} Σ_{i})$

(13)

$+ \frac{k_{i}}{2} log (\frac{2 π}{λ_{i}})$ $+ (\frac{λ_{i}}{2 {\hat{ϕ}}_{i}}) {\hat{β}}_{i}^{'} Σ_{i} {\hat{β}}_{i}$ $+ \frac{1}{2} log (n - d)$ $- \frac{k_{i} + 1}{2} log (2 π)$ $+ \frac{1}{2} log ((k_{i} + 1) π)$ where $| {\hat{γ}}_{i} | = k_{i}$ , $Σ_{i}$ is the unity matrix of size $d k_{i} \times d k_{i}$ , $I (γ_{i}) = log (\binom{p}{k_{i}}) + log (p + 1)$ , $L_{i}$ is the log-likelihood function depending on the density function of $x_{i}$ and matrix $W_{i}$ is a diagonal matrix depending on link function $η_{i}$ .

Remark 2. ([23]) compared

A I C_{c}

criterion with MML code for generalized linear models. We constructed the

A I C_{c}

criterion also for HGGM. This criterion however requires the computation of pseudoinverse of a matrix multiplication, which includes matrices

X_{i}

. Since

X_{i}

s are highly collinear, these matrix multiplications had, in our experiments, very high condition numbers. This consequently led to the application of

A I C_{c}

for HGGM, giving spurious results, and therefore we do not report them in our paper.

3.3. Log-Likelihood $L_{i}$ , Matrix $W_{i}$ and Dispersion $ϕ_{i}$ for $x_{i}$ with Various Exponential Distributions

In this section, we will present the form for the log-likelihood function and for matrix

W_{i}

for Gaussian, binomial, Poisson, gamma and inverse-Gaussian distributed time series

x_{i}

. The derivation for each case can be found in Appendix B.

μ_{i} = η_{i} (X_{i} β_{i}^{'})

holds in each case for the link function as in (10). By

{[X_{i} β_{i}^{'}]}^{t}

, we denote the t-th coordinate of vector

X_{i} β_{i}^{'}

.

Case $x_{i}$ is Gaussian This is the case when

x_{i}

is an independent Gaussian random variable and link function

η_{i}

is identity. Assume

{\hat{ϕ}}_{i} = σ_{i}^{2}

to be the variance of the Gaussian random variable. We assume that in model (10)

x_{i}

follows Gaussian distribution with the density function

p (x_{i} | {\hat{β}}_{i}, σ_{i}^{2}, X_{i}, γ_{i}) =

\prod_{t = d + 1}^{n} p (x_{i}^{t} | {\hat{β}}_{i}, σ_{i}^{2}, X_{i}, γ_{i}) = {(\frac{1}{(2 π σ_{i}^{2})})}^{(n - d) / 2} exp [- \frac{1}{2 σ_{i}^{2}} \sum_{t = d + 1}^{n} {(x_{i}^{t} - {[X_{i} {\hat{β}}_{i}]}^{t})}^{2}] .

(14)

Then

L_{i} = log p (x_{i} | {\hat{β}}_{i}, σ_{i}^{2}, X_{i}, γ_{i}) = - \frac{n - d}{2} log (2 π σ_{i}^{2}) - \frac{1}{2 σ_{i}^{2}} \sum_{t = d + 1}^{n} {(x_{i}^{t} - {[X_{i} {\hat{β}}_{i}]}^{t})}^{2}

(15)

and

W_{i} : = I_{n - d, n - d}

is a unit matrix of dimension

(n - d) \times (n - d)

.

Case $x_{i}$ is binomial This is the case when

x_{i}

is an independent Bernoulli random variable and it can achieve only two different values. For the link function, it holds

η_{i} = log (\frac{μ_{i}}{1 - μ_{j}})

. Without lack of generality, we consider

{\hat{ϕ}}_{i} = 1

and the density function

p (x_{i} | {\hat{β}}_{i}, σ_{i}^{2}, X_{i}, γ_{i}) =

\prod_{t = d + 1}^{n} p (x_{i}^{t} | {\hat{β}}_{i}, σ_{i}^{2}, X_{i}, γ_{i}) = \prod_{t = d + 1}^{n} {({[X_{i} {\hat{β}}_{i}^{'}]}^{t})}^{x_{i}^{t}} {(1 - ({[X_{i} {\hat{β}}_{i}^{'}]}^{t}))}^{(1 - x_{i}^{t})} .

(16)

Then

L_{i} = log (p (x_{i} | {\hat{β}}_{i}, X_{i}, γ_{i})) = \sum_{t = d + 1}^{n} (x_{i}^{t} {[X_{i} {\hat{β}}_{i}^{'}]}^{t} - log (1 + exp {[X_{i} {\hat{β}}_{i}^{'}]}^{t}))

(17)

and

W_{i} : = diag (\frac{exp ({[X_{i} {\hat{β}}_{i}]}^{1})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}]}^{1}))}^{2}}, \dots, \frac{exp ({[X_{i} {\hat{β}}_{i}]}^{n - d})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}]}^{n - d})}^{2})}) .

(18)

In the case that we cannot assume accurate fitting to one of the two values, for robust estimation we can consider the sandwich estimate of the covariance matrix of

{\hat{β}}_{i}

with

W_{i} = diag ({[x_{i}^{1} - \frac{exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{1})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{1}))}^{2}}]}^{2}, \dots, {[x_{i}^{n - d} - \frac{exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d}))}^{2}}]}^{2}) .

(19)

Case $x_{i}$ is Poisson If

x_{i}

is an independent Poisson random variable with link function

η_{i}^{t} = log (μ_{i}^{t}) = log ({[X_{i} {\hat{β}}_{i}^{'}]}^{t})

, the density is

p (x_{i} | {\hat{β}}_{i}, X_{i}, β_{i}) = \prod_{t = d + 1}^{n} \frac{exp {({[X_{i} {\hat{β}}_{i}^{'}]}^{t})}^{x_{i}^{t}} exp (- exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{t}))}{x_{i}^{t}!} .

(20)

Then

L_{i} = log (p (x_{i} | {\hat{β}}_{i}, X_{i}, γ_{i})) = \sum_{t = d + 1}^{n} x_{i}^{t} {[X_{i} {\hat{β}}_{i}^{'}]}^{t} - exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{t}) - log (x_{i}^{t}!)

(21)

and diagonal matrix

W_{i} : = diag (exp {(X_{i} {\hat{β}}_{i}^{'})}^{1}, \dots, exp {(X_{i} {\hat{β}}_{i}^{'})}^{n - d})

(22)

for Poisson

x_{i}

with

{\hat{ϕ}}_{i} = 1

and

W_{i} : = diag ({[x_{i}^{d + 1} - exp {(X_{i} {\hat{β}}_{i}^{'})}^{1}]}^{2}, \dots, {[x_{i}^{d + (n - d)} - exp {(X_{i} {\hat{β}}_{i}^{'})}^{n - d}]}^{2})

(23)

for over- or underdispersed Poisson

x_{i}

, i.e., when

{\hat{ϕ}}_{i} \neq 1

and is positive, where

t = 1, \dots, n - d

.

Case $x_{i}$ is gamma If

x_{i}

is an independent gamma random variable, we consider for the inverse of shape parameter

κ_{i}

for each t rate parameter

κ_{i} μ_{i}^{t}

and for the link function it holds

μ_{i}^{t} = \frac{1}{η_{i}^{t}} = \frac{1}{{[X_{i} β_{i}]}^{t}} .

For parameters of gamma function

a_{i}, b_{i}

we take

a_{i} = \frac{1}{κ_{i}}, b_{i}^{t} = κ_{i} {\hat{μ}}_{i}^{t}

and assume for dispersion

{\hat{ϕ}}_{i} = κ_{i}

. Then, we have density function

p (x_{i} | {\hat{β}}_{i}, \frac{1}{κ_{i}}, κ_{i} {\hat{μ}}_{i}, X_{i}, γ_{i}) = \prod_{t = d + 1}^{n} \frac{{(x_{i}^{t})}^{(\frac{1}{κ_{i}} - 1)} exp (- \frac{x_{i}^{t}}{κ_{i} μ_{i}^{t}})}{{(κ_{i} μ_{i}^{t})}^{\frac{1}{κ_{i}}} Γ (\frac{1}{κ_{i}})}

(24)

and log-likelihood

L_{i} = log (p (x_{i} | {\hat{β}}_{i}, \frac{1}{κ_{i}}, κ_{i} {\hat{μ}}_{i}, X_{i}, γ_{i}))

= \sum_{t = d + 1}^{n} ((\frac{1}{κ_{i}} - 1) log x_{i}^{t} - \frac{x_{i}^{t}}{κ_{i} {\hat{μ}}_{i}^{t}} - \frac{1}{κ_{i}} log (κ_{i} {\hat{μ}}_{i}^{t}) - log Γ (\frac{1}{κ_{i}}))

(25)

and diagonal matrix

W_{i} : = diag ({({\hat{μ}}_{i}^{1})}^{2}, \dots, {({\hat{μ}}_{i}^{n - d})}^{2}) = diag (\frac{1}{{({[X_{i} {\hat{β}}_{i}^{'}]}^{1})}^{2}}, \dots, \frac{1}{{({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d})}^{2}}) .

(26)

Case $x_{i}$ is inverse-Gaussian If

x_{i}

is an independent inverse-Gaussian random variable, we consider the inverse of the shape parameter

ξ_{i}

and link function

η_{i}^{t} = log (μ_{i}^{t}) = log ({[X_{i} {\hat{β}}_{i}^{'}]}^{t})

. Assume dispersion

{\hat{ϕ}}_{i} = ξ_{i}

. Then we have density function

p (x_{i} | {\hat{β}}_{i}, ξ_{i}, {\hat{μ}}_{i}, X_{i}, γ_{i}) = \prod_{t = d + 1}^{n} \frac{1}{2 π ξ_{i} {(x_{i}^{t})}^{3}} exp [- \frac{1}{2 ξ_{i}} \sum_{t = d + 1}^{n} \frac{{(x_{i}^{t} - {\hat{μ}}_{i}^{t})}^{2}}{{({\hat{μ}}_{i}^{t})}^{2} x_{i}^{t}}]

(27)

and log-likelihood

L_{i} = log (p (x_{i} | {\hat{β}}_{i}, ξ_{i} {\hat{μ}}_{i}, X_{i}, γ_{i}))

= \sum_{t = d + 1}^{n} (- \frac{1}{2 ξ_{i}} \sum_{t = d + 1}^{n} \frac{{(x_{i}^{t} - {\hat{μ}}_{i}^{t})}^{2}}{{({\hat{μ}}_{i}^{t})}^{2} x_{i}^{t}} - log (2 π ξ_{i}) + 3 log (x_{i}^{t}))

(28)

and diagonal matrix

W_{i} : = diag (\frac{1}{{\hat{μ}}_{i}^{1}}, \dots, \frac{1}{μ_{i}^{n - d}}) = diag (\frac{1}{({[X_{i} {\hat{β}}_{i}^{'}]}^{1})}, \dots, \frac{1}{({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d})}) .

(29)

One could express similarly

L_{i}

and

W_{i}

for other common exponential distributions, applied in GLMs.

3.4. Variable Selection by MML in Heterogeneous Graphical Granger Model

For all considered cases of exponential distributions of

x_{i}

we define the family of models

M (γ_{i}) : = {p (x_{i} | {\hat{β}}_{i}, {\hat{ϕ}}_{i}, X_{i}, γ_{i}), γ_{i} \in Γ}

with the corresponding exponential density

p (x_{i} | {\hat{β}}_{i}, {\hat{ϕ}}_{i}, X_{i}, γ_{i})

. First, we present the procedure which for each

x_{i}

computes the MML code for a set

γ_{i} \subset Γ

in Algorithm 1. Then we present Algorithm 2 for computation of

{\hat{γ}}_{i}

.

Algorithm 1 MML Code for

γ_{i}

Input: $γ_{i} \in Γ, d \geq 1$ , $| γ_{i} | = k_{i}$ , series is the matrix of $x_{i}^{t}$ , ${\hat{ϕ}}_{i}$ dispersion parameter,
$i = 1, \dots, p, t = 1, \dots, n - d$ , $Σ_{i}$ a unity matrix of size $d k_{i} \times d k_{i}$ , H a set of positive numbers;
$I (γ_{i}) = log (\binom{p}{k_{i}}) + log (p + 1)$ .
Output: For each i minimum of $H M M L (x_{i}, X_{i}, γ_{i})$ over H is found;
for all $x_{i}$ do
// Construct the d-lagged matrix $X_{i}$ with time series with indices from $γ_{i}$ .
// Compute matrix $W_{i}$ .
for all $λ_{i} \in H$ do
// Compute $L_{i}$
// Find the initial estimates of ${\hat{β}}_{i}$ .
// Compute $M M L (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, λ_{i}, X_{i}, γ_{i})$ from (13).
end for// to $λ_{i}$
// Compute $I (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, {\hat{λ}}_{i}, X_{i}, γ_{i}) = {min}_{λ_{i} \in H} M M L (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, λ_{i}, X_{i}, γ_{i})$ .
// $H M M L (x_{i}, X_{i}, γ_{i}) : = I (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, {\hat{λ}}_{i}, X_{i}, γ_{i}) + I (γ_{i})$ .
end for// to $x_{i}$
return $H M M L (x_{i}, X_{i}, γ_{i})$ for each i.

In general, the selection of the best structure

γ_{i}

amounts to evaluate values of

H M M L (γ_{i})

for all

γ_{i} \subset Γ

, i.e., for all

2^{p}

possible subsets and then to pick the subset with which the minimum of the function was achieved.

3.5. Search Algorithms

We will find the best structure of

γ_{i}

with MML code by two approaches. The first way is by the exhaustive search approach exHMML and the second one is by minimizing the HMML by genetic algorithm type procedure called HMMLGA, which we introduce in the following. Since HMML in (12) is a function having multiple local minima, the achievement of the global minimum by these two approaches is not, in general, guaranteed. In [12], a similar genetic algorithm MMLGA was proposed for the Poisson GGM. In this paper, we propose its modification, which is more appropriate for the objective functions that we have here.

The idea of HMMLGA is as follows. Consider an arbitrary

γ_{i} \subset Γ

with size

k_{i}

for a fixed i. Define a Boolean vector

Q_{i}

of length p corresponding to a given

γ_{i}

, so that it has ones in the positions of the indices of covariates from

γ_{i}

, otherwise zeros. Define

H M M L (Q_{i}) : = H M M L (γ_{i})

where

H M M L (γ_{i})

is from (12). Genetic algorithm MMLGA executes genetic operations on populations of

Q_{i}

. In the first step, a population of size m (m an even integer), is generated randomly in the set of all

2^{p}

binary strings (individuals) of length p. Then, we select

m / 2

individuals in the current population with the lowest value of (12) as the elite subpopulation of parents of the next population. For a predefined number of generated populations

n_{g}

, the crossover operation of parents and the mutation operation of a single parent are executed on the elite to create the rest of the new population. A mutation corresponds to a random change in

Q_{i}

and a crossover combines the vector entries of a pair of parents. The position of mutation is for each individual selected randomly in contrast to MMLGA, where the position was, for all individuals, the same, and is given as an input parameter. Similarly, the position of crossover in HMMLGA is for each pair of individuals selected randomly. After each run of these two operations on a current population, the current population is replaced with the children with the lowest value of (12) to form the next generation. The algorithm stops after the number of population generations

n_{g}

is achieved. Since HMML in (12) has multiple local minima, in contract to MMLGA, we selected in the HMMLGA the following strategy: We do not take the first

Q_{i}

with the sorted HMML values ascendently, but based on the parsimony principle, we take that

Q_{i}

among all with minimum HMML value, which has the minimum number of ones in

Q_{i}

. Concerning the approach by exhaustive search exHMML, similarly we do not take the first

Q_{i}

with sorted HMML code ascendently, but also, here, we take that

Q_{i}

, among all with a minimum value of HMML, which has the minimum number of ones in

Q_{i}

. The algorithm HMMLGA is summarized in Algorithm 2.

Algorithm 2 HMMLGA

Input: $Γ$ , $d \geq 1, p, n_{g}, m$ an even integer;
series is the matrix of $x_{i}^{t}, i = 1, \dots, p, t = 1, \dots, n - d$ ;
Output: $A d j$ := adjacency matrix of the output causal graph;
// For every $x_{i}$ $Q_{i}$ with minimum of (12) is found;
for all $x_{i}$ do
Create initial population ${Q_{i}^{j}, j = 1, \dots, m}$ at random; Compute
$H M M L (Q_{i}^{j}) : = I (x_{i}, {\hat{β}}_{i}, {\hat{ϕ}}_{i}, {\hat{λ}}_{i}, X_{i}, Q_{i}^{j}) + (\binom{p}{k_{i}^{j}}) + log (p + 1)$ for each $j = 1, \dots, m$ where
$k_{i}^{j}$ is the number of ones in $Q_{i}^{j}$ ; v:=1;
while $v \leq n_{g}$ do
u:=1;
while $u \leq m$ do
Sort $H M M L (Q_{i}^{j})$ ascendently and create the elite population; By crossover of $Q_{i}^{j}$ and $Q_{i}^{r}$ , $r \neq j$
at a random crossing position create children and add them to elite; Compute $H M M L (Q_{i}^{j})$
for each j; Mutate a single parent $Q_{i}^{j}$ at a random position; Compute $H M M L (Q_{i}^{j})$ for each j;
Add the children with minimum $H M M L (Q_{i}^{j})$ until the new population is not filled;
u := u + 1;
end while// to u
v := v + 1;
end while// to v
end for// to $x_{i}$
The i-th row of Adj: $A d j_{i} : = Q_{i}$ with min of (12) such that $| Q_{i} |$ is minimum.
return $(A d j)$

Our code in Matlab is publicly available at: https://t1p.de/26f3.

Computational Complexity of HMMLGA and of exHMML

We used Matlab function fminsearch for computation of

H M M L (x_{i}, {\hat{β}}_{i}, {\hat{λ}}_{i}, X_{i}, γ_{i})

. It is well known that the upper bound of the computational complexity of a genetic algorithm is of order of the product of the size of an individual, of the size of each population, of the number of generated populations and of the complexity of the function to be minimized. Therefore, an upper bound of the computational complexity of HMMLGA for p time series, size p of an individual, m the population size and

n_{g}

the number of population generations is

O (p m n_{g}) \times O (fminsearch) \times p

, where

O (fminsearch)

can also be estimated. The highest complexity in fminsearch has the computation of the Hessian matrix, which is the same as for the Fisher information matrix (our matrix

W_{i}

) or the computation of the determinant. The computational complexity of Hessian for i fixed for

(n - d) \times (n - d)

matrix is

O (\frac{(n - d) (n - d + 1)}{2})

. An upper bound on the complexity of determinant in (13) is

O ({(p d)}^{3})

(for proof see e.g., [25]). Denote

M = max {{(p d)}^{3}, \frac{(n - d) (n - d + 1)}{2}} .

Since we have p optimization functions, our upper bound on the computational complexity of HMMLGA is then

O (p^{2} m n_{g} M)

. The computational complexity of

e x H M M L

is

p 2^{p} O (fminsearch) = p 2^{p} M

.

4. Related Work

In this section, we discuss the related work on the application of two description length based compression schemes for generalized linear models, further the related work on these compression principles applied to causal inference in graphical models, and finally, other papers on causal inference in graphical models for non-Gaussian time series.

Minimum description length (MDL) is another principle based on compression. Similarly as for MML, by viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework (Rissanen [26], Barron et al. [27], and Hansen and Yu [28]) discriminates between competing model classes based on the complexity of each description. The minimum description length principle is based on the idea that one chooses the model that gives the shortest description of data. The methods based on MML and MDL appear mostly equivalent, but there are some differences, especially in interpretation. MML is a Bayesian approach: it assumes that the data-generating process has a given prior distribution. MDL avoids assumptions about the data-generating process. Both methods make use of two-part codes: the first part always represents the information that one is trying to learn, such as the index of a model class (model selection) or parameter values (parameter estimation); the second part is an encoding of the data, given the information in the first part.

Hansen and Yu 2003 in [29] derived objective functions for one-dimensional GLM regression by the minimum description principle. The extension to the multi-dimensional case is however not straighforward. Schmidt and Makalic in [23] used MML87 to derive the MML code of a multivariate GLM ridge regression. Since these works were not designed for time series and do not consider any lag, the mentioned codes cannot be directly used for Granger models.

Marx and Vreeken in [30,31] and Budhathoki and Vreeken [32] applied the MDL principle to the Granger causal inference. The inference in these papers is however done for the bivariate Granger causality and the extension to graphical Granger methods is not straightforward. Hlaváčková-Schindler and Plant in [33] applied both MML and MDL principle to the inference in the graphical Granger models for Gaussian time series. Inference in graphical Granger models for Poisson distributed data using the MML principle was done by the same authors in [12]. To our best knowledge, papers on compression criteria for heterogeneous graphical Granger model have not been published yet.

Among the causal inference on time series, Kim et al. in [7] proposed the statistical framework Granger causality (SFGC) that can operate on point processes, including neural-spike trains. The proposed framework uses multiple statistical hypothesis testing for each pair of involved neurons. A pair-wise hypothesis test was used for each pair of possible connections among all time series and the false discovery rate (FDR) applied. The method can also be used for time series from exponential family.

For a fair comparison with our method, we selected the causal inference methods, which are designed for

p \geq 3

non-Gaussian processes. In our experiments, we used SFGC as a comparison method, and as another comparison method, we selected the method LINGAM from Shimizu et al. [6], which estimates causal structure in Bayesian networks among non-Gaussian time series using structural equation models and independent component analysis. Finally, as a comparison method, we used the HGGM with the adaptive Lasso penalisation method, as introduced in [1] and described in Section 2.2. The experiments reported in the papers with comparison methods were done only in scenarios when the number of time observations is several orders of magnitude greater than the number of time series.

5. Experiments

We performed experiments with HMMLGA and with exHMML on processes, which have an exponential distribution of types given in Section 3.3. We used the methods HGGM [1], LINGAM [6] and SFGC [7] for comparison. To assess similarity between the target and output causal graphs in synthetic experiments by all methods, we used the commonly applied F-measure, which takes both precision and recall into account.

5.1. Implementation and Parameter Setting

The comparison method HGGM uses Matlab package

p e n a l i z e d

from [34] with adaptive Lasso penalty. The algorithm in this package employs the Fisher scoring algorithm to estimate the regression coefficients. As recommended by the author of

p e n a l i z e d

in [34] and employed in [1], we used adaptive Lasso with

λ_{m a x} = 5

, applying cross validation and taking the best result with respect to F measure from the interval

(0, λ_{m a x}]

. We also followed the recommendation of the authors of LINGAM in [6] and used threshold = 0.05 and the number of boots n/2, where n is the length of the time series. In method SFGC , we used the setting recommended by the authors, the significance level

0.05

of FDR.

In HMMLGA and exHMML, the initial estimates of

β_{i}

were achieved by the iteratively re-weighted least square procedure implemented in Matlab function glmfit; in the same function, we obtained also the estimates of the dispersion parameters of time series. (Considering initial estimates of

β_{i}

by the IRLS procedure using function penalized with ridge penalty gave poor results in the experiments.) In case of gamma distribution, we achieved the estimates of parameters

κ_{i}

by statistical fitting, concretely by Matlab function gamfit. The minimization over

λ_{i}

was done by function fminsearch, which defined set H from Algorithm 1 as positive numbers from interval [0.1, 1000].

5.2. Synthetically Generated Processes

To be able to evaluate the performance of HMML and to compare it to other methods, the ground truth, i.e., the target causal graph in the experiments, should be known. In this series of experiments, we examined randomly generated processes, having an exponential distribution of Gaussian and gamma types from Section 3.3, together with the correspondingly generated target causal graphs. The performance of all tested algorithms depends on various parameters, including the number of time series (features), the number of causal relations in Granger causal graph (dependencies), the length of time series, and finally, on the lag parameter. Concerning the calculation of an appropriate lag for each time series; theoretically, it can be done by AIC or BIC. However, the calculation of AIC and BIC assumes that the degrees of freedom are equal to the number of nonzero parameters, which is only known to be true for the Lasso penalty [35], but not known for adaptive Lasso. In our experiments, we followed the recommendation of [1] on how to select the lag of time series in HGGM. It was observed that varying the lag parameter from 3 to 50 did not influence either the performance of HGGM nor SFGC significantly. Based on that, we considered lags 3 and 4 in our experiments.

We examined causal graphs with mixed types of time series for

p = 5

and

p = 8

number of features. For each case, we considered causal graphs with higher edge density (dense case) and lower edge density (sparse case), which corresponds to the parameter “dependency” in the code, where the full graph has for p time series

p (p - 1)

possible directed edges. Since we concentrate on a short time series in the paper; the length of generated time series varied from 100 to 1000.

5.2.1. Causal Networks with 5 and 8 Time Series

We considered 5 time series with 2 gamma, 2 Gaussian and 1 Poisson distributions, which we generated randomly together with the corresponding network. For the denser case with 5 time series, we generated randomly graphs with 18 edges, and for the sparser case, random graphs with 8 edges. The results of our experiments on causal graphs with 5 features (

p = 5

) are presented in Table 1. Each value in Table 1 represents the mean value of all F-measures over 10 random generations of causal graphs for length n and lag d. For dependency 8, we took strength = 0.9; for dependency 18, we took strength = 0.5 of causal connections.

One can see from Table 1 that HMMLGA and exHMML gave considerably higher precision in terms of F-measure than three other comparison methods, for all considered n up to 1000.

In the second network, we considered 8 time series with 7 gamma and 1 Gaussian distributions, which we generated randomly together with a corresponding network. For the denser case, we randomly generated graphs with 52 edges and for the sparser case random graphs with 15 edges. The results are presented in Table 2. Each value in Table 2 represents the mean value of all F-measures over 10 random generations of causal graphs for length n and lag d. For graph with 52 dependencies, we had strength = 0.3; for graph with 15 dependencies, strength = 0.9. Similarly as in the experiments with

p = 5

, one can see in Table 2 for

p = 8

that both exHMML and HMMLGA gave considerably higher F-measure than the comparison methods for considered n up to 1000. The pair-wise hypothesis test used in SFGC for each pair of possible connections among all time series showed its efficiency for long time series in [1,7], however, it was in all experiments in our short-time series scenarios outperformed by LINGAM. The performance of method HGGM, efficient in long-term scenarios [1], was for 5 times series comparable to Lingam; for 8 times, this was the performance of HGGM the weakest from all the methods.

5.2.2. Performance of exHMML and MMLGA

The strategy to select the set

γ_{i}

with minimum HMML and with minimum number of regressors is applied in both methods. In exHMML, all

2^{p}

possible values of HMML were sorted ascendently. Among those having the same minimum value, that one in the list is selected so that it has minimum number of ones (regressors) and is the last in the list. Similarly, this strategy is applied iteratively in HMMLGA on populations of individuals which have size

m < 2^{p}

. This strategy is an improvement with respect to MMLGA [12], where the first

γ_{i}

in the list with minimum MML function was selected. However, since the function HMML has multiple local minima, the convergence to the global minimum by both exHMML and HMMLGA cannot be guaranteed. The different performance of exHMML and HMMLGA for various p and various causal graph density is given by the nature of the objective function in (12) to be minimized. This function has multiple local minima. The above described implementation of both procedures for the exhaustive search and for the genetic algorithm, therefore, without any prior knowledge of the ground truth causal graph, can give different performance of HMMLGA and exHMML. However as shown in the experiments, the achieved local minima are for both methods much closer to the global one than in case of the three rival methods.

5.3. Climatological Data

We studied dynamics among seven climatic time series in a time interval. All time series were measured in the station of the Institute for Meteorology of the University of Life Science in Vienna 265 m above sea level [36]. Since weather is a very changeable process, it makes sense to focus on shorter time interval. We considered time series of dewpoint temperature (degree C, dew p), air temperature (degree C, air tmp), relative humidity (%, rel hum), global radiation (W

m^{- 2}

, gl rad), wind speed (km/h, w speed), wind direction (degree, w dir), and air pressure (hPa, air pr). All processes were measured every ten minutes, which corresponds to

n = 432

time observations for each time series. We concentrated on the temporal interactions of these processes during two scenarios. The first one corresponded to 7 to 9 July 2020 which were days with no rain. The second one corresponded to 16 to 18 July 2020 which were rainy days.

Before we applied the methods, we tested the distributions of each time series. In the first scenario (rainy days), air temperature (2) and global radiation (4) followed a gamma distribution and the remaining processes, the dew point temperature (1), relative humidity (3), wind speed (5), wind direction (6), and air pressure (7), following a Gaussian distribution. In the second scenario (dry days), wind direction (6) and air pressure (7) followed a Gaussian distribution, the dew point temperature (1), air temperature (2), relative humidity (3), global radiation (4) and wind speed (5), following a gamma distribution. To decide which of the algorithms exHMML or HMMLGA would be preferable to apply in this real valued experiment, we executed synthetic experiments for constellations of 5 gamma and 2 Gaussian (dry days), as well as of 2 gamma and 5 Gaussian (rainy days), with

n = 432

for sparse and dense graphs with

d = 4

and 5, each for 10 random graphs. Higher F-measure was obtained for HMMLGA, therefore we applied the HMMLGA procedure in the climatological experiments.

The resulting output graphs for methods HMMLGA, Lingam and HGGM for rainy and dry days gave the same graphs each for both lags; for dry days, we obtained, in HGGM, different graphs for each lag. We were interested in (a) how realistic were the detected temporal interactions of the processes by each method and in each scenario and (b) how realistic were the detected temporal interactions by each method, coming from the difference of graphs for dry and rainy days. In this case, we focused here only on the connections which differed in both graphs for each method. The figures of the output graphs for methods HMMLGA, Lingam, SFGC and HGGM for rainy and dry days can be for lag

d = 4

, seen in Figure 1 and Figure 2.

For Lingam, the ouput graphs for rainy and dry days were identical and complete, so we omitted this method from further analysis.

Based on the expert knowledge [37], the temporal interactions in HMMLGA output graphs in both the rainy and dry scenarios correspond to the reality. In

H M M L G A_{D - R}

, which is the subgraph of HMMLGA of connections of the complement for dry days and of rainy days, the following directed edges in the form (cause, effect) were detected: (air tmp, air pr) and (dew p, air pr). The (direct) influence of dew point on air pressure is more strongly observable during sunny days, since the dew point is not possible to determine during rainy days. Similarly, the causal influence of air temperature on the air pressure is stronger during sunny days than during rainy days. So, both detected edges in HMMLGA were realistic.

H M M L G A_{R - D}

was empty. Output graph

H G G M_{D - R}

gave no edges. For

H G G M_{R - D}

, we obtained these directed edges: (dew p, air pr), which is, during rain, not observable, but the achieved influence (rel hum, dew p) is also during rain observable. Moreover, (rel hum, air pr) are observable (as humidity increases, pressure decreases). The edge (w speed, w dir) is not observable in reality, (w speed, air pr) is observable (higher wind speeds will show lower air pressure); also (w speed, air tmp) and (w speed, gl rad) are observable, however direct effect (w dir, rel hum) is not observable in reality. So,

H G G M_{R - D}

had 2 falsely detected directions out of 8. Graph

S F G C_{R - D}

gave this edge (dew p, air pr). Similarly, as in the case of HGGM, this edge is, during rain, not observable; (dew p, air tmp)—is during rain not observable; (dew p, w speed)—is during rain not observable; (dew p, rel hum)—is during rain not observable; (dew p, gl rad)—is during rain not observable; (rel hum, gl rad)—is during rain observable; (gl rad, w speed)—is during rain not observable; (gl rad, w dir)—is during rain not observable. So,

S F G C_{R - D}

had 7 falsely detected directions out of 8. The output of

S F G C_{D - R}

gave these edges: (rel hum, dew p)—this is during a dry period observable; (rel hum, air tmp)—this is during a dry period observable; (gl rad, w speed)—this is during a dry period observable; (dev p, air tmp)—this is during a dry period observable; (air press, w dir)—this is during a dry period observable; (w speed, air pr)—this is during a dry period observable; (air pr, w speed) is during dry period in reality observable. So,

S F G C_{D - R}

had 7 correctly detected directions out of 7.

We conclude that, in this climatological experiment, method HMMLGA, followed by SFGC, gave the most realistic causal connections with respect to the comparison methods.

5.4. Electrohysterogram Time Series

In the current obstetrics, there is no effective way of preventing preterm birth. The main reason is that no good objective method is known to evaluate the stepwise progression of pregnancy through to labor [38]. Any better understanding of the underlying labor dynamics can contribute to prevent preterm birth, which is the main cause of mortality in newborns. External recordings of the electrohysterogram (EHG) can provide new knowledge on uterine electrical activity associated with contractions.

We considered a database of 4-by-4 electrode EHG recordings performed on pregnant women, which were recorded in Iceland between 2008 and 2010 and are available via PhysioNet (PhysioBank ATM) [39]. This EHC grid (in the matrix form) was placed on the abdomen of the pregnant women. The electrode numbering, as considered in [38], can be found in Figure 3.

We applied the recordings, concretely for EHG signal for women in the third phase of pregnancy and during labor, to all the methods. We selected all (five) mothers for which the recordings were performed, both in the third trimester and during labor. Since there is no ground truth known on how the dynamics among the electrodes should look like for both modalities, we set a modest objective for us, whether HMMLGA and the comparison methods are able to distinguish labor from pregnancy from the EHG recordings. During labor, a higher density of interactions among electrodes is expected than during pregnancy, due to the higher occurrence of contractions of the uterine smooth muscles, which is also supported by some recent research in obstetrics, e.g., [40].

The 16 electromyographic time series (channels) were taken for all women (woman 11, 27, 30, 31 and 34), for each in the third trimester (P) and during labor (L). The observations in time series correspond to the time resolution every 5th microsecond. The time series in the databasis are commented by information about contraction, possible contraction, participant movement, participant change of position, fetal movement and equipment manipulation. By statistical fitting, we found out that all 16 time series followed Poisson distribution (setting raw ADC units in the Physionet database). We analysed the causal connections of each method for labor and pregnancy for all five women.

Since HMMLGA had higher F-measure than exHMML in the synthetic experiments with 16 Poisson time series, we considered further only HMMLGA in this real data experiment. In the synthetic experiments in [12], Poisson time series showed the highest F-measure on short time series, i.e., the case when the number of time observations is smaller than approximately two orders times the number of time series. Based on this, we took the last 1200 observations for labor, since in the last phase, it was sure the labor had already started and the contractions had increased in time. Labor still continued for another few hours after the EHG recording finished for each of five women. For pregnancy time series, we took also 1200 observations, starting the moment where all electrodes had been fixed. The hypothesis, that during labor all electrodes were activated was confirmed by HMMLGA, HGGM and Lingam at all mothers. The hypothesis, that the causal graph during labor had higher density of causal connections than in the pregnancy case, was confirmed at all mothers by HMMLGA, for HGGM for mothers 30 and 31, but for SFGC and Lingam, we could not confirm it. In fact, Lingam gave identical complete causal graphs for both labor and pregnancy cases. The real computational time for Lingam (with 100 boots, as recommended by the authors) was for 16 time series and both labor and pregnancy modalities cca 12 h (in HP Elite Notebook); on the other side, for other methods, the time was in order of minutes. We present the causal graphs of all methods for labor and pregnant phase of mother 31 in Figure 4.

One can see that the density of connections by HMMLGA for labor is higher than for pregnancy. Causal graphs of HMMLGA for all mothers were for labor also denser than the pregnancy one. To make some more concrete hypotheses about the temporal interactions among the electrodes based on contractions, we would probably have to consider only intervals about which we know that they are without or with a limited number of artifacts in terms of participant movement, participant change of position, etc.

6. Conclusions

Common graphical Granger models in scenarios with short time series suffer often from overestimation, including the heterogeneous graphical Granger model. To remedy this, in this paper, we proposed to use the minimum message length principle for determination of causal connections in the heterogeneous graphical Granger model. Based on the dispersion coefficient of the target time series and on the initial maximum likelihood estimates of the regression coefficients, we proposed a minimum message length criterion to select the subset of causally connected time series with each target time series, and we derived its concrete form for various exponential distributions. We found this subset by a genetic-type algorithm (HMMLGA), which we have proposed as well as by exhaustive search (exHMML). We evaluated the complexity of these algorithms. The code in Matlab is provided. We demonstrated superiority of both methods with respect to the comparison methods in synthetic experiments in short data scenarios. In two real data experiments, the interpretation of the causal connections as the result of HMMLGA was the most realistic with respect to the comparison methods. The superiority of HMMLGA with respect to the comparison methods for short time series can be explained by utilizing the dispersion of time series in the criterion as an additional (prior) information, as well as the fact that this criterion is optimized in the finite search space.

Author Contributions

Conceptualization, K.H.-S.; Data curation, K.H.-S.; Formal analysis, K.H.-S.; Investigation, K.H.-S.; Methodology, K.H.-S.; Resources, C.P.; Software, K.H.-S.; Supervision, C.P.; Validation, K.H.-S.; Visualization, K.H.-S.; Writing—original draft, K.H.-S.; Writing—review & editing, C.P., K.H.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Czech Science Foundation grant number GA19-16066S.

Acknowledgments

This work was supported by the Czech Science Foundation, project GA19-16066S. The authors thank to Dr. Irene Schicker and Dipl.-Ing. Petrina Papazek from [37] for their help with analysing the results of the climatological experiments. Open Access Funding by the University of Vienna.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of the MML Criterion for HGGM

Assume p independent random variables from the exponential family which are represented by time series

x_{i}^{t}, t = d + 1, \dots, n

and for each i be

{\hat{ϕ}}_{i}

the given estimate of its dispersion. Consider the problem (10) for a given lag

d > 0

.

We consider now

γ_{i}

fixed, so for simplicity of writing we omit it from the list of variables of the functions. Having function

L_{i}

, we can now compute an initial estimate of

{\hat{β}}_{i}

from (10) which is the solution to the system of score equations. Since

L_{i}

forms a convex function, one can use standard convex optimization techniques (e.g., Newton-Raphson method) to solve these equations numerically. (In our code, we use the Matlab implementation of an iteratively reweighted least squares (IRLS) algorithm of the Newton–Raphson method). Assume now we have an initial solution

{\hat{β}}_{i}

from (10).

Having parameters

{\hat{β}}_{i}

,

{\hat{ϕ}}_{i}

,

Σ_{i}

W_{i}

and

λ_{i}

, we need to construct the function

H M M L (γ_{i})

: We use for each

i = 1, \dots, p

and for regression (10) Formula (18) from [23] i.e., the case when we plug in variables

α : = 0

and

β : = {\hat{β}}_{i}

and

X : = X_{i}

,

y : = x_{i}

,

n : = n - d

,

k : = k_{i}

,

θ : = {\hat{β}}_{i}

,

λ : = λ_{i}

,

ϕ : = {\hat{ϕ}}_{i}

,

S = Σ_{i}

be the unity matrix of dimension

d k_{i}

. The corrected Fisher information matrix for the parameters

β_{i}

is then

J ({\hat{β}}_{i} | {\hat{ϕ}}_{i}, λ_{i}) = (\frac{1}{ϕ_{i}}) X_{i}^{'} W_{i} X_{i} + λ_{i} Σ_{i}

. Function

c (m)

for

m : = k_{i} + 1

is then

c (k_{i} + 1) = - \frac{k_{i} + 1}{2} log (2 π) + \frac{1}{2} log ((k_{i} + 1) π) - 0.5772

and the constants which are independent of

k_{i}

we omitted from the HMML code, since the optimization w.r.t.

γ_{i}

is independent of them. Among all subsets

γ_{i} \in Γ

, there are

(\binom{p}{k_{i}})

subsets of size

k_{i}

. If nothing is known a priori about the likelihood of any covariate

x_{i}

being included in the final model, a prior that treats all subset sizes equally likely

π (| γ_{i} |) = 1 / (p + 1)

is appropriate [23]. This gives the code length

I (γ_{i}) = log (\binom{p}{k_{i}}) + log (p + 1)

as in (12).

Appendix B. Derivation of L_i, W_i, ϕ_i for Various Exponential Distributions of x _i

Case $x_{i}$ is Gaussian Since in this case is

ϕ_{i} = σ_{i}^{2}

its variance, we will omit

ϕ_{i}

from the list of parameters which condition function p.

L_{i}

in (15) is obtained directly from (14) by applying logarithm on it. By plugging values for identity link corresponding to the Gaussian case as

η_{i}^{t} = μ_{i}^{t} = {[X_{i} β_{i}]}^{t}

and

\frac{δ η_{i}^{t}}{δ μ_{i}^{t}} = 1

into Formula (13) from [23], matrix

W_{i} = I_{k_{i} d \times k_{i} d}

is directly obtained.

Case $x_{i}$ is binomial Assuming

ϕ_{i}

be a constant, we can omit

ϕ_{i}

from the list of parameters which condition function p.

L_{i}

in (17) is obtained directly from (16) by applying logarithm on it. As in the previous case, it is obtained by plugging values into formula (13) from [23]. Value of

W_{i}

from (19) is obtained by plugging values for logit link corresponding to the binomial case as

η_{i}^{t} = {[X_{i} β_{i}]}^{t} = log (\frac{μ_{i}^{t}}{1 - μ_{i}^{t}})

and

\frac{δ η_{i}^{t}}{δ μ_{i}^{t}} = \frac{1}{μ_{i}^{t} (1 - μ_{i}^{t})}

into Formula (13) from [23]. In case we cannot assume

ϕ_{i} = 1

, we apply the sandwich estimate of the covariance matrix of

{\hat{β}}_{i}

for robust estimation which for a general logistic regression can be found in e.g., [41]) and in our case it gives matrix

W_{i}

in the form

W_{i} = diag ({(x_{i}^{1} - \frac{exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{1})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{1}))}^{2}})}^{2}, \dots, {(x_{i}^{n - d} - \frac{exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d})}{{(1 + exp ({[X_{i} {\hat{β}}_{i}^{'}]}^{n - d}))}^{2}})}^{2})

.

Case $x_{i}$ is Poisson First we will express the log-likelihood function

L_{i}

in terms of parameters

β_{i}

. Since we use Poisson model for

x_{i}

having the Poisson distribution or overdispersed Poisson, we omit

ϕ_{i}

from the list of parameters which condition function p. For a given set of parameters

β_{i}

, the probability of attaining

x_{i}^{d + 1}, \dots, x_{i}^{n}

is given by

p (x_{i}^{d + 1}, \dots, x_{i}^{n} | X_{i}, β_{i})

= \prod_{t = d + 1}^{n} \frac{{(μ_{i}^{t})}^{x_{i}^{t}} exp (- μ_{i}^{t})}{(x_{i}^{t})!} = \prod_{t = d + 1}^{n} \frac{exp {({[X_{i} β_{i}^{'}]}^{t})}^{x_{i}^{t}} exp (- exp ({[X_{i} β_{i}^{'}]}^{t}))}{x_{i}^{t}!}

and

η_{i}^{t} = exp ({[X_{i} β_{i}^{'}]}^{t})

, (recalling the notation from Section 3.2,

{[X_{i} β_{i}^{'}]}^{t}

denotes the t-th coordinate of the vector

X_{i} β_{i}^{'}

). The log-likelihood in terms of

β_{i}

is

L_{i} = log p (β_{i} | x_{i}, X_{i}) = \sum_{t = d + 1}^{n} x_{i}^{t} {[X_{i} β_{i}^{'}]}^{t} - exp ({[X_{i} β_{i}^{'}]}^{t}) - log (x_{i}^{t}!) .

Now we derive matrix

W_{i}

for

x_{i}

with (exact) Poisson distribution: The Fisher information matrix

J_{i} = J (β_{i}) = - E_{β_{i}} (\nabla^{2} L_{i} (β_{i} | x_{i}, X_{i}))

may be obtained by computing the second order partial derivatives of

L_{i}

for

r, s = 1, \dots, k_{i}

. This gives

\begin{matrix} \frac{δ^{2} L_{i} (β_{i} | x_{i}, X_{i})}{δ^{2} β_{i}^{r} β_{i}^{s}} & = & \frac{δ L_{i}}{δ β_{i}^{s}} \sum_{t = d + 1}^{n} [x_{i}^{t} \sum_{l = 1}^{d} x_{r}^{t - l} - exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{t - l} β_{j}^{l}) \sum_{l = 1}^{d} x_{r}^{t - l}] \\ = & - \sum_{t = d + 1}^{n} exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{t - l} β_{j}^{l}) (\sum_{l = 1}^{d} x_{s}^{t - l}) (\sum_{l = 1}^{d} x_{r}^{t - l}) . \end{matrix}

(A1)

If we denote

W_{i} : = d i a g (exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{d + 1 - l} β_{j}^{l}), \dots, exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{n - l} β_{j}^{l}))

then we have Fisher information matrix

J (β_{i}) = {(X_{i})}^{'} W_{i} X_{i}

. Alternatively,

W_{i}

can be obtained by plugging values into formula (13) from [23]. Value of

W_{i}

from (22) is obtained by plugging values for log link corresponding to the Poisson case as

η_{i}^{t} = {[X_{i} β_{i}]}^{t} = log (μ_{i}^{t})

and

\frac{δ η_{i}^{t}}{δ μ_{i}^{t}} = \frac{1}{μ_{i}^{t}}

into Formula (13) from [23].

Derivation of matrix

W_{i}

for

x_{i}

with overdispersed Poisson distribution: Assume now the dispersion parameter

ϕ_{i} > 0, \neq 1

. The variance of the overdispersed Poisson distribution is

ϕ_{i} μ_{i}

. We know that the Poisson regression model can be still used in overdispersed settings and the function

L_{i}

is the same as

L_{i} (β_{i})

derived above. We use the robust sandwich estimate of covariance of

{\hat{β}}_{i}

as it was proposed in [42] for general Poisson regression. The Fisher information matrix of overdispersed problem is

J_{i} = J (β_{i}) = {(X_{i})}^{'} W_{i} X_{i}

where

W_{i}

is constructed for

x_{i}

Poisson based on [42] and has the form

W_{i} =

diag ([x_{i}^{d + 1} - exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{d + 1 - l} β_{j}^{l})]^{2}, \dots, [x_{i}^{n} - exp (\sum_{j = 1}^{k_{i}} \sum_{l = 1}^{d} x_{j}^{n - l} β_{j}^{l})]^{2})

.

Case $x_{i}$ is gamma

L_{i}

in (25) is obtained directly from (24) by applying logarithm on it. By plugging values for log link corresponding to the gamma case as

η_{i}^{t} = \frac{1}{μ_{i}^{t}}

and

\frac{δ η_{i}^{t}}{δ μ_{i}^{t}} = \frac{1}{{(μ_{i}^{t})}^{2}}

into Formula (13) from [23], matrix

W_{i}

from (26) is directly obtained.

Case $x_{i}$ is inverse-Gaussian

L_{i}

in (28) is obtained directly from (24) by applying logarithm on it. By plugging values for log link corresponding to the inverse-Gaussian case as

η_{i}^{t} = {[X_{i} β_{i}]}^{t} = log (μ_{i}^{t})

and

\frac{δ η_{i}^{t}}{δ μ_{i}^{t}} = \frac{1}{μ_{i}^{t}}

into Formula (13) from [23], matrix

W_{i}

from (29) is directly obtained.

References

Behzadi, S.; Hlaváčková-Schindler, K.; Plant, C. Granger Causality for Heterogeneous Processes. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Cham, Switzerland, 2019. [Google Scholar]
Zou, H. The adaptive lasso and its oracle property. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Hryniewicz, O.; Kaczmarek, K. Forecasting short time series with the bayesian autoregression and the soft computing prior information. In Strengthening Links Between Data Analysis and Soft Computing; Springer: Cham, Switzerland, 2015; pp. 79–86. [Google Scholar]
Bréhélin, L. A Bayesian approach for the clustering of short time series. Rev. D’Intell. Artif. 2006, 20, 697–716. [Google Scholar] [CrossRef]
Wallace, C.S.; Boulton, D.M. An information measure for classification. Comput. J. 1968, 11, 185–194. [Google Scholar] [CrossRef]
Shimizu, S.; Inazumi, T.; Sogawa, Y.; Hyvärinen, A.; Kawahara, Y.; Washio, T.; Hoyer, P.O.; Bollen, K. DirectLiNGAM: A direct method for learning a linear non-Gaussian structural equation model. J. Mach. Learn. Res. 2011, 12, 1225–1248. [Google Scholar]
Kim, S.; Putrino, D.; Ghosh, S.; Brown, E.N. A Granger causality measure for point process models of ensemble neural spiking activity. PLoS Comput. Biol. 2011, 7, e1001110. [Google Scholar] [CrossRef]
Arnold, A.; Liu, Y.; Abe, N. Temporal causal modeling with graphical Granger methods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA, 12–15 August 2007; pp. 66–75. [Google Scholar]
Shojaie, A.; Michailidis, G. Discovering graphical Granger causality using the truncating lasso penalty. Bioinformatics 2010, 26, i517–i523. [Google Scholar] [CrossRef]
Lozano, A.C.; Abe, N.; Liu, Y.; Rosset, S. Grouped graphical Granger modeling methods for temporal causal modeling. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 577–586. [Google Scholar]
Nelder, J.; Wedderburn, R. Generalized Linear Models. J. R. Stat. Soc. Ser. A (General) 1972, 135, 370–384. [Google Scholar] [CrossRef]
Hlaváčková-Schindler, K.; Plant, C. Poisson Graphical Granger Causality by Minimum Message Length. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2020 (ECML/PKDD), Ghent, Belgium, 14–18 September 2020. [Google Scholar]
Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
Mannino, M.; Bressler, S.L. Foundational perspectives on causality in large-scale brain networks. Phys. Life Rev. 2015, 15, 107–123. [Google Scholar] [CrossRef]
Maziarz, M. A review of the Granger-causality fallacy. J. Philos. Econ. Reflect. Econ. Soc. Issues 2015, 8, 86–105. [Google Scholar]
Granger, C.W. Some recent development in a concept of causality. J. Econom. 1988, 39, 199–211. [Google Scholar] [CrossRef]
Lindquist, M.A.; Sobel, M.E. Graphical models, potential outcomes and causal inference: Comment on Ramsey, Spirtes and Glymour. NeuroImage 2011, 57, 334–336. [Google Scholar] [CrossRef] [PubMed][Green Version]
Spirtes, P.; Glymour, C.N.; Scheines, R.; Heckerman, D. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Glymour, C. Counterfactuals, graphical causal models and potential outcomes: Response to Lindquist and Sobel. NeuroImage 2013, 76, 450–451. [Google Scholar] [CrossRef] [PubMed]
Marinescu, I.E.; Lawlor, P.N.; Kording, K.P. Quasi-experimental causality in neuroscience and behavioural research. Nat. Hum. Behav. 2018, 2, 891–898. [Google Scholar] [CrossRef] [PubMed]
Wallace, C.S.; Freeman, P.R. Estimation and inference by compact coding. J. R. Stat. Soc. Ser. B 1987, 49, 240–252. [Google Scholar] [CrossRef]
Wallace, C.S.; Dowe, D.L. Minimum message length and Kolmogorov complexity. Comput. J. 1999, 42, 270–283. [Google Scholar] [CrossRef]
Schmidt, D.F.; Makalic, E. Minimum message length ridge regression for generalized linear models. In Australasian Joint Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2013; pp. 408–420. [Google Scholar]
Segerstedt, B. On ordinary ridge regression in generalized linear models. Commun. Stat. Theory Methods 1992, 21, 2227–2246. [Google Scholar] [CrossRef]
Computational Complexity of Mathmatical Operations. Available online: https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations (accessed on 2 October 2020).
Rissanen, J. Stochastic Complexity in Statistical Inquiry; World Scientific: Singapore, 1989; Volume 15, p. 188. [Google Scholar]
Barron, A.; Rissanen, J.; Yu, B. The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef]
Hansen, M.; Yu, B. Model selection and minimum description length principle. J. Am. Stat. Assoc. 2001, 96, 746–774. [Google Scholar] [CrossRef]
Hansen, M.H.; Yu, B. Minimum description length model selection criteria for generalized linear models. Lect. Notes Monogr. Ser. 2003, 40, 145–163. [Google Scholar]
Marx, A.; Vreeken, J. Telling cause from effect using MDL-based local and global regression. In Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, LA, USA, 18–21 November 2017; pp. 307–316. [Google Scholar]
Marx, A.; Vreeken, J. Causal inference on multivariate and mixed-type data. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; Volume 2018, pp. 655–671. [Google Scholar]
Budhathoki, K.; Vreeken, J. Origo: Causal inference by compression. Knowl. Inf. Syst. 2018, 56, 285–307. [Google Scholar] [CrossRef]
Hlaváčková-Schindler, K.; Plant, C. Graphical Granger causality by information-theoretic criteria. In Proceedings of the European Conference on Artificial Intelligence 2020 (ECAI), Santiago de Compostela, Spain, 29 August–2 September 2020; pp. 1459–1466. [Google Scholar]
McIlhagga, W.H. Penalized: A MATLAB toolbox for fitting generalized linear models with penalties. J. Stat. Softw. 2016, 72. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T.; Tibshirani, R. On the “degrees of freedom” of the lasso. Ann. Stat. 2007, 35, 2173–2192. [Google Scholar] [CrossRef]
Available online: https://meteo.boku.ac.at/wetter/mon-archiv/2020/202009/202009.html (accessed on 5 September 2020).
Zentralanstalt für Meteorologie und Geodynamik 1190 Vienna, Hohe Warte 38. Available online: https://www.zamg.ac.at/cms/de/aktuell (accessed on 5 September 2020).
Alexandersson, A.; Steingrimsdottir, T.; Terrien, J.; Marque, C.; Karlsson, B. The Icelandic 16-electrode electrohysterogram database. Nat. Sci. Data 2015, 2, 1–9. [Google Scholar] [CrossRef]
Available online: https://www.physionet.org (accessed on 5 September 2020).
Mikkelsen, E.; Johansen, P.; Fuglsang-Frederiksen, A.; Uldbjerg, N. Electrohysterography of labor contractions: Propagation velocity and direction. Acta Obstet. Gynecol. Scand. 2013, 92, 1070–1078. [Google Scholar] [CrossRef]
Agresti, A. Categorical Data Analysis; Section 12.3.3.; John Wiley and Sons: Hoboken, NJ, USA, 2003; Volume 482. [Google Scholar]
Huber, P.J. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 221–233. [Google Scholar]

Figure 1. Output causal graphs for method HMMLGA and Lingam for rainy days and dry day scenarios.

Figure 2. Output causal graphs for method HGGM and SFGC for rainy days and dry day scenarios.

Figure 3. The ordering of the electrodes as mounted on the abdomen of women.

Figure 4. Output causal graphs for mother 31 during (a) labor and (b) pregnancy for all methods.

Table 1.

p = 5

, average F-measure for each method, HMML,

n_{g} = 10

,

m = 20

, HGGM with

λ_{m a x} = 5

, LINGAM with

n / 2

boots. The first one subtable is for

d = 3

, the second one for

d = 4

.

Table 1.

p = 5

, average F-measure for each method, HMML,

n_{g} = 10

,

m = 20

, HGGM with

λ_{m a x} = 5

, LINGAM with

n / 2

boots. The first one subtable is for

d = 3

, the second one for

d = 4

.

dense g. 18, $n =$	100	300	500	1000;	sparse g. 8, $n =$	100	300	500	1000
exHMML	0.69	0.83	0.82	0.88	exHMML	0.70	0.72	0.72	0.67
HMMLGA	0.73	0.90	0.89	0.90	HMMLGA	0.73	0.76	0.74	0.67
HGGM	0.5	0.48	0.54	0.52	HGGM	0.52	0.36	0.66	0.36
LINGAM	0.57	0.58	0.62	0.58	LINGAM	0.58	0.54	0.69	0.45
SFGC	0.33	0.26	0.26	0.33	SFGC	0.14	0.35	0.44	0.31
dense g. 18, $n =$	100	300	500	1000;	sparse g. 8, $n =$	100	300	500	1000
exHMML	0.71	0.73	0.83	0.83	exHMML	0.67	0.80	0.80	0.68
HMMLGA	0.82	0.79	0.87	0.92	HMMLGA	0.67	0.73	0.77	0.70
HGGM	0.44	0.37	0.40	0.39	HGGM	0.53	0.47	0.65	0.36
LINGAM	0.71	0.58	0.58	0.65	LINGAM	0.33	0.52	0.74	0.46
SFGC	0.43	0.55	0.42	0.63	SFGC	0.35	0.59	0.42	0.38

Table 2.

p = 8

, average F-measure for each method, HMML, with

d = 3

,

n_{g} = 10

,

m = 20

, HGGM with

λ_{m a x} = 5

, LINGAM with

n / 2

boots. The first subtable is for

d = 3

, the second one for

d = 4

.

Table 2.

p = 8

, average F-measure for each method, HMML, with

d = 3

,

n_{g} = 10

,

m = 20

, HGGM with

λ_{m a x} = 5

, LINGAM with

n / 2

boots. The first subtable is for

d = 3

, the second one for

d = 4

.

dense g. 52, $n =$	100	300	500	1000;	sparse g. 15, $n =$	100	300	500	1000
exHMML	0.68	0.78	0.79	0.82	exHMML	0.69	0.73	0.77	0.64
HMMLGA	0.84	0.67	0.66	0.87	HMMLGA	0.57	0.69	0.7	0.56
HGGM	0.16	0.17	0.17	0.17	HGGM	0.2	0.09	0.18	0.17
LINGAM	0.62	0.54	0.51	0.55	LINGAM	0.28	0.33	0.4	0.19
SFGC	0.32	0.21	0.35	0.20	SFGC	0.3	0.24	0.22	0.19
dense g. 52, $n =$	100	300	500	1000;	sparse g. 15, $n =$	100	300	500	1000
exHMML	0.59	0.64	0.56	0.75	exHMML	0.58	0.84	0.80	0.69
HGGMGA	0.77	0.72	0.63	0.79	HMMLGA	0.42	0.69	0.70	0.56
HGGM	0.16	0.16	0.18	0.17	HGGM	0.17	0.10	0.18	0.19
LINGAM	0.62	0.54	0.51	0.55	LINGAM	0.27	0.33	0.40	0.18
SFGC	0.36	0.45	0.82	0.83	SFGC	0.29	0.29	0.24	0.20

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hlaváčková-Schindler, K.; Plant, C. Heterogeneous Graphical Granger Causality by Minimum Message Length. Entropy 2020, 22, 1400. https://doi.org/10.3390/e22121400

AMA Style

Hlaváčková-Schindler K, Plant C. Heterogeneous Graphical Granger Causality by Minimum Message Length. Entropy. 2020; 22(12):1400. https://doi.org/10.3390/e22121400

Chicago/Turabian Style

Hlaváčková-Schindler, Kateřina, and Claudia Plant. 2020. "Heterogeneous Graphical Granger Causality by Minimum Message Length" Entropy 22, no. 12: 1400. https://doi.org/10.3390/e22121400

APA Style

Hlaváčková-Schindler, K., & Plant, C. (2020). Heterogeneous Graphical Granger Causality by Minimum Message Length. Entropy, 22(12), 1400. https://doi.org/10.3390/e22121400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heterogeneous Graphical Granger Causality by Minimum Message Length

Abstract

1. Introduction

2. Preliminaries

2.1. Graphical Granger Model

2.2. Heterogeneous Graphical Granger Model

2.3. Granger Causality and Graphical Granger Models

2.4. Minimum Message Length Principle

3. Method

3.1. Heterogeneous Graphical Granger Model with Fixed Design Matrix

3.2. Minimum Message Length Criterion for Heterogeneous Graphical Granger Model

3.3. Log-Likelihood $L_{i}$ , Matrix $W_{i}$ and Dispersion $ϕ_{i}$ for $x_{i}$ with Various Exponential Distributions

3.4. Variable Selection by MML in Heterogeneous Graphical Granger Model

3.5. Search Algorithms

Computational Complexity of HMMLGA and of exHMML

4. Related Work

5. Experiments

5.1. Implementation and Parameter Setting

5.2. Synthetically Generated Processes

5.2.1. Causal Networks with 5 and 8 Time Series

5.2.2. Performance of exHMML and MMLGA

5.3. Climatological Data

5.4. Electrohysterogram Time Series

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Derivation of the MML Criterion for HGGM

Appendix B. Derivation of L_i, W_i, ϕ_i for Various Exponential Distributions of x _i

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Heterogeneous Graphical Granger Causality by Minimum Message Length

Abstract

1. Introduction

2. Preliminaries

2.1. Graphical Granger Model

2.2. Heterogeneous Graphical Granger Model

2.3. Granger Causality and Graphical Granger Models

2.4. Minimum Message Length Principle

3. Method

3.1. Heterogeneous Graphical Granger Model with Fixed Design Matrix

3.2. Minimum Message Length Criterion for Heterogeneous Graphical Granger Model

3.3. Log-Likelihood L i , Matrix W i and Dispersion ϕ i for x i with Various Exponential Distributions

3.4. Variable Selection by MML in Heterogeneous Graphical Granger Model

3.5. Search Algorithms

Computational Complexity of HMMLGA and of exHMML

4. Related Work

5. Experiments

5.1. Implementation and Parameter Setting

5.2. Synthetically Generated Processes

5.2.1. Causal Networks with 5 and 8 Time Series

5.2.2. Performance of exHMML and MMLGA

5.3. Climatological Data

5.4. Electrohysterogram Time Series

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Derivation of the MML Criterion for HGGM

Appendix B. Derivation of Li, Wi, ϕi for Various Exponential Distributions of x i

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. Log-Likelihood $L_{i}$ , Matrix $W_{i}$ and Dispersion $ϕ_{i}$ for $x_{i}$ with Various Exponential Distributions

Appendix B. Derivation of L_i, W_i, ϕ_i for Various Exponential Distributions of x _i