Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series

Maanan, Saïd; Dumitrescu, Bogdan; Giurcăneanu, Ciprian Doru

doi:10.3390/e20010076

Open AccessArticle

Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series

by

Saïd Maanan

¹

,

Bogdan Dumitrescu

² and

Ciprian Doru Giurcăneanu

^1,*

¹

Department of Statistics, University of Auckland, Auckland 1142, New Zealand

²

Department of Automatic Control and Computers, University Politehnica of Bucharest, 060042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(1), 76; https://doi.org/10.3390/e20010076

Submission received: 8 December 2017 / Revised: 14 January 2018 / Accepted: 16 January 2018 / Published: 19 January 2018

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

This work is focused on latent-variable graphical models for multivariate time series. We show how an algorithm which was originally used for finding zeros in the inverse of the covariance matrix can be generalized such that to identify the sparsity pattern of the inverse of spectral density matrix. When applied to a given time series, the algorithm produces a set of candidate models. Various information theoretic (IT) criteria are employed for deciding the winner. A novel IT criterion, which is tailored to our model selection problem, is introduced. Some options for reducing the computational burden are proposed and tested via numerical examples. We conduct an empirical study in which the algorithm is compared with the state-of-the-art. The results are good, and the major advantage is that the subjective choices made by the user are less important than in the case of other methods.

Keywords:

maximum entropy; Expectation-Maximization; graphical models; autoregressive model; latent variables; information theoretic criteria; time series

1. Introduction

Graphical models are instrumental in the analysis of multivariate data. Originally, these models have been employed for independently sampled data, but their use has been extended to multivariate, stationary time series [1,2], which triggered their popularity in statistics, machine learning, signal processing and neuroinformatics.

For better understanding the significance of graphical models, let

x

be a random vector having a Gaussian distribution with zero-mean and positive definite covariance matrix

Γ

. A graph

G = (V, E)

can be assigned to

x

in order to visualize the conditional independence between its components. The symbol V denotes the vertices of G, while E is the set of its edges. There are no loops from a vertex to itself, nor multiple edges between two vertices. Hence, E is a subset of

\{(a, b) \in V \times V : a \neq b\}

. Each vertex of the graph is assigned to an entry of

x

. We conventionally draw an edge between two vertices a and b if the random variables

x_{a}

and

x_{b}

are not conditionally independent, given all other components of

x

. The description above follows the main definitions from [3] and assumes that the graph G is undirected. Proposition 1 from the same reference provides a set of equivalent conditions for conditional independence. The most interesting one claims that

x_{a}

and

x_{b}

are conditionally independent if and only if the entry

(a, b)

of

Γ^{- 1}

is zero. This shows that the missing edges of G correspond to zero-entries in the inverse of the covariance matrix, which is called the concentration matrix.

There is an impressive amount of literature on graphical models. In this work, we focus on a generalization of this problem to time series. The main difference between the static case and the dynamic case is that the former relies on the sparsity pattern of the concentration matrix, whereas the latter is looking for zeros in the inverse of the spectral density matrix. One of the main difficulties stems from the fact that the methods developed in the static case cannot be applied straightforwardly to time series.

Parametric as well as non-parametric methods have been already proposed in the previous literature dedicated to graphical models for time series. Some of the recently introduced estimation methods are based on convex optimization. We briefly discuss below the most important algorithms which belong to this class.

Reference [4] extends the static case by allowing the presence of latent variables. The key point of their approach is to express the manifest concentration matrix as the sum of a sparse matrix and a low-rank matrix. Additionally, they provide conditions for the decomposition to be unique, in order to guarantee the identifiability. The two matrices are estimated by minimizing a penalized likelihood function, where the penalty involves both the $ℓ_{1}$ -norm and the nuclear norm. Interestingly enough, the authors of the discussion paper [5] pointed out that an alternative solution, which relies on the Expectation-Maximization algorithm, can be easily obtained.
In the dynamic case, reference [6] has an important contribution which consists in showing that the graphical models for multivariate autoregressive processes can be estimated by solving a convex optimization problem which follows from the application of the Maximum Entropy principle. This paved the way for the development of efficient algorithms dedicated to topology selection in graphical models of autoregressive processes [7,8] and autoregressive moving average processes [9].
A happy marriage between the approach from [4] and the use of Maximum Entropy led to the solution proposed in [10] for the identification of graphical models of autoregressive processes with latent variables. Similar to [4], the estimation is done by minimizing a cost function whose penalty term is given by a linear combination of the $ℓ_{1}$ -norm and the nuclear norm. The two coefficients of this linear combination are chosen by the user and they have a strong influence on the estimated model. The method introduced in [10] performs the estimation for various pairs of coefficients which yield a set of candidate models; the winner is decided by using a score function.

According to the best of our knowledge, there is no other work that extends the estimation method from [5] to the case of latent-variable autoregressive models. The main contribution of this paper is to propose an algorithm of this type, which combines the strengths of Expectation-Maximization and convex optimization. The key point for achieving this goal is to apply the Maximum Entropy principle.

The rest of the paper is organized as follows. In the next section, we introduce the notation and present the method from [10]. Section 3 outlines the newly proposed algorithm. The outcome of the algorithm is a set of models from which we choose the best one by employing information theoretic (IT) criteria. Section 4 is focused on the description of these criteria: We discuss the selection rules from the previous literature and propose a novel criterion. The experimental results are reported in Section 5. Section 6 concludes the paper.

2. Preliminaries and Previous Work

Let

x_{1}, \dots, x_{T}

be a

κ

-dimensional (

κ > 1

) time series generated by a stationary and stable

VAR

process of order p. We assume that the spacing of observation times is constant and

x_{t} = {[x_{1 t}, \dots, x_{κ t}]}^{⊤}

. The symbol

{(\cdot)}^{⊤}

denotes transposition. The difference equation of the process is

x_{t} = A_{1} x_{t - 1} + \dots + A_{p} x_{t - p} + ϵ_{t}, t = \bar{1, T},

(1)

where

A_{1}, \dots, A_{p}

are matrix coefficients of size

κ \times κ

and

ϵ_{t}

is a sequence of independently and identically distributed random

κ

-vectors. We assume that the vectors

{\{ϵ_{t}\}}_{t = 1}^{T}

are drawn from a

κ

-variate Gaussian distribution with zero mean vector and covariance matrix

Σ ≻ 0

. Additionally, the vectors

{\{x_{t}\}}_{t = 1 - p}^{0}

are assumed to be constant.

The conditional independence relations between the variables in

x_{t}

are provided by the inverse of the spectral density matrix (ISDM) of the

VAR

-process

{x_{t}}

. The ISDM has the expression

Φ^{- 1} (ω) = A^{H} (ω) Σ^{- 1} A (ω) = \sum_{i = - p}^{p} Q_{i} e^{- j ω i},

(2)

where

j = \sqrt{- 1}

and

{(\cdot)}^{H}

is the operator for conjugate transpose. We define

A_{0} = - I

, where

I

stands for the identity matrix of appropriate size, and

A (ω) = - \sum_{i = 0}^{p} A_{i} e^{- j ω i}

. For

i \geq 0

, we have that

Q_{i} = \sum_{k = 0}^{p - i} A_{k}^{⊤} Σ^{- 1} A_{k + i}

and

Q_{- i} = Q_{i}^{⊤}

. The sparse structure of the ISDM contains conditional dependence relations between the variables of

x_{t}

, i.e., two variables

x_{a}

and

x_{b}

are independent, conditional on the other variables, if and only if [1,11]

{[Φ^{- 1} (ω)]}_{a b} = 0, \forall ω \in (- π, π] .

(3)

In the graph corresponding to the ISDM

Φ^{- 1} (ω)

, the nodes stand for the variables of the model, and the edges stand for conditional dependence, i.e., there is no edge between conditionally independent variables.

In a latent-variable graphical model it is assumed that

κ = K + r

, where

K

variables are accessible to observation (they are called manifest variables) and

r

variables are latent, i.e., not accessible to observation, but playing a significant role in the conditional independence pattern of the overall model. The existence of latent variables in a model can be described in terms of the ISDM by the block decomposition

Φ (ω) = [\begin{matrix} Φ_{m} (ω) & Φ_{ℓ m}^{⊤} (- ω) \\ Φ_{ℓ m} (ω) & Φ_{ℓ} (ω) \end{matrix}], Φ^{- 1} (ω) = [\begin{matrix} Υ_{m} (ω) & Υ_{ℓ m}^{⊤} (- ω) \\ Υ_{ℓ m} (ω) & Υ_{ℓ} (ω) \end{matrix}],

(4)

where

Φ_{m} (ω)

and

Φ_{ℓ} (ω)

are the manifest and latent components of the spectral density matrix, respectively. Using the Schur complement, the ISDM of the manifest component has the form [10] (Equation (21)):

Φ_{m}^{- 1} (ω) = Υ_{m} (ω) - Υ_{ℓ m}^{⊤} (- ω) Υ_{ℓ}^{- 1} (ω) Υ_{ℓ m} (ω) .

(5)

When building latent variable graphical models, we assume that

r ≪ K

, i.e., few latent variables are sufficient to characterize the conditional dependence structure of the model. The previous formula can therefore be written

Φ_{m}^{- 1} (ω) = S (ω) - Λ (ω),

(6)

where

S (ω)

is sparse, and

Λ (ω)

has (constant) low-rank almost everywhere in

(- π, π]

. Furthermore, we can write [12] (Equation (4)):

\begin{matrix} S (ω) - Λ (ω) = Δ (ω) X Δ {(ω)}^{H}, \end{matrix}

(7)

\begin{matrix} Λ (ω) = Δ (ω) L Δ {(ω)}^{H}, \end{matrix}

(8)

where

Δ (ω) = [I, e^{j ω} I, \dots, e^{j ω p} I]

is a shift matrix, and

X

and

L

are

K (p + 1) \times K (p + 1)

positive semidefinite matrices. We split all such matrices in

K \times K

blocks, e.g.,

L = [\begin{matrix} L_{00} & \dots & L_{0 p} \\ ⋮ & ⋱ & ⋮ \\ L_{0 p}^{⊤} & \dots & L_{p p} \end{matrix}] .

(9)

The block trace operator for such a matrix is

D (\cdot)

, defined by

D_{i} (L) = \sum_{h = 0}^{p - i} L_{h, h + i}, i = \bar{0, p} .

(10)

For negative indices, the relation

D_{- i} (L) = D_{i} {(L)}^{⊤}

holds. Note that (8) can be rewritten as

Λ (ω) = \sum_{i = - p}^{p} D_{i} (L) e^{j ω i} .

The first

p + 1

sample covariances of the

VAR

process are [13]:

{\hat{C}}_{i} = \frac{1}{T} \sum_{t = 1}^{T - i} x_{t + i} x_{t}^{⊤}, i = \bar{0, p} .

(11)

However, only the upper left

K \times K

blocks corresponding to the manifest variables can be computed from data; they are denoted

{\hat{R}}_{i}

. With

\hat{R} = [{\hat{R}}_{0} \dots {\hat{R}}_{p}]

, we build the block Toeplitz matrix

T (\hat{R}) = [\begin{matrix} {\hat{R}}_{0} & \dots & {\hat{R}}_{p} \\ ⋮ & ⋱ & ⋮ \\ {\hat{R}}_{p}^{⊤} & \dots & {\hat{R}}_{0} \end{matrix}] .

(12)

It was proposed in [10] to estimate the matrices

X

and

L

by solving the optimization problem

\begin{matrix} \begin{matrix} min_{X, L} & tr (T (\hat{R}) X) - \log det X_{00} + λ γ f (X + L) + λ tr (L) \\ s . t . & X ⪰ 0 L ⪰ 0, \end{matrix} \end{matrix}

(13)

where

tr (\cdot)

is the trace operator,

\log (\cdot)

denotes the natural logarithm and

det (\cdot)

stands for the determinant. Minimizing

tr (L)

induces low rank in

L

and

λ, γ > 0

are trade-off constants. The function

f (\cdot)

is a group sparsity promoter whose expression is given by

f (Z) = \sum_{a = 1}^{K} \sum_{b = 1}^{a - 1} max_{i = \bar{0, p}} | D_{i} (Z) (a, b) | .

(14)

Note that

D_{i} (X + L) (a, b)

is the i-th degree coefficient of the polynomial that occupies the

(a, b)

position in the matrix polynomial

Φ_{m}^{- 1} (ω)

. Sparsity is encouraged by minimizing the

ℓ_{1}

-norm of the vector formed by the coefficients that are maximum for each position

(a, b)

.

3. New Algorithm

The obvious advantage of the optimization problem (13) is its convexity, which allows the safe computation of the solution. However, a possible drawback is the presence of two parameters,

λ

and

γ

, whose values should be chosen. A way to eliminate one of the parameters is to assume that the number

r

of latent variables is known. At least for parsimony reasons, it is natural to suppose that

r

is very small. Since a latent variable influences all manifest variables in the ISDM (5), there cannot be too many independent latent variables. Therefore, giving

r

a fixed small value is likely to be not restrictive.

In this section, we describe an estimation method which is clearly different from the one in [10]. More precisely, we generalize the Expectation-Maximization algorithm from [5], developed there for independent and identically distributed random variables, to a

VAR

process. For this purpose, we work with the full model (4) that includes the ISDM part pertaining to the

r

latent variables. Without loss of generality, we assume that

Υ_{ℓ} (ω)

equals the identity matrix

I

; the effect of the latent variables on the manifest ones in (5) can be modeled by

Υ_{ℓ m}

alone. Combining with (2), the model is

Φ^{- 1} (ω) = [\begin{matrix} Υ_{m} (ω) & Υ_{ℓ m}^{⊤} (- ω) \\ Υ_{ℓ m} (ω) & I \end{matrix}] = \sum_{i = - p}^{p} Q_{i} e^{- j ω i},

(15)

where the matrices

Q_{i}

have to be found.

The main difficulty of this approach is the unavailability of the latent part of the matrices (11). Were such matrices available, we could work with SDM

Φ (ω)

estimators (confined to order p) of the form

\tilde{Φ} (ω) = \sum_{i = - p}^{p} C_{i} e^{- j ω i},

(16)

where

C_{i}

denotes the i-th covariance lag for the VAR process

{x_{t}}

(see also (1) and (11)). We split the matrix coefficients from (15) and (16) according to the size of manifest and latent variables, e.g.,

C_{i} = [\begin{matrix} C_{m, i} & C_{ℓ m, - i}^{⊤} \\ C_{ℓ m, i} & C_{ℓ, i} \end{matrix}] .

(17)

To overcome the difficulty, the Expectation-Maximization algorithm alternatively keeps fixed either the model parameters

Q_{i}

or the matrices

C_{i}

, estimating or optimizing the remaining unknowns. The expectation step of Expectation-Maximization assumes that the ISDM

Φ^{- 1} (ω)

from (15) is completely known. Standard matrix identities [5] can be easily extended to matrix trigonometric polynomials for writing down the formula

\begin{matrix} Φ (ω) = [\begin{matrix} Φ_{m} (ω) & - Φ_{m} (ω) Υ_{ℓ m}^{⊤} (- ω) \\ - Υ_{ℓ m} (ω) Φ_{m} (ω) & I + Υ_{ℓ m} (ω) Φ_{m} (ω) Υ_{ℓ m}^{⊤} (- ω) \end{matrix}] . \end{matrix}

(18)

Identifying (16) with (18) gives expressions for estimating the matrices

C_{i}

, depending on the matrices

Q_{i}

from (15). The upper left corner of (18) needs no special computation, since the natural estimator is

Φ_{m} (ω) = \sum_{i = - p}^{p} {\hat{R}}_{i} e^{- j ω i},

where the sample covariances

{\hat{R}}_{i}

are directly computable from the time series. It results that

C_{m, i} = {\hat{R}}_{i}, i = \bar{- p, p} .

(19)

The other blocks from (17) result from convolution expressions associated with the polynomial multiplications from (18). The lower left block of the coefficients is

C_{ℓ m, i} = - \sum_{k + s = i} Q_{ℓ m, k} {\hat{R}}_{s} = - \sum_{k = max (- p, i - p)}^{min (p, i + p)} Q_{ℓ m, k} {\hat{R}}_{i - k}, i = \bar{- 2 p, 2 p} .

(20)

Note that the trigonometric polynomial

Υ_{ℓ m} (ω) Φ_{m} (ω)

has degree

2 p

, since its factors have degree p. With (20) available, we can compute

C_{ℓ, i} = δ_{i} I + \sum_{k - s = i} C_{ℓ m, k} Q_{ℓ m, - s}^{⊤} = δ_{i} I + \sum_{k = max (- 2 p, i - p)}^{min (2 p, i + p)} C_{ℓ m, k} Q_{ℓ m, i - k}^{⊤}, i = \bar{- p, p},

(21)

where

δ_{i} = 1

if

i = 0

and

δ_{i} = 0

otherwise. Although the degree of the polynomial from the lower right block of (18) is

3 p

, we need to truncate it to degree p, since this is the degree of the ISDM

Φ^{- 1} (ω)

from (15). This is the reason for computing only the coefficients

i = \bar{- p, p}

in (21). The same truncation is applied on (20); note that there we cannot compute only the coefficients that are finally needed, since all of them are required in (21).

In the maximization step of Expectation-Maximization, the covariance matrices

C_{i}

are assumed to be known and are fixed; the ISDM can be estimated by solving an optimization problem that will be detailed below. The overall solution we propose is outlined in Algorithm 1, explained in what follows.

Algorithm 1 Algorithm for Identifying

SP

of ISDM (AlgoEM)

Input: $Data (x_{1 : K, 1}, \dots, x_{1 : K, T})$ , $VAR$ -order p, number of latent variables $r$ , an information theoretic criterion (ITC).
Initialization:
Evaluate ${\hat{R}}_{i}$ for $i = \bar{0, p}$ ; (see (11) and the discussion below it)
$\hat{R} \leftarrow [{\hat{R}}_{0} \dots {\hat{R}}_{p}]$ ;
${\hat{Φ}}_{m} (ω) \leftarrow \sum_{i = - p}^{p} {\hat{R}}_{i} e^{- j ω i}$ ;
${\{{\overset{ˇ}{Q}}_{i}^{(0)} (1 : K, 1 : K)\}}_{i = 0}^{p} \leftarrow {ME}_{I} (\hat{R})$ (see (22));
Compute ${\overset{ˇ}{Υ}}_{ℓ m}^{(0)} (ω)$ from EIG of ${\overset{ˇ}{Q}}_{0}^{(0)}$ ;
for all $λ \in {λ_{1}, \dots, λ_{L}}$ do
Maximum Entropy Expectation-Maximization (penalized setting):
for $it = 1, \dots, N_{it}$ do
Use ${\hat{Φ}}_{m} (ω)$ and ${\overset{ˇ}{Υ}}_{ℓ m}^{(it - 1)} (ω)$ to compute ${\overset{ˇ}{C}}^{(it)}$ (see (16)–(18));
${\{{\overset{ˇ}{Q}}_{i}^{(it)}\}}_{i = 0}^{p} \leftarrow {ME}_{II} ({\overset{ˇ}{C}}^{(it)}, λ)$ (see (23));
Get ${\overset{ˇ}{Υ}}_{ℓ m}^{(it)} (ω)$ from ${\{{\overset{ˇ}{Q}}_{i}^{(it)}\}}_{i = 0}^{p}$ (see (15));
end for
Use ${\{{\overset{ˇ}{Q}}_{i}^{(N_{it})}\}}_{i = 0}^{p}$ to compute ${\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)$ ;
Determine ${SP}_{λ}$ (see (24));
if ADAPTIVE then
${\overset{ˇ}{Υ}}_{ℓ m}^{(0)} (ω) \leftarrow {\overset{ˇ}{Υ}}_{ℓ m}^{(N_{it})} (ω)$
end if
${\hat{Υ}}_{ℓ m}^{(0)} (ω) \leftarrow {\overset{ˇ}{Υ}}_{ℓ m}^{(N_{it})} (ω)$ ;
Maximum Entropy Expectation-Maximization (constrained setting):
for $it = 1, \dots, N_{it}$ do
Use ${\hat{Φ}}_{m} (ω)$ and ${\hat{Υ}}_{ℓ m}^{(it - 1)} (ω)$ to compute ${\hat{C}}^{(it)}$ (see (16)–(18));
${\{{\hat{Q}}_{i}^{(it)}\}}_{i = 0}^{p} \leftarrow {ME}_{III} ({\hat{C}}^{(it)}, {SP}_{λ})$ (see (25));
Get ${\hat{Υ}}_{ℓ m}^{(it)} (ω)$ from ${\{{\hat{Q}}_{i}^{(it)}\}}_{i = 0}^{p}$ (see (15));
end for
Use ${\{{\hat{Q}}_{i}^{(N_{it})}\}}_{i = 0}^{p}$ to compute ${\hat{Φ}}_{λ}^{- 1} (ω)$ ;
Find the matrix coefficients of the $VAR$ -model by spectral factorization of ${\hat{Φ}}_{λ}^{- 1} (ω)$ and compute $ITC (Data; {SP}_{λ})$ .
end for
$\hat{SP} \leftarrow \arg min_{λ} ITC (Data; {SP}_{λ})$ ;

The initialization stage provides a first estimate for the ISDM, from which the Expectation-Maximization alternations can begin. An estimate for the left upper corner of

Φ^{- 1} (ω)

is obtained by solving the classical Maximum Entropy problem for a

VAR (p)

-model, using the sample covariances of the manifest variables. We present below the matrix formulation of this problem, which allows an easy implementation in CVX (Matlab-based modeling system for convex optimization) [14]. The mathematical derivation of the matrix formulation from the information theoretic formulation can be found in [6,9].

First Maximum Entropy Problem [

{ME}_{I} (\hat{R})

]:

\begin{matrix} \begin{matrix} min_{X} & tr (T (\hat{R}) X) - \log det X_{00} \\ s . t . & X ⪰ 0 \end{matrix} \end{matrix}

(22)

The block Toeplitz operator

T

is defined in (12). The size of the positive semidefinite matrix variable

X

is

K (p + 1) \times K (p + 1)

. For all

i = \bar{0, p}

, the estimate

{\overset{ˇ}{Q}}_{i}^{(0)} (1 : K, 1 : K)

of the ISDM (15) is given by

D_{i} (X)

.

In order to compute an initial value for

Υ_{ℓ m} (ω)

, we resort to the eigenvalue decomposition (EIG) of

{\overset{ˇ}{Q}}_{0}^{(0)} (1 : K, 1 : K)

. More precisely, after arranging the eigenvalues of

{\overset{ˇ}{Q}}_{0}^{(0)} (1 : K, 1 : K)

in the decreasing order of their magnitudes, we have

{\overset{ˇ}{Q}}_{0}^{(0)} (1 : K, 1 : K) = U D U^{⊤}

. Then, we set

{\overset{ˇ}{Q}}_{0}^{(0)} (K + 1 : K + r, 1 : K) = D^{1 / 2} (1 : r, 1 : r) U^{⊤} (1 : K, 1 : r)

and

{\overset{ˇ}{Q}}_{i}^{(0)} (K + 1 : K + r, 1 : K) = 0

for

i = \bar{1, p}

.

When the covariances

C_{i}

are fixed in the maximization step of the Expectation-Maximization algorithm, the coefficients of the matrix polynomial that is the ISDM (15) are estimated from the solution of the following optimization problem:

Second Maximum Entropy Problem [

{ME}_{II} (C, λ)

]:

\begin{matrix} \begin{matrix} min_{X} & tr (T (C) X) - \log det X_{00} + λ f (X) \\ s . t . & X ⪰ 0 \\ D_{0} (X) (K + 1 : K + r, K + 1 : K + r) = I \\ D_{i} (X) (K + 1 : K + r, K + 1 : K + r) = 0, i = \bar{1, p} \end{matrix} \end{matrix}

(23)

Since now we work with the full model, the size of

X

is

(K + r) (p + 1) \times (K + r) (p + 1)

. The function

f (\cdot)

is the sparsity promoter defined in (14) and depends only on the entries of the block corresponding to the manifest variables. The equality constraints in (23) guarantee that the latent variables have variance one and they are independent, given the manifest variables, corresponding to the lower right block of (15).

The estimates

{\{{\overset{ˇ}{Q}}_{i}^{(N_{it})}\}}_{i = 0}^{p}

obtained after these iterations are further employed to compute

{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)

by using (15). If

λ

is large enough, then

{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)

is expected to have a certain sparsity pattern,

{SP}_{λ}

. Since the objective of (23) does not ensure exact sparsification and also because of the numerical calculations, the entries of

{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)

that belong to

{SP}_{λ}

are small, but not exactly zero. In order to turn them to zero, we apply a method similar to the one from [6] (Section 4.1.3). We firstly compute the maximum of partial spectral coherence (PSC),

\begin{matrix} max_{ω \in (- π, π]} \frac{|{[{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)]}_{a b}|}{\sqrt{{[{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)]}_{a a} {[{\overset{ˇ}{Φ}}_{λ}^{- 1} (ω)]}_{b b}}}, \end{matrix}

(24)

for all

a \neq b

with

1 \leq a, b \leq K

. Then

{SP}_{λ}

comprises all the pairs

(a, b)

for which the maximum PSC is not larger than a threshold

Th

. The discussion on the selection of parameters

N_{it}

and

Th

is deferred to Section 5.

The regularized estimate of ISDM is further improved by solving a problem similar to (23), but with the additional constraint that the sparsity pattern of ISDM is

{SP}_{λ}

, more precisely:

Third Maximum Entropy Problem [

{ME}_{III} (C, SP)

]:

\begin{matrix} \begin{matrix} min_{X} & tr (T (C) X) - \log det X_{00} \\ s . t . & X ⪰ 0 \\ D_{0} (X) (K + 1 : K + r, K + 1 : K + r) = I \\ D_{i} (X) (K + 1 : K + r, K + 1 : K + r) = 0, i = \bar{1, p} \\ D_{i} (X) (a, b) = 0, i = \bar{0, p}, if SP (a, b) = 0 \end{matrix} \end{matrix}

(25)

This step of the algorithm has a strong theoretical justification which stems from the fact that

{\hat{Φ}}^{- 1} (ω)

is the Maximum Entropy solution for a covariance extension problem (see [10] (Remark 2.1)). The number of iterations,

N_{it}

, is the same as in the case of the first loop.

The spectral factorization of the positive matrix trigonometric polynomial

{\hat{Φ}}_{λ}^{- 1} (ω)

is computed by solving a semidefinite programming problem. The implementation is the same as in [8], except that in our case the model contains latent variables. Therefore, the matrix coefficients produced by spectral factorization are altered to keep only those entries that correspond to manifest variables. The resulting

VAR

model is fitted to the data and then various IT criteria are evaluated. The accuracy of the selected model depends on the criterion that is employed as well as on the strategy used for generating the

λ

-values that yield the competing models. In the next section, we list the model selection rules that we apply; the problem of generating the

λ

-values is treated in Section 5.

As already mentioned, the estimation problem is solved for several values of

λ

:

λ_{1} < λ_{2} < \dots < λ_{L}

. From the description above we know that, for each value of the parameter

λ

,

Υ_{ℓ m} (ω)

gets the same initialization, which is based on (22). It is likely that this initialization is poor. A better approach is an ADAPTIVE algorithm which takes into consideration the fact that the difference

λ_{i} - λ_{i - 1}

is small for all

i = \bar{2, L}

. This algorithm initializes

Υ_{ℓ m} (ω)

as explained above only when

λ = λ_{1}

. When

λ = λ_{i}

for

i = \bar{2, L}

, the initial value of

Υ_{ℓ m} (ω)

is taken to be the estimate of this quantity that was previously obtained by solving the optimization problem in (23) for

λ = λ_{i - 1}

. The effect of the ADAPTIVE procedure will be investigated empirically in Section 5.

The newly proposed estimation method outlined in Algorithm 1 is dubbed AlgoEM. The Matlab code for AlgoEM can be downloaded from https://www.stat.auckland.ac.nz/~cgiu216/PUBLICATIONS.htm.

4. Model Selection

4.1. IT Criteria

It is well-known that the IT criteria can be expressed as the sum of a goodness-of-fit (GOF) term and a penalty. They are derived on various grounds, but their expressions cannot be obtained easily for the problem we investigate. Due to this reason, we resort to the methodology applied previously to VAR without latent variables, where the criteria originally proposed for model order selection have been modified such that to be employed for finding the best sparsity pattern. As the GOF term is obtained straightforwardly by fitting the model to the data, the difficult part is the alteration of the penalty term. Based on the observation that all the penalty terms of the criteria for model order selection involve the number of parameters of the model, reference [6] proposed to replace it with the effective number of parameters:

\begin{matrix} N_{ef} = \frac{K (K + 1)}{2} - N_{0} + p (K^{2} - 2 N_{0}) . \end{matrix}

(26)

Note that

N_{0}

is the number of zeros in the lower triangular part of

{SP}_{λ}

. The expression above can be obtained straightforwardly by counting the number of non-zero entries in the

K \times K

upper-left block of the matrices

{\{{\hat{Q}}_{i}^{(N_{it})}\}}_{i = 0}^{p}

produced by Algorithm 1; counting takes into consideration all the existing symmetries.

The formula in (26) was used in [6] in order to modify three celebrated criteria: Schwarz Bayesian Criterion—

SBC

[15], Akaike Information Criterion—

AIC

[16] and its corrected version—

{AIC}_{c}

[13] (pp. 432) and [17]. We name

SBC

the criterion from [15] because this is the term used in time series literature; the same selection rule is called Bayesian Information Criterion (

BIC

) in other works. Based on the empirical evidence from [6],

SBC

is ranked best when the sample size is large, whereas

{AIC}_{c}

works better for small sample sizes. This makes us to employ these two criteria in our experiments. Their expressions are [6]:

\begin{matrix} SBC (Data; SP) & = & T \log det \hat{Σ} + N_{ef} \log T, \\ {AIC}_{c} (Data; SP) & = & T \log det \hat{Σ} + \frac{2 N_{ef} T}{T - N_{ef} - 1}, \end{matrix}

(27)

where

\hat{Σ}

is the error covariance matrix. For simplicity, we write

SP

instead of

{SP}_{λ}

. Even if our notation does not emphasize on this fact, it is clear that both

N_{ef}

and

\hat{Σ}

depend on

λ

.

Another criterion employed in our tests is a variant of the Final Prediction Error—FPE [18]. The formula we use was obtained in [8] by relying on the asymptotic equivalence between

\log FPE

and

AIC

. For ease of comparison with other criteria, we do not give the expression of

FPE

, but that of

\log FPE

:

\begin{matrix} \log FPE (Data; SP) & = & \log det \hat{Σ} + K \log \frac{T + η}{T - η}, \end{matrix}

where

η = N_{ef} / K

.

We also apply the Renormalized Maximum Likelihood (

RNML

) criterion. Its derivation is related to the problem of encoding losslessly measurements assumed to be a sequence of outcomes from an unknown distribution. It has been proven in [19] that there is a unique distribution which, if it is used to encode these measurements, then it leads to the minimum code length in the worst case scenario. This particular distribution was further utilized in [20,21] in order to introduce the

RNML

criterion for model selection. The major problem comes from the fact that, for most of the family of models, its closed-form expression is hard to be obtained. The expression of

RNML

that can be employed for choosing the order of

VAR

-models was firstly derived in [22]. The properties of this criterion have been investigated in [8], where the criterion was also altered such that to be applied in the selection of the sparsity pattern for ISDM of

VAR

-models. With our notation, the

RNML

criterion can be written as follows:

RNML (Data; SP) = \frac{T - η - K + 1}{2} \log det \hat{Σ} + \frac{N_{ef}}{2} \log tr ({\hat{R}}_{0} - \hat{Σ}) - \log Γ_{K} (\frac{T - η}{2}) - \log Γ (\frac{N_{ef}}{2}),

(28)

where

Γ_{K} (\cdot)

is the multivariate Gamma function and

Γ (\cdot)

is the Gamma function. The matrix

{\hat{R}}_{0}

has the same significance as in (12). It is remarkable that the penalty term of

RNML

depends on the actual measurements, and not only on the triple

(T, K, N_{ef})

.

A common feature of all the criteria presented so far is that they do not take into consideration how big is the family of the competing candidates. This might be problematic because the total number of possible sparsity patterns is as large as

2^{\bar{K}}

, where

\bar{K} = K (K - 1) / 2

. The solutions proposed in the previous literature for circumventing this difficulty are called extended IT criteria. We show below how these criteria can be applied for selecting the sparsity pattern. In this context, we also propose a novel variant of

RNML

.

4.2. Extended IT Criteria

First we write down the expression of the extended

SBC

proposed in [23]. In the statistical literature, this criterion is named

EBIC

(Extended Bayesian Information Criterion):

\begin{matrix} EBIC (Data; SP) = SBC (Data; SP) + 2 γ \log (\binom{\bar{K}}{N_{0}}), \end{matrix}

(29)

where

γ \in [0, 1]

. It is evident that

EBIC

is equivalent to

SBC

when

γ = 0

. More interestingly, reference [24] uses arguments from information theory for justifying the use of a penalty term which counts the number of models that have the same number of parameters. For our problem, this is equivalent to choosing

γ = 1

in (29). Because this is also the value of

γ

that we use in this work for evaluating

EBIC

, we explain briefly the significance of the supplementary penalty term. The key point is to consider a scenario in which

Data

should be transmitted losslessly from an encoder to a decoder by employing the model given by

SP

. According to [24], the first step is to transmit the value of

N_{0}

. The assumption that all possible values of

N_{0}

are equally probable leads to the conclusion that the code length for

N_{0}

is

- \log (1 / (\bar{K} + 1)) = \log (\bar{K} + 1)

. As this quantity is the same for all models, it can be neglected. Then the decoder should be informed about the actual locations of the zeros in the sparsity pattern

SP

. Since the list of all sparsity patterns for a given

N_{0}

is known by both the encoder and the decoder, all that remains is to send to the decoder the index of

SP

in this list. Under the hypothesis that all the sparsity patterns in the list are equally probable, the code length for the index is

\log (\binom{\bar{K}}{N_{0}})

. In (29), this quantity is multiplied by two because of the scaling factor used in the definition of

SBC (Data; SP)

.

Another formulation of

EBIC

was introduced in [25] for finding the graphical structure of a Gaussian model (static case), in the situation when the number of variables and the number of observations grow simultaneously. After modifying the criterion from [25] by replacing the number of edges of the graph with

N_{ef}

, we get the following formula:

\begin{matrix} {EBIC}_{FD} (Data; SP) = SBC (Data; SP) + 4 γ N_{ef} \log K, \end{matrix}

(30)

where

γ

has the same significance as in (29), and again we take

γ = 1

. Remark that the term

4 γ N_{ef} \log K

grows when

N_{0}

decreases; the term in (29),

2 γ \log (\binom{\bar{K}}{N_{0}}) = 2 γ \log (\binom{\bar{K}}{\bar{K} - N_{0}})

, does not have the same property.

Relying on the asymptotic equivalence between

RNML

and

SBC

[8], we alter

RNML

by adding half of the extra penalty from (30); the scaling factor is needed because the ratio between the GOF term in (28) and the GOF term in (27) tends to

1 / 2

when

T \to \infty

. The new criterion, which is dubbed

{RNML}_{FD}

, has the following expression:

\begin{matrix} {RNML}_{FD} (Data; SP) = RNML (Data; SP) + 2 N_{ef} \log K . \end{matrix}

(31)

For all the selection rules listed above, the best model is the one which minimizes the value of the criterion. For evaluating the performance of IT criteria, we conduct an empirical study. The main results of this study are reported in the next section.

5. Experimental Results

5.1. Artificial Data

In our simulations, the order of the VAR model in (1) is taken to be one, the number of manifest variables is

K = 15

, and there is one single latent variable (

r = 1

). Let

KS

denote the number of non-zero entries in the lower triangular part of the sparsity pattern of size

K \times K

. We consider three different values for

KS

: 2, 3 and

K

. Remark that the larger is

KS

, less sparse is the ISDM. The locations of the non-zero entries for each value of

KS

are graphically represented in Figure 1.

When generating the ISDM, all the matrices

{\{Q_{i}\}}_{i = 0}^{p}

in (2) have only ones on their main diagonals. The entries of the

K \times K

upper-left block of the matrix

Q_{i}

, which should be non-zero according to Figure 1, are equal to

0.5 / (i + 1)

. Additionally, the entries on the last row and on the last column of

Q_{i}

, except the one on the main diagonal, are equal to

0.3 / (i + 1)

. Integer multiples of the

κ \times κ

identity matrix are added to

Q_{0}

until the resulting ISDM is positive definite. Furthermore, the spectral factorization is applied in order to obtain the matrix coefficients of the

VAR

-model from the ISDM (see [8] (Section 4.2) for more details). Hence, for each value of

KS

, one single

VAR

-model is produced and this is used for generating

N_{tr} = 10

κ

-variate time series of length

T =

50,000. To this end, we utilize Matlab R2014b functions from the package available at the address http://climate-dynamics.org/software/#arfit. After discarding from each time series the component corresponding to the latent variable, the simulated data are used for evaluating the performance of AlgoEM.

5.2. Settings for AlgoEM

The order of

VAR

, as well as the number of latent variables, is assumed to be known. The parameter

λ

takes values on a regular grid defined on the interval

[10^{- 3}, 10^{- 1}]

, for which the grid step is

10^{- 3}

. It follows that the total number of values for

λ

is

L = 100

. The threshold

Th

, which is used in conjunction with (24) in order to get the estimated sparsity pattern, equals

10^{- 3}

.

We are interested to evaluate the impact of the adaptive initialization procedure that was introduced in Section 3. This is why we run AlgoEM with and without this procedure, for all the time series we have generated. For each time series and for each value of

λ

on the grid, the estimated

{SP}_{λ}

is compared to the true sparsity pattern. The comparison reduces to computing the distance between the two sparsity patterns, which is given by the number of positions below the main diagonal where the patterns differ. For the case

KS = 15

, statistics related to this distance are presented in Figure 2. Remark that values of

λ

close to zero lead to estimated patterns which are not sparse. As expected, this happens disregarding if the adaptive procedure is applied or not. The use of the procedure has the positive effect that, for a large range of

λ

-values, the estimated patterns are close to the true one. The same is true for both

KS = 2

and

KS = 3

, which makes us apply the adaptive procedure in all the experiments outlined below.

The results reported in Figure 2 are obtained by taking the number of iterations to be

N_{it} = 4

. As the computational burden of the algorithm depends strongly on

N_{it}

, we investigate the effect of reducing the number of iterations to

N_{it} = 3

and

N_{it} = 2

, respectively. In each case, the evaluation of performance is done by an oracle having complete knowledge about the true sparsity pattern. From the set of sparsity patterns produced when applying AlgoEM to a particular time series, the oracle selects the one which is closest to the true sparsity pattern. The closeness is measured by the distance defined above. The average distances computed from

N_{tr} = 10

trials are plotted in Figure 3. We can see in the figure that, in the case when

KS = 2

and AlgoEM performs only two iterations, the true sparsity pattern is always in the set of the candidates produced by the algorithm and this makes the average distance to be zero. In general, all the results shown in the figure are good as the average distance is smaller than one in all cases. Since the increase of

N_{it}

does not guarantee the improvement in performance, we take

N_{it} = 2

for reducing the complexity of the algorithm.

It is clear from the description of Algorithm 1 that

N_{it}

is the same for the two major loops of AlgoEM. We name the first loop MEEM(Pen) and the second one MEEM(Con). Obviously, MEEM is the acronym for Maximum Entropy Expectation-Maximization, Pen stands for penalized settings, and Con means constrained settings. Because we want to quantify the influence of MEEM(Con) on the accuracy of the estimation, we show in Figure 4 the average distances to the true pattern, when the estimates are produced by MEEM(Pen) and by MEEM(Con), respectively. This time we do not report results obtained only when an oracle is used for selecting the sparsity pattern, but also for the case when the selection is done with the IT criteria defined in Section 4. We can observe that

{AIC}_{c}

and

FPE

perform very poorly, whereas both

{EBIC}_{FD}

and

{RNML}_{FD}

are very good. The fact that

{RNML}_{FD}

is superior to

RNML

demonstrates the importance of the extra-term in (31). Remark also that

{EBIC}_{FD}

is more accurate than

SBC

, while

EBIC

has the same level of performance as

SBC

. The most important conclusion is that removing MEEM(Con) from AlgoEM does not deteriorate the final outcome.

The next step is to compare AlgoEM with the algorithm that solves the optimization problem in (13). We use the implementation from [12], which is publicly available at the address https://drive.google.com/file/d/0BykD2O6uX6KjSGlkQTRYWFVBZGM/view, and we call it AlgoSL. The name comes from the fact that the method in (13) is sparse plus low-rank.

5.3. Comparison of AlgoEM and AlgoSL

In [12], the sparsity pattern for a given pair of parameters

(λ, γ)

is found by solving (13). Reference [10] uses further this sparsity pattern as an initialization for a constrained optimization problem (see also the discussion below Equation (25)). In both cases, a set of sparsity patterns is generated by choosing various values for the parameters

λ

and

γ

. For selecting the best one, they consider an alternative to IT criteria, which is dubbed score function (

SF

). The key point is that, when using

SF

, it is not needed to compute explicitly the matrix coefficients of the VAR-model. More details are provided below.

Let

{\hat{Φ}}_{m}

be an estimate of

Φ_{m}

in (4), which is constrained to have a certain sparsity pattern,

SP

. Using an idea from [9], reference [10] suggests to employ

Data

for computing the correlogram

{\hat{Φ}}_{m}^{c}

(with Bartlett window) [26], and then to evaluate the relative entropy rate:

\begin{matrix} D ({\hat{Φ}}_{m}^{c} | | {\hat{Φ}}_{m}) = - \frac{1}{4 π} \int_{- π}^{π} \log det Ω (ω) + tr (I - Ω (ω)) d ω, \end{matrix}

(32)

where

Ω (ω) = {\hat{Φ}}_{m}^{c} (ω) {\hat{Φ}}_{m}^{- 1} (ω)

for all

ω \in (- π, π]

. This formula can be regarded as a generalization of the I-divergence of two positive definite matrices, which was originally introduced in connection with graphical models (static case) [3]. The expression in (32) was also employed by [27] in the inference of graphical models for time series. Some of the properties of the relative entropy rate are discussed in [28]. It is worth noting that this index also belongs to the Tau divergence family [29] and to the Beta divergence family [30]. It has been proven in these references that the use of (32) in the formulation of the Maximum Entropy problem leads to the simplest solution, in the sense of the minimum McMillan degree.

The most important is that, in [10],

D ({\hat{Φ}}_{m}^{c} | | {\hat{Φ}}_{m})

is utilized for quantifying the adherence of the model to the data. The complexity of the model is determined by

N_{e}

, the total number of edges in the graph (including the edges that connect the latent variables to the manifest variables). For instance, if the model has a single latent variable, then

N_{e} = K (K + 1) / 2 - N_{0}

. Note that

N_{0}

has the same significance as in (26). It follows that

N_{ef} = N_{e} + p (K^{2} - 2 N_{0})

, which leads to the conclusion that

N_{ef} > N_{e}

, and the difference between the two quantities increases when the order of the model raises. If

p = 1

and the number of latent variables is at least three (

r \geq 3

), then

N_{ef} < N_{e}

when

N_{0}

is large.

The score functions given in [10] are:

\begin{matrix} \log {SF}_{1} (Data; SP) & = & \log D ({\hat{Φ}}_{m}^{c} | | {\hat{Φ}}_{m}) + \log N_{e}, \end{matrix}

(33)

\begin{matrix} {SF}_{2} (Data; SP) & = & D ({\hat{Φ}}_{m}^{c} | | {\hat{Φ}}_{m}) + \frac{N_{e}}{T}, \end{matrix}

(34)

\begin{matrix} {SF}_{3} (Data; SP) & = & D ({\hat{Φ}}_{m}^{c} | | {\hat{Φ}}_{m}) + \frac{N_{e} \log T}{T}, \end{matrix}

(35)

where

D (\cdot | | \cdot)

is defined in (32). The formula of

{SF}_{1}

is logged for ease of reading.

We apply AlgoSL to all the time series we have simulated (see again Section 5.1). In our experiments, we consider the pairs

(λ, γ)

for which

λ \in \{0.1, 0.2, \dots, 0.6\}

and

γ \in \{0.01, 0.02, \dots, 0.5\}

. For the selection of the sparsity pattern, we do not use only the score functions in (33)–(35), but we also employ the IT criteria from Section 4. The results are shown in Figure 5. The best criterion is

{RNML}_{FD}

, which is able to find the true sparsity pattern in all the experiments; the second best is

{EBIC}_{FD}

. The comparison of the plots in Figure 5 to those in Figure 4 leads to the conclusion that AlgoSL works better than AlgoEM when

{RNML}_{FD}

is employed for selecting the model. For understanding these results, we should take into consideration two important aspects: (i) Oracle gives perfect results for all

KS

in the case of AlgoSL, but not in the case of AlgoEM; to some extent this is due to the fact that 300

(λ, γ)

-pairs allow to produce a better set of candidates for AlgoSL than the one generated by 100

λ

-values for AlgoEM; (ii) The score functions have been used in [10] for time series of hundreds of samples, whereas the size of the time series we simulated is much larger (

T =

50,000).

In order to clarify the second aspect, we conduct an experiment with simulated time series for which

T = 500

. The simulation procedure is the same as in Section 5.1 and all the settings are the same, except that the non-zero entries of the matrices

{\{Q_{i}\}}_{i = 0}^{p}

(which are not located on the main diagonal) have values that are fifty times larger. For AlgoEM, the uniform grid for

λ

takes values on the interval

[10^{- 3}, 10^{- 1}]

; the grid step is

10^{- 3}

(see also [31]). Based on some empirical evidence, we use the parameters

λ \in \{0.02, 0.03, \dots, 0.3\}

and

λ γ \in \{0.01, 0.015, \dots, 0.2\}

for AlgoSL. All IT criteria and all the score functions are used for both AlgoEM and AlgoSL, and the estimation results are reported in Figure 6. In general, they are worse than the results for large sample size (

T =

50,000), and the difference is more evident when

KS = 15

. This shows that, for small T, it might be difficult to recover the true structure of ISDM when it is not sparse enough. It is encouraging that, even for

KS = 15

, oracle used in conjunction with AlgoSL finds the true pattern in all trials.

To gain more insight, we give in Figure 7 a graphical representation of all the distances which are computed by oracle for a time series whose ISDM has

KS = 15

. It follows from the way in which we have chosen the experimental parameters that the total number of such distances calculated for AlgoSL is more than eleven times larger than the number of distances for AlgoEM. We should keep in mind that we need to compute the distance from the pattern estimated by each candidate in the list to the true sparsity pattern. After a laborious process of selecting the values of

λ

and

λ γ

, we ended-up with a two-dimensional grid which yields a good number of estimated patterns that are identical to the true pattern (see the white area inside the square shown in Figure 7b). The behavior of AlgoEM is different: For a large range of

λ

-values, the

K \times K

upper-left block of the estimated pattern is equal to the identity matrix, which makes the distance to the true pattern to be equal to

KS

. This happens when the number of latent variables used in AlgoEM equals the true number (

r = 1

) as well as when

r = 2

is utilized in estimation. However, it is evident in Figure 7a that the estimation results are better when

r = 1

. If the number of latent variables is not known a priori, one possibility might be to use an

ITC

for selecting it. Note that the definition of

N_{ef}

in (26) should be modified. The score functions given in (33)–(35) do not need to be altered in order to be employed in selection of the number of latent variables.

5.4. Real-World Data

In addition to the experiments with simulated data, we test the capabilities of AlgoEM on multivariate time series of daily stock markets indices at closing time, which can be downloaded from the following address: http://au.mathworks.com/matlabcentral/fileexchange/48611-international-daily-stock-return-data-for-system-identification. This dataset was produced by pre-procesing the original data from http://finance.yahoo.com, which have been measured from 4 January 2012 to 31 December 2013. We refer to [10] for details regarding the pre-processing. The time series is

K

-variate with

K = 22

, and the sample size is

T = 518

.

For each component of the time series, we give the name of the country where was measured and we write in parentheses the acronym used in this analysis: Australia (AU), New Zealand (NZ), Singapore (SG), Hong Kong (HK), China (CH), Japan (JA), Korea (KO), Taiwan (TA), Brazil (BR), Mexico (ME), Argentina (AR), Switzerland (SW), Greece (GR), Belgium (BE), Austria (AS), Germany (GE), France (FR), Netherlands (NL), United Kingdom (UK), United States (US), Canada (CA), Malaysia (MA). The official names of the price indices can be found in [10].

Similar to [10], we take the order of

VAR

-model to be

p = 1

, and we assume that there is a single latent variable (

r = 1

). For AlgoEM, we have

N_{it} = 4

and

λ

takes values on a uniform grid on the interval

[2 \times 10^{- 3}, 2 \times 10^{- 1}]

, for which the grid step is

2 \times 10^{- 3}

. When

{AIC}_{c}

is used for choosing the model, the results are disappointing because the selected sparsity pattern contains very few zeros. The same type of outcome is also obtained when applying

{SF}_{2}

or

{SF}_{3}

. It is interesting that

{SF}_{1}

as well as six different IT criteria (

SBC

,

EBIC

,

{EBIC}_{FD}

,

FPE

,

RNML

,

{RNML}_{FD}

) select exactly the same sparsity pattern. This one is compared in Figure 8 to the sparsity pattern given in [10], for the same data set. Observe that the conditional independence graph yielded by AlgoEM has only four edges which connect manifest variables: Three of them connect vertices corresponding to Asian markets,

(HK, CH), (HK, JA), (JA, KO)

, and the fourth one connects two European markets,

(GR, AT)

. This is different from the conditional graph in [10], where all the edges between manifest variables connect only the European markets (with the exception of Greece). An in-depth analysis of the two graphs is beyond the interest of this paper.

6. Final Remarks

The main motivation for this study is the solution proposed in [5] for the static case, as an alternative to the method from [4]. We have shown how the estimation method from [5] can be generalized for the dynamic case. The resulting algorithm is dubbed AlgoEM. We have conducted an empirical study in which we have investigated the capabilities of AlgoEM and we have also compared it with the generalization of the method from [4], which was proposed in [10]. It is important to emphasize that the two methods for latent-variable autoregressive models, which we have compared, have some common features: apply the Maximum Entropy principle, use convex optimization, and generate a set of candidate models from which the best one is selected by a certain rule. In the case of AlgoEM, the set of the candidates depends strongly on the parameter

λ

, which is chosen by the user. For the method in [10], the user should choose two parameters,

λ

and

γ

. Based on our experience, the selection of the two parameters is much more difficult than the selection of the single parameter for AlgoEM. Another important aspect is how to pick-up the winner from the competing models. In [10], this was restricted to the use of score functions. We have demonstrated empirically that the IT criteria might be an option to consider. Especially when the sample size is large, it is recommended to employ the criterion

{RNML}_{FD}

which we have introduced in this work.

Acknowledgments

The work of the first author was supported by Department of Statistics (UOA) Doctoral Scholarship.

Author Contributions

All the authors contributed to the development of the main algorithm and to its implementation in Matlab. Saïd Maanan performed the experiments, produced all the figures, and wrote the first draft of the paper. Bogdan Dumitrescu and Ciprian Doru Giurcăneanu improved the quality of the text. All authors have read and approved the submitted manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brillinger, D. Remarks concerning graphical models for time series and point processes. Rev. Econom. 1996, 16, 1–23. [Google Scholar] [CrossRef]
Dahlhaus, R.; Eichler, M.; Sandkühler, J. Identification of synaptic connections in neural ensembles by graphical models. J. Neurosci. Methods 1997, 77, 93–107. [Google Scholar] [CrossRef]
Speed, T.P.; Kiiveri, H.T. Gaussian Markov distributions over finite graphs. Ann. Stat. 1986, 14, 138–150. [Google Scholar] [CrossRef]
Chandrasekaran, V.; Parrilo, P.; Willsky, A. Latent variable graphical model selection via convex optimization. Ann. Stat. 2012, 40, 1935–1967. [Google Scholar] [CrossRef]
Lauritzen, S.; Meinhausen, N. Discussion: Latent variable graphical model selection via convex optimization. Ann. Stat. 2012, 40, 1973–1977. [Google Scholar] [CrossRef]
Songsiri, J.; Dahl, J.; Vandenberghe, L. Graphical models of autoregressive processes. In Convex Optimization in Signal Processing and Communications; Palomar, D., Eldar, Y., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 89–116. [Google Scholar]
Songsiri, J.; Vandenberghe, L. Topology selection in graphical models of autoregressive processes. J. Mach. Learn. Res. 2010, 11, 2671–2705. [Google Scholar]
Maanan, S.; Dumitrescu, B.; Giurcăneanu, C. Conditional independence graphs for multivariate autoregressive models by convex optimization: Efficient Algorithms. Signal Process. 2017, 133, 122–134. [Google Scholar] [CrossRef]
Avventi, E.; Lindquist, A.; Wahlberg, B. ARMA identification of graphical models. IEEE Trans. Autom. Control 2013, 58, 1167–1178. [Google Scholar] [CrossRef]
Zorzi, M.; Sepulchre, R. AR Identification of latent-variable graphical models. IEEE Trans. Autom. Control 2016, 61, 2327–2340. [Google Scholar] [CrossRef]
Dahlhaus, R. Graphical interaction models for multivariate time series. Metrika 2000, 51, 157–172. [Google Scholar] [CrossRef]
Liegeois, R.; Mishra, B.; Zorzi, M.; Sepulchre, R. Sparse plus low-rank autoregressive identification in neuroimaging time series. In Proceedings of the 54th IEEE Conference on Decision and Control (CDC), Osaka, Japan, 15–18 December 2015; pp. 3965–3970. [Google Scholar]
Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods; Springer: New York, NY, USA, 1991. [Google Scholar]
Grant, M.; Boyd, S. CVX: Matlab Software for Disciplined Convex Programming. 2010. Available online: http://cvxr.com/cvx (accessed on 17 January 2018).
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Hurvich, C.M.; Tsai, C.L. A corrected Akaike information criterion for vector autoregressive model selection. J. Time Ser. Anal. 1993, 14, 271–273. [Google Scholar] [CrossRef]
Akaike, H. Autoregressive model fitting for control. Ann. Inst. Stat. Math. 1971, 23, 163–180. [Google Scholar] [CrossRef]
Shtarkov, Y. Universal sequential coding of single messages. Probl. Inf. Transm. 1987, 23, 175–186. [Google Scholar]
Rissanen, J. MDL denoising. IEEE Trans. Inf. Theory 2000, 46, 2537–2543. [Google Scholar] [CrossRef]
Rissanen, J. Information and Complexity in Statistical Modeling; Springer: New York, NY, USA, 2007. [Google Scholar]
Maanan, S.; Dumitrescu, B.; Giurcăneanu, C.D. Renormalized maximum likelihood for multivariate autoregressive models. In Proceedings of the 2016 European Signal Processing Conference (EUSIPCO 2016), Budapest, Hungary, 29 August–2 September 2016; pp. 193–197. [Google Scholar]
Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
Roos, T.; Myllymäki, P.; Rissanen, J. MDL denoising revisited. IEEE Trans. Signal Process. 2009, 57, 3347–3360. [Google Scholar] [CrossRef]
Foygel, R.; Drton, M. Extended Bayesian information criteria for Gaussian graphical models. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; Volume 23, pp. 604–612. [Google Scholar]
Stoica, P.; Moses, R. Spectral Analysis of Signals; Pearson Prentice Hall: Upper Saddle River, NJ, USA, 2005. [Google Scholar]
Bach, F.; Jordan, M. Learning graphical models for stationary time series. IEEE Trans. Signal Process. 2004, 52, 2189–2199. [Google Scholar] [CrossRef]
Ferrante, A.; Masiero, C.; Pavon, M. Time and spectral domain relative entropy: A New approach to multivariate spectral estimation. IEEE Trans. Autom. Control 2012, 57, 2561–2575. [Google Scholar] [CrossRef]
Zorzi, M. Multivariate spectral estimation based on the concept of optimal prediction. IEEE Trans. Autom. Control 2015, 60, 1647–1652. [Google Scholar] [CrossRef]
Zorzi, M. An interpretation of the dual problem of the THREE-like approaches. Automatica 2015, 62, 87–92. [Google Scholar] [CrossRef]
Basu, S.; Michailidis, G. Regularized estimation in sparse high-dimensional time series models. Ann. Stat. 2015, 43, 1535–1567. [Google Scholar] [CrossRef]

Figure 1. Sparsity patterns for the ISDM of the generated data:

KS

is the number of non-zero entries in the lower triangular part of

SP

. The black dots represent the locations of the non-zero entries, whereas the light grey dots are the zero entries.

Figure 1. Sparsity patterns for the ISDM of the generated data:

KS

is the number of non-zero entries in the lower triangular part of

SP

. The black dots represent the locations of the non-zero entries, whereas the light grey dots are the zero entries.

Figure 2. Results for

VAR

-models with

KS = 15

, for which the true sparsity pattern is shown in Figure 1c. With the convention that dist denotes the distance between the estimated sparsity pattern and the true one, we plot mean(dist) ±1 standard deviation(dist) versus the parameter

λ

. The statistics are computed from

N_{tr} = 10

trials, for both the adaptive and the non-adaptive case.

Figure 2. Results for

VAR

-models with

KS = 15

, for which the true sparsity pattern is shown in Figure 1c. With the convention that dist denotes the distance between the estimated sparsity pattern and the true one, we plot mean(dist) ±1 standard deviation(dist) versus the parameter

λ

. The statistics are computed from

N_{tr} = 10

trials, for both the adaptive and the non-adaptive case.

Figure 3. Impact of

N_{it}

on the performance of AlgoEM: Evaluation is done by replacing in AlgoEM the IT criterion with an oracle having full knowledge about the true sparsity pattern. For each

KS

and for each

N_{it}

, we run

N_{tr} = 10

trials for calculating the average distance between the true sparsity pattern and the sparsity pattern selected by oracle.

Figure 3. Impact of

N_{it}

on the performance of AlgoEM: Evaluation is done by replacing in AlgoEM the IT criterion with an oracle having full knowledge about the true sparsity pattern. For each

KS

and for each

N_{it}

, we run

N_{tr} = 10

trials for calculating the average distance between the true sparsity pattern and the sparsity pattern selected by oracle.

Figure 4. Performance of IT criteria compared to that of an oracle: (a) Only the first major loop, MEEM(Pen), of AlgoEM is executed; (b) Both MEEM(Pen) and MEEM(Con) are executed.

Figure 5. Estimation results obtained when AlgoSL is applied to the same time series which have been used to evaluate the performance of AlgoEM in Figure 4. For selection of the sparsity pattern, we employ the score functions in (33)–(35) and the IT criteria from Section 4. Score function

{SF}_{2}

is not shown in the graph because it leads to large values of the average distance: 102.4 for

KS = 2

, 101.5 for

KS = 3

, and 89.1 for

KS = 15

.

Figure 5. Estimation results obtained when AlgoSL is applied to the same time series which have been used to evaluate the performance of AlgoEM in Figure 4. For selection of the sparsity pattern, we employ the score functions in (33)–(35) and the IT criteria from Section 4. Score function

{SF}_{2}

is not shown in the graph because it leads to large values of the average distance: 102.4 for

KS = 2

, 101.5 for

KS = 3

, and 89.1 for

KS = 15

.

Figure 6. Estimation results when sample size is small (

T = 500

).

Figure 6. Estimation results when sample size is small (

T = 500

).

Figure 7. Distances computed for sparsity patterns which are estimated from a time series of length

T = 500

; the value of

KS

for the true sparsity pattern is 15. (a) AlgoEM (r is the number of latent variables used in estimation); (b) AlgoSL.

Figure 7. Distances computed for sparsity patterns which are estimated from a time series of length

T = 500

; the value of

KS

for the true sparsity pattern is 15. (a) AlgoEM (r is the number of latent variables used in estimation); (b) AlgoSL.

Figure 8. International stock markets data: Comparison between the sparsity pattern for manifest variables from [10] and the pattern produced by AlgoEM when either

{SF}_{1}

or one of the following IT criteria is used:

SBC

,

EBIC

,

{EBIC}_{FD}

,

FPE

,

RNML

,

{RNML}_{FD}

.

Figure 8. International stock markets data: Comparison between the sparsity pattern for manifest variables from [10] and the pattern produced by AlgoEM when either

{SF}_{1}

or one of the following IT criteria is used:

SBC

,

EBIC

,

{EBIC}_{FD}

,

FPE

,

RNML

,

{RNML}_{FD}

.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Maanan, S.; Dumitrescu, B.; Giurcăneanu, C.D. Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series. Entropy 2018, 20, 76. https://doi.org/10.3390/e20010076

AMA Style

Maanan S, Dumitrescu B, Giurcăneanu CD. Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series. Entropy. 2018; 20(1):76. https://doi.org/10.3390/e20010076

Chicago/Turabian Style

Maanan, Saïd, Bogdan Dumitrescu, and Ciprian Doru Giurcăneanu. 2018. "Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series" Entropy 20, no. 1: 76. https://doi.org/10.3390/e20010076

APA Style

Maanan, S., Dumitrescu, B., & Giurcăneanu, C. D. (2018). Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series. Entropy, 20(1), 76. https://doi.org/10.3390/e20010076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Maximum Entropy Expectation-Maximization Algorithm for Fitting Latent-Variable Graphical Models to Multivariate Time Series

Abstract

1. Introduction

2. Preliminaries and Previous Work

3. New Algorithm

4. Model Selection

4.1. IT Criteria

4.2. Extended IT Criteria

5. Experimental Results

5.1. Artificial Data

5.2. Settings for AlgoEM

5.3. Comparison of AlgoEM and AlgoSL

5.4. Real-World Data

6. Final Remarks

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI