Properties of the Statistical Complexity Functional and Partially Deterministic HMMs

Löhr, Wolfgang

doi:10.3390/e110300385

Open AccessArticle

Properties of the Statistical Complexity Functional and Partially Deterministic HMMs

by

Wolfgang Löhr

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, 04103 Leipzig, Germany

Entropy 2009, 11(3), 385-401; https://doi.org/10.3390/e110300385

Submission received: 31 March 2009 / Accepted: 5 August 2009 / Published: 11 August 2009

Download Versions Notes

Abstract

:

Statistical complexity is a measure of complexity of discrete-time stationary stochastic processes, which has many applications. We investigate its more abstract properties as a non-linear function of the space of processes and show its close relation to the Knight’s prediction process. We prove lower semi-continuity, concavity, and a formula for the ergodic decomposition of statistical complexity. On the way, we show that the discrete version of the prediction process has a continuous Markov transition. We also prove that, given the past output of a partially deterministic hidden Markov model (HMM), the uncertainty of the internal state is constant over time and knowledge of the internal state gives no additional information on the future output. Using this fact, we show that the causal state distribution is the unique stationary representation on prediction space that may have finite entropy.

Keywords:

statistical complexity; lower semi-continuity; ergodic decomposition; concavity; prediction process; partially deterministic hidden Markov models (HMMs)

1. Introduction

An important task of complex systems sciences is to define “complexity”. Measures that quantify complexity are of both theoretical (e.g., [1]) and practical interest. In applications, they are widely used to identify “interesting” parts of simulations and real-world data (e.g., [2]). There exist various measures of different kinds of complexity. In particular, statistical complexity constitutes a complexity measure for stationary stochastic processes in doubly infinite discrete time and discrete state space. It was introduced by Jim Crutchfield and co-workers within a theory called computational mechanics [3,4,5]. Note that here “computational mechanics” is unrelated to computer simulations of mechanical systems. Statistical complexity is applied to a variety of real-world data, e.g., in [6]. An important, closely related concept of computational mechanics is the so-called

ε

-machine. It is a particular partially deterministic HMM that encodes the mechanisms of prediction. Partially deterministic HMMs are often called deterministic stochastic automata to emphasize their close connection to a key concept of theoretical computer science, namely deterministic finite state automata [7].

In this paper, we look at more abstract features of statistical complexity as well as partially deterministic HMMs. We consider statistical complexity to be a non-linear functional from the space of Δ-valued stationary processes (Δ countable) to the set

{\bar{R}}_{+} = R_{+} \cup {\infty}

of non-negative extended real numbers. Here, we identify stationary processes with their law, i.e., with shift-invariant probability measures on the sequence space

Δ^{Z}

, and equip the space of measures with the usual weak-* topology (often called “weak topology”). Because Δ is discrete, this topology is equal to the topology of finite-dimensional convergence. In ergodic theory, Kolmogorov-Sinai entropy is studied as a function of the (invariant) measure, and the questions of continuity properties, affinity, and behaviour under ergodic decomposition arise naturally (e.g., [8]). We believe that these questions are worth considering also for complexity measures. A formula for the ergodic decomposition of excess entropy, which is another complexity measure for stochastic processes, was obtained in [9,10]. Our results presented here include the corresponding formula for statistical complexity, and this formula directly implies concavity. The most important result is lower semi-continuity of statistical complexity. We consider this a desirable property for a complexity measure, as it means that a process cannot be complex if it can be approximated by non-complex ones.

In Section 2., we define statistical complexity and show its relations to a discrete version of Frank Knight’s prediction process [11,12]. The prediction process is the measure-valued process of conditional probabilities of the future given the past. It takes values in the space

P (Δ^{N})

of probability measures on Δ^N, called prediction space. In our formulation, statistical complexity is the marginal entropy of the prediction process. This is equivalent to the classical definition as entropy of a certain partition of the past. We only replace equivalence classes with the respective induced probabilities on the future. In this section, we also show that the discrete (and thus technically vastly simplified) version of the prediction process has a continuous Markov transition kernel (Proposition 5).

In Section 3., we investigate properties of partially deterministic HMMs. Here, we use a general notion of HMM (sometimes called edge-emitting HMM), where new internal state and output symbol are jointly determined and may have dependencies conditioned on the last internal state. Partial determinism means that this dependence is extreme in the sense that the last internal state and the output altogether uniquely determine the following internal state. We show that, if one knows the past output trajectory, the remaining uncertainty (measured by entropy) of the internal state is constant over time, although it may depend on the ergodic component (Proposition 18). Furthermore, the distribution of future output is the same for any internal state that is compatible with the past output (Corollary 20). In Section 3.3., we construct a canonical Markov kernel, such that taking any measure

ν

on prediction space

P (Δ^{N})

(i.e.,

ν

is a measure on measures) as initial distribution, we obtain a partially deterministic HMM of a process

P \in P (Δ^{N})

. This process P coincides with the measure

r (ν)

represented by

ν

in the sense of integral representation theory, and if

ν

is appropriately chosen, we obtain the

ε

-machine of computational mechanics (or something isomorphic) as special case. Using the properties of partially deterministic HMMs, we obtain that there is no invariant representation on prediction space with finite entropy other than, possibly, the causal state distribution, which may have finite or infinite entropy (Proposition 23).

Section 4. contains our results about statistical complexity. We show that the complexity of a process is the average complexity of its ergodic components plus the entropy of the mixture (Proposition 26). As a direct consequence, statistical complexity is concave (Corollary 27) and non-continuous (even w.r.t. variational topology). But it does have a continuity property. Namely, using the results of the previous sections, we show in Theorem 32 that it is weak-* lower semi-continuous.

2. Prediction Dynamic and Statistical Complexity

For the whole article, fix a countable set Δ with at least two elements and discrete topology. We identify Δ-valued stochastic processes

X_{Z} : = {(X_{k})}_{k \in Z}

, defined on some probability space

(Ω, A, P)

, with their respective laws

P : = P \circ X_{Z}^{- 1} \in P (Δ^{Z})

. Here,

P

denotes the set of probability measures. If

X_{Z}

is stationary, P is in the set

P_{inv} (Δ^{Z})

of shift-invariant probability measures. Let

ξ_{k} : Δ^{Z} \to Δ

be the canonical projections. Then

ζ^{Z}

is a process on (

Δ^{Z}, B (Δ^{Z}), P

) with the same distribution as

X_{Z}

. Here,

B

denotes the Borel σ-algebra. We often decompose the time set

Z

into the “future”

N

and the “past”

Z \ N = - N_{0}

, where

N_{0} = N \cup {0}

. For simplicity of notation, we denote the canonical projections on

Δ^{N}

with the same symbols, ζ_k, as the projections on

Δ^{Z}

. If not stated otherwise, product spaces are equipped with product and spaces of probability measures are equipped with weak-

*

-topology. We use the arrow

\overset{*}{⇀}

to denote weak-

*

convergence.

2.1. Discrete Version of Knight’s Prediction Process

For every measurable stochastic process with time set

R_{+}

on some Lusin space, Frank Knight defines the corresponding prediction process as a process of conditional probabilities of the future given the past. This theory originated in [11] and was further developed in [12,13,14]. The most important properties of the prediction process are that its paths are right continuous with left limits (cadlag), it has the strong Markov property and determines the original process. The continuity of the time set and the generality of the state space lead to a lot of technical difficulties. In our simpler, discrete setting, these difficulties mostly disappear, and useful properties of the prediction process, such as having cadlag paths, become meaningless. A new aspect, however, is added by considering infinite pasts of stationary processes via the time-set

Z

. The marginal distribution (unique because of stationarity) of the prediction process is an important characteristic, which is used to define statistical complexity. For this subsection, fix a stationary process

X_{Z}

with distribution

P \in P_{inv} (Δ^{Z})

.

We use the following notation concerning Markov kernels and conditional probabilities. If K is a kernel from Ω to a measurable space M, we consider K as measurable function from Ω to

P (M)

and write

K (ω; A) : = K (ω) (A)

for the probability of a measurable set A w.r.t. the measure

K (ω)

. Given random variables

X, Y

on Ω, we write

K = P (X ∣ Y)

if K is the conditional probability kernel of X given Y, i.e.,

K (ω; A) = P ({X \in A} | Y) (ω)

.

Definition 1.

Let

Z_{Z} = Z_{Z}^{P}

be the

P (Δ^{N})

-valued stochastic process of conditional probabilities defined by

Z_{k} : = P (ξ_{[k + 1, \infty [} ∣ ξ_{] - \infty, k]})

for

k \in Z

. Then

Z_{Z}

is called prediction process of

X_{Z}

.

P (Δ^{N})

is called prediction space.

It is evident that the Markov property of the prediction process in continuous time also holds in discrete time. Nevertheless, we give a proof, because it is elementary in our discrete setting. The corresponding transition kernel works as follows. Assume the prediction process is in state

z \in P (Δ^{N})

. The transition kernel maps z to a measure on measures, namely

P (Z_{1} ∣ Z_{0} = z) \in P (P (Δ^{N}))

. Note that z is a state of the prediction process but at the same time a probability measure. Thus it makes sense to consider the conditional probability given

ξ_{1} = d

w.r.t. the measure z. It is intuitively plausible that the next state will be one of those conditional probabilities with d distributed according to the marginal of z. The resulting measure has to be shifted by one as time proceeds. With

ς : Δ^{N} \to Δ^{N}

, we denote the left shift.

Proposition 2.

For

z \in P (Δ^{N})

, let

ϕ_{z} : Δ^{N} \to P (Δ^{N})

,

ϕ_{z} (ω) : = z (ς^{- 1} (\cdot) ∣ ξ_{1}) (ω)

. The prediction process

Z_{Z}

is a stationary Markov process. The kernel

S : P (Δ^{N}) \to P (P (Δ^{N}))

with

S (z) = z \circ ϕ_{z}^{- 1}

, i.e.

S (z) (B) : = S (z; B) : = z ({ϕ_{z} \in B}), z \in P (Δ^{N}), B \in B (P (Δ^{N})),

satisfies

P (Z_{k} ∣ Z_{k - 1}) = S \circ Z_{k - 1}

a.s. In other words, S is the transition kernel of the prediction process.

Proof.

Stationarity is obvious from stationarity of

X_{Z}

. We obtain a.s.

\begin{matrix} S (Z_{0}; B) & = Z_{0} (\{Z_{0} (ς^{- 1} (\cdot) ∣ ξ_{1}) \in B\}) = P (\{P (ξ_{[2, \infty [} ∣ ξ_{] - \infty, 1]}) \in B\} | ξ_{- N_{0}}) \\ = P ({Z_{1} \in B} | ξ_{- N_{0}}) . \end{matrix}

In particular,

P ({Z_{1} \in B} | ξ_{- N_{0}})

is

σ (Z_{0})

-measurable (modulo P) and together with

σ (Z_{0}) \subseteq σ (ξ_{- N_{0}})

we obtain

P ({Z_{1} \in B} | Z_{0}) = P ({Z_{1} \in B} | ξ_{- N_{0}}) = S (Z_{0}; B),

(1)

as claimed. We still have to verify the Markov property. But because the

σ

-algebra induced by

Z_{- N_{0}}

is nested between those induced by

Z_{0}

and

ξ_{- N_{0}}

, i.e.

σ (Z_{0}) \subseteq σ (Z_{- N_{0}}) \subseteq σ (ξ_{- N_{0}})

, we obtain the Markov property from the first equality in (1). □

Definition 3.

We call the Markov transition S of the prediction process prediction dynamic.

Note that although the prediction process

Z_{Z}

obviously depends on P, the prediction space

P (Δ^{N})

and the prediction dynamic S do not. In the case of general Lusin state space, it is non-trivial to prove the existence of the regular versions of conditional probability such that

ϕ_{z} (ω)

is jointly measurable in

(z, ω)

(see [9]). For countable Δ, however, we can obtain essential continuity in an elementary way. This enables us to prove continuity of the prediction danamic.

Lemma 4.

Let

z, z_{n} \in P (ʔ^{N})

and

z_{n} \overset{*}{⇀} z

. There is a clopen (i.e. closed and open) set

Ω_{z} \subseteq ʔ^{N}

with

z (Ω_{z}) = 1

such that

ϕ_{z_{n}} \overset{*}{⇀} ϕ_{z}

, uniformly on compact subsets of

Ω_{z}

.

Proof.

Let

A_{ω} : = {ξ_{1}}^{- 1} (ξ_{1} (ω))

and

Ω_{z} : = \{ω \in Δ^{N} | z (A_{ω}) > 0\}

. Because Δ is discrete and countable, Ω_z id clopen with z(Ω_z) = 1. Uniform convergence on compacta is equivalent to

ϕ_{z_{n}} (ω_{n}) \overset{*}{⇀} ϕ_{z} (ω)

whenever ω_n → ω in Ω_z. For sufficiently large n, ζ₁ (ω_n) = ζ₁ (ω) and because ς⁻¹ maps cylinder sets to cylinder sets,

ϕ_{z_{n}} (ω_{n}) = \frac{z_{n} (A_{ω} \cap ς^{- 1} (°))}{z_{n} (A_{ω})} \overset{*}{⇀} ϕ_{z} (ω)

.

Proposition 5.

The prediction dynamic S is continuous.

Proof.

Let

z_{n}, z \in P (Δ^{N})

with

z_{n} \overset{*}{⇀} z

and

Ω_{z}

as in Lemma 4. We have to show

\int g d S (z_{n}) = \int g \circ ϕ_{z_{n}} d z_{n} \overset{n \to \infty}{⟶} \int g \circ ϕ_{z} d z = \int g d S (z)

(2)

for any bounded continuous g. According to Prokhorov’s theorem, the sequence

{(z_{n})}_{n \in N}

is uniformly tight and we can restrict the integrations to compact subsets. Because

{lim}_{n \to \infty} z_{n} (Ω_{z}) = z (Ω_{z}) = 1

, we can restrict to compact subsets of

Ω_{z}

. There, the convergence of

ϕ_{z_{n}}

is uniform, thus (2) holds. □

2.2. Statistical Complexity

In integral representation theory, a measure

ν \in P (P (Δ^{N}))

represents the measure

z \in P (Δ^{N})

if

z = r (ν) : = \int_{P (Δ^{N})}^{} {id}_{P (Δ^{N})} d ν,

(3)

where

r : P (P (Δ^{N})) \to P (Δ^{N})

is called resolvent or barycentre map (see [15]) and

id

is the identity map. Here, measure valued integrals are Gel’fand integrals. That is,

μ = \int K d ν

for some kernel K means

\int f d μ = \int \int f d K (\cdot) d ν

for all continuous, real-valued f or, equivalently,

μ (B) = \int K (\cdot; B) d ν

for all measurable sets B.

z = r (ν)

means that z is a mixture (convex combination) of other processes, and the mixture is described by

ν

. A trivial representation for z is given by

δ_{z}

, the Dirac measure in z. The measure

ν

is called S-invariant if

ν S = ν

, where

ν S : = \int S d ν

. In other words, it is S-invariant if the iteration with the prediction dynamic S does not change it. We see in the following lemma that general iteration with S shifts the represented measure, i.e.,

ν S

represents

z \circ ς^{- 1}

.

Lemma 6.

r (ν S) = r (ν) \circ ς^{- 1} .

In particular, S-invariant

ν

represent stationary processes.

Proof.

Because

r (ν S) = \int \int {id}_{P (Δ^{N})} d S d ν

, it is sufficient to consider Dirac measures

δ_{z}

,

z \in P (Δ^{N})

(the general claim follows by integration over

ν

). For Dirac measures we have

r (δ_{z} S) = \int {id}_{P (Δ^{N})} d S (z) = \int ϕ_{z} d z = \int z (ς^{- 1} (\cdot) | ξ_{1}) d z = z \circ ς^{- 1}

□

If

ν

is S-invariant, we also say that

ν

represents the stationary extension of

r (ν)

to

Δ^{Z}

. The marginal of the prediction process is an important such representation, which we call causal state distribution because of its close relation to the causal states of computational mechanics.

Definition 7.

For

P \in P_{inv} (Δ^{Z})

, the causal state distribution

μ_{C} (P)

is the marginal distribution of the prediction process, i.e.,

μ_{C} (P) : = P \circ Z_{0}^{- 1} \in P (P (Δ^{N}))

.

The causal state distribution of P is an S-invariant representation of P.

Lemma 8.

Let

P \in P_{inv} (Δ^{Z})

. Then

μ_{C} (P)

is S-invariant and represents P.

Proof.

From Proposition 2 we know that

P (Z_{1} ∣ Z_{0}) = S \circ Z_{0}

and

Z_{Z}

is stationary. Thus

\int S d μ_{C} (P) = \int S \circ Z_{0} d P = \int P (Z_{1} ∣ Z_{0}) d P = P \circ Z_{1}^{- 1} = μ_{C} (P)

Furthermore,

μ_{C} (P)

represents P because we have

r (μ_{C} (P)) = \int Z_{0} d P = \int P (ξ_{N} ∣ ξ_{- N_{0}}) d P = P \circ {ξ_{N}}^{- 1}

□

Remark.

The definitions in computational mechanics are slightly different. There, one works with equivalence classes of past trajectories (called causal states) instead of probability distributions on future trajectories. Because past trajectories

x, y \in Δ^{- N_{0}}

are identified if

P (ξ_{N} ∣ ξ_{- N_{0}} = x) = P (ξ_{N} ∣ ξ_{- N_{0}} = y)

, the two approaches are equivalent. The advantage of working on prediction space

P (Δ^{N})

is that it has a natural topology and the prediction processes of all Δ-valued stochastic processes are described in a unified manner on the same space with the same transition kernel.

Example 9.

μ_{C}

is not continuous. Let P be a non-deterministic i.i.d. (independent, identically distributed) process. Obviously, the causal state distribution of an i.i.d. process is the Dirac measure

δ_{P_{N}}

in its restriction

P_{N} : = P \circ {ξ_{N}}^{- 1}

to positive time. According to [16], periodic measures are dense in the stationary measures and we find an approximating sequence

P_{n} \overset{*}{⇀} P

of periodic measures

P_{n}

. But the past of a periodic process determines its future. Thus its causal state distribution is supported by the set of Dirac measures on

Δ^{N}

. Because the set of Dirac measures is closed in

P (P (Δ^{N}))

, the topological supports

supp μ_{C} (P_{n})

are disjoint from the support

supp μ_{C} (P) = {P_{N}}

. Consequently,

μ_{C} (P_{n})

cannot converge to

μ_{C} (P)

.

◊

With statistical complexity, we measure complexity of a process P by the “diversity” of its expected futures, given observed pasts (i.e., of

μ_{C} (P)

). The Shannon entropy

H (μ)

is used as the measure of “diversity” of a probability measure

μ

. With

φ (x) : = - x log (x)

, it is defined as

H (μ) : = sup {\sum_{i = 1}^{n} φ (μ (B_{i})) | n \in N, B_{i} disjoint, measurable}

(4)

Definition 10.

For

P \in P_{inv} (Δ^{Z})

, the quantity

C_{C} (P) : = H (μ_{C} (P)) \in {\bar{R}}_{+}

is called statistical complexity of P.

Note that if the probability space is sufficiently regular (e.g., separable, metrisable),

H (μ)

can only be finite if

μ

is supported by a countable set A. In this case

H (μ) = \sum_{a \in A} φ (μ ({a}))

Probably, lower semi-continuity of this entropy functional is well-known. We give a proof in the appendix.

Lemma 11.

Let M be a separable, metrisable space. Then the entropy

H : P (M) \to {\bar{R}}_{+}

is weak-* lower semi-continuous.

3. Partially Deterministic HMMs

The probability measures on prediction space induce hidden Markov models (HMMs) with an additional partial determinism property, and it turns out to be helpful to investigate such HMMs. In Section 3.1., we define HMMs and introduce the notation we need for the further discussion. In Section 3.2., we define the partial determinism property and obtain our results about the HMMs satisfying this property. In Section 3.3., we show how measures on prediction space induce partially deterministic HMMs and apply the results from Section 3.2. to prove that the causal state distribution is the only invariant representation on prediction space that can have finite entropy.

3.1. HMMs

We use the term HMM in a wide sense, meaning a pair

(T, μ)

, where

μ

is an initial probability measure on some Polish space M of internal states and T is a Markov kernel from M to Δ × M. The HMM generates on

(Ω, A, P)

a Δ-valued output process

X_{N}

and a (coupled) M-valued internal process

W_{N_{0}}

, such that W₀ is μ-distributed and the joint process is Markovian with

P ({X_{k} \in D, W_{k} \in B} | X_{k - 1}, W_{k - 1}) = T (W_{k - 1}; D \times B) a.s.

We call (T,μ) an HMM of

z \in P (Δ^{N})

if

z = P \circ X_{N}^{- 1}

if

μ (B) = \int T (\cdot; Δ \times B) d μ

, we say that the HMM is invariant and extend the generated processes to stationary processes

X_{Z}

and

W_{Z}

. We need some further notation.

Definition 12.

Let

(T, μ)

be an HMM,

m \in M

,

d \in Δ

, and

ν \in P (M)

.

a): The output kernel K: $M \to P (Δ)$ is defined by $K (m) : = K_{m} : = T (m; \cdot \times M) \in P (Δ)$ . We also use the notations $\bar{K_{d}} (m) : = K_{m} (d) : = K_{m} ({d})$ and $K_{ν} : = \int K d ν$ .
b): The internal operators $L_{d} : P (M) \to P (M) ⋃ {0}$ are defined as follows. $L_{d} (ν) = 0$ and

$L_{d} (ν) (B) : = \frac{\int T (\cdot; {d} \times B) d ν}{K_{ν} (d)} otherwise .$

Remark.

a): $K_{m}$ is the distribution of the next output symbol when the internal state is m, i.e. $K_{m} = P (X_{1} ∣ W_{0} = m)$ a.s. Further, $K_{μ}$ is the law of $X_{1}$ .
b): The internal operator $L_{d}$ describes the update of knowledge of the internal state when the symbol $d \in Δ$ is observed. For Dirac measures, we obtain

$L_{d} (δ_{m}) = P (W_{1} ∣ W_{0} = m, X_{1} = d) a.s.$

Be warned that L_d is not induced by a kernel in the following sense. There is no kernel $l_{d} : M \to P (M)$ such that $L_{d} (ν) = \int l_{d} d ν$ . To see this, note that $L_{d} (ν) \neq \int L_{d} ⃘ ι d ν$ for $ι (m) = ζ_{m}$ , because L_d(ν) is normalised outside the integral as opposed to an individual normalisation of the L_d(ζ_m) inside the integral on the right-hand side.

It directly follows from the definition of

(X_{N}, W_{N_{0}})

by a Markov kernel that the conditional probability, given that the internal state is m, is obtained by starting the HMM in m. In other words, it is generated by the HMM

(T, δ_{m})

. Similarly, the conditional probability given an observed symbol

X_{1} = d

is obtained by starting the HMM in the updated initial distribution

L_{d} (μ)

. We formulate these observations in the following lemma and give a formal proof in the appendix.

Lemma 13.

Let

(T, μ)

be an HMM with internal and output processes

W_{N_{0}}

,

X_{N}

as above. Then a.s.

(T, δ_{W_{0} (ω)})

is an HMM of

P (X_{N} ∣ W_{0}) (ω)

, and

(T, L_{X_{1} (ω)} (μ))

is an HMM of

P (X_{[2, \infty [} ∣ X_{1}) (ω)

.

Definition 14.

(processes

Y_{Z}

and

H_{Z}

) Given an invariant HMM, let

Y_{Z}

be the

P (M)

-valued stochastic process of expectations over internal states, given by

Y_{k} : = P (W_{k} ∣ X_{] - \infty, k]})

. Let

H_{Z}

be the process of entropies of the random measures

Y_{k}

, i.e.,

H_{k} (ω) : = H (Y_{k} (ω))

, where entropy H is defined by (4).

Remark.

Y_{k}

describes the current knowledge of the internal state, given the past.

H_{k}

is the entropy of the value of

Y_{k}

and measures “how uncertain” the knowledge of the internal state is. It is important to bear in mind that this is different from the entropy of the random variable

Y_{k}

. To avoid confusion, we always write

H^{P} (X)

when referring to the entropy of a random variable X defined on a probability space with measure P.

The following lemma justifies the idea of the internal operator

L_{d}

being an update of knowledge of the internal state. Furthermore, it enables us to condition on

Y_{0}

instead of

X_{- N_{0}}

. The conditional probability of the internal state given the past,

Y_{0}

, contains as much information about

X_{1}

(and in fact

X_{N}

, but we do not need that here) as the past

X_{- N_{0}}

does.

Lemma 15.

a): $Y_{1} (ω) = L_{X_{1} (ω)} (Y_{0} (ω))$ a.s.
b): $P ({X_{1} = d} | Y_{0}) (ω) = P ({X_{1} = d} | X_{- N_{0}}) (ω) = K_{Y_{0} (ω)} (d)$ a.s.

Proof.

Conditional independence of

(X_{1}, W_{1})

and

X_{- N_{0}}

given

W_{0}

implies that a.s.

P (X_{1}, W_{1} ∣ W_{0}) = P (X_{1}, W_{1} ∣ W_{0}, X_{- N_{0}})

and thus

\int T d Y_{0} = \int P (X_{1}, W_{1} ∣ W_{0}) d P (\cdot ∣ X_{- N_{0}}) = P (X_{1}, W_{1} ∣ X_{- N_{0}})

(5)

a): Let $d = X_{1} (ω)$ and for $B \in B (M)$ set $F_{B} : = {X_{1} = d, W_{1} \in B}$ . We obtain a.s.

$L_{d} (Y_{0}) (B) \overset{(5)}{=} \frac{P (F_{B} ∣ X_{- N_{0}})}{P (F_{M} ∣ X_{- N_{0}})} \overset{(d = X_{1} (ω))}{=} P ({W_{1} \in B} | X_{- N_{0}}, X_{1}) = Y_{1} (\cdot) (B)$
b): The second equality follows directly from (5). The first follows because, due to the second equality, $P ({X_{1} = d} | X_{- N_{0}})$ is $σ (Y_{0})$ -measurable modulo P. □

The previous lemma enables us to prove that

Y_{Z}

is Markovian and compute its transition kernel. We already know that

L_{d} (ν)

is the updated expectation of the internal state when it was previously

ν

and is now observed d. Thus it is not surprising that the conditional probability of

Y_{k}

given

Y_{k - 1} = ν

is a convex combination of Dirac measures in

L_{d} (ν)

for different d (note that

Y_{k}

is a measure-valued random variable, thus its conditional probability distribution is indeed a distribution on distributions). The mixture is given by the output kernel K, more precisely by

K_{ν}

.

Lemma 16.

For an invariant HMM,

Y_{Z}

and

H_{Z}

are stationary.

Y_{Z}

is a Markov process with transition kernel

P (Y_{k + 1} ∣ Y_{k} = ν) = \sum_{d \in Δ} K_{ν} (d) \cdot δ_{L_{d} (ν)} \in P (P (M)) \forall ν \in P (M) .

Proof.

Stationarity is obvious. For

ν_{0}, \dots ν_{k} \in P (M)

and

ν : ν_{k}

we obtain

\begin{matrix} P (Y_{k + 1} ∣ Y_{[0, k]} = ν_{[0, k]}) & \overset{(lem . 15 a)}{=} & P (L_{X_{k + 1} (\cdot)} (ν) | Y_{[0, k]} = ν_{[0, k]}) \\ = & \sum_{d \in Δ} P ({X_{k + 1} = d} | Y_{[0, k]} = ν_{[0, k]}) \cdot δ_{L_{d} (ν)} . \end{matrix}

Because

σ (Y_{[0, k]})

is nested between

σ (Y_{k})

and

σ (X_{] - \infty, k]})

, Lemma 15 b) implies that

P ({X_{k + 1} = d} | Y_{[0, k]} = ν_{[0, k]}) = K_{ν_{k}} = K_{ν}

and hence the claim.

Partial Determinism

If the transition T of an HMM is deterministic, i.e., if the internal state determines the next state and output (and thus the whole future) uniquely, the HMM is called (completely) deterministic. In a deterministic HMM, all randomness is due to the initial distribution. This is a very strong property, and a weaker partial determinism property is useful. In a partially deterministic HMM, the output symbol is determined randomly, but the new internal state is a function

f (m, d)

of the last internal state m and the new output symbol d. If the internal space M is finite, such HMMs are stochastic versions of deterministic finite state automata (DFAs), an important concept of theoretical computer science (see [7]). The function f directly corresponds to the transition function of the DFA, but the start state is replaced by the initial distribution and the HMM assigns probabilities to the outputs via the output kernel K. A difference in interpretation is that the symbols from Δ are considered input of the DFA and output of HMMs. To emphasise their close connection to DFAs, partially deterministic HMMs are often called deterministic stochastic automata, although they are not completely deterministic.

Definition 17.

An HMM

(T, μ)

is called partially deterministic if there is a measurable function

f : M \times Δ \to M

, called transition function, such that

T (m) = K_{m} \otimes δ_{f_{m} (\cdot)}

for all

m \in M

, i.e.,

T (m; D \times B) = K_{m} (D \cap f_{m}^{- 1} (B)) \forall m \in M, D \subseteq Δ B \in B (M)

where

f_{m} (d) : = {\hat{f}}_{d} (m) : = f (m, d)

and

B (M)

is the Borel

σ

-algebra on M.

Remark.

For partially deterministic HMMs we obtain

L_{d} (ν) (B) = \frac{1}{K_{ν} (d)} \int_{{\hat{f}}_{d}^{- 1} (B)} {\hat{K}}_{d} d ν and L_{d} (δ_{m}) = δ_{f_{m} (d)}

(6)

The second equation implies that

W_{k} = f_{W_{k - 1}} (X_{k})

a.s., justifying the name transition function for f.

The following proposition is crucial for understanding partially deterministic representations. It states that, given the past output, the uncertainty

H_{k} = H (Y_{k})

about the internal state is constant over time and the next output symbol is independent of the internal state. The proof is along the following lines. If we know the internal state at one point in time, we can maintain knowledge of the internal state due to partial determinism. More generally, the uncertainty

H_{k}

of the internal state cannot decrease on average and thus is a supermartingale. But because it is also stationary, the trajectories have to be constant. If two possible internal states would lead to different probabilities for the next output symbol, we could increase our knowledge of the internal state by observing the next output. But because of partial determinism, this would also decrease the uncertainty of the following internal state, in contradiction to the constant trajectories of

H_{Z}

.

Proposition 18.

Let

(T, μ)

be a partially deterministic, invariant HMM with

H (μ) < \infty

. Then

H_{Z}

has a.s. constant trajectories, i.e.,

H_{k} = H_{0}

a.s. Furthermore, the restriction

K ↾_{supp (Y_{0})}

of the output kernel K to the support

supp (Y_{0}) \subseteq M

of the random measure

Y_{0}

is a.s. a constant kernel, i.e.,

K_{m} = K_{\hat{m}} \forall m, \hat{m} \in supp (Y_{0} (ω)) a.s.

(7)

Proof.

We show that

H_{Z}

is a supermartingale to use the following well-known property.

Lemma.

Every stationary supermartingale has a.s. constant trajectories.

Because

H (μ) < \infty

, we may assume w.l.o.g. that M is countable. Note that

φ (x) = - x log (x)

satisfies

φ (\sum x_{i}) \leq \sum φ (x_{i})

. We obtain

H (L_{d} (ν)) \overset{(6)}{=} \sum_{\hat{m} \in M} φ (\sum_{m \in {\hat{f}}_{d}^{- 1} (\hat{m})} ν (m) \frac{K_{m} (d)}{K_{ν} (d)}) \leq \sum_{m \in {\hat{f}}_{d}^{- 1} (M) = M} φ (ν (m) \frac{K_{m} (d)}{K_{ν} (d)}) .

We use the filtration

F_{k} : = σ (Y_{] - \infty, k]})

. Markovianity of

Y_{Z}

yields

E (H_{k + 1} ∣ F_{k}) = E (H_{k + 1} ∣ Y_{k})

.

\begin{matrix} E (H_{k + 1} | Y_{k} = ν) & \overset{(lem . 16)}{=} & \sum_{d \in Δ} K_{ν} (d) \cdot H (L_{d} (ν)) \leq - \sum_{d, m} ν (m) K_{m} (d) \cdot log (ν (m) \frac{K_{m} (d)}{K_{ν} (d)}) \\ = & H^{P} (W_{k} ∣ X_{k + 1}, Y_{k} = ν) \leq H^{P} (W_{k} ∣ Y_{k} = ν) = H (ν) \end{matrix}

where the secong equality holds because

P ({W_{k} = m, X_{k + 1} = d} | Y_{k} = ν) = ν (m) K_{m} (d)

and

P ({X_{k + 1} = d} | Y_{k} = ν) = K_{ν} (d)

, This

H_{Z}

is a supermartingale w.r.t.

{(F_{k})}_{k \in Z}

and has a.s. constant trajectories. In particular, inequality (8) is actually an equality. Because

H (μ) < \infty

and

μ = \int Y_{k} d P

, the entropy of

Y_{k} (ω)

is a.s. finite. Thus,

H^{P} (W_{k} ∣ X_{k + 1}, Y_{k} = ν) = H^{P} (W_{k} ∣ Y_{k} = ν)

implies that W_k and X_K₊₁ are independent given

Y_{k} = ν

, i.e.

K ↾_{supp (ν)}

is constant. □

Note that the finite-entropy assumption is indeed necessary for the second statement of Proposition 18. For example, the shift defines a deterministic HMM that does not (in general) satisfy (7).

Example 19. (shift HMM)

The shift HMM is defined as follows. The internal state consists of the whole trajectory,

M : = Δ^{Z}

.

T = T^{ς}

outputs the symbol at position one and shifts the sequence to the left. More formally with

m = {(m_{k})}_{k \in Z} \in M

and

ς (m) = {(m_{k + 1})}_{k \in Z}

we have

T^{ς} (m) = δ_{m_{1}} \otimes δ_{ς (m)} = δ_{(m_{1}, ς (m))}

If

P \in P_{inv} (Δ^{Z})

, it is obvious that

(T^{ς}, P)

is an invariant, deterministic (in particular partially deterministic) HMM of P. Here, P is the law of both

X_{Z}

and

W_{0}

; in fact even

X_{Z} = W_{0}

. We claim that, generically,

(T^{ς}, P)

does not satisfy (7) (and of course the internal state entropy

H (P)

is infinite). Indeed,

K_{m} = δ_{m_{1}}

and thus

K_{m} = K_{\hat{m}}

implies

m_{1} = {\hat{m}}_{1}

. Because

Y_{0} (ω) = P (X_{Z} ∣ X_{- N_{0}}) (ω)

, (7) implies that

X_{- N_{0}}

determines

X_{1}

uniquely, which is generically not true. The analogously defined one-sided shift on

M = Δ^{N}

also does not satisfy (7). Note that, because future trajectories are equivalent to internal states, the associated process

Y_{Z}

is essentially the prediction process in the sense that

Y_{k} = Z_{k} \circ X_{Z}

.

◊

Proposition 18 tells us that the next output symbol of a partially deterministic HMM is conditionally independent of the internal state, given the past output. But even more is true. The whole future output is conditionally independent of the internal state. Thus, if we know the past, the internal state provides no additional information useful for the prediction of the future output.

Corollary 20.

Let

(T, μ)

be partially deterministic, invariant, and

H (μ) < \infty

. Then

P (X_{N} ∣ W_{0} = m) = P (X_{N} ∣ W_{0} = \hat{m}) \forall m, \hat{m} \in supp (Y_{0}) a.s.

Proof.

According to Proposition 18,

P (X_{1} ∣ W_{0} = \cdot) = K

is constant on

supp (Y_{0})

. To obtain the statement for

X_{[1, n]}

, we consider the n-tuple HMM defined as follows. The output space is Δⁿ, the internal space is M, whereas the output and internal processes

{\hat{X}}_{Z}

and

{\hat{W}}_{Z}

are given by

{\hat{X}}_{k} = X_{[(k - 1) n + 1, k n]}

and

{\hat{W}}_{k} = W_{n k}

. This is achieved by the HMM

(\hat{T}, μ)

with

\hat{T} : M \to P (Δ^{n} \times M)

,

\hat{T} (m) = P (X_{[1, n]}, W_{n} ∣ W_{0} = m)

. The HMM is obviously partially deterministic with transition function

f_{d_{n}} \circ \dots \circ f_{d_{1}}

and invariant. Thus Proposition 18 implies that

P (X_{[1, n]} ∣ W_{0} = \cdot) = P ({\hat{X}}_{1} ∣ {\hat{W}}_{0} = \cdot)

is constant on

supp ({\hat{Y}}_{0})

. Because we can couple the processes such that

{\hat{Y}}_{0} = Y_{0}

, the claim follows. □

3.3. Representations on Prediction Space

We can interpret any probability measure

μ

on prediction space

P (Δ^{N})

as initial distribution of an HMM. The “internal state update” of the corresponding transition

T^{C}

follows the same rule as the prediction dynamic S, described by the conditional probability given the last observation. The difference is that now we include output symbols from Δ. We want to construct the HMM in such a way that if it is started in the internal state

z \in P (Δ^{N}

, its output process is distributed according to z (which is also a measure on the future). Thus, the distribution of the next output d has to be equal to the marginal of z. The next internal state has to be the conditional z-probability of the future given

ζ_{1} = d

. Recall that

ϕ_{z} (ω) = (ς^{- 1} (\cdot) | ζ_{1}) (ω)

.

Definition 21.

We define the Markov kernel

T^{C}

from

P (Δ^{N})

to

Δ \times P (Δ^{N})

by

T^{C} (z; D \times B) : = z ({ξ_{1} \in D, ϕ_{z} \in B}), z \in P (Δ^{N}), D \subseteq Δ B \in B (P (Δ^{N})) .

Note that

T^{C} (z; Δ \times B) = S (z; B)

, i.e., marginalising

T^{C} (z)

to the internal component yields the prediction dynamic. Thus, if

μ = μ_{C} (P)

is the causal state distribution (Definition 7) of some process

P \in P_{inv} (Δ^{Z})

, then the internal state process of the induced HMM

(T^{C}, μ)

coincides with the prediction process

Z_{Z}

of P. From the following lemma we conclude that the output process

X_{Z}

is, as expected, distributed according to P. More generally, if

μ \in P (P (Δ^{N}))

represents a process

z \in P (Δ^{N})

in the sense of integral representation theory as a mixture of other processes, it also induces an HMM of z, namely

(T^{C}, μ)

. Recall that r is the resolvent, defined in (3), and associates the represented process to

μ

.

Lemma 22.

Let

μ \in P (P (Δ^{N}))

. Then

(T^{C}, μ)

is a partially deterministic HMM of

r (μ)

. In particular,

(T^{C}, μ_{C} (P))

is an invariant HMM of

P \in P_{inv} (Δ^{Z})

.

Proof.

Partial determinism follows directly from the definition of

T^{C}

. We have

K_{z} = z \circ {ξ_{1}}^{- 1}

and the transition function f is given by

f_{z} \circ ξ_{1} : = ϕ_{z}

. It is well defined due to the

σ (ξ_{1})

-measurability of

ϕ_{z}

and obviously

T^{C} (z; D \times B) = K_{z} (D \cap f_{z}^{- 1} (B))

. We assume w.l.o.g. that

μ

is a Dirac measure (the general claim follows by integration over

μ

). Thus let

μ = δ_{z}

with

z = r (μ)

. Recall that, according to Lemma 13,

(T^{C}, T_{d}^{C} (δ_{z}))

is an HMM of the conditional probability of

ξ_{[2, \infty [}

given that

ξ_{1} = d

(w.r.t. the output process of

(T^{C}, δ_{z})

). Using

T^{C} (z; {d} \times P (Δ^{N})) = z ({ξ_{1} = d})

and

r (T_{d}^{C} (δ_{z})) \overset{(6)}{=} r (δ_{f_{z} (d)}) = f_{z} (d) = z (ς^{- 1} (\cdot) ∣ ξ_{1} = d),

the claim follows by induction. □

Remark. (

ε

-machine).

(T^{C}, μ_{C} (P))

corresponds to the so-called

ε

-machine of computational mechanics. It is in some sense a minimal predictive model but in general not the minimal HMM of P (see [17]).

Given a process

P \in P_{inv} (Δ^{Z})

, there are (usually) many invariant representations on prediction space (i.e., S-invariant

ν \in P (P (Δ^{N}))

with

r (ν) = P_{N}

). The next proposition shows that the causal state distribution of P is distinguished among them as the only one that can have finite entropy.

Proposition 23.

Let

ν \in P (P (Δ^{N}))

be S-invariant, and

P \in P_{inv} (Δ^{Z})

the measure it represents. If

ν \neq μ_{C} (P)

, then

H (ν) = \infty

.

Proof.

Let

H (ν) < \infty

. According to Lemma 22,

(T^{C}, ν)

is an invariant HMM of P and satisfies the conditions of Corollary 20. Let

W_{Z}

be the corresponding

M = P (Δ^{N})

-valued internal process. For a.e. fixed

ω

, Lemma 13 tells us that

(T^{C}, δ_{W_{0} (ω)})

is an HMM of

P (X_{N} ∣ W_{0}) (ω)

, but it is also an HMM of

r (δ_{W_{0} (ω)}) = W_{0} (ω)

due to Lemma 22. Thus,

P (X_{N} ∣ W_{0}) = W_{0}

and

z = P (X_{N} ∣ W_{0} = z) \overset{(cor . 20)}{=} P (X_{N} ∣ W_{0} = \hat{z}) = \hat{z} \forall z, \hat{z} \in supp (Y_{0} (ω))

This means

|supp (Y_{0})| = 1

, i.e.

Y_{0} (ω)

is a Dirac measure. Thus

Y_{0} = P (W_{0} ∣ X_{- N_{0}}) = δ_{W_{0}}

a.s. and

Z_{0} \circ X_{Z} = P (X_{N} ∣ X_{- N_{0}}) = \int P (X_{N} ∣ W_{0} = \cdot) d Y_{0} = P (X_{N} ∣ W_{0}) = W_{0} a.s.

Because

W_{0}

is

ν

-distributed and

μ_{C} (P)

is the law of

Z_{0}

, we obtain

ν = μ_{C} (P)

.

We conclude this section with two examples of representations on prediction space. They are extreme cases. The first one,

ν_{1}

, is maximally concentrated, namely

ν_{1}

is the Dirac measure in (the future part of) the process we want to represent. Thus it has no uncertainty in itself, but the (unique) process in its support can be arbitrary. The second example,

ν_{2}

, is supported by maximally concentrated processes, i.e. by Dirac measures on

(Δ^{N})

, but the mixture

ν_{2}

is as diverse as the original process. The HMM corresponding to

ν_{2}

is equivalent to the one-sided shift (Example 19).

Example 24.

Let

P \in P_{inv} (Δ^{Z})

,

P_{N} = P \circ X_{N}^{- 1}

and

ν = δ_{P_{N}}

. Then

ν

is a representation of

P_{N}

with

H (ν) = 0

. This is no contradiction to Proposition 23 because

ν

is not S-invariant (if P is not i.i.d.)

◊

Example 25. (lifted shift).

Let

P \in P_{inv} (Δ^{Z})

and

ν = P_{N} \circ ι^{- 1}

, where

ι : Δ^{N} \to P (Δ^{N})

,

ι (x) = δ_{x}

is the embedding as Dirac measures.

ν

is an S-invariant representations of P and

(T^{C}, ν)

is equivalent to the one-sided shift. The only difference is that trajectories

x \in Δ^{N}

are replaced by the corresponding Dirac measures

δ_{x} \in P (Δ^{N})

. In other words,

ι

is an isomorphism. This is no contradiction to Proposition 23 because

H (ν) = \infty

(if P is not concentrated on countably many trajectories).

◊

4. Properties of the Statistical Complexity Functional

Recall that the statistical complexity

C_{C} (P)

(Definition 10) of a process

P \in P_{inv} (Δ^{Z})

is defined as the entropy

H (μ_{C} (P))

of its causal state distribution. In this section, we investigate

C_{C}

as a functional on the space of processes. First, we consider the problem of ergodic decomposition. With ergodic decomposition of P, we denote a probability measure

ν

on the ergodic measures

P_{e} (Δ^{Z}) \subseteq P_{inv} (Δ^{Z})

that satisfies

P = r (ν) = \int_{P_{e} (Δ^{Z})}^{} {id}_{P_{e} (Z)} d ν

Such a measure

ν

always exists and is uniquely determined by P. In [9,10], Łukasz Dębowski investigated another complexity measure, excess entropy, and gave a formula for its ergodic decomposition. Here, we obtain the corresponding result for statistical complexity. It is the average complexity of the ergodic components plus the entropy of the mixture.

Proposition 26.

(ergodic decomposition). Let

ν \in P (P_{e} (Δ^{Z}))

be the ergodic decomposition of

P \in P_{inv} (Δ^{Z})

. Then

C_{C} (P) = \int C_{C} d ν + H (ν)

Proof.

First note that

μ_{C} (P_{1})

and

μ_{C} (P_{2})

are singular for distinct ergodic

P_{1}, P_{2} \in P_{e} (Δ^{Z})

. Indeed, there exist disjoint

A_{1}, A_{2} \in σ (ξ_{- N_{0}})

and disjoint

B_{1}, B_{2} \in σ (ξ_{N})

s.t.

P_{k} (A_{k}) = 1

and

P_{k} (B_{k} ∣ ξ_{- N_{0}}) ↾_{A_{k}} \equiv 1

. Consequently, if

ν

is not supported by a countable set,

μ_{C} (P)

cannot be supported by a countable set and

C_{C} (P) = H (ν) = \infty

. Thus assume

ν = \sum_{k \in N} ν_{k} δ_{P_{k}}

for some

ν_{k} \geq 0

and distinct

P_{k} \in P_{e} (Δ^{Z})

. Then there are disjoint

A_{k} \in σ (ξ_{- N_{0}})

s.t.

P_{k} (A_{k}) = 1

. We claim

P (\cdot ∣ ξ_{- N_{0}}) = \sum_{k \in N} 1_{A_{k}} P_{k} (\cdot ∣ ξ_{- N_{0}}) P -a.s.

Indeed, the

σ (ξ_{- N_{0}})

-measurability is clear, and for

A \in σ (ξ_{- N_{0}})

,

F \in B (Δ^{Z})

we have

\begin{matrix} \int_{A}^{} \sum_{k \in N} 1_{A_{k}} P_{k} (F ∣ ξ_{- N_{0}}) d P & = & \sum_{j \in N} ν_{j} \int_{A}^{} \sum_{k \in N} 1_{A_{k}} P_{k} (F ∣ ξ_{- N_{0}}) d P_{j} \\ \overset{(P_{j} (A_{j}) = 1)}{=} & \sum_{j} ν_{j} \int_{A \cap A_{j}}^{} P_{j} (F ∣ ξ_{- N_{0}}) d P_{j} \\ = & \sum_{j} ν_{j} P_{j} (F \cap A \cap A_{j}) = P (F \cap A) \end{matrix}

As

P (A_{k}) = ν_{k}

, it follows that

μ_{C} (P) = \sum_{k} ν_{k} μ_{C} (P_{k})

. Mutual singularity of the

μ_{C} (P_{k})

implies

C_{C} (P) = H (\sum_{k} ν_{k} μ_{C} (P_{k})) = \sum_{k} ν_{k} H (μ_{C} (P_{k})) + H (ν)

□

Several corollaries follow directly from this proposition. The set

P_{C} : = C_{C}^{- 1} (R)

of stationary processes with finite statistical complexity is convex,

C_{C}

is concave but not continuous, and the set

P_{\infty} : = P_{inv} (Δ^{Z}) ∖ P_{C}

of processes with infinite statistical complexity is dense.

Corollary 27

(concavity)

P_{C}

is a convex set and

C_{C}

is concave. Moreover, for all

ν \in P (Δ^{N})

,

ν_{k} : = ν (k)

and

P_{k} \in P_{inv} (Z)

\sum_{k \in N} ν_{k} C_{C} (P_{k}) \leq C_{C} (\sum_{k \in N} ν_{k} P_{k}) \leq \sum_{k \in N} ν_{k} C_{C} (P_{k}) + H (ν)

Proof.

Use ergodic decomposition of the

P_{k}

and Proposition 26. □

Corollary 28.

(non-continuity).

C_{C} ↾_{P_{C}}

is not continuous in any

P \in P_{C}

w.r.t. variational topology, let alone w.r.t. weak-* topology.

Proof.

Let

Q_{n} \in P_{C}

with

{lim}_{n \to \infty} \frac{1}{n} C_{C} (Q_{n}) \to \infty

and

P_{n} : = \frac{n - 1}{n} P + \frac{1}{n} Q_{n}

. Then

P_{n} \to P

in variational topology, but

C_{C} (P_{n}) \geq \frac{1}{n} C_{C} (Q_{n}) \to \infty

by Corollary 27.

Corollary 29.

P_{\infty}

is dense in

P_{inv} (Δ^{Z})

w.r.t. variational- and a fortiori w.r.t. weak-*-topology.

Proof.

P, Q \in P_{inv} (Δ^{Z})

with

C_{C} (Q) = \infty

. Then

P_{\infty} ∋ \frac{n - 1}{n} P + \frac{1}{n} Q \to P

□

We give a simple example of a situation where statistical complexity is not continuous.

Example 30.

(non-continuity). Let

Q_{p}

be the Bernoulli process on

Δ = {0, 1}

with parameter

0 < p < 1

, i.e.

Q_{p} (ξ_{1} = 1) = p

. Consider the process of throwing a coin which is either slightly biased to 0 or 1, each with probability

\frac{1}{2}

, i.e.

P_{ε} = \frac{1}{2} Q_{\frac{1}{2} + ε} + \frac{1}{2} Q_{\frac{1}{2} - ε}

with

0 < ε < \frac{1}{2}

. Then

P_{ε} \overset{*}{⇀} P_{0} = Q_{\frac{1}{2}}

for

ε \to 0

, but

C_{C} (P_{ε}) = log (2)

for

ε > 0

and

C_{C} (P_{0}) = 0

.

◊

The proof of our most important result about statistical complexity, namely its lower semi-continuity, makes use of the propositions given in Section 2.1. and Section 3.. It also uses a compactness argument. To this end we need, in the case of infinite Δ, a lemma guaranteeing that

μ_{C}

preserves relative compactness.

Lemma 31.

Let

M \subseteq P_{inv} (Z)

be relatively compact. Then

μ_{C} (M) : = \{μ_{C} (P) | P \in M\}

is relatively compact in

P (P (Δ^{N}))

.

Proof.

Using Prokhorov’s theorem, we have to show that

μ_{C} (M)

is tight provided that

M

is tight. Let

ε > 0

and

K_{n} \subseteq Δ^{Z}

compact with

P (K_{n}) \geq 1 - ε \frac{2^{- n}}{n}

for all

P \in M

. We define

K_{n}^{'} : = ξ_{N} (K_{n})

,

\tilde{K} : = \{z \in P (Δ^{N}) | z (K_{n}^{'}) \geq 1 - \frac{1}{n} \forall n \in N\}

and

f_{n} : = P ({ξ_{N} \in K_{n}^{'}} | ξ_{- N_{0}})

. For

P \in M

\int f_{n} d P \geq \int P (K_{n} ∣ ξ_{- N_{0}}) d P = P (K_{n}) \geq 1 - ε \frac{2^{- n}}{n} .

We obtain

P (⋃_{n} {f_{n} < 1 - \frac{1}{n}}) \leq \sum_{n} n (1 - \int f_{n} d P) \leq \sum ε 2^{- n} = ε

and consequently

μ_{C} (P) (\tilde{K}) = P ({Z_{0} \in \tilde{K}}) = P (⋂_{n} {f_{n} \geq 1 - \frac{1}{n}}) \geq 1 - ε

for all

P \in M

. We still have to show compactness of

\tilde{K}

. It is closed because

z_{k} \overset{*}{⇀} z

implies

z (K_{n}^{'}) \geq {lim sup}_{k} z_{k} (K_{n}^{'})

due to closedness of

K_{n}^{'}

. It is tight by definition because the

K_{n}^{'}

are compact. Therefore,

\tilde{K}

is compact. □

Theorem 32.

(lower semi-continuity). The statistical complexity functional,

C_{C} : P_{inv} (Δ^{Z}) \to {\bar{R}}_{+}

, is weak-* lower semi-continuous.

Proof.

Let

P_{n}, P \in P_{inv} (Δ^{Z})

with

P_{n} \overset{*}{⇀} P

. Every subsequence of

(μ_{C} (P_{n}))_{n \in N}

has an accumulation point (a.p.), according to Lemma 31. Consequently,

\underset{n \to \infty}{lim inf} C_{C} (P_{n}) = \underset{n \to \infty}{lim inf} H (μ_{C} (P_{n})) \overset{(H lsc)}{\geq} inf \{H (ν) | ν a.p. of {(μ_{C} (P_{n}))}_{n \in N}\} .

Every

μ_{C} (P_{n})

is S-invariant. According to Proposition 5, S is continuous and thus every a.p.

ν

of

(μ_{C} (P_{n}))_{n \in N}

is also S-invariant. The resolvent

r : P (P (Δ^{N})) \to P (Δ^{N})

is continuous (see [15]), and thus

ν

represents P. Therefore, according to Proposition 23,

H (ν) \geq C_{C} (P)

. In total we obtain

\underset{n \to \infty}{lim inf} C_{C} (P_{n}) \geq C_{C} (P) .

□

We argue that, from a theoretical point of view, every complexity measure should be lower semi-continuous. While it is not counter-intuitive that it is possible to approximate a simple system by unnecessarily complex ones (and hence the complexity is not continuous), it would be strange to consider a process complex if there is an approximating sequence with (uniformly) simple processes. Therefore, an axiomatic characterisation of complexity measures (although, of course, we are far from having such a characterisation) should include lower semi-continuity. There are also slightly more practical reasons why semi-continuity is a nice property.

In a model selection task, for instance, it might be desirable to impose some upper bound

a \in R_{+}

on the complexity of considered processes (e.g. to avoid overfitting). An important consequence of lower semi-continuity is that the set

C_{C}^{- 1} ([0, a]) = \{P \in P_{inv} (Δ^{Z}) | C_{C} (P) \leq a\}

of processes with complexity bounded by a is closed. This makes the complexity constraint technically easier. Consider any complete metric on

P_{inv} (Δ^{Z})

compatible with weak-* (or any stronger) topology (e.g. Prokhorov, Kantorovich-Rubinshtein or variational metric). Then due to the closedness, for every

P \in P_{inv} (Δ^{Z})

with arbitrary complexity, there is a (not necessarily unique) closest “sufficiently simple” process

P_{a}

with complexity not exceeding a. Another consequence is that the set of processes with infinite complexity is generic in the following sense.

Corollary 33.

P_{\infty}

contains a dense

G_{δ}

-set.

Proof.

Because all

C_{C}^{- 1} ([0, n])

are closed,

P_{\infty}

is a

G_{δ}

-set. It is dense according to Corollary 29.

Example 34.

Consider the experiment of first choosing a random coin with success probability p uniformly in

[0, 1]

and then generating an i.i.d. sequence with this coin. More precisely, let

Q_{p}

be the Bernoulli process with parameter p on

Δ = {0, 1}

and

P = \int Q_{p} d p

. Then P has infinite statistical complexity according to Proposition 26. We might approximate P by

P_{n} \overset{*}{⇀} P

(e.g. with ergodic

P_{n}

). Then Theorem 32 implies that the complexity of

P_{n}

necessarily tends to infinity.

◊

Example 35.

Let Δ be finite, then

P_{inv} (Δ^{Z})

is compact. Assume we made observations of a Δ-valued process and want to fit some

P \in P_{inv} (Δ^{Z})

. From the observations, we might derive a set of closed constraints, e.g.,

P ({ξ_{1} = ξ_{2}}) \in [a, b]

,

P ({ξ_{1} = d}) \geq ε

, and

P ({ξ_{2} = d} | ξ_{1} = d) \in [a, b]

(the third is closed only in presence of the second). Further closed constraints may be given by modelling assumptions. Because the resulting set of admissible processes is compact, lower semi-continuity implies that there is at least one process of minimal complexity satisfying all constraints.

◊

Appendix

Proof of Lemma 11 (lower semi-continuity of the entropy). Recall that

φ (x) : = - x log (x)

and denote the boundary of a set B by

\partial B

. Define

\hat{H} (μ) : = sup {\sum_{i = 1}^{n} φ (μ (B_{i})) | n \in N, B_{i} disjoint, μ (\partial B_{i}) = 0} .

Obviously,

\hat{H} \leq H

. Recall that

μ_{n} \overset{*}{⇀} μ

implies

μ_{n} (A) \to μ (A)

for all A with

μ (\partial A) = 0

(e.g., [18]). Thus

\hat{H}

is clearly lower semi-continuous and it is sufficient to show

H (μ) \leq \hat{H} (μ) .

If

μ

is not supported by any countable set,

\hat{H} (μ) = \infty

due to separability of M. Let

μ = \sum_{i = 1}^{\infty} a_{i} δ_{x_{i}}

(a_{i} \in [0, 1], x_{i} \in M)

, and d a compatible metric on M. For fixed

n \in N

, we can choose a radius

r_{n} > 0

, such that

B_{i}^{n} : = {x \in M ∣ d (x_{i}, x) < r_{n}}

,

i = 1, \dots, n

, are disjoint and

μ (\partial B_{i}^{n}) = 0

. We get

\sum_{i = 1}^{n} φ (a_{i}) \overset{(φ^{'} \geq - 1)}{\leq} \sum_{i = 1}^{n} φ (μ (B_{i}^{n})) + \sum_{i = 1}^{n} (μ (B_{i}^{n}) - a_{i}) \leq \hat{H} (μ) + \sum_{i = n + 1}^{\infty} a_{i} .

Therefore,

H (μ) = {lim}_{n \to \infty} \sum_{i = 1}^{n} φ (a_{i}) \leq \hat{H} (μ)

.

Proof of Lemma 13. We first prove that

(T, δ_{W_{0}})

is an HMM of

P (X_{N} ∣ W_{0})

. Let

G_{T} (m) \in P (N)

be the distribution of the output process of

(T, δ_{m})

. Because

G_{T}

is measurable,

G_{T} \circ W_{0}

is

σ (W_{0})

-measurable. From the definition of

(W_{N_{0}}, X_{N})

it follows for measurable

B \subseteq M, A \subseteq Δ^{N}

that

P ({W_{0} \in B} \cap {X_{N} \in A}) = \int_{B}^{} G_{T} (\cdot; A) d μ = \int_{W_{0}^{- 1} (B)}^{} G_{T} (W_{0} (\cdot); A) d P,

where the second equality holds because

W_{0}

is distributed according to

μ

. Thus

G_{T} \circ W_{0}

is the claimed conditional probability. To see that

(T, L_{X_{1}} (μ))

is an HMM of

P (X_{[2, \infty [} ∣ X_{1})

, let

d \in Δ

that

\int G_{T} (\cdot; A) d L_{d} (μ) = \frac{1}{K_{μ} (d)} \int \int_{{d} \times M}^{} G_{T} (\cdot; A) d T d μ = \frac{P ({X_{1} = d, X_{[2, \infty [} \in A})}{P ({X_{1} = d})} .

□

Acknowledgements

I am thankful to Nihat Ay for introducing me to computational mechanics, discussions, and all kinds of scientific support. I also thank the anonymous reviewers for their many helpful comments.

References

Olbrich, E.; Bertschinger, N.; Ay, N.; Jost, J. How Should Complexity Scale with System Size? Eur. Phys. J. B 2008, 63, 407–415. [Google Scholar]
Jänicke, H.; Wiebel, A.; Scheuermann, G.; Kollmann, W. Multifield Visualization Using Local Statistical Complexity. IEEE Trans. Visual. Comput. Gr. 2007, 13, 1384–1391. [Google Scholar]
Crutchfield, J.P.; Young, K. Inferring Statistical Complexity. Phys. Rev. Let. 1989, 63, 105–108. [Google Scholar]
Shalizi, C.R.; Crutchfield, J.P. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. J. Statist. Phys. 2001, 104, 817–879. [Google Scholar]
Ay, N.; Crutchfield, J.P. Reductions of Hidden Information Sources. J. Statist. Phys. 2005, 120, 659–684. [Google Scholar]
Clarke, R.W.; Freeman, M.P.; Watkins, N.W. Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism. Phys. Rev. E 2003, 67, 016203.1–016203.15. [Google Scholar]
Hopcroft, J.; Ullman, J. Introduction to Automata Theory, Language, and Computation; Addison-Wesely: Reading, Massachusetts, USA, 1979. [Google Scholar]
Keller, G. Equilibrium States in Ergodic Theory; London Mathematical Society: New York, USA, 1998. [Google Scholar]
Dębowski, Ł. Ergodic Decomposition of Excess Entropy and Conditional Mutual Information. IPI PAN Reports, nr 993. 2006. [Google Scholar]
Dębowski, Ł. A General Definition of Conditional Information and Its Application to Ergodic Decomposition. Stat. Probab. Lett. 2009, 79, 1260–1268. [Google Scholar]
Knight, F. A Predictive View of Continuous Time Processes. The Annals of Probability 1975, 573–596. [Google Scholar]
Knight, F. Foundations of the Prediction Process; Oxford Science Publications: New York, USA, 1992. [Google Scholar]
Meyer, P. La théorie de la prédiction de F. Knight. Seminaire de Probabilités 1976, X, 86–103. [Google Scholar]
Knight, F. Essays on the Prediction Process; Vol. 1, Lecture Notes Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1981. [Google Scholar]
Choquet, G. Lectures on Analysis, Volume II (Representation Theory); W. A. Benjamin, Inc.: New York, USA, 1969. [Google Scholar]
Parthasarathy. On the Category of Ergodic Measures. Illinois J. Math. 1961, 5, 648–656. [Google Scholar]
Löhr, W.; Ay, N. On the Generative Nature of Prediction. Adv. Complex. Syst. 2009, 12, 169–194. [Google Scholar]
Billingsley, P. Convergence of Probability Measures, 2nd Ed. ed; Wiley: New York, USA, 1968. [Google Scholar]

© 2009 by the author; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.

Share and Cite

MDPI and ACS Style

Löhr, W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy 2009, 11, 385-401. https://doi.org/10.3390/e110300385

AMA Style

Löhr W. Properties of the Statistical Complexity Functional and Partially Deterministic HMMs. Entropy. 2009; 11(3):385-401. https://doi.org/10.3390/e110300385

Chicago/Turabian Style

Löhr, Wolfgang. 2009. "Properties of the Statistical Complexity Functional and Partially Deterministic HMMs" Entropy 11, no. 3: 385-401. https://doi.org/10.3390/e110300385

Article Menu

Properties of the Statistical Complexity Functional and Partially Deterministic HMMs

Abstract

1. Introduction

2. Prediction Dynamic and Statistical Complexity

2.1. Discrete Version of Knight’s Prediction Process

2.2. Statistical Complexity

3. Partially Deterministic HMMs

3.1. HMMs

Partial Determinism

3.3. Representations on Prediction Space

4. Properties of the Statistical Complexity Functional

Appendix

Acknowledgements

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI