Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family

Malagò, Luigi; Pistone, Giovanni

doi:10.3390/e17064215

Open AccessArticle

Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family^†

by

Luigi Malagò

^1,2,* and

Giovanni Pistone

³

¹

Department of Electrical and Electronic Engineering, Shinshu University, Nagano, Japan

²

Inria Saclay, Île-de-France, Orsay Cedex, France

³

De Castro Statistics, Collegio Carlo Alberto, Moncalieri, Italy

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the Proceedings of MaxEnt 2014 Conference on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014.

Entropy 2015, 17(6), 4215-4254; https://doi.org/10.3390/e17064215

Submission received: 31 January 2015 / Revised: 21 May 2015 / Accepted: 2 June 2015 / Published: 18 June 2015

(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we study Amari’s natural gradient flows of real functions defined on the densities belonging to an exponential family on a finite sample space. Our main example is the minimization of the expected value of a real function defined on the sample space. In such a case, the natural gradient flow converges to densities with reduced support that belong to the border of the exponential family. We have suggested in previous works to use the natural gradient evaluated in the mixture geometry. Here, we show that in some cases, the differential equation can be extended to a bigger domain in such a way that the densities at the border of the exponential family are actually internal points in the extended problem. The extension is based on the algebraic concept of an exponential variety. We study in full detail a toy example and obtain positive partial results in the important case of a binary sample space.

Keywords:

information geometry; stochastic relaxation; natural gradient flow; expectation parameters; toric models

Graphical Abstract

1. Introduction

For the purpose of obtaining a clear presentation of our approach to the geometry of statistical models, we start with a recap of nonparametric statistical manifold; see, e.g., the review paper [1]. However, we will shortly move to the actual setup of the present paper, i.e., the finite state space case.

Let

(Ω, A, µ)

be a measured space of sample points x ∈ Ω. We denote by

P_{\geq} \subset L^{1} (µ)

the simplex of (probability) densities and by

P_{>} \subset P_{\geq}

the convex set of strictly positive densities. If Ω is finite, then

P_{>}

is the topological interior of

P_{\geq}

. We denote by

P^{1}

the affine space generated by

P_{\geq}

.

The set

P_{>}

holds the exponential geometry, which is an affine geometry, whose geodesics are curves of the form

t \mapsto p_{t} \propto p_{0}^{1 - t} p_{1}^{t}

. The set

P^{1}

holds the mixture geometry, whose geodesics are of the form t ↦ p_t = (1 − t)p₀ + tp₁. A proper definition of the exponential and mixture geometry, where probability densities are considered points, requires the definition of the proper tangent space to hold the vectors representing the velocity of a curve. In both cases, the tangent space T_p at a point p is a space of random variables V with zero expected value, E_p [V] = 0. On the tangent space T_p, a natural scalar product is defined, 〈U, V〉_p = E_p [UV], so that a pseudo-Riemannian structure is available. Note that the Riemannian structure is a third geometry, different from both the exponential and the mixture geometries. Note also that both the expected value and the covariance can be naturally extended to be defined on

P^{1}

.

For each lower bounded objective function

f : Ω \to ℝ

and each statistical model

M \subset P_{>}

, the (stochastic) relaxation of f to

M

is the function

F (p) = E_{p} [f] \in ℝ

,

p \in M

; cf. [2]. The minimization of the stochastic relaxation as a tool to minimize the objective function has been studied by many authors [3–7].

If we have a parameterization ξ ↦ p_ξ of

M

, the parametric expression of the relaxed function is

\hat{F} (ξ) = E p_{ξ} [f]

. Under integrability and differentiability conditions on both ξ ↦ p_ξ and x ↦ f(x),

\hat{F}

is differentiable, with

\partial_{j} \hat{F} (ξ) = E p_{ξ} [\partial_{j} \log (p_{ξ}) f]

and

E p_{ξ} [\partial_{j} \log (p_{ξ})] = 0

; see [1,8]. In order to properly describe the gradient flow of a relaxed random variable, these classical computations are better cast into the formal language of information geometry (see [9]) and, even better, in the language of non-parametric differential geometry [10] that was used in [11]. The previous computations suggest to take the Fisher score

\partial_{j} \log (p_{ξ})

as the definition of a tangent vector at the j-th coordinate curve. While the development of this analogy in the finite state space case does not require a special setup, in the non-finite state space, some care has to be taken.

In this paper, we follow the non-parametric setup discussed in [1] and, in particular, the notion of an exponential family ℇ and the identification of the tangent space at each

p \in ε

with a space of p-centered random variables.

The paper is organized as follows. We discuss in Section 2 the generalities of the finite state space case; in particular, we carefully define the various notions of the Fisher information matrix and natural gradient that arise from a given parameterization. In Section 3, we discuss a toy example in order to introduce the construction of an algebraic variety extending the exponential family from positive probabilities

P_{>}

to signed probabilities

P^{1}

; this construction is applied to the natural gradient flow in the expectation parameters; moreover, it is shown that this model has a variety that is ruled. The last Section 4 is devoted to the treatment of the special important case when the sample space is binary.

The present paper is a development of the paper [12], which was presented as a poster at the MaxEnt Conference 2014. While the topic is the same, the actual overlapping between the two papers is minimal and concerns mainly the generalities that are repeated for the convenience of the reader.

2. Gradient Flow of Relaxed Optimization

Let Ω be a finite set of points x = (x₁, …, x_n) and µ the counting measure of Ω. In this case, a density

p \in P_{\geq}

is a probability function, i.e.,

p : Ω \to ℝ_{+}

, such that

\sum_{x \in Ω} p (x) = 1

.

Let

B = {T_{1}, \dots, T_{d}}

be a set of random variables, such that, if

\sum_{j = 1}^{d} c_{j} T_{j}

is constant, then c₁ = ⋯ = c_d = 0; for instance consider

B

such that

\sum_{x \in Ω} T_{j} (x) = 0

, j = 0,…,d, and

B

is a linear basis. We say that

B

is a set of affinely independent random variables. If

B

is a linear basis it is affinely independent if and only if {1, T₁, …, T_d} is a linear basis.

We consider the statistical model ℇ whose elements are uniquely identified by the natural parameters θ in the exponential family with sufficient statistics

B

namely:

p_{θ} \in ℰ \Leftrightarrow \log p_{θ} (x) = \sum_{i = 1}^{d} θ_{i} T_{i} (x) - ψ (θ),

see [13].

The proper convex function

ψ : ℝ^{d}

,

θ \mapsto ψ (θ) = \log \sum_{x \in Ω} e^{θ \cdot T (x)} = θ \cdot E_{p θ} [T] - E_{p θ} [\log (p_{θ})]

is the cumulant generating function of the sufficient statistics T, in particular,

\nabla ψ (θ) = E_{θ} [T], Hess ψ (θ) = {Cov}_{θ} (T, T) .

Moreover, the entropy of p_θ is:

H (p_{θ}) = - E_{p_{θ}} [\log (p_{θ})] = ψ (θ) - θ \cdot \nabla ψ (θ) .

The mapping ∇ψ is one-to-one onto the interior M° of the marginal polytope, that is the convex span of the values of the sufficient statistics M = {T (x)|x ∈ Ω}. Note that no extra condition is required, because on a finite state space, all random variables are bounded. Nonetheless, even in this case, the proof is not trivial; see [13].

Convex conjugation applies [14] (Section 25) with the definition:

ψ_{*} (η) = \sup {θ \in ℝ^{d} | θ \cdot η - ψ (θ)}, η \in ℝ^{d} .

The concave function θ ↦ η · θ − ψ(θ) has divergence mapping θ ↦ η − ∇ψ(θ), and the equation η = ∇ψ(θ) has a solution if and only if η belongs to the interior M° of the marginal polytope. The restriction

ϕ = ψ_{*} |_{M °}

is the Legendre conjugate of ψ, and it is computed by:

ϕ : M ° ∋ η \mapsto \in {(\nabla ψ)}^{- 1} (η) \cdot η - ψ \circ {(\nabla ψ)}^{- 1} (η) \in ℝ .

The Legendre conjugate ϕ is such that ∇ϕ = (∇ψ)⁻¹, and it provides an alternative parameterization of ℇ with the so-called expectation or mixture parameter η = ∇ψ(θ),

p_{η} = \exp ((T - η) \cdot \nabla ϕ (η) + ϕ (η)) .

(1)

While in the θ parameters, the entropy is H(p_θ) = ψ(θ) − θ · ∇ψ(θ), in the η parameters, the ϕ function gives the negative entropy:

- H (p_{η}) = E_{p_{η}} [\log_{p_{η}}] = ϕ (η)

.

Proposition 1.

Hess ϕ (η) = (Hess ψ(θ))⁻¹ when η = ∇ψ(θ).
The Fisher information matrix of the statistical model given by the exponential family in the θ parameters is I_e(θ) = Cov_pθ (∇ log p_θ, ∇ log p_θ) = Hess ψ(θ).
The Fisher information matrix of the statistical model given by the exponential family in the η parameters is I_m(θ) = Cov_p_η (∇ log p_η, ∇ log p_η) = Hess ϕ (η).

Proof. Derivation of the equality ∇ ϕ = (∇ψ)⁻¹ gives the first item. The second item is a property of the cumulant generating function ψ. The third item follows from Equation (1). □

2.1. Statistical Manifold

The exponential family ℇ is an elementary manifold in either the θ or the η parameterization, named respectively exponential or mixture parameterization. We discuss now the proper definition of the tangent bundle T ℇ.

Definition 1 (Velocity). If I ∋ t ↦ p_t, I open interval, is a differentiable curve in ℇ, then its velocity vector is identified with its Fisher score:

\frac{D}{d t} p (t) = \frac{d}{d t} \log (p_{t}) .

The capital D notation is taken from differential geometry; see the classical monograph [15].

Definition 2 (Tangent space). In the expression of the curve by the exponential parameters, the velocity is:

\frac{D}{d t} p (t) = \frac{d}{d t} \log (p_{t}) = \frac{d}{d t} (θ (t) \cdot T - ψ (θ (t))) = \dot{θ} (t) \cdot (T - E_{θ (t)} [T]),

(2)

that is it equals the statistics whose coordinates are

\dot{θ} (t)

in the basis of the sufficient statistics centered at p_t. As a consequence, we identify the tangent space at each p ∈ ℇ with the vector space of centered sufficient statistics, that is:

T_{p} ℰ = Span (T_{j} - E_{p} [T_{j}] | j = 1, \dots, d) .

In the mixture parameterization of Equation (1), the computation of the velocity is:

\begin{array}{l} \frac{D}{d t} p (t) = \frac{d}{d t} \log (p_{t}) = \frac{d}{d t} (\nabla ϕ η (t) \cdot (T - η (t)) + ϕ (η (t))) = \\ (Hess ϕ (η (t)) \dot{η} (t)) \cdot (T - η (t)) = \dot{η} (t) \cdot [Hess ϕ (η (t)) (T - η (t))] . \end{array}

(3)

The last equality provides the interpretation of

\dot{η} (t)

as the coordinate of the velocity in the conjugate vector basis Hess ϕ (η(t)) (T − η(t)), that is the basis of velocities along the η coordinates.

In conclusion, the first order geometry is characterized as follows.

Definition 3 (Tangent bundle T ℇ). The tangent space at each p ∈ ℇ is a vector space of random variables T_pℇ = Span (T_j − E_p [T_j]|j = 1, …, d), and the tangent bundle T ℇ = {(p, V)|p ∈ ℇ, V ∈ T_p ℇ}, as a manifold, is defined by the chart:

T ℰ ∍ (e^{θ \cdot T - ψ (θ)}, v \cdot (T - E_{θ} [T])) \mapsto (θ, v) .

(4)

Proposition 2.

If V = v · (T − η) ∈ T_pηℇ, then V is represented in the conjugate basis as:

$\begin{array}{l} V = v \cdot (T - η) = v \cdot {(Hess ϕ (η))}^{- 1} Hess ϕ (η) (T - η) = \\ (Hess ϕ {(η)}^{- 1} v) \cdot Hess ϕ (η) (T - η) . \end{array}$

(5)
The mapping (Hess ϕ (η))⁻¹ maps the coordinates v of a tangent vector V ∈ T_pη ℇ with respect to the basis of centered sufficient statistics to the coordinates v^* with respect to the conjugate basis.
In the θ parameters, the transformation is v ↦ v^* = Hess ψ(θ)v.

Remark 1. In the finite state space case, it is not necessary to go on to the formal construction of a dual tangent bundle, because all finite dimensional vector spaces are isomorphic. However, this step is compulsory in the infinite state space case, as was done in [1]. Moreover, the explicit construction of natural connections and natural parallel transports of the tangent and dual tangent bundle is unavoidable when considering the second-order calculus, as was done in [1,8], in order to compute Hessians and implement Newton methods of optimization. However, the scope of the present paper is restricted to a basic study of gradient flows; hence, from now on, we focus on the Riemannian structure and disregard all second-order topics.

Proposition 3 (Riemannian metric). The tangent bundle has a Riemannian structure with the natural scalar product of each T_pℇ, 〈V, W〉_p = E_p [VW]. In the basis of sufficient statistics, the metric is expressed by the Fisher information matrix I(p) = Cov_p (T, T), while in the conjugate basis, it is expressed by the inverse Fisher matrix I⁻¹(p).

Proof. In the basis of the sufficient statistics, V = v · (T − E_p [T]), W = w · (T − E_p [T]), so that:

{〈 V, W 〉}_{p} = v^{'} E_{p} [(T - E_{p} [T]) {(T - E_{p} [T])}^{'}] w = v^{'} {Cov}_{p} (T, T) w = v^{'} I (p) w,

(6)

where I(p) = Cov_p (T, T) is the Fisher information matrix.

If p = p_θ = p_η, the conjugate basis at p is:

Hess ϕ (η) (T - η) = Hess ψ {(θ)}^{- 1} (T - \nabla ϕ (θ)) = I^{- 1} (p) (T - E_{p} (T)),

(7)

so that for elements of the tangent space expressed in the conjugate basis, we have V = v^* · I⁻¹(p) (T − E_p [T]), W = w^* · I⁻¹(p) (T − E_p [T]); thus:

{〈 V, W 〉}_{p} = v *^{'} E_{p} [I^{- 1} (p) \cdot (T - E_{p} [T]) {(T - E_{p} [T])}^{'} I^{- 1} (p)] w * = v *^{'} I^{- 1} (p) w * .

(8)

2.2. Gradient

For each C¹ real function

F : ℰ \to ℝ

, its gradient is defined by taking the derivative along a C¹ curve I ↦ p(t), p = p(0), and writing it with the Riemannian metrics,

{\frac{d}{d t} \hat{F} (θ (t)) |}_{t = 0} = {〈 {\nabla F (p), \frac{D}{d t} p (t) |}_{t = 0} 〉}_{p}, \nabla F (p) \in T_{p} ℰ .

(9)

If

θ \mapsto \hat{F} (θ)

is the expression of F in the parameter θ and t ↦ θ (t) is the expression of the curve, then

\frac{d}{d t} \hat{F} (θ (t)) = \nabla \hat{F} (θ (t)) \cdot \dot{θ} (t)

, so that at p = p_θ ₍₀₎, with velocity

V = \frac{D}{d t} p (t) |_{t = 0} = \dot{θ} (0) \cdot (T - \nabla ψ (θ (0))

, so that we obtain the celebrated Amari’s natural gradient of [16]:

{〈 \nabla F (p), V 〉}_{p} = {(Hess ψ {(θ (0))}^{- 1} \nabla \hat{F} (θ (0))}^{'} Hess ψ (θ (0)) \dot{θ} (0) .

(10)

If

η \mapsto \overset{⌣}{F} (η)

is the expression of F in the parameter η and t ↦ η (t) is the expression of the curve, then

\frac{d}{d t} \hat{F} (θ (t)) = \nabla \hat{F} (θ (t)) \cdot \dot{θ} (t)

so that at p = p_η₍₀₎, with velocity

V = \frac{d}{d t} \log (p (t)) |_{t = 0} = \dot{η} (0) \cdot Hess ϕ (η (0)) (T - η (0))

,

{〈 \nabla F (p), V 〉}_{p} = (Hess ϕ {(η (0))}^{- 1} \nabla \hat{F} (η {(0))}^{'} Hess ϕ (η (0)) \dot{η} (0) .

(11)

We summarize all notions of gradient in the following definition.

Definition 4 (Gradients).

The random variable ∇F (p) uniquely defined by Equation (9) is called the (geometric) gradient of F at p. The mapping ∇F : ℇ ∋ p ↦ ∇F (p) is a vector field of T ℇ.
The vector $\tilde{\nabla} \hat{F} (θ) = Hess ϕ {(θ)}^{- 1} \nabla \hat{F} (θ)$ of Equation (10) is the expression of the geometric gradient in the θ in the basis of sufficient statistics, and it is called the natural gradient, while $\nabla \hat{F} (θ)$ , which is the expression in the conjugate basis of the sufficient statistics, is called the vanilla gradient.
The vector $\tilde{\nabla} \overset{⌣}{F} (η) = Hess ϕ {(η)}^{- 1} \nabla \overset{⌣}{F} (η)$ of Equation (10) is the expression of the geometric gradient in the η parameter and in the conjugate basis of sufficient statistics, and it is called the natural gradient, while $\nabla \overset{⌣}{F} (η)$ , which is the expression in the basis of sufficient statistics, is called the vanilla gradient.

Given a vector field of ℇ, i.e., a mapping G defined on ℇ, such that G(p) ∈ T_p ℇ, which is called a section of the tangent bundle in the standard differential geometric language, an integral curve from p is a curve I ∋ t ↦ p(t), such that p(0) = p and

\frac{D}{d t} p (t) = G (p (t))

. In the θ parameters, G(p_θ) = Ĝ(θ) · (T − ∇ψ(θ)), so that the differential equation is expressed by

\dot{θ} (t) = \hat{G} (θ (t))

. In the η parameters,

G (p_{η}) = \overset{⌣}{G} (η) \cdot Hess ϕ (η) (T - η)

, and the differential equation is

\dot{η} (t) = \overset{⌣}{G} (η (t))

.

Definition 5 (Gradient flow). The gradient flow of the real function F : ℇ is the flow of the differential equation

\frac{D}{d t} p (t) = \nabla F (p (t))

, i.e.,

\frac{d}{d t} p (t) = p (t) \nabla F (p (t))

. The expression in the θ parameters is

\dot{θ} (t) = \tilde{\nabla} \hat{F} (θ (t))

, and the expression in the η parameters is

\dot{η} (t) = \tilde{\nabla} \overset{⌣}{F} (η (t))

.

The cases of gradient computation we have discussed above are just a special case of a generic argument. Let us briefly study the gradient flow in a general chart f : ζ ↦ p_ζ. Consider the change of parametrization from ζ to θ,

ζ \mapsto p_{ζ} \mapsto θ (p_{ζ}) = I {(p_{ζ})}^{- 1} {Cov}_{p_{ζ}} (T, \log p_{ζ}),

and denote the Jacobian matrix of the parameters’ change by J(ζ). We have:

\begin{array}{l} \log p_{ζ} = T \cdot θ (ζ) - ψ (θ (ζ)) \\ = T \cdot I {(p_{ζ})}^{- 1} {Cov}_{p_{ζ}} (T, \log p_{ζ}) - ψ (I {(p_{ζ})}^{- 1} {Cov}_{p_{ζ}} (T, \log p_{ζ})), \end{array}

and the ζ coordinate basis of the tangent space

T_{p_{ζ}} ℰ

consists of the components of the gradient with respect to ζ,

\nabla (ζ \mapsto \log p_{ζ}) = J^{- 1} (ζ) (T - E_{p_{ζ}} [T])

It should be noted that in this case, the expression of the Fisher information matrix does not have the form of a Hessian of a potential function. In fact, the case of the exponential and the mixture parameters point to a special structure, which is called the Hessian manifold; see [17].

2.3. Gradient Flow in the Mixture Geometry

From now on, we are going to focus on the expression of the gradient flow in the η parameters. From Definition 4, we have:

\tilde{\nabla} \overset{⌣}{F} (η) = Hess ϕ {(η)}^{- 1} \nabla \overset{⌣}{F} (η) = Hess ψ (\nabla ϕ (η)) \nabla \overset{⌣}{F} (η) = I (p_{η}) \nabla \overset{⌣}{F} (η),

where I(p) = Cov_p (T, T). As p ↦ Cov_p (T, T) is the restriction to the simplex of a quadratic function, while p ↦ η is the restriction to the exponential family ℇ of a linear function, in some cases, we can naturally consider the extension of the gradient flow equation outside M°. One notable case is when the function F is a relaxation of a non-constant state space function f : Ω → ℝ, as it is defined in, e.g., [3].

Proposition 4. Let f : Ω → ℝ, and let F (p) = E_p [f] be its relaxation on p ∈ ℇ. It follows:

∇F (p) is the least square projection of f onto T_pℇ, that is:

$\nabla F (p) = I {(p)}^{- 1} {Cov}_{p} (f, T) \cdot (T - E_{p} [T]) .$
The expressions in the exponential parameters θ are $\tilde{\nabla} \hat{F} (θ) = {(Hess ψ (θ))}^{- 1} {Cov}_{θ} (f, T)$ , $\nabla \hat{F} (θ) = {Cov}_{θ} (f, T)$ respectively.
The expressions in the mixture parameters η are $\tilde{\nabla} \overset{⌣}{F} (η) = {Cov}_{η} (f, T)$ and $\nabla \overset{⌣}{F} (η) = Hess ϕ (η) {Cov}_{η} (f, T)$ , respectively.

Proof. On a generic curve through p with velocity V, we have

\frac{d}{d t} E_{p (t)} [f] |_{t = 0} = {Cov}_{p} (f, V) = {〈 f, V 〉}_{p}

. If V ∈ T_pℇ, we can orthogonally project f to get

{〈 \nabla F, V 〉}_{p} = {〈 (I^{- 1} (p) {Cov}_{p} (f, T)) \cdot (T - E_{p} [T]), V 〉}_{p}

.

Remark 2. Let us briefly recall the behavior of the gradient flow in the relaxation case. Let θ_n, n = 1, 2, …, be a minimizing sequence for

\hat{F}

, and let

\bar{p}

be a limit point of the sequence

{(p_{θ_{n}})}_{n}

. It follows that

\bar{p}

has a defective support, in particular

\bar{p} \in ℰ

; see [18,19]. For a proof along lines coherent with the present paper, see [20] (Theorem 1). It is found that the support

\underline{F} \subset Ω

is exposed, that is

T (\underline{F})

is a face of the marginal polytope M = con {T (x)|x ∈ Ω}. In particular,

E_{\bar{p}} [T] = \bar{η}

belongs to a face of the marginal polytope M. If a is the (interior) orthogonal of the face, that is a · T (x) + b ≥ 0 for all x ∈ Ω and a · T (x) + b = 0 on the exposed set, then

a \cdot (T (x) - \bar{η}) = 0

on the face, so that

a \cdot {Cov}_{\bar{p}} (f, T) = 0

. If we extend the mapping η ↦ Cov_η (f, T) on the closed marginal polytope M to be the limit of the vector field of the gradient on the faces of the marginal polytope, we expect to see that such a vector field is tangent to the faces. This remark is further elaborated below in the binary case.

2.4. The Saturated Model

A case of special tutorial interest is obtained when the exponential family contains all probability densities, that is when

ℰ {= P}_{>}

. This case has been treated by many authors; here, we use the presentation of [21].

It is convenient to recode the sample space as Ω = {0, …, d}, where x = 0 is a distinguished point. If X is the identity on Ω, we define the sufficient statistics to be the indicator functions of points T_j = (X = j), j = 1, …, d. The saturated exponential family consists of all of the positive densities written as:

p (x; θ) = \exp (\sum_{j = 1}^{d} θ_{j} (X = j) - ψ (θ)),

where:

ψ (θ) = \log (1 + \sum_{j = 1}^{d} e^{θ_{j}}) .

Note that, in this case, the expectation parameter η_j = E ((X = j)) is the probability of case x = j and the marginal polytope is the probability simplex Δ_d.

The gradient mapping is:

η = \nabla ψ (θ) = (\frac{e^{θ_{j}}}{1 + \sum_{i = 1}^{d} e^{θ_{i}}} | j = 1, \dots, d),

the inverse gradient mapping is defined for η ∈]0, 1[^d by:

θ = {(\nabla ψ)}^{- 1} (η) = \nabla ϕ (η) = (\log (\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}}) | j = 1, \dots, d),

the negative entropy (Legendre conjugate) is:

ϕ (η) = η \cdot \nabla ϕ (η) - ψ \circ \nabla ϕ (η) = \sum_{j = 1}^{d} η_{j} \log (\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}}) + \log (1 - \sum_{i = 1}^{d} η_{i}),

the η parameterization (1) of the probability is:

\begin{array}{l} p_{η} = \exp ((T - η) \cdot \nabla ϕ (η) + ϕ (η)) = \\ \exp (\sum_{j = 1}^{d} ((X = j) - η_{j}) \log (\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}}) + \sum_{j = 1}^{d} η_{j} \log (\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}}) + \log (1 - \sum_{i = 1}^{d} η_{i})) = \\ \exp (\sum_{j = 1}^{d} (X = j) \log (\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}}) + \log (1 - \sum_{i = 1}^{d} η_{i})) = \\ \prod_{j = 1}^{d} {(\frac{η_{j}}{1 - \sum_{i = 1}^{d} η_{i}})}^{(X = j)} (1 - \sum_{i = 1}^{d} η_{i}) = {(1 - \sum_{i = 1}^{d} η_{i})}^{(X = 0)} \prod_{j = 1}^{d} η_{j}^{(X = j)} . \end{array}

Remark 3. The previous equation prompts three crucial remarks:

The expression of the probability in the η parameters is a normalized monomial in the parameters.
The expression continuously extends the exponential family to the probabilities in $P_{\geq}$ .
The expression actually is a polynomial parameterization of the signed densities $P^{1}$ .

We proceed to approach the three issues above. The Hessian functions are:

\begin{array}{l} Hess ψ (θ) = diag (p) - p \otimes p, p = {(1 - \sum_{j = 1}^{d} e^{θ_{j}})}^{- 1} e^{p}, \\ Hess ϕ (η) = diag {(η)}^{- 1} - η_{0}^{- 1} {[1]}_{i, j = 1}^{d}, η_{0} = 1 - \sum_{j = 1}^{d} η_{j} . \end{array}

The matrix Hess ψ(θ) is the Fisher information matrix I(p) of the exponential family at p = p_θ, and the matrix Hess ϕ (η) is the inverse Fisher information matrix I⁻¹(p) at p = p_η. It follows that the natural gradient of a function η ↦ h(η) will be:

\tilde{\nabla} h (η) = Hess ϕ (η) \nabla h (η),

whose behavior depends on the following theorem; see [21] (Proposition 3).

Proposition 5.

The inverse Fisher information matrix I(p)⁻¹ is zero on the vertexes of the simplex, only.
The determinant of the inverse Fisher information matrix I(p)⁻¹ is:

$\det (I {(p)}^{- 1}) = (1 - \sum_{i = 1}^{n} p_{i}) \prod_{i = 1}^{n} p_{i} .$
The determinant of the inverse Fisher information matrix I(p)⁻¹ is zero on the borders of the simplex, only.
On the interior of each facet, the rank of the inverse Fisher information matrix I(p)⁻¹ is (n − 1), and the (n − 1) linear independent column vectors generate the subspace parallel to the facet itself.

A generic statistical model can be seen as a submanifold of the saturated model, so that the form of the gradient in the submanifold is derived according to the general results in differential geometry. We do not do that here, and we switch to some very specific examples.

3. Toric Models: A Tutorial Example

Exponential families whose sample space is an integer lattice, such as finite subsets of ℤ² or {+1, −1}^d, have special algebro-combinatorial features that fall under the name of algebraic statistics. Seminal papers have been [22,23]. Monographs on the topic are [24–26]. The book [27] covers both information geometry and algebraic statistics.

We do not assume the reader has detailed information about algebraic statistics. In this section, we work on a toy example intended to show both the basic mechanism of algebraic statistics and how the algebraic concepts are applied to the gradient flow problem as it was described in the previous section.

First, we give a general definition of the object on which we focus. A toric model is an exponential family, such that the orthogonal space of the space generated by the sufficient statistics and the constant has a vector basis of integer-valued random variables. We consider this example:

\begin{array}{l} Ω & T_{1} & T_{2} & T_{3} \\ 1 & 0 & 0 & - 2 \\ 2 & 0 & 1 & 1 \\ 3 & 1 & 0 & 2 \\ 4 & 2 & 1 & - 1 \end{array},

(12)

which corresponds to a variation of the classical independence model, where the design corresponds to the vertices of a square. It this example we moved the point {4} from (1, 1) to (2, 1).

In Equation (12)T₁ and T₂ are the sufficient statistics of the exponential family:

p_{θ} = \exp (θ_{1} T_{1} + θ_{2} T_{2} - ψ (θ)), ψ (θ) = \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}}),

(13)

T₃ is an integer-valued vector basis of the orthogonal space Span (1, T₁, T₂)^⊥.

For the purpose of the generalization to less trivial examples, it should be noted that

T_{3} = T_{3}^{+} - T_{3}^{-}

, that is (−2, 1, 2,−1) = (0, 1, 2, 0) − (2, 0, 0, 1). The couple

T_{3}^{+} - T_{3}^{-}

connects the lattice defined by:

ℒ = {(Y, Z} \in ℤ_{\geq}^{4} \times ℤ_{\geq}^{4} | B^{T} y = B^{T} Z}, B = [1 T_{1} T_{2}] .

Such a set of generators is called a Markov basis of the lattice; see [22]. Algorithms are available to compute such a set of generators and are implemented, for instance, in the software suite 4ti2; see [28].

The sample space can be identified with the value of the sufficient statistics, hence with a finite subset of ℚ² ⊃ Ω, Ω = {(0, 0), (0, 1), (1, 0), (2, 1)}; see Figure 1. Given a finite subset of ℝ^d, it is a general algebraic fact that there exists a filtering set of monomial functions that is a vector basis of all real functions on the subset itself; see an exposition and the applications to statistics in [24] or [27]. In our case, the monomial basis is 1, T₁, T₂, T₁T₂, and we define the matrix of the saturated model to be:

\begin{array}{l} 1 T_{1} T_{2} T_{1} T_{2} \\ A = \begin{array}{l} 1 \\ 2 \\ 3 \\ 4 \end{array} [\begin{array}{l} 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 2 & 1 & 2 \end{array}], A^{- 1} = \frac{1}{2} [\begin{array}{l} 2 & 0 & 0 & 0 \\ - 2 & 0 & 2 & 0 \\ - 2 & 2 & 0 & 0 \\ 2 & - 1 & - 2 & 1 \end{array}] . \end{array}

(14)

The matrix A one-to-one maps probabilities into expected values,

[\begin{array}{l} 1 & η_{1} & η_{2} & η_{12} \end{array}] = [\begin{array}{l} 1 & E [T_{1}] & E [T_{2}] & E [T_{1} T_{2}] \end{array}] = [\begin{array}{l} p_{1} & p_{2} & p_{3} & p_{4} \end{array}] [\begin{array}{l} 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 2 & 1 & 2 \end{array}],

(15)

and vice versa,

[\begin{array}{l} p_{1} & p_{2} & p_{3} & p_{4} \end{array}] = [\begin{array}{l} 1 & η_{1} & η_{2} & η_{12} \end{array}] [\begin{array}{l} 1 & 0 & 0 & 0 \\ - 1 & 0 & 1 & 0 \\ - 1 & 1 & 0 & 0 \\ 1 & - \frac{1}{2} & - 1 & \frac{1}{2} \end{array}] .

(16)

On Model (13), the (positive) probabilities are constrained by the model:

\begin{array}{l} Ω & p_{θ} & \exp (θ_{1} T_{1} + θ_{2} T_{2} - \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}})) \\ 1 & p (1; θ) & \exp (- \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}})) \\ 2 & p (2; θ) & \exp (θ_{2} - \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}})) \\ 3 & p (3; θ) & \exp (θ_{1} - \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}})) \\ 4 & p (4; θ) & \exp (2 θ_{1} + θ_{2} - \log (1 + e^{θ_{2}} + e^{θ_{1}} + e^{2 θ_{1} + θ_{2}})) \end{array} .

(17)

If we introduce the parameters ζ₁ = exp (θ₁), ζ₂ = exp (θ₂), the model is shown to be a (piece of an) algebraic variety, that is a set described by the rational parametric equations:

\begin{array}{l} Ω & p_{ζ} & ζ^{T_{1}} ζ^{T_{2}} / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ 1 & p (1; ζ) & 1 / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ 2 & p (2; ζ) & ζ_{2} / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ 3 & p (3; ζ) & ζ_{1} / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ 4 & p (4; ζ) & ζ_{1}^{2} ζ_{2} / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \end{array} .

(18)

The peculiar structure of the toric model is best seen by considering the unnormalized probabilities:

\begin{array}{l} Ω & q_{ζ} & ζ^{T_{1}} ζ^{T_{2}} \\ 1 & q (1; ζ) & 1 \\ 2 & q (2; ζ) & ζ_{2} \\ 3 & q (3; ζ) & ζ_{1} \\ 4 & q (4; ζ) & ζ_{1}^{2} ζ_{2} \end{array}, p (x; ζ) = \frac{q (x; ζ)}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}} .

(19)

In algebraic terms, the homogeneous coordinates [q₁ : q₂ : q₃ : q₄] belong to the projective space P³. Precisely, the (real) projective space P³ is the set of all non-zero points of ℝ⁴ together with the equivalence relation

[q_{1} : q_{2} : q_{3} : q_{4}] = [{\bar{q}}_{1} : {\bar{q}}_{2} : {\bar{q}}_{3} : {\bar{q}}_{4}]

if, and only if,

[q_{1} : q_{2} : q_{3} : q_{4}] = k [{\bar{q}}_{1} : {\bar{q}}_{2} : {\bar{q}}_{3} : {\bar{q}}_{4}]

, k ≠ 0. The domain of unnormalized signed probabilities as projective points is the open subset

ℙ_{*}^{3}

of ℙ³ where q₁ + q₂ + q₃ + q₄ ≠ 0. On this set, we can compute the normalization:

ℙ_{*}^{3} ∋ [q_{1} : q_{2} : q_{3} : q_{4}] \mapsto [q_{1}, q_{2}, q_{3}, q_{4}] / (q_{1} + q_{2} + q_{3} + q_{4}) \in^{*} ℰ,

where ^*ℇ is the affine space generated by the simplex Δ₃. Notice that this embedding produces a number of natural geometrical structures on ^*ℇ.

Because of the form of (13), a positive density p belongs to that family if, and only if, log p ∈ Span (1, T₁, T₂), which, in turn, is equivalent to log p ⊥ T₃. We can rewrite the orthogonality as:

\begin{array}{l} 0 = \sum_{x \in Ω} \log p (x) T_{3} (x) = \sum_{x : T_{3} (x) > 0} \log p (x) T_{3}^{+} (x) - \sum_{x : T_{3} (x) < 0} \log p (x) T_{3}^{-} (x) \\ = \log (\prod_{x : T_{3} (x) > 0} p (x) T_{3}^{+} (x)) - \log (\prod_{x : T_{3} (x) < 0} p (x) T_{3}^{-} (x)) . \end{array}

Dropping the log function in the last expression, we observe that the positive probabilities described by either Equation (17) with θ₁, θ₂ ∈ ℝ or Equation (18) with ζ₁, ζ₂ ∈ ℝ_> are equivalently described by the equations:

p_{1} + p_{2} + p_{3} + p_{4} - 1 = 0,

(20)

p_{1}^{2} p_{4} - p_{2} p_{3}^{2} = 0.

(21)

Equation (21) identifies a surface within the probability simplex Δ₃, which is represented in Figure 2 by the triangularization of a grid of points that satisfy the invariant.

By choosing a basis for the space orthogonal to Span (1, T₁, T₂)^⊥, we can embed the marginal polytope of Figure 1 into the associated full marginal polytope. By expressing probabilities as a function of the expectation parameters, Equation (21) identifies a relationship between η₁, η₂ and the expected values of the chosen basis for the orthogonal space. This corresponds to an equivalent invariant in the expectation parameters, which, in turn, identifies a surface in the full marginal polytope.

For instance, consider the full marginal polytope parametrized by η = (η₁, η₂, η₃), with

η_{3} = E [T_{3}]

, which corresponds to the choice of T₃ as a basis for the space orthogonal to the span of the sufficient statistics of the model, together with the constant 1, as in Equation (12). We introduce the following matrix:

\begin{array}{l} 1 T_{1} T_{2} T_{3} \\ B = \begin{array}{l} 1 \\ 2 \\ 3 \\ 4 \end{array} [\begin{array}{l} 1 & 0 & 0 & - 2 \\ 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 2 \\ 1 & 2 & 1 & - 1 \end{array}], \end{array}

(22)

and similarly to Equation (15), we use the B matrix to one-to-one map probabilities into expected values, that is:

[\begin{array}{l} 1 \\ η_{1} \\ η_{2} \\ η_{3} \end{array}] = [\begin{array}{l} 1 & 1 & 1 & 1 \\ 0 & 0 & 1 & 2 \\ 0 & 1 & 0 & 1 \\ - 2 & 1 & 2 & - 1 \end{array}] [\begin{array}{l} p_{1} \\ p_{2} \\ p_{3} \\ p_{4} \end{array}],

(23)

and:

[\begin{array}{l} p_{1} \\ p_{2} \\ p_{3} \\ p_{4} \end{array}] = [\begin{array}{l} \frac{3}{5} & - \frac{1}{5} & - \frac{2}{5} & - \frac{1}{5} \\ \frac{1}{5} & - \frac{2}{5} & \frac{7}{10} & \frac{1}{10} \\ \frac{2}{5} & \frac{1}{5} & - \frac{3}{5} & \frac{1}{5} \\ - \frac{1}{5} & \frac{2}{5} & \frac{3}{10} & - \frac{1}{10} \end{array}] [\begin{array}{l} 1 \\ η_{1} \\ η_{2} \\ η_{3} \end{array}] .

(24)

Then, by expressing probabilities as a function of the expectation parameters in Equation (21), we obtain the following invariant in η associated with the model:

(4 η_{1} + 3 η_{2} - η_{3} - 2) {(η_{1} + 2 η_{2} + η_{3} - 3)}^{2} + (4 η_{1} - 7 η_{2} - η_{3} - 2) {(η_{1} - 3 η_{2} + η_{3} + 2)}^{2} = 0.

(25)

From the linear relationship between probabilities and expectation probabilities, we know that on the interior of the full marginal polytope, there exists a unique η₃ which can be computed as a function of the other expectation parameters. Solving Equation (25) for η₃ allows one to express explicitly the value of η₃ given (η₁, η₂) and represent the surface associated with the invariant in the full marginal polytope. However, the cubic polynomial in Equation (25) in general admits three roots. The unique value of η₃ can be obtained from the roots of the cubic polynomial, by imposing that η₃ must be real and belong to the full marginal polytope given by Conv {(T₁(x), T₂(x), T₃(x))|x ∈ Ω}.

We remind that the determinant Δ associated with the cubic function in Equation (25) in the η₃ variable:

a η_{3}^{3} + b η_{3}^{2} + c η_{3} + d = 0,

(26)

with:

a = 1

(27)

b = - 2 η_{1} + η_{2} + 1

(28)

\begin{array}{l} c = - (4 η_{1} + 3 η_{2} - 2) (η_{1} + 2 η_{2} - 3) + \frac{1}{2} {(η_{1} + 2 η_{2} - 3)}^{2} - (4 η_{1} + 7 η_{2} - 2) (η_{1} - 3 η_{2} + 2) + \\ + \frac{1}{2} {(η_{1} - 3 η_{2} + 2)}^{2} \end{array}

(29)

d = - \frac{1}{2} (4 η_{1} + 3 η_{2} - 2) {(η_{1} + 2 η_{2} - 3)}^{2} - \frac{1}{2} (4 η_{1} - 7 η_{2} - 2) {(η_{1} - 3 η_{2} + 2)}^{2}

(30)

is given by:

Δ = 18 a b c d - 4 b^{3} d + b^{2} c^{2} - 4 a c^{3} - 27 a^{2} d^{2} .

(31)

For Δ = 0, the polynomial has a real root with multiplicity equal to three; for Δ < 0, we have one real root and two complex conjugates roots, while for Δ > 0, there exist three real roots. The three roots of the polynomial as a function of the coefficients are given by:

η_{3, k} = - \frac{1}{3} (b + u_{k} C + \frac{Δ_{0}}{u_{k} C}),

(32)

for k ∈ {1, 2, 3}, with:

u_{1} = 1,

(33)

u_{2} = \frac{- 1 + i \sqrt{3}}{2},

(34)

u_{3} = \frac{- 1 - i \sqrt{3}}{2},

(35)

and:

C = \sqrt[3]{\frac{Δ_{1} + \sqrt{(Δ_{1}^{2} - 4 Δ_{0}^{3})}}{2}},

(36)

Δ_{0} = b^{2} - 3 a c,

(37)

Δ_{1} = 2 b^{3} + 9 a b c + 27 a^{2} d .

(38)

For the cubic polynomial in η₃ of Equation (25), Δ < 0 for η₂ − 1 ≠ 0 and for:

4 η_{1}^{4} - 8 η_{1}^{3} η_{2} + 24 η_{1}^{2} η_{2}^{2} - 20 η_{1} η_{2}^{3} - 2 η_{2}^{4} - 8 η_{1}^{3} - 12 η_{1}^{2} η_{2} + 4 η_{2}^{3} + 8 η_{1}^{2} + 16 η_{1} η_{2} - η_{2}^{2} - 4 η_{1} - 2 η_{2} + 1 > 0.

(39)

In Figure 3(a), we represent in blue the region of the space (η₁, η₂) where Δ < 0, in red where Δ > 0, and the points where Δ = 0 with a dashed line. For Δ < 0, the only real root is η_3,1, which identifies the blue surface in the full marginal polytope in Figure 3(b). For Δ > 0, it is easy to verify that only η_3,2 belongs to the interior of the full marginal polytope parametrized by (η₁, η₂, η₃), since it satisfies the inequalities given by the facets of the marginal polytope, and is represented in Figure 3(b) by the red surface. Finally, the three real roots coincide for Δ = 0, that is, for η₂ = 1, and where:

4 η_{1}^{4} - 8 η_{1}^{3} η_{2} + 24 η_{1}^{2} η_{2}^{2} - 20 η_{1} η_{2}^{3} - 2 η_{2}^{4} - 8 η_{1}^{3} - 12 η_{1}^{2} η_{2} + 4 η_{2}^{3} + 8 η_{1}^{2} + 16 η_{1} η_{2} - η_{2}^{2} - 4 η_{1} - 2 η_{2} + 1 = 0.

(40)

In the polynomial ring ℚ [p₁, p₂, p₃, p₄], the model ideal:

I = 〈 p_{1} + p_{2} + p_{3} + p_{4} - 1, p_{1}^{2} p_{4} - p_{2} p_{3}^{2} 〉

(41)

consists of all the polynomials of the form:

A = (p_{1} + p_{2} + p_{3} + p_{4} - 1) + B (p_{1}^{2} p_{4} - p_{2} p_{3}^{2}), \forall A, B \in ℚ [p_{1}, p_{2}, p_{3}, p_{4}] .

The algebraic variety of

I

uniquely extends the exponential family outside the positive octant. In the language of commutative algebra, it is the real Zariski closure of the exponential family model, cf. [29]. It is a notable example of toric variety. The general theory is in the monograph [30], and the applications to statistical models were first discussed in [31,32].

Let us discuss in some detail the parameterization of the toric variety as the submanifold of ℝ⁴ defined by Equations (20) and (21). The Jacobian matrix is:

J = [\begin{array}{l} 1 & 1 & 1 & 1 \\ 2 p_{1} p_{4} & - p_{3}^{2} & - 2 p_{2} p_{3} & p_{1}^{2} \end{array}] .

It has rank one, that is, there is a singularity, if, and only if,

2 p_{1} p_{4} = - p_{3}^{2} = - 2 p_{2} p_{3} = p_{1}^{2} .

This is equivalent to

p_{1}^{2} = p_{3}^{2} = 0

, which is a subspace of dimension two, whose intersection with Equation (20), is a line

C

in the affine space ^*ℇ = {p ∈ ℝ⁴|p₁ + p₂ + p₃ + p₄ = 1}. This (double) critical line intersects the simplex along the edge δ₂ ↔ δ₄. Outside

C

, that is in the open complement set, the equations of the toric variety are locally solvable in two among the p_i’s under the condition that the corresponding minor is not zero. To have a picture of what this critical set looks like, let us intersect our surface with the plane p₃ = 0. On the affine space p₁ + p₂ + p₄ = 1 we have

p_{1}^{2} p_{4} = 0

, that is the union of the double line

p_{1}^{2} = 0

with the line p₄ = 0.

In the following, we derive a parameterization based on an algebraic argument, the Bézout theorem. In fact, it is remarkable that the cubic surface defined by Equations (20) and (21) is a well known example of ruled surface, see Exercise 5.8.15 in [33]. In fact, the singular line is a double line, so that the intersection of the cubic surface with any plane through the singular line is of degree 1 = 3 − 2, by the Bézout theorem, and thus, it is a line.

The line

C

is said to be double because the polynomial

p_{1}^{2} p_{4} - p_{2} p_{3}^{2}

belongs to the ideal generated by

p_{1}^{2}

and

p_{3}^{2}

. Let us consider the sheaf of planes through the singular line defined for each [α : β] ∈ P¹ by the equations:

P [α : β] = {p_{1} + p_{2} + p_{3} + p_{4} - 1 = 0, α p_{1} + β p_{3} = 0} .

Let us intersect each plane

P [α : β]

of the sheaf with the model variety

M

by solving the system of equations:

{\begin{cases} p_{1} + p_{2} + p_{3} + p_{4} & = 1 \\ p_{1}^{2} p_{4} - p_{2} p_{3}^{2} & = 0 \\ α p_{1} + β p_{3} & = 0 \end{cases} .

(42)

On the critical line

C

, a generic point is parameterized as p(τ, 0) = (0, τ, 0, 1 − τ), which satisfies Equation (42) for τ ∈ ℝ. If 0 ≤ τ ≤ 1, then p(τ, 0) belongs to the edge δ₂ ↔ δ₄.

As the critical line is double and the intersection of the model variety with the plane of the sheaf is a cubic curve, we expect the remaining part to be of degree 3 − 2 = 1, that is to be a line. Assume first α, β ≠ 0. Outside the critical line, as p₁, p₃ are not both zero and αp₁ + βp₃ = 0, then αp₁ = − βp₃ ≠ 0. It follows (αp₁)² = (βp₃)²≠ 0; hence:

p_{1}^{2} p_{4} - p_{2} p_{3}^{2} = 0 \Rightarrow β^{2} {(α p_{1})}^{2} p_{4} - α^{2} p_{2} {(β p_{3})}^{2} = 0 \Rightarrow β^{2} p_{4} - α^{2} p_{2} = 0.

We have found that for α, β ≠ 0, the intersection between the plane

P [α : β]

and the model variety

M

is the union of the critical line

C

and the line of equations:

{\begin{cases} p_{1} + p_{2} + p_{3} + p_{4} & = 1 \\ α p_{1} + β p_{3} & = 0 \\ - α^{2} p_{2} + β^{2} p_{4} & = 0 \end{cases} .

(43)

This line intersects the critical line where:

p_{1} = p_{3} = 0, p_{2} + p_{4} = 1, - α^{2} p_{2} + β^{2} p_{4} = 0,

that is in the point:

p ([α : β], 0)) = (0, \frac{β^{2}}{α^{2} + β^{2}}, 0, \frac{α^{2}}{α^{2} + β^{2}}) .

In parametric form, the line in Equations (43) is:

p ([α : β], t) = p ([α : β], 0) + u t,

with

u = (β, \frac{β^{2} (α - β)}{α^{2} + β^{2}}, - α, \frac{α^{2} (α - β)}{α^{2} + β^{2}}),

\begin{array}{l} p_{1} ([α : β], t) = β t \\ p_{2} ([α : β], t) = \frac{β^{2}}{α^{2} + β^{2}} + \frac{β^{2} (α - β)}{α^{2} + β^{2}} t \\ p_{3} ([α : β], t) = - α t \\ p_{4} ([α : β], t) = \frac{α^{2}}{α^{2} + β^{2}} + \frac{α^{2} (α - β)}{α^{2} + β^{2}} t . \end{array}

(44)

The same equations hold in the previously excluded case αβ = 0.

Positive values of components 1 and 3 of the probability are obtained in Equation (44) for αβ < 0 and βt > 0, say α < 0, β > 0, t > 0. In this case, we have for component 2:

\frac{β^{2}}{α^{2} + β^{2}} + \frac{β^{2} (α - β)}{α^{2} + β^{2}} t = \frac{β^{2}}{α^{2} + β^{2}} (1 - (β - α) t),

which is positive if t < (β − α)⁻¹. The same condition applies to component 4. As

[α : β] = [\frac{α}{β - α} : \frac{β}{β - α}]

, we can always assume β > 0 and β − α = 1 that is, α = β − 1; hence β < 1. The parameterization of the positive probabilities in the model becomes:

\begin{array}{l} p_{1} (α, t) = (α + 1) t \\ p_{2} (α, t) = \frac{α^{2} - (α^{2} + 2 α + 1) t + 2 α + 1}{2 α^{2} + 2 α + 1} \\ p_{3} (α, t) = - α t \\ p_{4} (α, t) = - \frac{α^{2} t - α^{2}}{2 α^{2} + 2 α + 1} \end{array}, 0 < t < 1, - 1 < α < 0.

(45)

For example, with

α = - \frac{1}{2}

, we have:

\begin{array}{l} p_{1} (α, t) = \frac{1}{2} t \\ p_{2} (α, t) = \frac{1}{2} (1 - t) \\ p_{3} (α, t) = \frac{1}{2} t \\ p_{4} (α, t) = \frac{1}{2} (1 - t) \end{array}, 0 < t < 1.

In Figure 4(a), we represented the surface associated with the invariant of Equation (21) as a ruled surface in the probability simplex, according to Equations (45), where the blue line corresponds to the case

α = - \frac{1}{2}

. The ruled surface corresponds to the surface in Figure 2 that was approximated by the triangularization of a grid of points satisfying the invariant. In Figure 4(b), we represent the same lines of Figure 4(a) in the chart (α, t).

From Equation (45), we can express the expectation parameters η as a function of (α, t), i.e.,

η_{1} = \frac{2 α^{2} - (2 α^{3} + 4 α^{2} + α) t}{2 α^{2} + 2 α + 1},

(46)

η_{2} = - t + 1,

(47)

η_{3} = - \frac{(8 α^{3} + 12 α^{2} + 10 α + 3) t - 2 α - 1}{2 α^{2} + 2 α + 1} .

(48)

Notice that the dependence on (α, t) is rational. In Figure 5(a), the ruled surface has been represented in the full marginal polytope, while in Figure 5(a), the lines have been projected over the marginal polytope.

Let us invert Equation (45) to obtain the corresponding chart p ↦ (β, t). From p₁ and p₃, we obtain β = p₁/(p₁ + p₃). As p₂ + p₄ = 1 − t, we have the chart:

\begin{array}{l} β = \frac{p_{1}}{p_{1} + p_{3}}, \\ t = 1 - p_{2} - p_{4} = p_{1} + p_{3} . \end{array}

It is remarkable that the model depends on the probability restricted to {1, 3}; similarly, the expectation parameters depend on p₁ and p₃ only.

From the theory of exponential families, we know that the gradient mapping:

(θ_{1}, θ_{2}) \mapsto \nabla ψ (θ_{1}, θ_{2}) = [\frac{2 e^{(2 θ_{1} + θ_{2})} + e^{θ_{1}}}{e^{(2 θ_{1} + θ_{2})} + e^{θ_{1}} + e^{θ_{2}} + 1} \frac{e^{(2 θ_{1} + θ_{2})} + e^{θ_{2}}}{e^{(2 θ_{1} + θ_{2})} + e^{θ_{1}} + e^{θ_{2}} + 1}]

is one-to-one from ℝ² onto the interior of the marginal polytope M; see Figure 3(b). The equations:

\begin{array}{l} η_{1} = \frac{ζ_{1} + 2 ζ_{1}^{2} ζ_{2}}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}}, \\ η_{2} = \frac{ζ_{2} + ζ_{1}^{2} ζ_{2}}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}}, \end{array}

are uniquely solvable for (η₁, η₂) ∈ M°. We study the local solvability in ζ₁, ζ₂ of:

\begin{array}{l} (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{1} = ζ_{1} + 2 ζ_{1}^{2} ζ_{2}, \\ (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{2} = ζ_{2} + ζ_{1}^{2} ζ_{2}, \end{array}

that is,

\begin{array}{l} 0 = η_{1} + (η_{1} - 1) ζ_{1} + η_{1} ζ_{2} + (η_{1} - 2) ζ_{1}^{2} ζ_{2}, \\ 0 = η_{2} + η_{2} ζ_{1} + (η_{2} - 1) ζ_{2} + (η_{2} - 1) ζ_{1}^{2} ζ_{2} . \end{array}

The Jacobian is:

[\begin{array}{l} (η_{1} - 1) + 2 (η_{1} - 2) ζ_{1} ζ_{2} & η_{1} + (η_{1} - 2) ζ_{1}^{2} \\ η_{2} + 2 (η_{2} - 1) ζ_{1} ζ_{2} & (η_{2} - 1) + (η_{2} - 1) ζ_{1}^{2} \end{array}] .

If we introduce the extra variable η₁₂, from Equations (15) and (18) we have the system:

\begin{array}{l} (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{1} = ζ_{1} + 2 ζ_{1}^{2} ζ_{2}, \\ (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{2} = ζ_{2} + ζ_{1}^{2} ζ_{2}, \\ (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{12} = 2 ζ_{1}^{2} ζ_{2}, \end{array}

Instead, if we use the variable η₃, from Equations (16) and (41), it is possible to derive the equation of the model variety in the η₁, η₂, η₃ parameters. From Equation (18), we have:

\begin{array}{l} η_{1} = E_{ζ} [T_{1}] = \frac{ζ_{1} + 2 ζ_{1}^{2} ζ_{2}}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}}, \\ η_{2} = E_{ζ} [T_{2}] = \frac{ζ_{2} + ζ_{1}^{2} ζ_{2}}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}}, \\ η_{3} = E_{ζ} [T_{3}] = \frac{- 2 + ζ_{2} + 2 ζ_{1} - ζ_{1}^{2} ζ_{2}}{1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}} . \end{array}

Let us solve for the ζ, that is:

There is another way to derive the model constraint in the η. In the example, the sample space has four points; the monomials 1, T₁, T₂, T₁T₂ are a vector basis of the linear space of the columns of the matrix A, in particular T₃ is a linear combination:

\begin{array}{l} Ω & 1 & T_{1} & T_{2} & T_{1} T_{2} & T_{3} \\ 1 & 1 & 0 & 0 & 0 & - 2 \\ 2 & 1 & 0 & 1 & 0 & 1 \\ 3 & 1 & 1 & 0 & 0 & 2 \\ 4 & 1 & 2 & 1 & 2 & - 1 \\ - 2 & 4 & 3 & - 5 & = \end{array}

It follows that:

\begin{array}{l} η_{3} = E_{θ} [T_{3}] = E_{θ} [- 2 + 4 T_{1} + 3 T_{2} - 5 T_{1} T_{2}] \\ = - 2 + 4 E_{θ} [T_{1}] + 3 E_{θ} [T_{2}] + 3 {Cov}_{θ} (T_{1}, T_{2}) + 3 E_{θ} [T_{1}] E_{θ} [T_{2}] \\ = - 2 + 4 \partial_{1} ψ (θ) + 3 \partial_{2} ψ (θ) - 5 \partial_{1} \partial_{2} ψ (θ) - 5 \partial_{1} ψ (θ) \partial_{2} ψ (θ) \\ = - 2 + 4 η_{1} + 3 η_{2} - 5 \partial_{1} \partial_{2} ψ (θ) - 5 η_{1} η_{2} . \end{array}

3.1. Border

Let us consider the points in the model variety that are probabilities, that is,

p_{1} + p_{2} + p_{3} + p_{4} = 1, p_{1}^{2} p_{4} = p_{2} p_{3}^{2}, p_{1}, p_{2}, p_{3}, p_{4} \geq 0.

(49)

From the equation above, we see that single zeros are not allowed, that is to say there are no intersections between the model in Equation (49) and the open facets of the probability simplex. We now consider the full marginal polytope obtained by adding the sufficient statistics T₁T₂, and parametrized by (η₁, η₂, η₁₂). By Equation (16), the marginal polytope is represented by the inequalities:

\begin{array}{l} p_{1} = 1 - η_{1} - η_{2} + η_{12} \geq 0, \\ p_{2} = η_{2} - \frac{1}{2} η_{12} \geq 0, \\ p_{3} = η_{1} - η_{12} \geq 0, \\ p_{4} = \frac{1}{2} η_{3} \geq 0, \end{array}

which is a convex set with vertexes (0, 0, 0), (0, 1, 0), (1, 0, 0), (2, 1, 2), which corresponds to the full marginal polytope associated to the sufficient statistics {T₁, T₂, T₁T₂}. As the critical set is the edge δ₂ ↔ δ₄ in the p space, it is the edge (0, 1, 0) ↔ (2, 1, 2) in the η space.

We have the following possible models on the border of the probability simplex and on the border of the full marginal polytope, where the values for η₁ and η₂ are obtained from Equation (15).

\begin{array}{l} p_{1} & p_{2} & p_{3} & p_{4} & η_{1} & η_{2} \\ 0 & 0 & + & + & p_{3} + 2 p_{4} & p_{4} \\ 0 & + & 0 & + & 2 p_{4} & p_{2} + p_{4} \\ + & 0 & + & 0 & p_{3} & 0 \\ + & + & 0 & 0 & 0 & p_{2} \end{array} \begin{array}{l} p_{1} & p_{2} & p_{3} & p_{4} & η_{1} & η_{2} \\ + & 0 & 0 & 0 & 0 & 0 \\ 0 & + & 0 & 0 & 0 & 1 \\ 0 & 0 & + & 0 & 1 & 0 \\ 0 & 0 & 0 & + & 2 & 1 \end{array}

That is, the domains that can be support of probabilities in the algebraic model are the faces of the marginal polytope. This is general; see [20,34].

3.2. Fisher Information

Let us consider the covariance matrix of the sufficient statistics. Let us denote by A_|12 the block of the two central columns in A in Equation (14) and by p the row vector of probabilities. Then, the variance matrix is:

A_{| 12}^{T} diag (p) A_{| 12} - {(p A_{| 12})}^{T} p A_{| 12} = A_{| 12}^{T} diag (p) A_{| 12} - A_{| 12}^{T} p^{T} p A_{| 12} = A_{| 12}^{T} (diag (p) - p^{T} p) A_{| 12} .

On each of the cases of probabilities supported by a single point, the matrix p − p^T p is zero; hence, the covariance matrix is zero. In each of the cases where the probability is supported by a facet, say {1, 2}, the matrix p − p^T p reduces to the corresponding block, and the covariance matrix is:

\begin{array}{l} [\begin{array}{l} 0 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 \end{array}] [\begin{array}{l} p_{1} - p_{1}^{2} & - p_{1} p_{2} & 0 & 0 \\ - p_{1} p_{2} & p_{2} - p_{2}^{2} & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}] [\begin{array}{l} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 2 & 1 \end{array}] \\ = [\begin{array}{l} 0 & 0 \\ 0 & 1 \end{array}] [\begin{array}{l} p_{1} - p_{1}^{2} & - p_{1} p_{2} \\ - p_{1} p_{2} & p_{2} - p_{2}^{2} \end{array}] [\begin{array}{l} 0 & 0 \\ 0 & 1 \end{array}] \\ = [\begin{array}{l} 0 & 0 \\ 0 & p_{2} - p_{2}^{2} \end{array}] . \end{array}

The space generated by the covariance matrix is ℚ (0, 1), that is the affine space that contains the facets itself. Analogous results hold for each facet, and this result is general.

We note that the determinant of the covariance matrix is a polynomial of degree six in the indeterminates p₁, p₂, p₃. This polynomial is zero on each facet.

The η parameters can be given as a function of either θ or ζ. We have:

\begin{array}{l} η & A^{T} [p ζ] \\ η_{1} & (ζ_{1} + 2 ζ_{1}^{2} ζ_{2}) / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ η_{2} & (ζ_{2} + ζ_{1}^{2} ζ_{2}) / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \\ η_{3} & (- 2 + ζ_{2} + 2 ζ_{1} - ζ_{1}^{2} ζ_{2}) / (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) \end{array}

(50)

We know from the theory of exponential families that the mapping:

] 0, \infty [\times] 0, \infty [∍ (ζ_{1}, ζ_{2}) \mapsto (η_{1}, η_{2}) \in Conv {(T_{1} (x), T_{2} (x)) | x \in Ω} °

is one-to-one. We look for an algebraic inversion of the equations:

\begin{array}{l} (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{1} = ζ_{1} + 2 ζ_{1}^{2} ζ_{2}, \\ (1 + ζ_{2} + ζ_{1} + ζ_{1}^{2} ζ_{2}) η_{2} = ζ_{2} + ζ_{1}^{2} ζ_{2} . \end{array}

If we rewrite Equations (50) as polynomials in ζ₁, ζ₂, we obtain:

η_{1} + (η_{1} - 1) ζ_{1} + η_{1} ζ_{2} + (η_{1} - 2) ζ_{1}^{2} ζ_{2} = 0,

(51)

η_{2} + η_{2} ζ_{1} + (η_{2} - 1) ζ_{2} + (η_{2} - 1) ζ_{1}^{2} ζ_{2} = 0,

(52)

- η_{3} + (η_{3} - 2) ζ_{1} + (η_{3} - 1) ζ_{2} + (η_{3} + 1) ζ_{1}^{2} ζ_{2} = 0.

(53)

Gauss elimination produces a linear system in ζ₁, ζ₂ with coefficients that are polynomials in η₁, η₂, η₃ to be considered with the implicit equation derived from

p_{1}^{2} p_{4} - p_{2} p_{3}^{2} = 0

. The system is:

\begin{matrix} - 2 η_{2} η_{3} - 2 η_{1} + 2 η_{2} = (- 2 η_{2} η_{3} - 2 η_{1} + 2) ζ_{1} + (- 2 η_{2} η_{3} + 2 η_{2} + 2 η_{3} - 2) ζ_{2}, \\ η_{2} = η_{2} ζ 1 + (η_{2} - 1) ζ_{2} . \end{matrix}

3.3. Extension of the Model

In this subsection, we study an extension to signed probabilities of the exponential family in Equations (12) and (13) based on the representation of the statistical model as a ruled surface in the probability simplex. Our motivation for such an analysis is the study of the stability of the critical points of a gradient field in the η parameters, in particular when the critical points belong to the boundary of the model. Indeed, by extending the gradient field outside the marginal polytope, we can identify open neighborhoods for critical points on the boundary of the polytope, which allow one to study the convergence of the differential equations associated with the gradient flows, for instance by means of Lyapunov stability.

In the following, we describe more in detail how the extension can be obtained. Let a be a point along the edge δ₂ ↔ δ₄ of the full marginal polytope parametrized by (η₁, η₂, η₃) and b the coordinates of the corresponding point over δ₁ ↔ δ₃ obtained by intersecting the line of the ruled surface through a with the edge δ₁ ↔ δ₃. The values of the η₂ coordinate for a and b are one and zero, respectively. The other coordinates of b depend on those of a though α. First, we obtain the values of the η₃ coordinates as a function of the η₁ coordinate. For a, we find the equation of the line to which δ₂ ↔ δ₄ belongs, given by:

(\begin{array}{l} η_{1} \\ η_{2} \\ η_{3} \end{array}) = (\begin{array}{l} 0 \\ 1 \\ 1 \end{array}) + u (\begin{array}{l} 2 \\ 0 \\ - 2 \end{array}) = (\begin{array}{l} 2 u \\ 1 \\ 1 - 2 u \end{array}),

(54)

from which we obtain η₃ = 1 − η₁. Similarly, for the η₃ coordinate of b, we consider the line through δ₁ ↔ δ₃, that is:

(\begin{array}{l} η_{1} \\ η_{2} \\ η_{3} \end{array}) = (\begin{array}{l} 0 \\ 0 \\ - 2 \end{array}) + u (\begin{array}{l} 1 \\ 0 \\ 4 \end{array}) = (\begin{array}{l} u \\ 0 \\ 4 u - 2 \end{array}),

(55)

which gives us η₃ = 4η₁ − 2. Finally, for the η₁ coordinate, we use Equations (44). In a, since t = 0 and p₁ = p₃ = 0, then

p_{2} = \frac{β^{2}}{α^{2} + β^{2}}

and

p_{4} = \frac{α^{2}}{α^{2} + β^{2}}

. From Equation (24), it follows that:

η_{1} = \frac{2 α^{2}}{2 α^{2} + 2 α + 1} .

(56)

Similarly, for b, we have p₂ = p₄ = 0 and t = 1, so that p₁ = α + 1 and p₃ = −α. From Equation (24), it follows that:

η_{1} = - α .

(57)

As a result, the coordinates of a and b both depend on α as follows,

a = (\frac{2 α^{2}}{2 α^{2} + 2 α + 1}, 1, \frac{2 α + 1}{2 α^{2} + 2 α + 1})

(58)

b = (- α, 0, - 4 α - 2)

(59)

The ruled surface in the full marginal polytope is given by the lines through a and b described by the following implicit representation, for −1 < α < 1 and 0 < t < 1,

[\begin{array}{l} η_{1} \\ η_{2} \\ η_{3} \end{array}] = [\begin{array}{l} - α \\ 0 \\ - 4 α - 2 \end{array}] + t [\begin{array}{l} \frac{2 α^{3} + 4 α^{2} + α}{2 α^{2} + 2 α + 1} \\ 1 \\ \frac{8 α^{3} + 12 α^{2} + 10 α + 3}{2 α^{2} + 2 α + 1} \end{array}] .

(60)

The ruled surface can be extended outside the marginal polytope by taking values of α, t ∈ ℝ and considering the set of lines through a and b for different values of α. For α → ±∞, the η₁ coordinate of b tends to ∓∞, while the η₁ of a tends to one. For α → ±∞, the ruled surface admits the same limit given by the line parallel to δ₁ ↔ δ₃ passing through (1, 1, 0). The surface intersects the interior of the marginal polytope for t ∈ (0, 1) and α ∈ (−1, 0). Moreover, the surface intersects the critical line twice, for t = 0, α ∈ [−1, 0] and for t = 0, α ∉ [−1, 0].

In Figures 6 and 7, we represent the extension of the ruled surface outside the probability simplex and in the (α, t) chart, while in Figures 8 and 9, the extended surface has been represented in the full marginal polytope parametrized by (η₁, η₂, η₃) and in the marginal polytope parametrized by (η₁, η₂).

3.4. Optimization and Natural Gradient Flows

We are interested in the study of natural gradient flows of functions defined over statistical models. Our motivation is the study of the optimization of the stochastic relaxation of a function, i.e., the optimization of the expected value of the function itself with respect to a distribution p in a statistical model. Natural gradient flows associated with the stochastic relaxation converge to the boundary of the model, where the probability mass is concentrated on some instances of the search space. To study the convergence over the boundary, we proposed to extend the natural gradient field outside the marginal polytope and the probability simplex, by employing a parameterization that describes the model as a ruled surface, as we described in the tutorial example of this section.

In the following, we focus on the optimization of a function f : Ω → ℝ, and we consider its stochastic relaxation with respect to a probability distribution in the exponential family in Equations (12) and (13). First, we compute a basis for all real-valued functions defined over Ω using algebraic arguments. Consider the zero-dimensional ideal I associated with the set of points in Ω, and let R be the polynomial ring with the field of real coefficients; a vector space basis for the quotient ring R/I defines a basis for all functions defined over Ω. In CoCoA [36], this can be computed with the command QuotientBasis.

Coming back to our example, with Ω = {1, 2, 3, 4}, by fixing the graded reverse lexicographical monomial order, which is the default one in CoCoA [36], we obtain a basis given by {1, x₁, x₂, x₁ x₂}, so that any f : Ω → ℝ can be written as:

f = c_{0} + c_{1} x_{1} + c_{2} x_{2} + c_{12} x_{1} x_{2} .

(61)

We are interested in the study of the natural gradient field of

F (p) = E_{p} [f]

. Recall that T₃ = 4x₁ + 3x₂ − 5x₁x₂ − 2 and

η_{3} = E [T_{3}]

, so that:

E [x_{1} x_{2}] = \frac{1}{5} (4 η_{1} + 3 η_{2} - η_{3} - 2),

(62)

which implies:

F_{η} (η) c_{0} - \frac{2}{5} c_{12} + (c_{1} + \frac{4}{5} c_{12}) η_{1} + (c_{2} + \frac{3}{4} c_{12}) η_{2} - \frac{1}{5} c_{12} η_{3} .

(63)

In order to study the gradient field of F_η(η) over the marginal polytope parameterized by (η₁, η₂), we need to express η₃ as a function of η₁ and η₂. In order to do that, we parametrize the exponential family as a ruled surface by means of the (α, t) parameters. Moreover, this parametrization has a natural extension outside the marginal polytope, which allows one to study the stability of the critical points on the boundary of the marginal polytope. We start by evaluating the gradient field of F_α_,_t(α, t) in the (α, t) parametrization, then we map it to the marginal polytope in the η parameterization.

By expressing (η₁, η₂) as a function of (α, t), we obtain:

F_{α, t} (α, t) = \frac{2 α^{2} (c_{1} + c_{12}) + (2 α^{2} + 2 α + 1) (c_{0} + c_{2}) - (2 α^{2} (c_{1} + c_{12}) + (2 α^{2} + 2 α + 1) (c_{1} α + c_{2})) t}{2 α^{2} + 2 α + 1} .

(64)

If we take partial derivatives of Equation (64) with respect to α and t, we have:

\partial_{α} F_{α, t} (α, t) = \frac{4 (α^{2} + α) (c_{1} + c_{12}) - ((4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) c_{1} + 4 (α^{2} + α) c_{12}) t}{4 α^{4} + 8 α^{3} + 8 α^{2} + 4 α + 1},

(65)

\partial_{t} F_{α, t} (α, t) = - \frac{2 α^{2} c_{12} + (2 α^{3} + 4 α^{2} + α) c_{1} + (2 α^{2} + 2 α + 1) c_{2}}{2 α^{2} + 2 α + 1} .

(66)

In the (α, t) parameterization, the Fisher information matrix reads:

I_{α, t} (α, t) = E_{α, t} [- \partial^{2} \log p (x; α, t)] = [\begin{matrix} \frac{4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α}{4 α^{6} + 12 α^{5} + 16 α^{4} + 12 α^{3} + 5 α^{2} + α} & 0 \\ 0 & - {(t^{2} - t)}^{- 1} \end{matrix}] .

(67)

Finally, the natural gradient becomes:

\begin{array}{l} \tilde{\nabla} F_{α, t} (α, t) = I_{α, t} {(α, t)}^{- 1} \nabla F_{α, t} (α, t) \\ = [\begin{matrix} \frac{(4 α^{6} + 12 α^{5} + 16 α^{4} + 12 α^{3} + 5 α^{2} + α) (4 (α^{2} + α) c_{1} + 4 (α^{2} + α) c_{12} - ((4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) c_{1} + 4 (α^{2} + α) c_{12}) t)}{(4 α^{4} + 8 α^{3} + 8 α^{2} + 4 α + 1) (4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α} \\ \frac{(2 α^{2} c_{12} + (2 α^{3} + 4 α^{2} + α) c_{1} + (2 α^{2} + 2 α + 1) c_{2}) (t^{2} - t)}{2 α^{2} + 2 α + 1} \end{matrix}] \end{array}

(68)

We obtained a rational formula for the natural gradient in the (α, t) parameterization, which can be easily extended outside the marginal polytope. However, notice that the inverse Fisher information matrix and the natural gradient are not defined for:

t = \frac{4 (α^{2} + α)}{4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1} .

(69)

We also remark that over the boundary of the model, for t ∈ {0, 1} and α ∈ {−1, 0}, the determinant of the inverse Fisher information vanishes, so that the matrix is not full rank. It follows that the trajectories associated with natural gradient flows with initial conditions in the interior of the marginal polytope remain in the marginal polytope.

In order to study the natural gradient field over the marginal polytope, we apply a reparameterization of a tangent vector from the (α, t) parameterization to the (η₁, η₂) parameterization. Indeed, by the chain rule and the inverse function theorem, we have:

\nabla F_{η} (α, t) = \nabla F_{α, t} {(α, t)}^{T} J {(α, t)}^{- 1}

(70)

The Jacobian of the map (α, t) 7↦ (η₁, η₂) is:

J (α, t) = [\begin{matrix} - \frac{(6 α^{2} + 8 α + 1) t - 4 α}{2 α^{2} + 2 α + 1} - \frac{2 (2 α^{2} - (2 α^{3} + 4 α^{2} + α) t) (2 α + 1)}{{(2 α^{2} + 2 α + 1)}^{2}} & - \frac{2 α^{3} + 4 α^{2} + α}{2 α^{2} + 2 α + 1} \\ 0 & - 1 \end{matrix}],

(71)

with inverse:

J {(α, t)}^{- 1} = [\begin{matrix} \frac{4 α^{4} + 8 α^{3} + 8 α^{2} + 4 α + 1}{4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α} & \frac{4 α^{5} + 12 α^{4} + 12 α^{3} + 6 α^{2} + α}{4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α} \\ 0 & - 1 \end{matrix}] .

(72)

It follows that:

\nabla F_{η} (α, t) = [\begin{matrix} \frac{4 (α^{2} + α) c 1 + 4 (α^{2} + α) c 12 - ((4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) c 1 + 4 (α^{2} + α) c 12) t}{4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α} \\ - \frac{4 (α^{3} + α^{2}) c 12 - 4 (α^{2} + α) c 2 + (2 (2 α^{4} - α^{2}) c 12 + (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) c 2) t}{4 α^{2} - (4 α^{4} + 8 α^{3} + 12 α^{2} + 8 α + 1) t + 4 α} \end{matrix}] .

(73)

Notice that, as for the inverse Fisher information matrix, the inverse Jacobian J(α, t)⁻¹ is not defined for t which satisfies Equation (69).

We compute the inverse Fisher information matrix by evaluating the covariance between the sufficient statistics of the exponential family. Since over Ω, we have

x_{1}^{2} = x_{1} + x_{1} x_{2}

and

x_{1}^{2} = x_{1}

, it follows that:

I_{η} {(η)}^{- 1} = [\begin{matrix} \frac{1}{5} (9 η_{1} + 3 η_{2} - η_{3} - 2) - η_{1}^{2} & \frac{1}{5} (4 η_{1} + 3 η_{2} + η_{3} - 2) - η_{1} η_{2} \\ \frac{1}{5} (4 η_{1} + 3 η_{2} - η_{3} - 2) - η_{1} η_{2} & η_{2} - η_{2}^{2} \end{matrix}] .

(74)

By parameterizing

I_{η}^{- 1}

with (α, t), we have:

\begin{array}{l} I_{η} {(α, t)}^{- 1} \\ = [\begin{matrix} \frac{4 α^{4} + 8 α^{3} - (4 α^{6} + 16 α^{5} + 20 α^{4} + 8 α^{3} + α^{2}) t^{2} + 4 α^{2} + (4 α^{5} - 12 α^{3} - 8 α^{2} - α) t}{4 α^{4} + 8 α^{3} + 8 α^{2} + 4 α + 1} & \frac{(2 α^{3} + 4 α^{2} + α) t^{2} - (2 α^{3} + 4 α^{2} + α) t}{2 α^{2} + 2 α + 1} \\ - \frac{(2 α^{3} + 4 α^{2} + α) t^{2} - (2 α^{3} + 4 α^{2} + α) t}{2 α^{2} + 2 α + 1} & - t^{2} + t \end{matrix}] . \end{array}

(75)

Finally, we derive the following rational formula for the natural gradient over the marginal polytope parametrized as a ruled surface by (α, t):

\begin{array}{l} \tilde{\nabla} F_{η} (α, t) = I_{η} {(α, t)}^{- 1} \nabla F_{η} (α, t) \\ = [\begin{matrix} ((4 α^{6} + 16 α^{5} + 20 α^{4} + 8 α^{3} + α^{2}) c_{1} + 2 (2 α^{5} + 4 α^{4} + α^{3}) c_{12} + (4 α^{5} + 12 α^{4} + 12 α^{3} + 6 α^{2} + α) c_{2}) t^{2} - 4 (α^{4} + 2 α^{3} + α^{2}) c_{1} + \\ - \frac{- 4 (α^{4} + 2 α^{3} + α^{2}) c_{12} - ((4 α^{5} - 12 α^{3} - 8 α^{2} - α) c_{1} + 2 (2 α^{5} + 2 α^{4} - 3 α^{3} - 2 α^{2}) c_{12} + (4 α^{5} + 12 α^{4} + 12 α^{3} + 6 α^{2} + α) c_{2}) t}{4 α^{4} + 8 α^{3} + 8 α^{2} + 4 α + 1} \\ - \frac{(2 α^{2} c_{12} + (2 α^{3} + 4 α^{2} + α) c_{1} + (2 α^{2} + 2 α + 1) c_{2}) t^{2} - (2 α^{2} c_{12} + (2 α^{2} c_{12} + (2 α^{3} + 4 α^{2} + α) c_{1} + (2 α^{2} + 2 α + 1) c_{2}) t}{2 α^{2} + 2 α + 1} \end{matrix}] . \end{array}

(76)

3.5. Examples with Global and Local Optima

We conclude this section with two examples of natural gradient flows associated with two different f functions. First, consider the case where c₀ = 0, c₁ = 1, c₂ = 2, c₃ = 3, so that:

\begin{array}{l} Ω & x_{1} & x_{2} & f_{1} \\ 1 & 0 & 0 & 0 \\ 2 & 0 & 1 & 2 \\ 3 & 1 & 0 & 1 \\ 4 & 2 & 1 & 10 \end{array} .

(77)

The function admits a minimum on {1}. In Figure 10, we plotted the vector fields associated with the vanilla and natural gradient, together with some gradient flows for different initial conditions, in the (α, t) parameterization. In Figure 11, we represent the vanilla and natural gradient field over the marginal polytope in the (η₁, η₂) parameterization. Notice that, as expected, differently from the vanilla gradient, the natural gradient flows converge to the unique global optima, which corresponds to the vertex where all of the probability is concentrated over {1}. In the (α, t) parameterization, the flows have been extended outside the statistical model by prolonging the lines of the ruled surface, and as we can see, they remain compatible with the flows on the interior of the model, in the sense that the nature of the critical point is the same for trajectories with initial conditions on the interior and on the exterior of the model. In other words, the global optima is an attractor from both the interior and the exterior of the model and similarly for the other critical points on the vertices, both for saddle points and the unstable points, where the natural gradient vanishes.

In the second example, we set c₀ = 0, c₁ = 1, c₂ = 2, c₃ = −5/2, and we have:

\begin{array}{l} Ω & x_{1} & x_{2} & f_{2} \\ 1 & 0 & 0 & 0 \\ 2 & 0 & 1 & 2 \\ 4 & 2 & 1 & - 1 \end{array}

(78)

so that f₂ admits a minimum on {4}. In Figures 12 and 13, we plotted the vector fields associated with the vanilla and natural gradient, together with some gradient flows for different initial conditions, in the (α, t) and (η₁, η₂) parameterization, respectively. As in the previous example, natural gradient flows converge to the vertices of the model; however, in this case, we have one local optima in {1} and one global optima in {4}, together with a saddle point in the interior of the model. Similarly to the previous example, in the (α, t) parameterization, the flows have been extended outside the statistical model, and the nature of the critical points is the same for trajectories with initial conditions in the statistical model and in the extension of the statistical model.

We conclude the section by noticing that in both examples, for certain values of t in Equation (69), the natural gradient flows are not defined on the extension of the statistical model. As represented in the figures, once a trajectory encounters the dashed blue line in the (α, t) parameterization, the flow stops at that point.

4. Pseudo-Boolean Functions

We turn to discuss a case of considerable practical interest to see which of the results obtained in the example of the previous section we are able to extend.

For binary variables, we use the coding ±1, that is x = (x₁,…,x_n) ∈ {+1, −1}ⁿ = Ω. For any function f : Ω ↦ ℝ, with multi-index notation, f(x) = ∑_α∈L a_αx^α, with L = {0, 1}ⁿ and

x^{α} = \prod_{i = 1}^{n} x_{i}^{α_{i}}

, 0⁰=1. If M ⊂ L* = L\{0}, the model where p ∈ ε if:

p \propto \exp (\sum_{α \in M} θ_{α} X^{α}) = {\prod_{α \in M} (e^{θ_{α}})}^{X^{α}}

has been considered in a number of papers on combinatorial optimization; see [3–5]. The following statements are results in algebraic statistics; cf. [20,35]. Let

P^{1} = {f \in ℝ^{Ω} | \sum_{x \in Ω} p (x) = 1}

.

Proposition 6 (Implicitization of the exponential family). Given a function p: Ω → ℝ, then and p ∈ ε if, and only if, the following conditions all hold:

p(x) > 0, x ∈ Ω;
∑_x_∈Ωp(x)=1;
$\prod_{x : x^{β} = 1} p (x) = \prod_{x : x^{β} = - 1} p (x)$ for all β ∈ L*\M.

Proof. (⇒) If p ∈ ε, then p(x) > 0, x ∈ Ω (Item 1) and ∑_x_∈Ω p(x) = 1 (Item 2). Moreover, log

p (x) = \sum_{α \in M} θ_{α} x^{α} - ψ (θ)

. The function log p is orthogonal to each X^β, β ∈ L* \ M. Hence:

0 = \sum_{x \in Ω} \log p (x) x^{β} = \sum_{x : x^{β} = 1} \log p (x) - \sum_{x : x^{β} = - 1} \log p (x) = \log \prod_{x : x^{β} = 1} p (x) - \log \prod_{x : x^{β} = 1} p (x),

(79)

which is equivalent to Item 3.

(⇐) Oppositely, the computation in Equation (79) implies that log p is orthogonal to each X^β; hence, there exists θ, such that log

p = \sum_{α \in M} θ_{α} X^{α} + C

. Now, Item 2 implies C = −ψ(θ).

Let ℝ [Ω] denote the ring of polynomials in the indeterminates {p(x)|x ∈ Ω}. Given a binary model M, the set of polynomials:

{\prod_{x : x^{β} = 1} p (x) - \prod_{x : x^{β} = - 1} p (x) | β \in L * \ M},

generates an ideal

J (M)

, which is called the toric ideal of the model M. Its variety

V (M)

is called the exponential variety of M.

Proposition 7.

The exponential variety of M is the Zariski closure of the exponential model ε.
The closure $\bar{ℰ}$ of ε in $P \geq$ is characterized by p(x) ≥ 0, x ∈ Ω, together with Items 2 and 3 of Proposition 6.
The algebraic variety of the ring ℝ[p(x): x ∈ Ω], which is generated by the polynomials Σ_x_∈Ωp(x)−1, $\prod_{x : x^{β} = 1} p (x) - \prod_{x : x^{β} = - 1} p (x)$ , β ∈ L* \ M, is an extension ε¹ of ε to $P^{1}$ .
Define the moments $η_{α} = \sum_{x \in Ω} x^{α} p (x)$ , α ∈ L, i.e., the discrete Fourier transform of p, with inverse $p (x) = 2^{- n} \sum_{α \in L} x^{α} η_{α}$ . There exists an algebraic extension of the moment function ε ∋ p ↔ η(p) ∈ M° to a mapping defined on ε¹.
Proof. 1. According to the implicitization Proposition 6, the exponential family is characterized by the positivity condition together with the algebraic binomial conditions.
This follows from the implicit form, and it is proven, for example, in [20].
By definition.
As the mapping from the probabilities to the moments is affine and one-to-one, such a transformation extends to a one-to-one mapping from the extended model to the affine space of the marginal polytope.

We conclude this section by introducing the so-called no three-way interaction example. On Ω = {0, 1}³, the full model in the statistics 0 ↦ 1, 1 ↦ −1, that is t = (−1)^x = 1 − 2x, is described by the matrix:

\begin{array}{l} 1 T_{3} T_{2} T_{2} T_{3} T_{1} T_{1} T_{3} T_{1} T_{2} T_{1} T_{2} T_{3} \\ D_{3} = \begin{array}{l} 000 \\ 001 \\ 010 \\ 011 \\ 100 \\ 101 \\ 110 \\ 111 \end{array} [\begin{array}{l} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & - 1 & 1 & - 1 & 1 & - 1 & 1 & - 1 \\ 1 & 1 & - 1 & - 1 & 1 & 1 & - 1 & - 1 \\ 1 & - 1 & - 1 & 1 & 1 & - 1 & - 1 & 1 \\ 1 & 1 & 1 & 1 & - 1 & - 1 & - 1 & - 1 \\ 1 & - 1 & 1 & - 1 & - 1 & 1 & - 1 & 1 \\ 1 & 1 & - 1 & - 1 & - 1 & - 1 & 1 & 1 \\ 1 & - 1 & - 1 & 1 & - 1 & 1 & 1 & - 1 \end{array}] . \end{array}

(80)

Note the lexicographic order of both the sample points and the statistics’ exponents.

The exponential family without the interaction term T₁T₂T₃ is the same model as the toric model without the three-way interaction, which is based on the matrix:

\begin{array}{l} C ς_{1} ς_{2} ς_{3} ς_{4} ς_{5} ς_{6} \\ B = \begin{array}{l} 000 \\ 001 \\ 010 \\ 011 \\ 100 \\ 101 \\ 110 \\ 111 \end{array} [\begin{array}{l} 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 0 & 0 & 1 & 1 \\ 1 & 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 1 & 0 & 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 0 & 1 & 0 & 0 \end{array}] \end{array}

(81)

that is the probabilities as a function of the ζ’s are:

{\begin{cases} p_{1} = c \\ p_{2} = c ς_{1} ς_{3} ς_{5} \\ p_{3} = c ς_{2} ς_{3} ς_{6} \\ p_{4} = c ς_{1} ς_{2} ς_{5} ς_{6} \\ p_{5} = c ς_{4} ς_{5} ς_{6} \\ p_{6} = c ς_{1} ς_{3} ς_{4} ς_{6} \\ p_{7} = c ς_{2} ς_{3} ς_{4} ς_{5} \\ p_{8} = c ς_{1} ς_{2} ς_{4} \end{cases} .

(82)

The toric ideal of the toric model in Equation (82) is generated by the polynomial:

p_{2} p_{3} p_{5} p_{8} - p_{1} p_{4} p_{6} p_{7} = 0,

(83)

this means that the closure of the exponential family is given by the solution of the equations:

{\begin{array}{l} p_{1} + p_{2} + p_{3} + p_{4} + p_{5} + p_{6} + p_{7} + p_{8} = 1 \\ p_{2} p_{3} p_{5} p_{8} - p_{1} p_{4} p_{6} p_{7} = 0 \end{array} .

(84)

The η parameters are the expected values of the sufficient statistics of the full model,

\begin{array}{l} 000 001 010 011 100 101 110 111 \\ [\begin{array}{l} η_{1} \\ η_{2} \\ η_{3} \\ η_{4} \\ η_{5} \\ η_{6} \\ η_{7} \end{array}] = [\begin{array}{l} 001 \\ 010 \\ 011 \\ 100 \\ 101 \\ 110 \\ 111 \end{array}] [\begin{array}{l} 1 & - 1 & 1 & - 1 & 1 & - 1 & 1 & - 1 \\ 1 & 1 & - 1 & - 1 & 1 & 1 & - 1 & - 1 \\ 1 & - 1 & - 1 & 1 & 1 & - 1 & - 1 & 1 \\ 1 & 1 & 1 & 1 & - 1 & - 1 & - 1 & - 1 \\ 1 & - 1 & 1 & - 1 & - 1 & 1 & - 1 & 1 \\ 1 & 1 & - 1 & - 1 & - 1 & - 1 & 1 & 1 \\ 1 & - 1 & - 1 & 1 & - 1 & 1 & 1 & - 1 \end{array}] [\begin{array}{l} p_{1} \\ p_{2} \\ p_{3} \\ p_{4} \\ p_{5} \\ p_{6} \\ p_{7} \\ p_{8} \end{array}] . \end{array}

(85)

In the ring:

R = ℚ [p_{1}, p_{2}, p_{3}, p_{4}, p_{5}, p_{6}, p_{7}, p_{8}, η_{2}, η_{3}, η_{4}, η_{5}, η_{6}, η_{7}]

(86)

we can consider the ideal

I

generated by the Equations (84) together with Equations (85). The elimination ideal:

J = I \cap ℚ [η_{1}, η_{2}, η_{3}, η_{4}, η_{5}, η_{6}, η_{7}]

(87)

will express the model as a dependence between the η’s.

Computation with CoCoA [36] gives the following polynomial:

\begin{array}{l} f (η_{1}, η_{2}, η_{3}, η_{4}, η_{5}, η_{6}; η_{7}) = \\ η_{1}^{2} η_{3} η_{4} + η_{2}^{2} η_{3} η_{4} - η_{3}^{3} η_{4} - η_{3} η_{4}^{3} + η_{1}^{2} η_{2} η_{5} - η_{2}^{3} η_{5} + η_{2} η_{3}^{2} η_{5} + η_{2} η_{4}^{2} η_{5} + η_{3} η_{4} η_{5}^{2} - η_{2} η_{5}^{3} - η_{1}^{3} η_{6} + η_{1} η_{2}^{2} η_{6} + η_{1} η_{3}^{2} η_{6} \\ + η_{1} η_{4}^{2} η_{6} + η_{1} η_{5}^{2} η_{6} + η_{3} η_{4} η_{6}^{2} + η_{2} η_{5} η_{6}^{2} - η_{1} η_{6}^{3} - 2 η_{1} η_{2} η_{4} - 2 η_{1} η_{3} η_{5} - 2 η_{2} η_{3} η_{6} - 2 η_{4} η_{5} η_{6} + η_{3} η_{4} + η_{2} η_{5} + η_{1} η_{6} \\ + (- 2 η_{1} η_{2} η_{3} - 2 η_{1} η_{4} η_{5} - 2 η_{2} η_{4} η_{6} - 2 η_{3} η_{5} η_{6} + η_{1}^{2} + η_{2}^{2} + η_{3}^{2} + η_{4}^{2} + η_{5}^{2} + η_{6}^{2} - 1) η_{7} \\ + (η_{3} η_{4} + η_{2} η_{5} + η_{1} η_{6}) η_{7}^{2} + (- 1) η_{7}^{3} . \end{array}

(88)

The equation:

f (η_{1}, η_{2}, η_{3}, η_{4}, η_{5}, η_{6}; η_{7}) = 0

(89)

is an expression of the model in the expectation parameters, and this expression is a polynomial equation. We know unique solvability in η₇ if (η₁, η₂, η₃, η₄, η₅, η₆) is in the interior of the marginal polytope. As in the example of the previous section, it is possible to intersect the polynomial invariant in Equation (83) with one or more sheaves of hyperplanes around some faces of the simplex, in order to lower the degree of the invariant and thus decompose the model as the convex hull of probabilities on the boundary of the model. We do not describe the details here, and we postpone the discussion of this example to a paper which is in preparation.

5. Conclusions

Geometry and algebra play a fundamental role in the study of statistical models, and in particular in the exponential family. In the fist part of the paper, starting from the definition of the natural gradient over an exponential family, we described the relationship between its expression in the basis of the sufficient statistics and in the conjugate basis. From this perspective, the terms natural gradient and vanilla gradient, to denote gradients evaluated with respect to the Fisher and the Euclidean geometry, together with their duality in the natural and expectation parameters, assume a new meaning, since these definitions depend on the choice of the basis for the tangent space.

In order to study natural gradient flows for a generic discrete exponential model and, in particular, their convergence, it is convenient to move to the mixture geometry of the expectation parameters and to study trajectories over the marginal polytope. However, in order to obtain explicit equations for the flows, it is necessary to determine the dependence between the moments associated with the sufficient statistics of the model, which are constrained to belong to the marginal polytope, and the remaining moments, which on the other side are not free. Such a relationship, which for finite search spaces is given by a system of polynomial invariants, cannot be easily solved explicitly in general. In the second part of the paper, by using algebraic tools, we proposed a novel parameterization based on ruled surfaces for an exponential family, which does not require to solve the polynomial invariant explicitly. We applied our approach to a simple example, and we showed that the surface associated with the model in the full marginal polytope is a ruled surface. We claim that these results are not peculiar to the example we described, and we are working towards an extension of this approach in a more general case.

Acknowledgments

The authors would like to thank Gianfranco Casnati from Politecnico di Torino for the useful discussions on the geometry of ruled surfaces. Giovanni Pistone is supported by de Castro Statistics of Collegio Carlo Alberto at Moncalieri and is a member of INdAM/GNAMPA.

Author Contributions

Both authors contributed to the design of the research. The research was carried out by all of the authors. The manuscript was written by Luigi Malagò and Giovanni Pistone. Both authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pistone, G. Nonparametric information geometry. In Geometric Science of Information, Proceedings of the First International Conference (GSI 2013), Paris, France, 28–30 August 2013; Nielsen, F., Barbaresco, F., Eds.; Springer: Heidelberg, Germany, 2013; 8085, pp. 5–36. [Google Scholar]
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming, 2009, Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by estimation of empirical covariances. Proceedings of the 2011 IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective, Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
Wierstra, D.; Schaul, T.; Peters, J.; Schmidhuber, J. Natural evolution strategies. Proceedings of the 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 3381–3387.
Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles 2011. arXiv: 1106.3708.
Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: Newton method. Entropy 2014, 16, 4260–4289. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; Translated from the 1993 Japanese original by Daishi Harada. [Google Scholar]
Bourbaki, N. Variétés differentielles et analytiques. Fascicule de résultats / Paragraphes 1 à 7; Number XXXIII in Éléments de mathématiques; Hermann: Paris, France, 1971. [Google Scholar]
Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar]
Malagò, L.; Pistone, G. Gradient Flow of the Stochastic Relaxation on a Generic Exponential Family. Proceedings of Conference of Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Clos Lucé, Amboise, France, 21–26 September 2014; Mohammad-Djafari, A., Barbaresco, F., Eds.; pp. 353–360.
Brown, L.D. Fundamentals of Statistical Exponential Families With Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes, Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton Mathematical Series No. 28; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Do Carmo, M.P. Riemannian Geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; Translated from the second Portuguese edition by Francis Flaherty. [Google Scholar]
Amari, S.I. Natural gradient works efficiently in learning. Neur. Comput. 1998, 10, 251–276. [Google Scholar]
Shima, H. The Geometry of Hessian Structures; World Scientific Publishing Co. Pte. Ltd.: Hackensack, NJ, USA, 2007. [Google Scholar]
Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 2009, 3, 446–484. [Google Scholar]
Rauh, J.; Kahle, T.; Ay, N. Support Sets in Exponential Families and Oriented Matroid Theory. Int. J. Approx. Reas. 2011, 52, 613–626. [Google Scholar]
Malagò, L.; Pistone, G. A note on the border of an exponential family 2010. arXiv:1012.0637v1.
Pistone, G.; Rogantin, M. The gradient flow of the polarization measure. With an appendix 2015. arXiv:1502.06718. [Google Scholar]
Diaconis, P.; Sturmfels, B. Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 1998, 26, 363–397. [Google Scholar]
Pistone, G.; Wynn, H.P. Generalised confounding with Gröbner bases. Biometrika 1996, 83, 653–666. [Google Scholar]
Pistone, G.; Riccomagno, E.; Wynn, H.P. Algebraic Statistics: Computational Commutative Algebra in Statistics; Volume 89, Monographs on Statistics and Applied Probability; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
Drton, M.; Sturmfels, B.; Sullivant, S. Lectures on Algebraic Statistics; Volume 39, Oberwolfach Seminars; Birkhäuser Verlag: Basel, Germany, 2009. [Google Scholar]
Pachter, L.; Sturmfels, B. (Eds.) Algebraic Statistics for Computational Biology; Cambridge University Press: Cambridge, UK, 2005.
Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. Algebraic and Geometric Methods in Statistics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
4ti2 team. 4ti2—A software package for algebraic, geometric and combinatorial problems on linear spaces. Available online: http://www.4ti2.de accessed on 2 June 2015.
Michałek, M.; Sturmfels, B.; Uhler, C.; Zwiernik, P. Exponential Varieties 2014. arXiv:1412.6185.
Sturmfels, B. Gröbner Bases and Convex Polytopes; American Mathematical Society: Providence, RI, USA, 1996. [Google Scholar]
Geiger, D.; Meek, C.; Sturmfels, B. On the toric algebra of graphical models. Ann. Stat. 2006, 34, 1463–1492. [Google Scholar]
Rapallo, F. Toric statistical models: Parametric and binomial representations. Ann. Inst. Stat. Math. 2007, 59, 727–740. [Google Scholar]
Beltrametti, M.; Carletti, E.; Gallarati, D.; Monti Bragadin, G. Lectures on Curves, Surfaces and Projective Varieties: A Classical View of Algebraic Geometry; EMS textbooks in mathematics; European Mathematical Society: Zürich, Switzerland, 2009. [Google Scholar]
Rinaldo, A.; Fienberg, S.E.; Zhou, Y. On the geometry of discrete exponential families with application to exponential random graph models. Electron. J. Stat. 2009, 3, 446–484. [Google Scholar]
Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2009; Chapter 21; pp. 339–363. [Google Scholar]
Abbott, J.; Bigatti, A.; Lagorio, G. CoCoA-5: A system for doing Computations in Commutative Algebra. Available online: http://cocoa.dima.unige.it accessed on 2 June 2015.

Figure 1. Marginal polytope of the exponential family in Equations (12) and (13). The coordinates of the vertices are given by (T₁, T₂).

Figure 2. Representation of the exponential family in Equations (12) and (13) as a surface that intersects the probability simplex ∆₃. The surface is obtained by the triangularization of a grid of points that satisfy the invariant in Equation (21).

Figure 3. Marginal polytope of the exponential family in Equations (12) and (13) (a). The dashed lines correspond to the points where ∆ = 0, where ∆ is the discriminant in Equation (31); over the red regions ∆ > 0 and over the blue regions ∆ < 0. Representation of the exponential family as a surface in the full marginal polytope parametrized by (η₁, η₂, η₃) (b). The blue surface is given by the unique real root η_3,1 in Equation (32); the red surface corresponds to the unique real root η_3,2, which belongs to the full marginal polytope; over the dashed lines, which have been computed solving Equation (40) numerically, Equation (26) admits a real root with multiplicity equal to three.

Figure 4. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the probability simplex (a) and in the parameter space (α, t) (b). The dashed line corresponds to the critical edge δ₂ ↔ δ₄ and the blue line to the case

α = - \frac{1}{2}

.

Figure 4. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the probability simplex (a) and in the parameter space (α, t) (b). The dashed line corresponds to the critical edge δ₂ ↔ δ₄ and the blue line to the case

α = - \frac{1}{2}

.

Figure 5. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the marginal polytope (η_1, η₂) (a) and in the full marginal polytope parametrized by (η_1, η₂ η₃) (b) The dashed line corresponds to the critical line δ₂ ↔ δ₄

α = - \frac{1}{2}

Figure 5. Representation of the exponential family in Equations (12) and (13) as a ruled surface in the marginal polytope (η_1, η₂) (a) and in the full marginal polytope parametrized by (η_1, η₂ η₃) (b) The dashed line corresponds to the critical line δ₂ ↔ δ₄

α = - \frac{1}{2}

Figure 6. The segments that form the ruled surface in Figure 4 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(0.7) (shading from red to black for increasing values of α) and for exp(0.7) − 1 < α < −1 (shading from red to white for decreasing values of α). The simplex in (b) has been rotated with respect to Figure 4(a) to better visualize the intersection of the lines with the critical edge δ₂ ↔ δ₄.

Figure 7. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 6(b), for exp(3.5) − 1 < α < exp(3.5) and −0.5 < t < 1.5; for α → ±∞, the lines of the extended surface admit the same limit.

Figure 8. The segments that form the ruled surface in Figure 5 have been extended, for −0.5 < t < 1.5. New lines described by Equations (60) have been represented for 0 < α < exp(1) (shading from blue to black for increasing values of α) and exp(1) − 1 < α < −1 (shading from blue to white for decreasing values of α). The full marginal polytope in (b) has been rotated with respect to Figure 5(b) to better visualize the intersection of the lines with the critical edge δ₂ ↔ δ₄.

Figure 9. Extension of the ruled surface associated with the exponential family in Equations (12) and (13) as in Figure 8(b), for exp(3)−1 < α < exp(3) and −0.5 < t < 1.5; notice that for α → ±∞, the lines of the extended surface admit the same limit.

Figure 10. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b), together with level lines associated with F_α_,_t(α, t) in the (α, t) parameterization, for c₀ = 0, c₁ = 1, c₂ = 2 and c₃ = 3; the dashed blue lines in (b) represent the points where

\tilde{\nabla} F_{α, t} (α, t)

is not defined; see Equation (68).

Figure 10. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b), together with level lines associated with F_α_,_t(α, t) in the (α, t) parameterization, for c₀ = 0, c₁ = 1, c₂ = 2 and c₃ = 3; the dashed blue lines in (b) represent the points where

\tilde{\nabla} F_{α, t} (α, t)

is not defined; see Equation (68).

Figure 11. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b), together with level lines associated with F_η(α, t) over the marginal polytope, for c₀ = 0, c₁ = 1, c₂ = 2 and c₃ = 3.

Figure 12. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b) as in Figure 10, for c₀ = 0, c₁ = 1, c₂ = 2 and

c_{3} = - \frac{5}{2}

.

Figure 12. Vanilla gradient field and flows in blue (a) and natural gradient field and flows in red (b) as in Figure 10, for c₀ = 0, c₁ = 1, c₂ = 2 and

c_{3} = - \frac{5}{2}

.

Figure 13. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b) as in Figure 11, for c₀ = 0, c₁ = 1, c₂ = 2 and

c_{3} = - \frac{5}{2}

.

Figure 13. Vanilla gradient field in blue (a) and natural gradient field and flows in red (b) as in Figure 11, for c₀ = 0, c₁ = 1, c₂ = 2 and

c_{3} = - \frac{5}{2}

.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Malagò, L.; Pistone, G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. Entropy 2015, 17, 4215-4254. https://doi.org/10.3390/e17064215

AMA Style

Malagò L, Pistone G. Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family. Entropy. 2015; 17(6):4215-4254. https://doi.org/10.3390/e17064215

Chicago/Turabian Style

Malagò, Luigi, and Giovanni Pistone. 2015. "Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family" Entropy 17, no. 6: 4215-4254. https://doi.org/10.3390/e17064215

Article Menu

Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family^†

Abstract

1. Introduction

2. Gradient Flow of Relaxed Optimization

2.1. Statistical Manifold

2.2. Gradient

2.3. Gradient Flow in the Mixture Geometry

2.4. The Saturated Model

3. Toric Models: A Tutorial Example

3.1. Border

3.2. Fisher Information

3.3. Extension of the Model

3.4. Optimization and Natural Gradient Flows

3.5. Examples with Global and Local Optima

4. Pseudo-Boolean Functions

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family †

Abstract

1. Introduction

2. Gradient Flow of Relaxed Optimization

2.1. Statistical Manifold

2.2. Gradient

2.3. Gradient Flow in the Mixture Geometry

2.4. The Saturated Model

3. Toric Models: A Tutorial Example

3.1. Border

3.2. Fisher Information

3.3. Extension of the Model

3.4. Optimization and Natural Gradient Flows

3.5. Examples with Global and Local Optima

4. Pseudo-Boolean Functions

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family^†