Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Amari, Shun-ichi

doi:10.3390/e16042131

Open AccessArticle

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

by

Shun-ichi Amari

RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama 351-0198, Japan

Entropy 2014, 16(4), 2131-2145; https://doi.org/10.3390/e16042131

Submission received: 14 February 2014 / Revised: 9 April 2014 / Accepted: 10 April 2014 / Published: 14 April 2014

(This article belongs to the Special Issue Information Geometry)

Download Versions Notes

Abstract

: Information geometry studies the dually flat structure of a manifold, highlighted by the generalized Pythagorean theorem. The present paper studies a class of Bregman divergences called the (ρ, τ )-divergence. A (ρ, τ )-divergence generates a dually flat structure in the manifold of positive measures, as well as in the manifold of positive-definite matrices. The class is composed of decomposable divergences, which are written as a sum of componentwise divergences. Conversely, a decomposable dually flat divergence is shown to be a (ρ, τ )-divergence. A (ρ, τ )-divergence is determined from two monotone scalar functions, ρ and τ. The class includes the KL-divergence, α-, β- and (α, β)-divergences as special cases. The transformation between an affine parameter and its dual is easily calculated in the case of a decomposable divergence. Therefore, such a divergence is useful for obtaining the center for a cluster of points, which will be applied to classification and information retrieval in vision. For the manifold of positive-definite matrices, in addition to the dually flatness and decomposability, we require the invariance under linear transformations, in particular under orthogonal transformations. This opens a way to define a new class of divergences, called the (ρ, τ )-structure in the manifold of positive-definite matrices.

Keywords:

information geometry; dually flat structure; decomposable divergence; (ρ,τ )-structure

1. Introduction

Information geometry, originated from the invariant structure of a manifold of probability distributions, consists of a Riemannian metric and dually coupled affine connections with respect to the metric [1]. A manifold having a dually flat structure is particularly interesting and important. In such a manifold, there are two dually coupled affine coordinate systems and a canonical divergence, which is a Bregman divergence. The highlight is given by the generalized Pythagorean theorem and projection theorem. Information geometry is useful not only for statistical inference, but also for machine learning, pattern recognition, optimization and even for neural networks. It is also related to the statistical physics of Tsallis q-entropy [2–4].

The present paper studies a general and unique class of decomposable divergence functions in $R_{+}^{n}$ , the manifold of n-dimensional positive measures. This is the (ρ, τ )-divergences, introduced by Zhang [5,6], from the point of view of “representation duality”. They are Bregman divergences generating a dually flat structure. The class includes the well-known Kullback-Leibler divergence, α-divergence, β-divergence and (α, β)-divergence [1,7–9] as special cases. The merit of a decomposable Bregman divergence is that the θ-η Legendre transformation is computationally tractable, where θ and η are two affine coordinates systems coupled by the Legendre transformation. When one uses a dually flat divergence to define the center of a cluster of elements, the center is easily given by the arithmetic mean of the dual coordinates of the elements [10,11]. However, we need to calculate its primal coordinates. This is the θ-η transformation. Hence, our new type of divergences has an advantage of calculating θ-coordinates for clustering and related pattern matching problems. The most general class of dually flat divergences, not necessarily decomposable, is further given in $R_{+}^{n}$ . They are the (ρ, τ ) divergence.

Positive-definite (PD) matrices appear in many engineering problems, such as convex programming, diffusion tensor analysis and multivariate statistical analysis [12–20]. The manifold, PD_n, of n × n PD matrices form a cone, and its geometry is by itself an important subject of research. If we consider the submanifold consisting of only diagonal matrices, it is equivalent to the manifold of positive measures. Hence, PD matrices can be regarded as a generalization of positive measures. There are many studies on geometry and divergences of the manifold of positive-definite matrices. We introduce a general class of dually flat divergences, the (ρ, τ )-divergence. We analyze the cases when a (ρ, τ )-divergence is invariant under the general linear transformations, Gl(n), and invariant under the orthogonal transformations, O(n). They not only include many well-known divergences of PD matrices, but also give new important divergences.

The present paper is organized as follows. Section 2 is preliminary, giving a short introduction to a dually flat manifold and the Bregman divergence. It also defines the cluster center due to a divergence. Section 3 defines the (ρ, τ )-structure in $R_{+}^{n}$ . It gives dually flat decomposable affine coordinates and a related canonical divergence (Bregman divergence). Section 4 is devoted to the (ρ, τ )-structure of the manifold, PD_n, of PD matrices. We first study the class of divergences that are invariant under O(n). We further study a decomposable divergence that is invariant under Gl(n). It coincides with the invariant divergence derived from zero-mean Gaussian distributions with PD covariance matrices. They not only include various known divergences, but new remarkable ones. Section 5 discusses a general class of non-decomposable flat divergences and miscellaneous topics. Section 6 is the conclusions.

2. Preliminaries to Information Geometry of Divergence

2.1. Dually Flat Manifold

A manifold is said to have the dually flat Riemannian structure, when it has two affine coordinate systems θ = (θ¹, · · · , θⁿ) and η = (η₁, · · · , η_n) (with respect to two flat affine connections) together with two convex functions, ψ(θ) and ϕ(η), such that the two coordinates are connected by the Legendre transformations:

η = \nabla ψ (θ), θ = \nabla ϕ (η),

(1)

where ∇ is the gradient operator. The Riemannian metric is given by:

(g_{i j} (θ)) = \nabla \nabla ψ (θ), (g^{i j} (η)) = \nabla \nabla ϕ (η)

(2)

in the respective coordinate systems. A curve that is linear in the θ-coordinates is called a θ-geodesic, and a curve linear in the η-coordinates is called an η-geodesic.

A dually flat manifold has a unique canonical divergence, which is the Bregman divergence defined by the convex functions,

D [P : Q] = ψ (θ_{P}) + ϕ (η_{Q}) - θ_{P} \cdot η_{Q},

(3)

where θ_P is the θ-coordinates of P, η_Q is the η-coordinates of Q and $θ_{P} \cdot η_{Q} = \sum_{i} (θ_{P}^{i}) (η_{Q i})$ , where $θ_{P}^{i}$ and η_Qi are components of θ_p and η_Q, respectively. The Pythagorean and projection theorems hold in a dually flat manifold:

Pythagorean Theorem

Given three points, P,Q,R, when the η-geodesic connecting P and Q is orthogonal to the θ-geodesic connecting Q and R with respect to the Riemannian metric,

D [P : Q] + D [Q : R] = D [P : R] .

(4)

Projection Theorem

Given a smooth submanifold, S, let P_S be the minimizer of divergence from

P_{S} = min_{Q \in S} D [P : Q]

(5)

Then, P_S is the η-geodesic projection of P to S, that is the η-geodesic connecting P and P_S is orthogonal to S.

We have the dual of the above theorems where θ- and η-geodesics are exchanged and D [P : Q] is replaced by its dual D [Q : P].

2.2. Decomposable Divergence

A divergence, D [P : Q], is said to be decomposable, when it is written as a sum of component-wise divergences,

D [P : Q] = \sum_{i = 1}^{n} d (θ_{P}^{i}, θ_{Q}^{i}),

(6)

where $θ_{P}^{i}$ and $θ_{Q}^{i}$ are the components of θ_P and θ_Q and $d (θ_{P}^{i}, θ_{Q}^{i})$ is a scalar divergence function.

An f-divergence:

D_{f} [P : Q] = \sum p_{i} f (\frac{q_{i}}{p_{i}})

(7)

is a typical example of decomposable divergence in the manifold of probability distributions, where P = (p) and Q = (q) are two probability vectors with ∑p_i = ∑q_i = 1. A convex function, ψ(θ), is said to be decomposable, when it is written as:

ψ (θ) = \sum_{i = 1}^{n} \tilde{ψ} (θ^{i})

(8)

by using a scalar convex function, ψ̃(θ). The Bregman divergence derived from a decomposable convex function is decomposable.

When ψ(θ) is a decomposable convex function, its Legendre dual is also decomposable. The Legendre transformation is given componentwise as:

η_{i} = {\tilde{ψ}}^{'} (θ_{i}),

(9)

where ′ is the differentiation of a function, so that it is computationally tractable. Its inverse transformation is also componentwise,

θ_{i} = {\tilde{ϕ}}^{'} (η_{i}),

(10)

where ϕ̃ is the Legendre dual of ψ̃.

2.3. Cluster Center

Consider a cluster of points P₁, · · · , P_m of which θ-coordinates are θ₁, · · · , θ_m and η-coordinates are η₁, · · · , η_m. The center, R, of the cluster with respect to the divergence, D [P : Q], is defined by:

R = \underset{Q}{arg min} \sum_{i = 1}^{m} D [Q : P_{i}] .

(11)

By differentiating ∑D [Q : P_i] by θ (the θ-coordinates of Q), we have:

\nabla ψ (θ_{R}) = \frac{1}{m} \sum_{i = 1}^{m} η_{i} .

(12)

Hence, the cluster-center theorem due to Banerjee et al. [10] follows; see also [11]:

Cluster-Center Theorem

The η-coordinates η_R of the cluster center are given by the arithmetic average of the η-coordinates of the points in the cluster:

η_{R} = \frac{1}{m} \sum_{i = 1}^{m} η_{i} .

(13)

When we need to obtain the θ-coordinates of the cluster center, it is given by the θ-η transformation from η_R,

θ_{R} = \nabla ϕ (η_{R}) .

(14)

However, in many cases, the transformation is computationally heavy and intractable when the dimensions of a manifold is large. The transformation is easy in the case of a decomposable divergence. This is motivation for considering a general class of decomposable Bregman divergences.

3. (ρ, τ ) Dually Flat Structure in $R_{+}^{n}$

3.1. (ρ, τ )-Coordinates of $R_{+}^{n}$

Let $R_{+}^{n}$ be the manifold of positive measures over n elements x₁, · · · , x_n. A measure (or a weight) of x_i is given by:

ξ_{i} = m (x_{i}) > 0

(15)

and ξ = (ξ₁, · · · , ξ_n) is a distribution of measures. When ∑ξ_i = 1 is satisfied, it is a probability measure. We write:

R_{n}^{+} = {ξ ∣ ξ_{i} > 0}

(16)

and ξ forms a coordinate system of $R_{+}^{n}$ .

Let ρ(ξ) and τ (ξ) be two monotonically increasing differentiable functions. We call:

θ = ρ (ξ), η = τ (ξ)

(17)

the ρ- and τ -representations of positive measure ξ. This is a generalization of the ±α representations [1] and was introduced in [5] for a manifold of probability distributions. See also [6].

By using these functions, we construct new coordinate systems θ and η of $R_{+}^{n}$ . They are given, for θ = (θⁱ) and η = (η_i), by componentwise relations,

θ^{i} = ρ (ξ_{i}), η_{i} = τ (ξ_{i}) .

(18)

They are called the ρ- and τ -representations of ξ ∈ $ξ \in R_{+}^{n}$ , respectively. We search for convex functions, ψ_ρ,τ (θ) and ϕ_ρ,τ (η), which are Legendre duals to each other, such that θ and η are two dually coupled affine coordinate systems.

3.2. Convex Functions

We introduce two scalar functions of θ and η by:

{\tilde{ψ}}_{ρ, τ} (θ) = \int_{0}^{ρ^{- 1} (θ)} τ (ξ) ρ^{'} (ξ) d ξ,

(19)

{\tilde{ϕ}}_{ρ, τ} (η) = \int_{0}^{τ^{- 1} (η)} ρ (ξ) τ^{'} (ξ) d ξ .

(20)

Then, the first and second derivatives of ψ̃_ρ,τ are:

{\tilde{ψ}}_{ρ, τ}^{'} (θ) = τ (ξ),

(21)

{\tilde{ψ}}_{ρ, τ}^{''} (θ) = \frac{τ^{'} (ξ)}{ρ^{'} (ξ)} .

(22)

Since ρ′(ξ) > 0, τ′ (ξ) > 0, we see that ψ̃_ρ,τ (θ) is a convex function. So is ϕ̃_ρ,τ (η). Moreover, they are Legendre duals, because:

{\tilde{ψ}}_{ρ, τ} (θ) + {\tilde{ϕ}}_{ρ, τ} (η) - θ η = \int_{0}^{ξ} τ (ξ) ρ^{'} (ξ) d ξ + \int_{0}^{ξ} ρ (ξ) τ^{'} (ξ) d ξ - ρ (ξ) τ (ξ)

(23)

= 0.

(24)

We then define two decomposable convex functions of θ and η by:

ψ_{ρ, τ} (θ) = \sum {\tilde{ψ}}_{ρ, τ} (θ^{i}),

(25)

ϕ_{ρ, τ} (η) = \sum {\tilde{ϕ}}_{ρ, τ} (η_{i}) .

(26)

They are Legendre duals to each other.

3.3. (ρ, τ )-Divergence

The (ρ, τ )-divergence between two points, ξ, $ξ^{'} \in R_{n}^{+}$ , is defined by:

D_{ρ, τ} [ξ : ξ^{'}] = ψ_{ρ, τ} (θ) + ϕ_{ρ, τ} (η^{'}) - θ \cdot η^{'}

(27)

= \sum_{i = 1}^{n} [\int_{0}^{ξ_{i}} τ (ξ) ρ^{'} (ξ) d ξ + \int_{0}^{ξ_{i}} ρ (ξ) τ^{'} (ξ) d ξ - ρ (ξ_{i}) τ (ξ_{i}^{'})],

(28)

where θ and η′ are ρ- and τ -representations of ξ and ξ′, respectively.

The (ρ, τ )-divergence gives a dually flat structure having θ and η as affine and dual affine coordinate systems. This is originally due to Zhang [5] and a generalization of our previous results concerning the q and deformed exponential families [4]. The transformation between θ and η is simple in the (ρ, τ )-structure, because it can be done componentwise,

θ^{i} = ρ {τ^{- 1} (η_{i})},

(29)

η_{i} = τ {ρ^{- 1} (θ^{i})} .

(30)

The Riemannian metric is:

g_{i j} (ξ) = \frac{τ^{'} (ξ_{i})}{ρ^{'} (ξ_{i})} δ_{i j},

(31)

and hence Euclidean, because the Riemann-Christoffel curvature due to the Levi-Civita connection vanishes, too.

The following theorem is new, characterizing the (ρ, τ )-divergence.

Theorem 1

The (ρ, τ )-divergences form a unique class of divergences in $R_{+}^{n}$ that are dually flat and decomposable.

3.4. Biduality: α-(ρ, τ ) Divergence

We have dually flat connections, (∇_ρ,τ, $\nabla_{ρ, τ}^{*}$ ), represented in terms of covariant derivatives, which are derived from D_ρ,τ. This is called the representation duality by Zhang [5]. We further have the α-(ρ, τ ) connections defined by:

\nabla_{ρ, τ}^{(α)} = \frac{1 + α}{2} \nabla_{ρ, τ} + \frac{1 - α}{2} \nabla_{ρ, τ}^{*} .

(32)

The α-(−α) duality is called the reference duality [5]. Therefore, $\nabla_{ρ, τ}^{(α)}$ possesses the biduality, one concerning α and (−α), and the other with respect to ρ and τ.

The Riemann-Christoffel curvature of $\nabla_{ρ, τ}^{(α)}$ is:

R_{ρ, τ}^{(α)} = \frac{1 - α^{2}}{4} R_{ρ, τ}^{(0)} = 0

(33)

for any α. Hence, there exists unique canonical divergence $D_{ρ, τ}^{(α)}$ and α-(ρ, τ ) affine coordinate systems. It is an interesting future problem to obtain their explicit forms.

3.5. Various Examples

As a special case of the (ρ, τ )-divergence, we have the (α, β)-divergence obtained from the following power functions,

ρ (ξ) = \frac{1}{α} ξ^{α}, τ (ξ) = \frac{1}{β} ξ^{β} .

(34)

This was introduced by Cichocki, Cruse and Amari in [7,8].

The affine and dual affine coordinates are:

θ^{i} = \frac{1}{α} {(ξ_{i})}^{α}, η_{i} = \frac{1}{β} {(ξ_{i})}^{β}

(35)

and the convex functions are:

ψ (θ) = c_{α, β} \sum θ_{i}^{\frac{α + β}{α}}, ϕ (η) = c_{β, α} \sum η_{i}^{\frac{α + β}{β}},

(36)

where:

c_{α, β} = \frac{1}{β (α + β)} α^{\frac{α + β}{α}} .

(37)

The induced (α, β)-divergence has a simple form,

D_{α, β} [ξ : ξ^{'}] = \frac{1}{α β (α + β)} \sum {α ξ_{i}^{α + β} β ξ_{i}^{' α + β} - (α + β) ξ_{i}^{α} ξ_{i}^{' β}},

(38)

for ξ, $ξ^{'} \in R_{n}^{+}$ . It is defined similarly in the manifold, S_n, of probability distributions, but it is not a Bregman divergence in S_n. This is because the total mass constraint ∑ξ_i = 1 is not linear in θ- or η-coordinates in general.

The α-divergence is a special case of the (α, β)-divergence, so that it is a (ρ, τ )-divergence. By putting:

ρ (ξ) = \frac{2}{1 - α} ξ^{\frac{1 - α}{2}}, τ (ξ) = \frac{2}{1 + α} ξ^{\frac{1 + α}{2}},

(39)

we have:

D_{α} [ξ : ξ^{'}] = \frac{4}{1 - α^{2}} \sum {\frac{1 - α}{2} ξ_{i} + \frac{1 + α}{2} ξ_{i}^{\frac{1 - α}{2}} - ξ_{i}^{α} {(ξ_{i}^{'})}^{\frac{1 + α}{2}}} .

(40)

The β-divergence [19] is obtained from:

ρ (ξ) = ξ, τ (ξ) = \frac{1}{β} ξ^{1 + β} .

(41)

It is written as:

D_{β} [ξ : ξ^{'}] = \frac{1}{β (β + 1)} \sum_{i} [ξ_{i}^{β + 1} + (β + 1) ξ_{i}^{'} - {(ξ_{i}^{'})}^{β + 1} - (β + 1) ξ_{i} {(ξ_{i}^{'})}^{β}] .

(42)

The β-divergence is special in the sense that it gives a dually flat structure, even in S_n. This is because u(ξ) is linear in ξ.

The classes of α-divergences and β-divergences intersect at the KL-divergence, and their duals are different in general. They are the only intersecting points of the two classes.

When ρ(ξ) = ξ and τ (ξ) = U′ (ξ) where U is a convex function, (ρ, τ )-divergence is Eguchi’s U-divergence [21].

Zhang already introduced the (α, β)-divergence in [5], which is not a (ρ, τ )-divergence in $R_{+}^{n}$ and different from ours. We regret for our confusing the naming of the (α, β)-divergence.

4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices

4.1. Invariant and Decomposable Convex Function

Let P be a positive-definite matrix and ψ(P) be a convex function. Then, a Bregman divergence is defined between two positive definite matrices, P and Q, by:

D [P : Q] = ψ (P) - ψ (Q) - \nabla ψ (P) \cdot (P - Q)

(43)

where ∇ is the gradient operator with respect to matrix P = (P_ij), so that ∇ψ(P) is a matrix and the inner product of two matrices is defined by:

\nabla ψ (Q) \cdot P = tr {\nabla ψ (Q) P},

(44)

where tr is the trace of a matrix.

It induces a dually flat structure to the manifold of positive-definite matrices, where the affine coordinate system (θ-coordinates) is Θ = P and the dual affine coordinate system (η-coordinates) is:

H = \nabla ψ (P) .

(45)

A convex function, ψ(P), is said to be invariant under the orthogonal group O(n), when:

ψ (P) = ψ (O^{T} PO)

(46)

holds for any orthogonal transformation O, where O^T is the transpose of O. An invariant function is written as a symmetric function of n eigenvalues λ₁, · · · , λ_n of P. See Dhillon and Tropp [12]. When an invariant convex function of P is written, by using a convex function, f, of one variable, in the additive form:

ψ (P) = \sum f (λ_{i}),

(47)

it is said to be decomposable. We have:

ψ (P) = tr f (P) .

(48)

4.2. Invariant, Flat and Decomposable Divergence

A divergence D [P : Q] is said to be invariant under O(n), when it satisfies:

D [P : Q] = D [O^{T} PO : O^{T} QO] .

(49)

When it is derived from a decomposable convex function, ψ(P), it is invariant, flat and decomposable.

We give well-known examples of decomposable convex functions and the divergences derived from them:

(1)

For f(λ) = (1/2)λ², we have:

ψ (P) = \frac{1}{2} \sum λ_{i}^{2},

(50)

D [P : Q] = \frac{1}{2} ∣ ∣ P - Q ∣ ∣^{2},

(51)

where ||P||² is the Frobenius norm:

∣ ∣ P ∣ ∣^{2} = \sum P_{i j}^{2} .

(52)

(2)

For f(λ) = −log λ

ψ (P) = - log (det ∣ P ∣),

(53)

D [P : Q] = tr (P Q^{- 1}) - log (det ∣ P Q^{- 1} ∣) - n .

(54)

The affine coordinate system is P, and the dual coordinate system is P⁻¹. The derived geometry is the same as that of multivariate Gaussian probability distributions with mean zero and covariance matrix P.

(3)

For f(λ) = λ log λ − λ,

ψ (P) = tr (P log P - P),

(55)

D [P : Q] = tr (P log P - P log Q - P + Q) .

(56)

This divergence is used in quantum information theory. The affine coordinate system is P, and the dual affine coordinate system is log P; and, ψ(P) is called the negative von Neuman entropy.

4.3. (ρ, τ )-Structure in Positive Definite Matrices

We extend the (ρ, τ )-structure in the previous section to the matrix case and introduce a general dually flat invariant decomposable divergence in the manifold of positive-definite matrices. Let:

Θ = ρ (P), H = τ (P)

(57)

be ρ- and τ -representations of matrices. We use two functions, ψ̃_ρ,τ (θ) and ϕ̃_ρ,τ (η), defined in Equations (19) and (20), for defining a pair of dually coupled invariant and decomposable convex functions,

ψ (Θ) = tr {\tilde{ψ}}_{ρ, τ} {Θ},

(58)

ϕ (H) = tr {\tilde{ϕ}}_{ρ, τ} {H} .

(59)

They are not convex with respect to P, but are convex with respect to Θ and H, respectively. The derived Bregman divergence is:

D [P : Q] = ψ {Θ (P)} + ϕ {H (Q)} - Θ (P) \cdot H (Q) .

(60)

Theorem 2

The (ρ, τ )-divergences form a unique class of invariant, decomposable and dually flat divergences in the manifold of positive matrices.

The Euclidean, Gaussian and von Neuman divergences given in Equations (51), (54) and (56) are special examples of (ρ, τ )-divergences. They are given, respectively, by:

(1)

ρ (ξ) = τ (ξ) = ξ,

(61)

(2)

ρ (ξ) = ξ, τ (ξ) = - \frac{1}{ξ},

(62)

(3)

ρ (ξ) = ξ, τ (ξ) = log ξ .

(63)

When ρ and τ are power functions, we have the (α, β)-structure in the manifold of positive-definite matrices.

(4)

(α-β)-divergence.

By using the (α, β) power functions given by Equation (34), we have:

ψ (Θ) = \frac{α}{α + β} tr Θ^{\frac{α + β}{α}} = \frac{α}{α + β} tr P^{α + β},

(64)

ϕ (H) = \frac{β}{α + β} tr H^{\frac{α + β}{β}} = \frac{β}{α + β} tr P^{α + β}

(65)

so that the (α, β)-divergence of matrices is:

D [P : Q] = tr {\frac{α}{α + β} P^{α + β} + \frac{β}{α + β} Q^{α + β} - P^{α} Q^{β}} .

(66)

This is a Bregman divergence, where the affine coordinate system is Θ = P^α and its dual is H = P^β.

(5)

The α-divergence is derived as:

Θ (P) = \frac{2}{1 - α} P^{\frac{1 - α}{2}},

(67)

ψ (Θ) = \frac{2}{1 + α} P,

(68)

D_{α} [P : Q] = \frac{4}{1 - α^{2}} tr (- P^{\frac{1 - α}{2}} Q^{\frac{1 + α}{2}} + \frac{1 - α}{2} P + \frac{1 + α}{2} Q) .

(69)

The affine coordinate system is $\frac{2}{1 - α} P^{\frac{1 - α}{2}}$ , and its dual is $\frac{2}{1 + α} P^{\frac{1 + α}{2}}$ .

(6)

The β-divergence is derived from Equation (41) as:

D_{β} [P : Q] = \frac{1}{β (β + 1)} tr [P^{β + 1} + (β + 1) Q - Q^{β + 1} - (β + 1) P Q^{β}] .

(70)

4.4. Invariance Under Gl(n)

We extend the concept of invariance under the orthogonal group to that under the general linear group, Gl(n), that is the set of invertible matrices, L, det |L| ≠ 0. This is a stronger condition. A divergence is said to be invariant under Gl(n), when:

D [P : Q] = D [L^{T} PL : L^{T} QL]

(71)

holds for any L ∈ Gl(n).

We identify matrix P with the zero-mean Gaussian distribution:

p (x, P) = exp {- \frac{1}{2} x^{T} P^{- 1} x - \frac{1}{2} log det ∣ P ∣ - c},

(72)

where c is a constant. We know that an invariant divergence belongs to the class of f-divergences in the case of a manifold of probability distributions, where the invariance means the geometry does not change under a one-to-one mapping of x to y. Moreover, the only invariant flat divergence is the KL-divergence [22]. These facts suggest the following conjecture.

Proposition

The invariant, flat and decomposable divergence under Gl(n) is the KL-divergence given by:

D_{K L} [P : Q] = tr ({PQ}^{- 1}) - log (det ∣ {PQ}^{- 1} ∣) - n .

(73)

5. Non-Decomposable Divergence

We have focused on flat and decomposable divergences. There are many interesting non-decomposable divergences. We first discuss a general class of flat divergences in $R_{+}^{n}$ and then touch upon interesting flat and non-flat divergences in the manifold of positive-definite matrices.

5.1. General Class of Flat Divergences in $R_{+}^{n}$

We can describe a general class of flat divergence in $R_{+}^{n}$ , which are not necessarily decomposable. This is introduced in [23], which studies the conformal structure of general total Bregman divergences ([11,13]). When $R_{+}^{n}$ is endowed with a dually flat structure, it has a θ-coordinate system given by:

θ = ρ (ξ)

(74)

which is not necessarily a componentwise function. Any pair of invertible θ = ρ(ξ) and convex function ψ(θ) defines a dually flat structure and, hence, a Bregman divergence in $R_{+}^{n}$ .

The dual coordinates η = τ (ξ) are given by:

η = \nabla ψ (θ)

(75)

so that we have:

η = τ (ξ) = \nabla ψ {ρ (ξ)} .

(76)

This implies that a pair (ρ, τ ) of coordinate systems can define dually coupled affine coordinates and, hence, a dually flat structure, when and only when η = τ {ρ⁻¹(θ)} is a gradient of a convex function.

This is different from the case of decomposable divergence, where any monotone pair of ρ(ξ) and τ (ξ) gives a dually flat structure.

5.2. Non-Decomposable Flat Divergence in PD_n

Ohara and Eguchi [15,16] introduced the following function:

ψ_{V} (P) = V (det ∣ P ∣),

(77)

where V (ξ) is a monotonically decreasing scalar function. ψ_V is convex when and only when:

1 + \frac{V^{″} (ξ) ξ^{2}}{V^{'} (ξ)} < \frac{1}{n} .

(78)

In such a case, we can introduce dually flat structure to PD_n, where P is an affine coordinate system with convex ψ_V (P), and the dual affine coordinate system is:

H = V^{'} (det ∣ ∣ P ∣ ∣) P^{- 1} .

(79)

The derived divergence is:

D_{V} [P : Q] = V (det ∣ P) - V (det ∣ Q) ∣

(80)

+ V^{'} (det ∣ Q ∣) tr {Q^{- 1} (Q - P)} .

(81)

When V (ξ) = −log ξ, it reduces to the case of Equation (54), which is invariant under Gl(n) and decomposable. However, the divergence D_V [P : Q] is not decomposable. It is invariant under O(n) and more strongly so under SGl(n) ⊂ Gl(n), defined by det |L| = ±1.

5.3. Flat Structure Derived from q-Escort Distribution

A dually flat structure is introduced in the manifold of probability distributions [4] as:

{\tilde{D}}_{α} [p : q] = \frac{1}{1 - q} \frac{1}{H_{q} (p)} (1 - \sum p_{i}^{1 - q} q_{i}^{q}),

(82)

where:

H_{q} (p) = \sum p_{i}^{q},

(83)

q = \frac{1 + α}{2} .

(84)

The dual affine coordinates are the q-escort distribution: [4]

η_{i} = \frac{1}{H_{q} (p)} p_{i}^{q} .

(85)

The divergence, D̃_q, is flat, but not decomposable.

We can generalize it to the case of PD_n,

{\tilde{D}}_{q} [P : Q] = \frac{1}{1 - q} \frac{1}{tr P^{q}} {(1 - q) tr (P) + q tr (Q) - tr (P^{1 - q} Q^{q})} .

(86)

This is flat, but not decomposable.

5.4. γ-Divergence in PD_n

The γ-divergence is introduced by Fujisawa and Eguchi [24]. It gives a super-robust estimator. It is interesting to generalize it to PD_n,

D_{γ} [P : Q] = \frac{1}{γ (γ - 1)} {log tr P^{γ} - (γ - 1) log tr Q^{γ - 1} - γ log tr P Q^{γ - 1}} .

(87)

This is not flat nor decomposable. This is a projective divergence in the sense that, for any c, c′ > 0,

D_{γ} [c P : c^{'} Q] = D_{γ} [P : Q] .

(88)

Therefore, it can be defined in the submanifold of tr P = 1.

6. Concluding Remarks

We have shown that the (ρ, τ )-divergence introduced by Zhang [5] is a general dually flat decomposable structure of the manifold of positive measures. We then extended it to the manifold of positive-definite matrices, where the criterion of invariance under linear transformations (in particular, under orthogonal transformations) were added. The decomposability is useful from the computational point of view, because the θ-η transformation is tractable. This is the motivation for studying decomposable flat divergences.

When we treat the manifold of probability distributions, it is a submanifold of the manifold of positive measures, where the total sum of measures are restricted to one. This is a nonlinear constraint in the θ or η coordinates, so that the manifold is not flat, but curved in general. Hence, our arguments hold in this case only when at least one of the ρ and τ functions are linear. The U-divergence [21] and β-divergence [19] are such cases. However, for clustering, we can take the average of the η-coordinates of member probability distributions in the larger manifold of positive measures and then project it to the manifold of probability distributions. This is called the exterior average, and the projection is simply a normalization of the result. Therefore, the (ρ, τ )-structure is useful in the case of probability distributions. The same situation holds in the case of positive-definite matrices.

Quantum information theory deals with positive-definite Hermitian matrices of trace one [25,26]. We need to extend our discussions to the case of complex matrices. The trace one constraint is not linear with respect to θ- or η-coordinates, as is the same in the case of probability distributions. Many interesting divergence functions have been introduced in the manifold of positive-definite Hermitian matrices. It is an interesting future problem to apply our theory to quantum information theory.

Conflicts of Interest

The author declares no conflicts of interest.

References

Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society and Oxford University Press: Rhode Island, RI, USA, 2000. [Google Scholar]
Tsallis, C. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Naudts, J. Generalized Thermostatistics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319. [Google Scholar]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput 2004, 16, 159–195. [Google Scholar]
Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar]
Cichocki, A.; Amari, S. Families of alpha- beta- and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar]
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar]
Minami, M.; Eguchi, S. Robust blind source separation by beta-divergence. Neural Comput 2002, 14, 1859–1886. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res 2005, 6, 1705–1749. [Google Scholar]
Liu, M.; Vemuri, B.C.; Amari, S.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Learn 2012, 24, 3192–3212. [Google Scholar]
Dhillon, I.S.; Tropp, J.A. Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl 2007, 29, 1120–1146. [Google Scholar]
Vemuri, B.C.; Liu, M.; Amari, S.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imaging 2011, 30, 475–483. [Google Scholar]
Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl 1996, 247, 31–53. [Google Scholar]
Ohara, A.; Eguchi, S. Group invariance of information geometry on q-Gaussian distributions induced by beta-divergence. Entropy 2013, 15, 4732–4747. [Google Scholar]
Ohara, A.; Eguchi, S. Geometry on positive definite matrices induced from V -potential functions. In Geometric Science of Information; Nielsen, F., Barbaresco, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 621–629. [Google Scholar]
Chebbi, Z.; Moakher, M. Means of Hermitian positive-definite matrices based on the log-determinant alpha-divergence function. Linear Algebra Appl 2012, 436, 1872–1889. [Google Scholar]
Tsuda, K.; Ratsch, G.; Warmuth, M.K. Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res 2005, 6, 995–1018. [Google Scholar]
Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for portfolio selection. In Matrix Information Geometry; Nielsen, F., Bhatia, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume Chapter 15, pp. 373–402. [Google Scholar]
Nielsen, F., Bhatia, R., Eds.; Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013.
Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expo 2006, 19, 197–216. [Google Scholar]
Amari, S. α-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar]
Nock, R.; Nielsen, F.; Amari, S. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2014. submitted for publication. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal 2008, 99, 2053–2081. [Google Scholar]
Petz, P. Monotone metrics on matrix spaces. Linear Algebra Appl 1996, 244, 81–96. [Google Scholar]
Hasegawa, H. α-divergence of the non-commutative information geometry. Rep. Math. Phys 1993, 33, 87–93. [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Amari, S.-i. Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure. Entropy 2014, 16, 2131-2145. https://doi.org/10.3390/e16042131

AMA Style

Amari S-i. Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure. Entropy. 2014; 16(4):2131-2145. https://doi.org/10.3390/e16042131

Chicago/Turabian Style

Amari, Shun-ichi. 2014. "Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure" Entropy 16, no. 4: 2131-2145. https://doi.org/10.3390/e16042131

Article Menu

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Abstract

1. Introduction

2. Preliminaries to Information Geometry of Divergence

2.1. Dually Flat Manifold

Pythagorean Theorem

Projection Theorem

2.2. Decomposable Divergence

2.3. Cluster Center

Cluster-Center Theorem

3. (ρ, τ ) Dually Flat Structure in $R_{+}^{n}$

3.1. (ρ, τ )-Coordinates of $R_{+}^{n}$

3.2. Convex Functions

3.3. (ρ, τ )-Divergence

Theorem 1

3.4. Biduality: α-(ρ, τ ) Divergence

3.5. Various Examples

4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices

4.1. Invariant and Decomposable Convex Function

4.2. Invariant, Flat and Decomposable Divergence

4.3. (ρ, τ )-Structure in Positive Definite Matrices

Theorem 2

4.4. Invariance Under Gl(n)

Proposition

5. Non-Decomposable Divergence

5.1. General Class of Flat Divergences in $R_{+}^{n}$

5.2. Non-Decomposable Flat Divergence in PD_n

5.3. Flat Structure Derived from q-Escort Distribution

5.4. γ-Divergence in PD_n

6. Concluding Remarks

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Information Geometry of Positive Measures and Positive-Definite Matrices: Decomposable Dually Flat Structure

Abstract

1. Introduction

2. Preliminaries to Information Geometry of Divergence

2.1. Dually Flat Manifold

Pythagorean Theorem

Projection Theorem

2.2. Decomposable Divergence

2.3. Cluster Center

Cluster-Center Theorem

3. (ρ, τ ) Dually Flat Structure in R + n

3.1. (ρ, τ )-Coordinates of R + n

3.2. Convex Functions

3.3. (ρ, τ )-Divergence

Theorem 1

3.4. Biduality: α-(ρ, τ ) Divergence

3.5. Various Examples

4. Invariant, Flat Decomposable Divergences in the Manifold of Positive-Definite Matrices

4.1. Invariant and Decomposable Convex Function

4.2. Invariant, Flat and Decomposable Divergence

4.3. (ρ, τ )-Structure in Positive Definite Matrices

Theorem 2

4.4. Invariance Under Gl(n)

Proposition

5. Non-Decomposable Divergence

5.1. General Class of Flat Divergences in R + n

5.2. Non-Decomposable Flat Divergence in PDn

5.3. Flat Structure Derived from q-Escort Distribution

5.4. γ-Divergence in PDn

6. Concluding Remarks

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. (ρ, τ ) Dually Flat Structure in $R_{+}^{n}$

3.1. (ρ, τ )-Coordinates of $R_{+}^{n}$

5.1. General Class of Flat Divergences in $R_{+}^{n}$

5.2. Non-Decomposable Flat Divergence in PD_n

5.4. γ-Divergence in PD_n