Next Article in Journal
Clausius’ Disgregation: A Conceptual Relic that Sheds Light on the Second Law
Next Article in Special Issue
A New Robust Regression Method Based on Minimization of Geodesic Distances on a Probabilistic Manifold: Application to Power Laws
Previous Article in Journal
Modeling Soil Moisture Profiles in Irrigated Fields by the Principle of Maximum Entropy
Previous Article in Special Issue
Natural Gradient Flow in the Mixture Geometry of a Discrete Exponential Family
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Monotone Embedding in Information Geometry

Department of Psychology and Department of Mathematics, University of Michigan, 530 Church Street, Ann Arbor, MI 48109, USA
Entropy 2015, 17(7), 4485-4499; https://doi.org/10.3390/e17074485
Submission received: 27 January 2015 / Revised: 28 February 2015 / Accepted: 17 March 2015 / Published: 25 June 2015
(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Abstract

:
A paper was published (Harsha and Subrahamanian Moosath, 2014) in which the authors claimed to have discovered an extension to Amari’s α-geometry through a general monotone embedding function. It will be pointed out here that this so-called (F, G)-geometry (which includes F-geometry as a special case) is identical to Zhang’s (2004) extension to the α-geometry, where the name of the pair of monotone embedding functions ρ and τ were used instead of F and H used in Harsha and Subrahamanian Moosath (2014). Their weighting function G for the Riemannian metric appears cosmetically due to a rewrite of the score function in log-representation as opposed to (ρ, τ)-representation in Zhang (2004). It is further shown here that the resulting metric and α-connections obtained by Zhang (2004) through arbitrary monotone embeddings is a unique extension of the α-geometric structure. As a special case, Naudts’ (2004) ϕ-logarithm embedding (using the so-called logϕ function) is recovered with the identification ρ = ϕ, τ = logϕ, with ϕ-exponential expϕ given by the associated convex function linking the two representations.

In a recent paper that appeared in Entropy (Harsha and Subrahamanian Moosath, 2014) [1], the authors proposed an extension to Amari’s α-geometry, which they call F - or (F, G)-geometry, where F is a monotone embedding function and G is the weighting function for taking the expectation of random variables in calculating the Riemannian metric (G = 1 reduces to F -geometry, with the standard Fisher–Rao metric). This paper serves the purpose of pointing out that (F, G)-geometry as proposed is the same as what Zhang (2004) [2] has obtained for extending the α-geometry and captured in his subsequent work [48]. The metric and affine connections proposed by [1] are identical to [2] apart from the notations: the embedding functions F and H in [1] were denoted as ρ and τ in [2], and weighting function G in [1] is a trivial rewriting of the convex function f used by [2].
This paper will start in Section 1 with a review of Amari’s α-geometry and α-embedding, a review of Zhang’s (2004) [2] extension to ρ-embedding with an arbitrary monotone function and a summary of Harsha and Subrahamanian Moosath (2014) [1]. Then, the equivalence of [1] to [2] is shown. In Section 2, after analyzing the group of monotone embedding functions, a stronger statement is made: the construction of [2] is a unique dualistic extension of Amari’s α-geometry through arbitrary monotone embedding in place of α-embedding. As an important special case, we illustrate how the deformed logarithm logϕ associated with an arbitrary strictly increasing function ϕ as investigated by Naudts (2004) [3] arises naturally from identifying ϕ with ρ and with a proper choice of the auxiliary function f as a part of Zhang’s theory.

1. Equivalence of (F, G)-Geometry to Zhang’s (2004) [2] (ρ, τ)-Geometry

1.1. Amari’s α-Geometry and α-Embedding

The now standard differential geometric characterization of the manifold Θ = {p(·|θ), θ∈Θ ⊆ ℝn} of parametric probability functions p (probability density or probability distributions) is through the Fisher–Rao metric gij as its Riemannian metric:
g i j ( θ ) = E μ { p ( ζ | θ ) log p ( ζ | θ ) θ i log p ( ζ | θ ) θ j }
and a family of α-connections (given by Amari [9,10]) with coefficients Γ(α) (α ∈ ℝ):
Γ i j , k ( α ) ( θ ) = E μ { ( 1 α 2 log p ( ζ | θ ) θ i log p ( ζ | θ ) θ j + 2 log p ( ζ | θ ) θ i θ j ) p ( ζ | θ ) θ k } .
Here, Eµ denotes the expectation with respect to a background measure µ of the random variable denoted by ζ:
E μ { } = ( ) d μ ( ζ ) .
The α-connection is constructed as a convex combination of a pair of conjugate connections Γ, Γ*
Γ i j , k ( α ) ( θ ) = 1 + α 2 Γ i j , k ( θ ) + 1 α 2 Γ i j , k * ( θ ) ,
where Γ Γ(1) is frequently called e-connection (α = 1) and Γ* Γ(−1) called m-connection (α = −1). A Riemannian manifold Mµ with its metric g and the family of α-connections Γ(α) in the form of (1) and (2) has been called α-geometry. Amari’s α-geometry can be specified in terms of a symmetric (0, 2)-tensor gij (the Fisher–Rao metric) and a totally symmetric (0, 3)-tensor Tijk (sometimes called the Amari–Chentsov tensor), which is linked to the α-connections via:
Γ i j , k ( α ) = Γ i j , k L C ( θ ) α 2 T i j k ( θ ) ,
where Γ i j , k L C is the Levi–Civita connection corresponding to the Riemannian metric g.
As an extension of the logarithmic embedding l(p) = log p of probability density function p, an α-embedding function [10] is defined through l(α) : ℝ+ ℝ:
l ( α ) ( t ) = { log t α = 1 2 1 α t ( 1 α ) / 2 α 1 .
It is an interesting observation (e.g., p. 46 in [11]) that the α-geometry can be recovered under such α-representation (scaling) of the probability function, that is the Fisher–Rao metric turns out to be α-independent (i.e., embedding independent) and the ±1-connections precisely the α-connections:
g i j ( θ ) = E μ { l ( α ) ( p ( ζ | θ ) ) θ i l ( α ) ( p ( ζ | θ ) ) θ j } ,
Γ i j , k ( α ) ( θ ) = E μ { 2 l ( α ) ( p ( ζ | θ ) ) θ i θ j l ( α ) ( p ( ζ | θ ) ) θ k } .
A variance of α-embedding of a probability function plays an important role in Tsallis statistics; see [1214]. On the geometric side, [15,16] illuminated that the α-scaling of the probability functions leads to a conformal transformation.

1.2. Zhang (2004) [2] Extension: ρ-Embedding and (ρ, τ)-Geometry

Zhang [2,4,6] obtained generalizations of the α-geometry for a pair of monotone embeddings, called ρ- and τ-embeddings generalizing α-embedding. Given any smooth strictly convex function f : ℝ ℝ, with convex conjugate f* given by:
f * ( t ) = t ( f ) 1 ( t ) f ( ( f ) 1 ( t ) ) ,
Zhang (2004) defines a pair of conjugate representations [2] (Section 3.2) using two strictly increasing functions ρ, τ from ℝ ℝ:
  • we call ρ-representation of a probability function p the mapping pρ(p);
  • we say τ-representation of the probability function pτ(p) is conjugate to ρ-representation with respect to a smooth and strictly convex function f, or simply τ is f-conjugate to ρ, if:
    τ ( p ) = f ( ρ ( p ) ) = ( ( f * ) ) 1 ( ρ ( p ) ) ,
    which can be equivalently written as:
    ρ ( p ) = ( f ) 1 ( τ ( p ) ) = ( f * ) ( τ ( p ) ) .
These equalities in (10) and (11) hold, and they are equivalent, because f′ and (f*) are both strictly increasing (due to their strict convexity) and that (f*)* = f, (f*) = (f′)−1. Sometimes, we write f′ = σ, (f*)−1 = σ−1 for convenience, so σ(ρ) = τ, σ−1(τ) = ρ, for a strictly increasing function τ.
As a first example, we may set ρ(t) = t, τ(t) = log t. Then, we can derive that f*(t) = exp(t) and f(t) = t log tt + 1. That ρ(p) and τ(p) are just the p and log p representation reflects the conventional dual embeddings that have later been extended to ϕ- and logϕ-embedding in ([3]). In Section 2.2, it will be shown that Naudts’ ϕ-logarithm formulation is recovered as a special case of the (ρ, τ)-embedding.
As another example, we may set ρ(p) = l(β)(p) to be the β-representation given by Equation (6); this would have been traditionally called “alpha-embedding”, except we use the symbol β, so that the α-parameter will be reserved for indexing α-connections. In this case, the conjugate representation is the (−β)-representation τ(p) = l(−β)(p):
ρ ( p ) = l ( β ) ( p ) τ ( p ) = l ( β ) ( p ) .
In this case, ρ and τ are conjugate with respect to f, where f is given by:
f ( t ) = 2 1 + β ( ( 1 β 2 ) t ) 2 1 β , f * ( t ) = 2 1 β ( ( 1 + β 2 ) t ) 2 1 + β .
Based on divergence functions constructed under monotone embedding, Zhang ([2]) showed:
Proposition 1. ([2], Proposition 7) Using an arbitrary monotone embedding function ρ and an arbitrary smooth strictly convex function f, a generalization of α-geometry is obtained, with metric and α-connections taking the form:
g i j ( θ ) = E μ { f ( ρ ( p ( ζ | θ ) ) ρ ( p ( ζ | θ ) ) θ i ρ ( p ( ζ | θ ) ) θ j }
Γ i j , k ( α ) ( θ ) = E μ { 1 α 2 f ( ρ ( p ( ζ | θ ) ) ) A i j k + f ( ρ ( p ( ζ | θ ) ) ) B i j k } ,
where:
A i j k ( ζ , θ ) = ρ ( p ( ζ | θ ) ) θ i ρ ( p ( ζ | θ ) ) θ j ρ ( p ( ζ | θ ) ) θ k , B i j k ( ζ , θ ) = 2 ρ ( p ( ζ | θ ) ) θ i θ j ρ ( p ( ζ | θ ) ) θ k .
As special cases,
Γ i j , k ( θ ) = E μ { f ( ρ ( p ) ) 2 ρ ( p ) θ i θ j ρ ( p ) θ k } ,
Γ i j , k * ( θ ) = E μ { ρ ( p ) θ k ( f ( ρ ( p ) ) ρ ( p ) θ i ρ ( p ) θ j + f ( ρ ( p ) ) 2 ρ ( p ) θ i θ j ) } .
Furthermore, taking a pair of monotone representations, the metric tensor and affine connections stated in Proposition 1 have dualistic expressions:
Corollary 1. ([2], Proposition 8) Using two arbitrary monotone embedding functions ρ and τ, the metric and α-connections of (14)(16) are:
g i j ( θ ) = E μ { ρ ( p ( ζ | θ ) ) θ i τ ( p ( ζ | θ ) ) θ j } ,
Γ i j , k ( α ) ( θ ) = E μ { 1 α 2 2 τ ( p ( ζ , θ ) ) θ i θ j ρ ( p ( ζ | θ ) ) θ k + 1 + α 2 2 ρ ( p ( ζ | θ ) ) θ i θ j τ ( p ( ζ | θ ) ) θ k } .
As a special case, when ρ, τ take the familiar alpha-embeddings (12) (using β as the parameter), the α-connections becomes (αβ)-connections:
Γ i j , k ( α ) ( θ ) = E μ { ( 1 α β 2 log p ( ζ | θ ) θ i log p ( ζ | θ ) θ j + 2 log p ( ζ | θ ) ) θ i θ j ) p ( ζ | θ ) θ k } ,
with the product α · β playing the role of the alpha-parameter indexing the family of connections.

1.3. Harsha and Subrahamanian Moosath’s (2014) Work [1]

Using a monotone embedding function denoted as F and a weighting function denoted as G (G = 1 is a special case to reduce to what they called F -geometry), these authors [1] proposed (F, G)-metric as (their Equation (33) in [1]):
g i j F , G = E μ { p G ( p ) log p θ i log p θ j }
with affine connection given as (their Equation (34)):
Γ i j k F , G = E μ { p G ( p ) log p θ k ( 2 log p θ i θ j + ( 1 + p F ( p ) F ( p ) ) log p θ i log p θ j ) } .
Note Ep{(·)} = Eµ{(·)p}. (23) is the expression for the e-connection (α = 1), Γ i j k F , G. To express the conjugate connection (m-connection, α = −1), Γ i j k H , G, a dual embedding function H is introduced, which is shown ([1], Theorem 3.2) to be related to F and G via (their Equation (36)):
H ( p ) = G ( p ) p F ( p ) .
In such a case, the conjugate connection Γ i j k H , G (sic, more accurately ( Γ F , G ) i j k *) is expressed as (their Equation (37)):
Γ i j k H , G = E μ { p G ( p ) log p θ k ( 2 log p θ i θ j + ( p G ( p ) G ( p ) p F ( p ) F ( p ) ) log p θ i log p θ j ) } .
We now show the equivalence of the three expressions (14), (17), (18) from the work [2] with the three corresponding expressions (22), (23), (25) from the work [1].
Statement 1. Equations (14) and (22) give the same Riemannian metric; Equations (17) and (23) give the same affine connection; and Equations (18) and (25) give the same conjugate connection, as long as:
F ( p ) = ρ ( p ) , G ( p ) = ( ρ ) 2 p f ( ρ ( p ) ) .
Proof. Re-writing (14), and keeping in mind:
ρ ( p ) θ i = ρ ( p ) p θ i = p ρ ( p ) log p θ i ,
so:
g i j ( θ ) = E μ { f ( ρ ( p ) ( p ρ ( p ) ) 2 log p θ i log p θ j } .
Comparing the above with (22), obviously, F is just ρ, and G is linked to f and ρ:
G ( p ) = ( ρ ) 2 p f ( ρ ( p ) ) = p ρ ( p ) τ ( p )
where we have used (10).
Next, differentiate (27); we obtain:
2 ρ ( p ) θ i θ j = p θ j ρ ( p ) log p θ i + p ρ ( p ) p θ j log p θ i + p ρ ( p ) 2 log p θ i θ j
= p ρ ( p ) ( log p θ i log p θ j + 2 log p θ i θ j + p ρ ( p ) ρ ( p ) log p θ i log p θ j )
= p ρ ( p ) ( 2 log p θ i θ j + ( 1 + p ρ ( p ) ρ ( p ) ) log p θ i log p θ j ) .
Identifying F = ρ and making use of (29), we see that (17) is precisely (23).
Finally, differentiate (29),
G ( p ) = ( ρ ) 2 f ( ρ ( p ) ) + ( ρ ) 3 p f ( ρ ( p ) ) + 2 ρ ( p ) ρ ( p ) p f ( ρ ( p ) ) .
Therefore,
p G ( p ) G ( p ) p F ( p ) F ( p ) = 1 + p ρ ( p ) f ( ρ ( p ) ) f ( ρ ( p ) ) + p ρ ( p ) ρ ( p ) .
After substituting (34) and (29) into (25) and making use of (31), the expression (18) results.
Statement 2. The conjugate embedding function H is the same as τ. The conjugate connection (25), when expressed using H, has the same form as (23) for Γ i j , k G , F using F.
Proof. Applying Definition (24) immediately yields H′ = τ′. Therefore, (apart from constant) H(p) = τ(p). Next, we will express (25) explicitly using the conjugate embedding function H (rather than F) and the weighting function G. That is to say, we will simplify the terms in the middle parenthesis of (25):
p G ( p ) G ( p ) p F ( p ) F ( p ) = p ( log G ( p ) F ( p ) ) = p ( log ( p H ( p ) ) = p ( log p + log H ( p ) )
= p ( 1 p + H ( p ) H ) = 1 + p H H ( p ) .
Hence, (25) has the same expression as (23) showing the duality between the embedding function H and the embedding function F.
By Statement 1, starting from F (that is, ρ) and G and imposing conjugacy requirement on the pair of affine connections, one is guaranteed to derive H (that is, τ) as the conjugate embedding function.
From Statements 1 and 2, we conclude that, Harsha and Moosath’s F -embedding [1] replicates the ρ-embedding of Zhang (2004) [2]; the conjugate H-embedding turns out to be identical to τ-embedding of [2]. Contrary to the authors’ claim (Remark 3.7 of [1], p. 2480), (F, G)-geometry is identical to Zhang’s (ρ, τ) geometry [2]. In particular, their F -geometry is recovered by simply choosing f to satisfy f″(t) = 1/(ρ−1(t) (ρ′(ρ−1(t)))2), for a given ρ. The subsequent development in their paper [1], e.g., the definition of the F -affine manifold (their Equation (50)), replicates the definition of ρ-affine manifold in [2] (Section 3.4).
During the review of their manuscript [1] and in subsequent personal communications, these authors argued that they used a different approach: (F, G)-geometry is derived by embedding the manifold into the space of random variables and suitably defining the inner product through using the F -expectation (their Equation (15)) and (F, G)-expectation (their Equation (32)) as a general weighted expectation of a random variable, while Zhang (2004) [2] derived the geometry through constructing a divergence function. This difference, however, is entirely superficial, because the relationship between divergence functions and geometric structure (metric and affine connection) is well-established by Eguchi’s work [17,18] and known to information geometers. Therefore, neither the approach nor the results of Harsha and Moosath’s proposed (F, H, G) extension to Amari’s α-geometry differs from Zhang’s proposed (ρ, τ, f) extension, with the following correspondence in different symbols by the two papers:
F ρ , H τ ,
G ( t ) t ρ ( t ) τ ( t ) = t f ( ρ ( t ) ) ( ρ ( t ) ) 2 = t ( f * ) ( τ ( t ) ) ( τ ( t ) ) 2 ;
the difference in the representation of score function as log-representation in [1] or under ρ or τ-representation in [2] is cosmetic.

2. Uniqueness of (ρ, τ)-Geometry and Representation Duality

2.1. Monotone Embedding as a Transformation Group

Monotone representations of any given probability function form a transformation group, with functional composition as group composition operation and the functional inverse as the group inverse operation. This was pointed out by Zhang [6] (Section 2.2.2). We state it as a lemma here.
Lemma 1. Denoteas the set of strictly increasing functions from ℝ. Then, (Ω, ○) forms a group, with ○ denoting functional composition.
Proof. We easily verify that:
  • closure for ○: for any ρ1, ρ2 ∈ Ω, ρ2ρ1, defined as ρ2(ρ1(·)), is strictly increasing, and hence, ρ2ρ1 Ω;
  • existence of unique identity element: the identity function ι, which satisfies ρι = ιρ = ρ, is strictly increasing, and hence, ι ∈ Ω and is unique;
  • existence of inverse: for any ρ ∈ Ω, its functional inverse ρ−1, which satisfies ρ−1ρ = ρ−1ρ = ι, is also strictly increasing, and hence, ρ−1 Ω;
  • associativity of ○: for any three ρ1, ρ2, ρ3 Ω, then (ρ1ρ2) ○ ρ3 = ρ1 ○ (ρ2ρ3).
Recall that the derivative of smooth strictly convex functions are strictly increasing functions. From this perspective, f′ = τρ−1 = τ(ρ−1(·)), (f*) = ρτ−1 = ρ(τ−1(·)), encountered above, are themselves two mutually inverse strictly increasing functions. This is the rationale behind Zhang’s ([2]) choice of f (and f*) as the auxiliary function to capture conjugate embedding, rather than using G as in [1]. The following identities are useful; they are obtained by differentiating (10) and (11):
f ( ρ ( t ) ) ρ ( t ) = τ ( t ) , ( f * ) ( τ ( t ) ) τ ( t ) = ρ ( t ) ;
therefore:
f ( ρ ( t ) ) ( ρ ( t ) ) 2 = ( f * ) ( τ ( t ) ) ( τ ( t ) ) 2 ,
and:
f ( ρ ( t ) ) ( f * ) ( τ ( t ) ) = 1.
With respect to (41), taking log on both sides yields:
log f ( ρ ( t ) ) + log ( f * ) ( τ ( t ) ) = 0.
Move and differentiate:
f ( ρ ( t ) ) ρ ( t ) f ( ρ ( t ) ) = ( f * ) ( τ ( t ) ) τ ( t ) ( f * ) ( τ ( t ) ) .
Making use of (40) yields:
f ( ρ ( t ) ) ( ρ ( t ) ) 3 = ( f * ) ( τ ( t ) ) ( τ ( t ) ) 3 .
Note the coupling between f and ρ, τ given by (10), (11), (40) and (44). They allow us to cast (14) and (15) in terms of f* and τ.
Among the triple (f, ρ, τ), given any two, the third is specified. In particular, if we arbitrary choose two strictly increasing functions ρ and τ as embedding functions and require them to be conjugate embeddings, then f is specified by f′(t) = τ(ρ−1(t)). In terms of conjugate function f*, the relation is (f*)(t) = ρ(τ−1(t)). The function f (or f*) is important in constructing the general class of divergence function.

2.2. Naudts’ ϕ-Logarithm as a Special Case

In his 2004 publication [3], Naudts considered the “deformed” logarithm function as an extension to the exponential family of densities that is log-linear. Given a strictly increasing and strictly positive function ϕ : ℝ+ +, the ϕ-logarithm is defined as:
log ϕ ( t ) = 1 t 1 ϕ ( s ) d s , ( t > 0 ) .
The deformed exponential denoted expψ, is defined by:
exp ψ ( t ) = 1 + 0 t ψ ( s ) d s .
(Naudts (2004) used the notation expϕ, so our current rendition has a subtle difference shown as (48) and (49) below.) It can be shown that the deformed functions logϕ and expψ are in fact inverse functions of each other if:
ψ ( log ϕ ( t ) ) = ϕ ( t ) , ψ ( t ) = ϕ ( exp ψ ( t ) ) .
Stated alternatively, the deformed logarithmic function h(t) = logϕ(t) can be viewed as the solution to the following integral and its equivalent differential equation:
h ( t ) = 1 t 1 ψ ( h ( s ) ) d s d h d t = 1 ψ ( h ( t ) ) ,
whereas the deformed exponential function h(t) = expψ(t) can be viewed as the solution to the following integral and its equivalent differential equation:
h ( t ) = 1 + 0 t ϕ ( h ( s ) ) d s d h d t = ϕ ( h ( t ) ) .
We now show that the above formulation can be re-written as (ρ, τ)-embeddings with a particular choice of f (or equivalently, f*) function. Set ϕ(t) = ρ(t) and f*(t) = expψ(t), so that (f*)(t) = ψ(t) from (46). Therefore, we derive:
log ϕ ( t ) = ψ 1 ( ϕ ( t ) ) = ( ( f * ) ) 1 ( ρ ( t ) ) = f ( ρ ( t ) ) = τ ( t ) .
That is, when ϕ is chosen as ρ-representation, the deformed logarithm logϕ turns out to be the τ-representation, while the deformed exponential is nothing but f*. The relationship (47) is identical to (10) and (11).
In the ϕ-logarithm approach, once ϕ (that is, ρ) is specified, then logϕ (that is, τ) is specified, through the integral relation (45). Viewing τ(·) = f′(ρ(·)), the relation (45) essentially specifies a strictly convex function f, through its derivative f′, which operates on ρ.
Proposition 2. Denote ρ ≡ ϕ. The deformed logarithmic transformation ϕ → logϕ given by (45) can be viewed as the function composition f′ : ρ → f′(ρ), where f is given by:
f ( ρ ( t ) ) = ρ ( t ) f ( ρ ( t ) ) t .
Equivalently, using conjugate function f* given by (9),
ρ = ( f * ) ( f * ) 1 ,
or
ρ = 1 ( ( f ) 1 ) .
Proof. From (45), we write:
f ( ρ ( t ) ) = 1 t 1 ρ ( s ) d s ,
with unknown f. Multiply both sides by ρ′(t) and then integrate from one to x; the left-hand side of (53) is:
1 x f ( ρ ( t ) ) ρ ( t ) d t = 1 x f ( ρ ( t ) ) d ( ρ ( t ) ) = f ( ρ ( x ) ) f ( ρ ( 1 ) ) .
The right-hand side of (53), after the same operation, is:
1 x ρ ( t ) d t 1 t 1 ρ ( s ) d s = 1 x 1 ρ ( s ) d s s x ρ ( t ) d t = 1 x ρ ( x ) ρ ( s ) ρ ( s ) d s = 1 x ( ρ ( x ) ρ ( s ) 1 ) d s = ρ ( x ) ( 1 x 1 ρ ( s ) d s ) 1 x d s = ρ ( x ) f ( ρ ( x ) ) ( x 1 ) .
Clearly, f′(ρ(1)) = 0 by (53). We set f(ρ(1)) = −1. Comparing expressions from the left- and right-hand side, we obtain (50).
Applying (9), we obtain the equivalent expression:
f * ( f ( ρ ( t ) ) ) = t .
That is, f is chosen, such that f*f′ is the inverse function of ρ, or:
ρ = ( f * f ) 1 = ( f ) 1 ( f * ) 1 = ( f * ) ( f * ) 1 .
Hence, (51) holds.
Finally, differentiate the identity:
f * ( ( f * ) 1 ( t ) ) = t ,
we obtain:
1 = ( f * ) ( ( f * ) 1 ( t ) ) · ( f * ) 1 ( t ) = ρ ( t ) · ( f * ) 1 ( t )
upon substituting (51). Hence, (52) holds.
The expression (51) in Proposition 2 shows that for any ρ, if one can find a decomposition: ρ = g′g−1 in terms of g, then g would be the ρ-exponential, g−1 the ρ-logarithm and g′ the linking function. In the case of ϕ ↦ logϕ transformation, g = f*(t).
Naudts’ ([3]) deformed logarithm/exponential embedding approach and Zhang’s ([2]) (ρ, τ)-embedding approach can be seen as playing complementary roles in information geometry: the former makes it easy to generalize the exponentiation and logarithm as inverse operations obeying desired differential/integral equations, while the latter makes it apparent how conjugate (ρ, τ)-embeddings lead to bidualistic expressions for the underlying geometric structures (metric and conjugate connections).

2.3. Uniqueness of (ρ, τ)-Geometry

It is known [19,20] that the Fisher–Rao metric and α-connections (equivalently, Amari–Chentsov tensor T) are the only invariants of sufficient statistics under the Markov morphism of a random variable. In [22,23], the Fisher–Rao metric has been extended to allow a weighting function. In [2,6], general weighting functions for affine connections were made compatible with the generalized (i.e., weighted) Fisher–Rao metric, since they result from divergence functions that are allowed to have the freedom of monotone embedding. The recent reinvention [1] constructed weighted connections that turned out to be identical to the expressions given by [2]. A natural question is, then, whether Zhang’s (ρ, τ) geometry is the unique construction given the freedom of arbitrary monotone embedding. Below, arguments will be provided, along with a proof, for a positive answer to this question.
First, when a probability function p(ζ|θ) (as a function of a random variable indexed by ζ and a background measure of µ) is embedded into the parametric manifold Θ, there are several traditional choices for tangent vectors: ip, i log p, i p, etc. Each of these are linked with a weighting function (expectation operator), so that the tangent vectors are zero-mean random variables:
0 = E µ { i p } = E µ { ( p ) i log p } = E µ { ( p ) i ( p ) } =
where the weighting functions are, respectively, one, p , p:
0 = E µ { i p } = E p { i log p } = E p { i ( p ) } =
For these various choices, the direction of the tangent vectors are all the same. We can consider the above as special cases of ρ-embedding, with ρ(t) = t, log t, t, respectively. Because i(ρ(p)) = ρ′(p)ip, so a tangent vector retains its direction with any choice of monotone embedding function.
To investigate the weighting function for general monotone ρ-embedding, let us consider the f-normalization (foliation) condition, cf. [21],
E µ { f ( ρ ( p ) } = 1 ,
where f is a given convex function. Differentiate the above; we get:
0 = E μ { f ( ρ ( p ) ) ρ ( p ) θ i } = E μ { τ ( p ) i ρ } .
Therefore, we can see that τ(p) = f′(ρ(p)), what we have called the f-conjugate of ρ, is precisely the weighting function to make iρ a zero-mean random function at any point of Θ (i.e., for any value of θ ∈ Θ).
Next, consider the Fisher–Rao metric (1), which can be written as Eµ{ip ∂j log p} = Eµ{i log p ∂jp}, the pairing of a random function with a random functional under two embeddings p and log p. A natural generalization (see [6]) is to use two (independently chosen) monotone embeddings ρ, τ:
g i j ( θ ) = E µ { i ρ j τ } = E µ { j ρ i τ } = E µ { ρ ( p ) τ ( p ) i p j p } .
This is precisely (14), with the weighting function for the Riemannian metric as f″(ρ(p))(ρ′(p))2 = τ′(p)ρ′(p), when tangent vectors are expressed as ip (identity representation). When ρ-representation or τ-representation is adopted, the weighting function is simply f″(ρ(p)) or (f*)(τ(p)), respectively.
Third, given ρ, τ embedding, we can construct two affine connections on the manifold as follows. Differentiate (57),
g i j ( θ ) θ k = E μ { 2 ρ ( p ) θ k θ i τ θ j + 2 τ ( p ) θ k θ j ρ ( p ) θ i } ,
and compare with the relation that defines conjugate connections:
g i j ( θ ) θ k = Γ k i , j ( θ ) + Γ k j , i * ( θ ) ;
we can identify:
E μ { 2 ρ ( p ) θ k θ i τ ( p ) θ j }
with Γki,j and:
E μ { 2 τ ( p ) θ k θ i ρ ( p ) θ j }
with Γ k j , i *, respectively. Their difference is, by definition, the Amari–Chentsov (0,3)-tensor T:
T i j k ( θ ) E μ { 2 τ ( p ) θ i θ j ρ ( p ) θ k 2 ρ ( p ) θ i θ j τ ( p ) θ k } .
Proposition 3. T as given by (62) is a totally symmetric (0,3)-tensor.
Proof. First, we prove that T (θ) is totally symmetric:
T i j k = T j i k = T i k j = T j k i = T k i j = T k j i .
Since (62) clearly implies Tijk = Tjik, we only need to establish Tijk = Tikj. Applying the chain-rule of differentiation,
θ i ( τ ( p ) θ j ρ ( p ) θ k ) = 2 τ ( p ) θ i θ j ρ ( p ) θ k + 2 ρ ( p ) θ i θ k τ ( p ) θ j ,
θ i ( ρ ( p ) θ j τ ( p ) θ k ) = 2 ρ ( p ) θ i θ j τ ( p ) θ k + 2 τ ( p ) θ i θ k ρ ( p ) θ j ,
and taking into account:
τ ( p ) θ j ρ ( p ) θ k = τ ( p ) θ k ρ ( p ) θ j = τ ( p ) ρ ( p ) p θ j p θ k ,
(62) becomes:
T i j k ( θ ) = E μ { ( 2 ρ ( p ) θ i θ k τ ( p ) θ j 2 τ ( p ) θ i θ k ρ ( p ) θ j ) } = T i k j ( θ ) .
Next, we prove that Tijk is indeed a (0,3)-tensor. This is done through examining the behavior of T under a coordinate transform θ θ ¯, with the (inverse) Jacobian matrix θ k θ ¯ l, which affects:
ρ ( p ) θ ¯ i = l ρ ( p ) θ l θ l θ ¯ i , τ ( p ) θ ¯ i = l τ ( p ) θ l θ l θ ¯ i ,
and:
2 ρ ( p ) θ ¯ i θ ¯ j = l , m 2 ρ ( p ) θ l θ m θ l θ ¯ i θ m θ ¯ j + l ρ ( p ) θ ¯ l 2 θ l θ ¯ i θ ¯ j ,
2 τ ( p ) θ ¯ i θ ¯ j = l , m 2 τ ( p ) θ l θ m θ l θ ¯ i θ m θ ¯ j + l τ ( p ) θ ¯ l 2 θ l θ ¯ i θ ¯ j .
Therefore:
T ¯ i j k ( θ ¯ ) E μ { 2 τ ( p ) θ ¯ i θ ¯ j ρ ( p ) θ ¯ k 2 ρ ( p ) θ ¯ i θ ¯ j τ ( p ) θ ¯ k } = l m n θ i θ ¯ l θ j θ ¯ m θ k θ ¯ n T l m n ( θ ) .
after substituting (69), (70) and (62). T indeed transforms to T ¯ in a manner that defines a (0, 3)-tensor. Therefore, the proposition is proven.
We now cast the Amari–Chentsov tensor T in an alternative form that gives an explicit form of weighting function. Given ρ, τ, because of Lemma 1, there exists another monotone embedding σ, such that σ(ρ) = τ. Differentiating,
σ ( ρ ( p ) ) θ i = σ ( ρ ( p ) ) ρ ( p ) θ i .
Differentiate again, we obtain:
2 σ ( ρ ( p ) ) θ i θ j = σ ( ρ ( p ) ) ρ ( p ) θ j ρ ( p ) θ j + σ ( ρ ( p ) ) 2 ρ ( p ) θ i θ j .
Substituting the above into (62), we obtain an expression of T in terms of ρ (which plays the role of embedding function) and σ (which plays the role of weighting function):
T i j k ( θ ) = E μ { σ ( ρ ( p ) ) ρ ( p ) θ i ρ ( p ) θ j ρ ( p ) θ k } .
Similarly, we can obtain:
T i j k ( θ ) = E μ { ( σ 1 ) ( τ ( p ) ) τ ( p ) θ i τ ( p ) θ j τ ( p ) θ k } .
Therefore, under τ-representation, σ−1 (the inverse function of σ) serves as the weighting function. Note that σ = f′, σ−1 = (f*) when ρ and τ are said to be conjugate. Furthermore, note the negative sign in (75) compared with (74); this precisely reflects “representation duality” with a ρτ exchange.
To summarize, because α-geometry {, g, T} is uniquely specified given a Riemannian metric g and the Amari–Chentsov tensor T, the above derivations show that they both enjoy the freedom of two monotone/convex functions, with the freedom in specifying g coupled to the freedom in specifying T in the same way that the metric and connections are coupled via Codazzi relation for statistical manifolds. That the weighting functions used to construct linear, symmetric bilinear and totally symmetric trilinear functionals (on random functions) turns out to be f′(ρ(·)), f″(ρ(·)), f″′(ρ(·)), respectively, is noteworthy. See [6] for more discussions.

2.4. Representation Duality versus Reference Duality

Going beyond extending α-embedding to dual monotonic embeddings, Reference [2] illuminated two different senses of duality in the α-geometry. Prior to [2], there have been several different usages of α-parameter in Amari’s theory of information geometry [10,11]:
  • parameterizing the divergence functions (α-divergences);
  • parameterizing monotone embedding of probability functions (α-embedding);
  • parameterizing the convex mixture of connections (α-connections).
Zhang (2004) [2] showed that (1) and (2) reflect two different types of duality in information geometry, with (1) concerning the reference/comparison status of a pair of points (functions) expressed in the divergence function (“reference duality”) and (2) concerning their representation under arbitrary monotone scaling (“representation duality”). Both can lead to (3), the family of α-connections. Therefore, care has to be taken in carefully delineating these two kinds of duality; for instance, the αβ-connection we derived in (21) reflects how reference duality and representation duality interacts in the alpha-connections.
The present analysis elaborated representation duality in information geometry by working out the freedom in allowing two (independently chosen) embedding functions ρ, τ or, equivalently, one embedding function ρ along with a weighting function f, while the (ρ, f) pair can be dually chosen to be the (τ, f*) pair. Naudts’ (2004) [3] ϕ-logarithm is but a special case of the (ρ, τ) duality, in which f′ plays the role of the “integral-of-the-reciprocal” operation, that is taking the log of a function. This linkage then leads to f* and τ as inverse functions. The phenomena of biduality emerges when exchanging ρ ↔ τ or (ρ, f) (τ, f*) leads to invariance of the Riemannian metric, but switches the two connections (the latter half of the statement is equivalent to changing signs of the Amari–Chentsov tensor). Therefore, the present paper, while elaborating the theory developed in [2], re-asserts the distinction between two distinct kinds of duality that was originally confounded in Amari’s theory of α-geometry, one through the freedom of selecting monotone embedding functions (“representation duality”) and the other through the freedom of assigning referential status to points for pair comparison (“reference duality”).
Finally, it is noted that the (bi)dualistic structure of the (ρ, τ)-geometry (generalizing α-geometry) is preserved in the non-parametric (infinite-dimensional) setting, as well [4,6], with the α-connection structure cast in a more general way. Theorem 1 of [4] gives non-parametric expressions of the metric and connections under monotone embedding, mirroring the forms (14) and (15) in the parametric case.

3. Conclusion

The Riemannian metric with the pair of conjugate connections derived by Harsha and Moosath [1] are identical to the (ρ, τ)-geometry obtained by Zhang in [2]. The (ρ, τ)-embedding also recovers Naudts’ deformed logarithm/exponential formulation. It is further shown in this paper that such (ρ, τ)-geometry obtained is, when α-embedding is relaxed to arbitrary monotone embeddings, the unique extension of Amari’s α-geometry in terms of its representational freedom.

Acknowledgments

The writing of this paper was supported by research grant ARO W911NF-12-1-0163 awarded to Jun Zhang.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Harsha, K.V; Subrahamanian Moosath, K.S. F -geometry and Amari’s α-geometry on a statistical manifold. Entropy 2014, 16, 2472–2487. [Google Scholar]
  2. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar]
  3. Naudts, J. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Inequal. Pure Appl. Math. 2004, 5, 102. [Google Scholar]
  4. Zhang, J. Referential Duality and Representational Duality on Statistical Manifolds. In Referential Duality and Representational Duality on Statistical Manifolds, Proceedings of the Second International Symposium on Information Geometry and Its Applications, Tokyo, Japan, 12–16 December 2005; pp. 58–67.
  5. Zhang, J. Referential duality and representational duality in the scaling of multi-dimensional and infinite-dimensional stimulus space. In Measurement and Representation of Sensations: Recent Progress in Psychological Theory; Dzhafarov, E., Colonius, H., Eds.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2006. [Google Scholar]
  6. Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar]
  7. Zhang, J. Divergence functions and geometric structures they induce on a manifold. In Geometric Theory of Information; Nielsen, F., Ed.; Springer: Cham, Switzerland, 2014; pp. 1–30. [Google Scholar]
  8. Zhang, J. Reference duality and representation duality in information geometry. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt2014), Proceedings of 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Amboise, France, 21–26 September 2014; 1641, pp. 130–146.
  9. Amari, S. Differential geometry of curved exponential families—curvatures and information loss. Ann. Stat. 1982, 10, 357–385. [Google Scholar]
  10. Amari, S. Differential Geometric Methods in Statistics; Lecture Notes in Statistics; Volume 28, Springer: New York, NY, USA, 1985. [Google Scholar]
  11. Amari, S.; Nagaoka, H. Method of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  12. Ohara, A. Geometry of distributions associated with Tsallis statistics and properties of relative entropy minimization. Phys. Lett. A 2007, 370, 184–193. [Google Scholar]
  13. Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
  14. Ohara, A.; Matsuzoe, H.; Amari, S. A dually flat structure on the space of escort distributions. J. Phys. Conf. Ser. 2010, 201, 012012. [Google Scholar]
  15. Amari, S.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar]
  16. Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometry. Physica A 2012, 391, 4308–4319. [Google Scholar]
  17. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 1983, 11, 793–803. [Google Scholar]
  18. Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J 1985, 15, 341–391. [Google Scholar]
  19. Chentsov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematics Society: Providence, RI, USA, 1982. [Google Scholar]
  20. Ay, N.; Jost, J.; Le, H.V.; Schwachhöfer, L. Information geometry and sufficient statistics. Probab. Theory Relat. Fields 2014. [Google Scholar] [CrossRef]
  21. Zhang, J.; Hasto, P. Statistical manifold as an affine space: A functional equation approach. J. Math. Psychol. 2006, 50, 60–65. [Google Scholar]
  22. Burbea, J.; Rao, C.R. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multivar. Anal. 1982, 12, 575–596. [Google Scholar]
  23. Burbea, J.; Rao, C.R. Differential metrics in probability spaces. Probab. Math. Stat. 1984, 3, 241–258. [Google Scholar]

Share and Cite

MDPI and ACS Style

Zhang, J. On Monotone Embedding in Information Geometry. Entropy 2015, 17, 4485-4499. https://doi.org/10.3390/e17074485

AMA Style

Zhang J. On Monotone Embedding in Information Geometry. Entropy. 2015; 17(7):4485-4499. https://doi.org/10.3390/e17074485

Chicago/Turabian Style

Zhang, Jun. 2015. "On Monotone Embedding in Information Geometry" Entropy 17, no. 7: 4485-4499. https://doi.org/10.3390/e17074485

Article Metrics

Back to TopTop