Next Article in Journal
The New Genetics and Natural versus Artificial Genetic Modification
Next Article in Special Issue
Statistical Mechanics and Information-Theoretic Perspectives on Complexity in the Earth System
Previous Article in Journal
Optimal Thermal Design of a Stacked Mini-Channel Heat Sink Cooled by a Low Flow Rate Coolant
Previous Article in Special Issue
Law of Multiplicative Error and Its Generalization to the Correlated Observations Represented by the q-Product
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Group Invariance of Information Geometry on q-Gaussian Distributions Induced by Beta-Divergence

1
Department of Electrical and Electronics Engineering, University of Fukui, Fukui 910-8507, Japan
2
The Institute of Statistical Mathematics and the Graduate University of Advanced Studies, Tachikawa Tokyo 190-8562, Japan
*
Author to whom correspondence should be addressed.
Entropy 2013, 15(11), 4732-4747; https://doi.org/10.3390/e15114732
Submission received: 6 October 2013 / Revised: 26 October 2013 / Accepted: 29 October 2013 / Published: 4 November 2013
(This article belongs to the Collection Advances in Applied Statistical Mechanics)

Abstract

:
We demonstrate that the q-exponential family particularly admits natural geometrical structures among deformed exponential families. The property is the invariance of structures with respect to a general linear group, which transitively acts on the space of positive definite matrices. We prove this property via the correspondence between information geometry induced by a deformed potential on the space and the one induced by what we call β-divergence defined on the q-exponential family with q = β + 1 . The results are fundamental in robust multivariate analysis using the q-Gaussian family.

1. Introduction

Generalizations of the exponential family have recently had much attention paid to them in mathematical statistics or statistical physics [1,2,3,4,5,6]. One of their goals is to attack a wider class of statistical problems that ranges outside the ones solved via the well-established theory of the exponential family [7]. Among such generalizations, the q-exponential family, where the ordinary exponential function is replaced by the q-exponential [8], often naturally appears to play important roles in experimental and theoretical aspects.
For example, the family not only describes phenomena obeying the power-law well [6], but also, it is theoretically proven to include a velocity distribution of the classical gas with N particles [9,10], an attracting invariant manifold of porous media flow [11], and so on. In statistics, it is reported to provide a reasonable statistical model in robust inference from data losing normality [1,12,13,14]. In addition, quite interesting and abundant mathematical structures have been developed [15,16,17,18,19] for the q-exponential function itself.
On certain families of elliptical distributions [20,21], we can introduce information geometric structures [22,23] starting from what is called the U-divergence [1] instead of the Kullback-Leibler divergence, in order to geometrically tackle various statistical problems that include the above robustness analysis. Zero-mean elliptical density functions f P ( x ) = u ( x T P x / 2 c U ( det P ) ) , with a fixed function, u, and the normalizing constant, c U ( det P ) , can be specified by positive definite matrices, P. Hence, we can discuss the geometric structures of such density functions explicitly, e.g., with no integrations, via the corresponding information geometry on the parameter space of positive definite matrices, called the V-geometry [24].
In the present paper, we focus on investigating the geometric structures of the q-Gaussian family induced from the β-divergence [1] via the V-geometry following the above idea. For this purpose, we establish the correspondence between two geometries and derive explicit formulas of important geometric quantities, such as the Riemannian metric and mutually dual affine connections. Consequently, we can prove that the information geometry on the q-Gaussian family enjoys fairly natural group invariance properties. These invariances or homogeneities are important in multivariate analysis (see, e.g., [21,25]). Further, they practically assure that the statistical inferences based on the geometrical theory are independent of linear transformations of multivariate random variables, such as scaling or numerical conditioning in computations. It should be additionally mentioned that our results might shed new light on the rich mathematical structures of the q-exponential or power functions.
The organization of the paper is as follows: Section 2 collects necessary results on the V-geometry of positive definite matrices and the U-divergence defined on elliptical distributions. In Section 3, we discuss the group invariances of fundamental structures of the V-geometry induced by the power potentials. We find that its pair of mutually dual connections and orthogonality are G L ( n , R ) -invariant, which is a natural requirement for geometries on positive definite matrices. Section 4 is devoted to demonstrating that the dualistic geometry on the q-Gaussian family induced by the β-divergence coincides with the V-geometry with the power potentials. Finally, Section 5 gives concluding remarks.

2. Preliminaries: Geometries on Positive Definite Matrices and the U-Model

We recall the relation between information geometry on positive definite matrices induced by V-potentials and that on a multivariate statistical model called the U-model. Details described in this section and the ideas behind them can be found in [1,24].

2.1. V-Potential Function and the Induced Geometry on Positive Definite Matrices

Denote by S y m ( n , R ) the vector space of n × n real symmetric matrices, and by P D ( n , R ) the convex cone of n × n positive definite matrices in S y m ( n , R ) , respectively. For two matrices, X and Y, in S y m ( n , R ) , we denote tr ( X Y ) by X , Y . For an arbitrary set of basis matrices { E i } i = 1 n ( n + 1 ) / 2 of S y m ( n , R ) , we can introduce a coordinate system, ( x i ) , by X = i = 1 n ( n + 1 ) / 2 x i E i . Note that S y m ( n , R ) is isomorphic to a tangent vector space at each P P D ( n , R ) . Hence, we particularly identify E i with a tangent vector, ( / x i ) P .
Definition 1. 
Let V ( s ) be a smooth function of real numbers s > 0 . The function defined by:
φ ( V ) ( P ) = V ( det P )
is called a V-potential on P D ( n , R ) .
When V ( s ) = log s , the V-potential reduces to the standard one, called the characteristic function on P D ( n , R ) , which plays a fundamental role in the geometrical theory of P D ( n , R ) [26,27,28].
Let ν i ( s ) , i = 1 , 2 , be functions defined by
ν i ( s ) = d ν i 1 ( s ) d s s , i = 1 , 2 , , where ν 0 ( s ) = V ( s )
We assume that V ( s ) satisfies the following two conditions:
i ) ν 1 ( s ) < 0 ( s > 0 ) , ii ) β ( V ) ( s ) = ν 2 ( s ) ν 1 ( s ) < 1 n ( s > 0 )
which are later shown to ensure the convexity of φ ( V ) ( P ) on P D ( n , R ) . Note that the first condition, ν 1 ( s ) < 0 , for all s > 0 implies the function, V ( s ) , is strictly decreasing on s > 0 .
Using the formula grad det P = ( det P ) P 1 , we have the gradient mapping, grad φ ( V ) :
grad φ ( V ) : P P * = ν 1 ( det P ) P 1
The Hessian of φ ( V ) at P P D ( n , R ) , which we write as g P ( V ) , is given by:
g P ( V ) ( X , Y ) = ν 1 ( det P ) tr ( P 1 X P 1 Y ) + ν 2 ( det P ) tr ( P 1 X ) tr ( P 1 Y )
for arbitrary tangent vectors, X and Y, in S y m ( n , R ) .
Proposition 1. 
[24] The Hessian, g ( V ) , is positive definite on P D ( n , R ) , if and only if the conditions in Equation (3) hold.
To establish the Legendre relation on P D ( n , R ) , we consider the conjugate function of φ ( V ) denoted by φ ( V ) * . Define the Legendre transform:
φ ( V ) * ( P * ) = sup P P D ( n ; R ) { P * , P φ ( V ) ( P ) }
Since the extremal condition is:
P * = grad φ ( V ) ( P ) = ν 1 ( det P ) P 1
and grad φ ( V ) is invertible by the positive definiteness of g ( V ) , we have the following expression for φ ( V ) * with respect to P:
φ ( V ) * ( P * ) = n ν 1 ( det P ) φ ( V ) ( P )
Hence, the canonical divergence [23] D ( V ) of ( P D ( n , R ) , , g ( V ) ) is obtained as:
D ( V ) ( P , Q ) = φ ( V ) ( P ) + φ ( V ) * ( Q * ) Q * , P = V ( det P ) V ( det Q ) + Q * , Q P
Regarding g ( V ) as a Riemannian metric, we can consider P D ( n , R ) as a Riemannian manifold. Further, using the canonical flat affine connection on S y m ( n , R ) denoted by ∇, define the dual affine connection [23, * ( V ) , satisfying:
X g ( V ) ( Y , Z ) = g ( V ) ( X Y , Z ) + g ( V ) ( Y , * ( V ) X Z )
for arbitrary tangent vector fields, X , Y and Z, on P D ( n , R ) ; then, we can introduce a dually flat structure [23] or a Hessian structure [29] ( P D ( n , R ) , g ( V ) , , * ( V ) ) .
Their covariant derivatives at P are actually given by:
/ x i / x j P = 0 * / x i ( V ) / x j P = E i P 1 E j E j P 1 E i Φ ( E i , E j , P ) Φ ( E i , E j , P )
where:
Φ ( X , Y , P ) = ν 2 ( s ) tr ( P 1 X ) ν 1 ( s ) Y + ν 2 ( s ) tr ( P 1 Y ) ν 1 ( s ) X Φ ( X , Y , P ) = { ν 3 ( s ) ν 1 ( s ) 2 ν 2 2 ( s ) } tr ( P 1 X ) tr ( P 1 Y ) + ν 2 ( s ) ν 1 ( s ) tr ( P 1 X P 1 Y ) ν 1 ( s ) { ν 1 ( s ) n ν 2 ( s ) } P
and s = det P .
Since several properties of the pair of mutually dual connection, ∇ and * ( V ) , are stated in [24], we omit them here for the sake of simplicity. However, from a geometrical viewpoint, we should note that the following two important properties are related to the invariance of structures. In Section 3, we shall return to these points and discuss them in detail.
Proposition 2. 
[24]
1. 
The dually flat structure, ( P D ( n , R ) , g ( V ) , , * ( V ) ) , is S L ( n , R ) -invariant for general V ( s ) satisfying Equation (3), while it is G L ( n , R ) -invariant for V ( s ) = c 1 + c 2 log s for real constants, c 1 and c 2 < 0 .
2. 
When V ( s ) is a power function (with a constant term) of the form:
V ( s ) = c 1 + c 2 s β
for real constants, c 1 , c 2 and β, satisfying Equation (3), both affine connections, ∇ and * ( V ) , are G L ( n , R ) -invariant. Further, the orthogonality with respect to g ( V ) is also G L ( n , R ) -invariant, while g ( V ) itself is not.
Remark 1. 
One of interesting implications is that the above second point means that for the power function in Equation (12), both - and * ( V ) -projections [23], which can be also variationally characterized by the divergence in Equation (9) and its dual, are G L ( n , R ) -invariant; hence, so is the Pythagorean theorem [23] as a one-dimensional special case for this implication. Conversely, G L ( n , R ) -invariance of the Pythagorean theorem means those of both projections, because and * ( V ) are torsion-free [23].

2.2. Relation between Information Geometries on the U-model and Positive Definite Matrices

We briefly introduce the U-divergence and U-model and show how the dualistic geometries induced from U-divergence and V-potential are related.
In the field of statistical inference, the well-established method is the maximum likelihood method, which is based on the Kullback-Leibler divergence. To improve the robustness performance of the method, maintaining its theoretical advantages, such as efficiency, the methods of minimizing general divergences have been proposed as alternatives to the maximum likelihood method [1,13,30,31,32].
Definition 2. 
Let U ( s ) be a smooth convex function with the positive derivative u ( s ) = U ( s ) > 0 on R or its (semi-infinite) interval and ξ be the inverse function of u there. If the following functional for two functions, f ( x ) and g ( x ) , on R n :
D U ( f , g ) = U ( ξ ( g ) ) U ( ξ ( f ) ) { ξ ( g ) ξ ( f ) } f d x
exists, we call it the U-divergence.
It follows that D U ( f , g ) 0 and D U ( f , g ) = 0 , if and only if f = g , because the integrand, U ( ξ g ) { U ( ξ f ) + u ( ξ f ) ( ξ g ξ f ) } , where ξ f = ξ ( f ) and ξ g = ξ ( g ) , is interpreted as the difference of the convex function, U, and its supporting function. If we set U ( s ) = 1 β + 1 ( 1 + β s ) ( β + 1 ) / β for β R , then the corresponding U-divergence is the beta-divergence [1] defined by:
D β ( f , g ) = g ( x ) β + 1 f ( x ) β + 1 β + 1 f ( x ) { g ( x ) β f ( x ) β } β d x
As β goes to zero, it reduces to the Kullback-Leibler divergence; on the other hand, as β goes to one, it reduces to the squared L 2 -distance. Thus, the efficiency increases as β goes to zero, while the robustness increases as β goes to one. In this sense, we could find an appropriate β between zero and one as a trade-off between efficiency and robustness. The beta-divergence is strongly connected to the Tsallis entropy [33].
When we consider the family of functions parametrized by elements in a manifold, M , the divergences on the family induce the dualistic structure on M [1]. Concretely, we here confine our attention to the family of multivariate probability density functions specified by P in M = P D ( n , R ) and then study its structure on P D ( n , R ) . The family is natural in the sense that it is a dually flat statistical manifold with respect to the dualistic geometry induced by the U-divergence.
Definition 3. 
Let U and u be the functions given in Definition 2. The family of elliptical distributions with the following density functions:
M U = f U ( x , P ) = u 1 2 x T P x c U ( det P ) P P D ( n , R )
is called the U-model associated with the U-divergence. Here, we set f ( x , P ) = 0 if the right-hand side is nonpositive or undefined, and c U ( det P ) is a normalizing constant.
Note that if u satisfies the following self-similarity property:
t = c U ( det P ) , a t , b t , u ( s t ) = a t u ( b t s )
where a t and b t are positive constants depending on det P , the density function, f U , in the U-model can be alternatively expressed in the usual form of an elliptical distribution [20,21, i.e.,
f U ( x , P ) = a t u x T P x / 2 , P = b t P
One such example is the β-model, to be discussed in Section 4 and the Appendix. Thus, the probability density function, f U ( x , P ) , has the mean vector of zero and the variance matrix, c P 1 , where c is a positive constant obtained from the characteristic function of f U ( x , P ) , and P is called the precision matrix.
Now, we consider the correspondence between the dualistic geometry induced by D U on the U-model and that on P D ( n , R ) induced by the V-potential function discussed.
Proposition 3. 
[24] Define the V-potential function, φ ( V ) , via:
V ( s ) = s 1 2 U 1 2 x T x c U ( s ) d x + c U ( s ) , s > 0
Assume that V satisfies the conditions in Equation (3); then, the dually flat structure, ( g ( V ) , , * ( V ) ) , on P D ( n , R ) coincides with that on U-model induced by the U-divergence, D U .

2.3. Statistical Estimation on the U-Model

We discuss a statistical estimation for the precision matrix parameter, P, in the U-model, M U . The U-divergence, D U ( f , g ) , is decomposed into the difference of U-cross entropy, C U ( f , g ) , and U-entropy, H U ( f ) [1], that is, D U ( f , g ) = C U ( f , g ) H U ( f ) , where
C U ( f , g ) = ξ ( g ( x ) ) f ( x ) d x + U ( ξ ( g ( x ) ) d x
and H U ( f ) = C U ( f , f ) . Consider a maximum U-entropy distribution on the moment equal space. Let F be the space of all probability density functions on R n and
F ( P ) = { f F : E f ( X ) = 0 , E f ( X X T ) = c P 1 }
the zero-mean and equal variance space in F , where E f denotes the statistical expectation with respect to f, and c is the aforementioned constant. Then, we observe that:
f U ( · , P ) = argmin f F ( P ) H U ( f )
In effect, for any f in F ( P ) ,
H ( f U ( · , P ) ) H U ( f ) = D U ( f U ( · , P ) , f ) 0
with equality only if f = f U ( · , P ) . Thus, the U-model, M U , is characterized by maximum U-entropy distributions.
Let { X i } 1 i N be random samples from a probability density function, f U ( x , P ) . Then, the U-loss function is defined by
L U ( P ) = 1 N i = 1 N ξ ( f U ( X i , P ) ) + U ( ξ ( f U ( x , P ) ) ) d x
in which the U-estimator, P ^ U , is defined by the minimizer of L U ( P ) . The U-loss function is an empirical analogue of U-cross entropy, in the sense that E f L U ( P ) = C U ( f , f ( · , P ) ) . By definition, P ^ U is a solution of P L U ( P ) = 0 if the solution is unique. Hence, we conclude that if N n , then P ^ U = c S 1 , where S is the sample variance matrix defined by 1 N i = 1 N X i X i T , because
P L U ( P ) = 1 2 S x x T f U ( x , P ) d x
and c P 1 = x x T f U ( x , P ) d x . We remark that S is positive-definite with a probability of one if N n . The derivation for P ^ U is confirmed by the following fact
L U ( P ) L U ( P ^ U ) = D U ( f U ( · , P ^ U ) , f U ( · , P ) ) 0
with equality, only if P = P ^ U .
Surprisingly, such U-estimators are independent of the choice of U, which implies that a U-estimator for a U-model equals the maximum likelihood estimator for the Gaussian model. On the other hand, assume that { X i } 1 i N are random samples from a Gaussian density function, that is, f U ( x , P ) with U = exp . Then, unless U = exp , the U-estimator, P ^ U , for the precision parameter, P, in the Gaussian model has no exact expression. This is rather a different aspect from the situation discussed above, and P ^ U solved by an iteration algorithm is shown to be robust against heavy outliers if the generator function, U, satisfies a tail condition.
For example, if we select as U ( s ) = 1 β + 1 ( 1 + β s ) ( β + 1 ) / β with a fixed β > 0 , then the corresponding divergence is the β-divergence given in Equation (14), and the corresponding U-estimator is called the β-estimator. The estimator is associated with the iteration algorithm, say { P t , t = 1 , 2 , } , with an initial value, P 1 , in which the update, P t + 1 , from the t-step, P t , is given by:
P t + 1 1 = 1 i = 1 N w ( X i , P t ) d β i = 1 N w ( X i , P t ) X i X i T
where w ( x , P ) = exp ( β 2 x T P x ) and d β = β / ( β + 1 ) n + 1 . See [1] for a detailed discussion. We remark that the β-estimator, P ^ β , satisfies the fixed-point condition by setting as P t + 1 = P t = P ^ β in Equation (26). Therefore, if the i-th observation, X i , has an extremely large value, X i T P ^ β X i , then the i-th weight w ( X i , P ^ β ) in the weighted variance form of Equation (26) becomes negligible, so that the β-estimator is automatically robust for theses outliers. The degree of robustness for P ^ β depends on the value of β. In this way, it is also possible to introduce a dualistic structure on the product space of ( U , U ) for U-models and U -estimators.

3. G L ( n , R ) -Invariance Induced from the Power Function

The transformation group, τ G , with G G L ( n , R ) , transitively acts on P D ( n , R ) , i.e., there exists G G L ( n , R ) for all P , P P D ( n , R ) , such that τ G P = G P G T = P . We denote by τ G * the differential of τ G . The invariance of geometry for these transformations is defined as follows:
Definition 4. 
We say a dually flat structure, ( P D ( n , R ) , g ( V ) , , * ( V ) ) , is G L ( n , R ) -invariant if the Riemannian metric, g ( V ) , and the pair of mutually dual connections, and * ( V ) , satisfy:
g P ( V ) ( X , Y ) = g P ( V ) ( X , Y )
τ G * X Y P = X Y P , τ G * * X ( V ) Y P = * X ( V ) Y P
for arbitrary G G L ( n , R ) , where P = τ G P , X = τ G * X and Y = τ G * Y , respectively.
We can similarly define S L ( n , R ) -invariance while S L ( n , R ) is not transitive on P D ( n , R ) . These invariances mean that the geometries are homogeneous to the corresponding transformations. They practically imply that the obtained geometrical results are not influenced by scaling (unit change), numerical conditioning, and so on. Note that ( P D ( n , R ) , g ( V ) , , * ( V ) ) is G L ( n , R ) -invariant, if and only if the corresponding canonical divergence is also, i.e.,
G G L ( n , R ) , D ( V ) ( P , Q ) = D ( V ) ( τ G P , τ G Q )
because the dually flat structure can be alternatively derived from the canonical divergence [1,23].
Now, we fix the form of V as V ( s ) = c 1 + c 2 s β to actually confirm the invariance property described in the second statement of Proposition 2. First, the convexity conditions in Equation (3) reduce to:
(i) β c 2 < 0 , (ii) β < 1 n
The dual variables, P * , and the conjugate function, φ ( V ) * , are expressed by P as:
P * = c 2 β ( det P ) β P 1 , φ ( V ) * ( P * ) = n c 2 β ( det P ) β c 1 c 2 ( det P ) β
The corresponding Riemannian metric in Equation (5) and divergence in Equation (9) are respectively given by:
g P ( V ) ( X , Y ) = c 2 β ( det P ) β tr ( P 1 X P 1 Y ) + β tr ( P 1 X ) tr ( P 1 Y )
D ( V ) ( P , Q ) = c 2 ( det P ) β ( det Q ) β + β ( det Q ) β tr ( I Q 1 P )
When we particularly set c 1 = c 2 = 1 / β , i.e., V ( s ) = ( 1 s β ) / β , and move β to zero, we find that they converge to the standard Riemannian metric and divergence [27] for V ( s ) = log s . We immediately see that the above g ( V ) and D ( V ) are not G L ( n , R ) - but S L ( n , R ) -invariant. However, g P ( V ) ( X , Y ) = 0 , if and only if g P ( V ) ( X , Y ) = 0 for any G G L ( n , R ) . Thus, the orthogonality is G L ( n , R ) -invariant.
Next, the covariant derivatives of ∇ vanish everywhere, since it is the canonical flat affine connection. For the covariant derivatives of * ( V ) , it is seen that the first and second terms in Equation (11) are G L ( n , R ) -invariant. The third and fourth terms of the dual covariant derivatives respectively reduce to:
Φ ( X , Y , P ) = β tr ( P 1 X ) Y + β tr ( P 1 Y ) X
Φ ( X , Y , P ) = β 2 tr ( P 1 X ) tr ( P 1 Y ) + β tr ( P 1 X P 1 Y ) 1 n β P
which are independent of det P . Thus, we can find that both ∇ and * ( V ) are G L ( n , R ) -invariant.
Finally, consider two smooth curves, γ = { C γ ( t ) | ϵ < t < ϵ } and γ * = { C γ * ( t ) | ϵ < t < ϵ } , in P D ( n , R ) , satisfying:
C γ ( 0 ) = C γ * ( 0 ) = Q , d C γ d t ( 0 ) = X , d C γ * d t ( 0 ) = Y
Since ∇ is the canonical flat connection on S y m ( n , R ) , γ is ∇-geodesic iff it is represented by C γ ( t ) = Q + t X . On the other hand, γ * is * ( V ) -geodesic iff it is represented as a straight line in the dual variable, C γ * * ( t ) [23]. Let us obtain its explicit form. Since it follows that:
d d t C γ * * = c 2 β d d t ( det C γ * ) β C γ * 1 + ( det C γ * ) d d t C γ * 1 = c 2 β 2 ( det C γ * ) β 1 grad det C γ * , d C γ * d t + c 2 β ( det C γ * ) β C γ * 1 d C γ * d t C γ * 1
we have:
Y ˜ = d C γ * * d t ( 0 ) = c 2 β ( det Q ) β { β tr ( Q 1 Y ) Q 1 Q 1 Y Q 1 }
by substituting Equation (35). Thus, γ * is * ( V ) -geodesic iff it is represented in the dual variables by:
C γ * * ( t ) = Q * + t Y ˜
Assume that X and Y are mutually orthogonal at Q, and two points, P and R, in P D ( n , R ) are, respectively, located on the ∇-geodesic γ and * ( V ) -geodesic γ * satisfying Equation (35), i.e.,
P = Q + t 1 X , R * = Q * + t 2 Y ˜ , g Q ( V ) ( X , Y ) = 0
for some real numbers, t 1 and t 2 . Then, we have:
D ( V ) ( P , Q ) + D ( V ) ( Q , R ) D ( V ) ( P , R ) = R * Q * , P Q = t 1 t 2 Y ˜ , X = t 1 t 2 g Q ( V ) ( X , Y ) = 0
which results in the Pythagorean theorem. If we, respectively, transform Q , X , Y in Equation (35) to Q = τ G Q , X = τ G * X , Y = τ G * Y , we see that Equations (39) and (40) hold by replacing t 2 with t 2 ( det G ) 2 β . Thus, even if such three points, P , Q , R , are, respectively, transformed by τ G to P , Q , R with arbitrary, but common, G G L ( n , R ) , the Pythagorean theorem still holds for P , Q , R .
The G L ( n , R ) -invariances of both ∇- and * ( V ) -projections are similarly confirmed.
Thus, we have confirmed the second statement of Proposition 2, i.e., the above G L ( n , R ) -invariance holds if V ( s ) = c 1 + c 2 s β . In fact, we can show that the converse of Proposition 2 is also true.
Theorem 1. 
Assume that the function, V, meets Equation (3). Mutually dual connections, and * ( V ) , and the orthogonality with respect to g ( V ) are G L ( n , R ) -invariant, if and only if V ( s ) = c 1 + c 2 s β or V ( s ) = c 1 + c 2 log s .
Proof. 
We only show the if part. The covariant derivatives of ∇ are independent of V and clearly invariant. For those of * ( V ) in Equation (11), the first and second terms can be readily understood to be invariant. The third term, Φ, and the fourth term, Φ , are invariant, only if the ratios, ν 2 ( s ) / ν 1 ( s ) and ν 3 ( s ) / ν 1 ( s ) , also are, because the coefficients in these terms are respectively expressed by:
ν 2 ν 1 tr ( P 1 X ) , ν 2 ν 1 tr ( P 1 Y )
ν 3 / ν 1 2 ( ν 2 / ν 1 ) 2 1 n ν 2 / ν 1 tr ( P 1 X ) tr ( P 1 Y ) + ν 2 / ν 1 1 n ν 2 / ν 1 tr ( P 1 X P 1 Y )
where tr ( P 1 X ) , tr ( P 1 Y ) and tr ( P 1 X P 1 Y ) are invariant.
From the definition of ν i ( s ) ’s, the invariance of ν 2 ( s ) / ν 1 ( s ) is satisfied by the solutions, ν 1 ( s ) , of the ordinary differential equation:
1 ν 1 d ν 1 d s = β s
for a real constant, β. By solving the ODEand integrating again, we have V ( s ) = c 1 + c 2 s β or V ( s ) = c 1 + c 2 log s for real constants, c i , i = 1 , 2 . These forms also meet the invariance of ν 3 ( s ) / ν 1 ( s ) . The invariance of the orthogonality for such functions, V, have been already confirmed.      ☐
Note that the above theorem does not reject the possibility of the G L ( n , R ) -invariances for the potentials of the other forms, except the V-potentials.

4. Geometry on the q-Gaussian Family Induced by the β-Divergence and the V-Geometry

This section demonstrates the main result.
Let β be a real parameter satisfying β 0 and β 1 and define a function, U, by:
U ( s ) = 1 β + 1 ( 1 + β s ) ( β + 1 ) / β , s I β = { s R | 1 + β s > 0 } 0 , otherwise
Using its derivative, we define a function, u, by:
u ( s ) = d U ( s ) d s = ( 1 + β s ) 1 / β s I β = { s R | 1 + β s > 0 } 0 , otherwise
and the inverse, ξ of u, on I β by:
ξ ( t ) = t β 1 β , t > 0
Note that U is convex and u is positive where s > 1 / β if β > 0 and s < 1 / β if β < 0 , respectively. Further, u and ξ respectively approach the usual exponential and logarithmic functions when β goes to zero. Hence, by introducing a parameter, q = 1 + β , they are called the q-exponential and q-logarithmic functions in the literature of nonextensive statistical physics [4,6].
Let us fix the parameter, β, arbitrarily and consider an elliptical density function, f ( x , P ) , specified by P P D ( n , R ) using the q-exponential function, u:
f ( x , P ) = u 1 2 x T P x c β ( det P )
where c β ( det P ) is the normalizing constant. The density function, f, is said to be the (zero-mean) q-Gaussian [4,6], and we call the family of all such f’s denoted by M β the q-Gaussian family or β-model.
By starting from the β-divergence, we can define a dualistic structure on M β , which invokes the corresponding V-geometry on the parameter space, P D ( n , R ) . The V-potential for the V-geometry is obtained as follows:
Theorem 2. 
The information geometry on M β induced from the β-divergence is characterized by a dually flat structure, ( g ( V ) , , * ( V ) ) , on P D ( n , R ) induced by the V-potentials:
V ( s ) = 1 β + c + s 1 / ( 2 n β ) , β > 0 1 β + c s 1 / ( 2 n β ) , 2 n + 2 < β < 0
satisfying Equation (30), where s = det P , n β = n / 2 + 1 / β and c ± are constants depending on β and n.
The proof can be found in the Appendix.
The above theorem implies that geometric structure on M β induced from the β-divergence admits the natural invariance properties discussed in Section 3.

5. Conclusions

We prove that information geometry on the q-Gaussian family induced by the β-divergence is equivalently characterized by the V-geometry on the space of positive definite matrices induced by the power potential. Studying the corresponding V-geometry, we show that some of the dually flat structures of the q-Gaussian family admit the G L ( n , R ) -invariances. This fact implies the importance of the family in multivariate statistical analysis, as well as gives a geometrical viewpoint to mathematical properties of the q-exponential functions. Following the way given in Section 2.1, we can introduce the other dually flat structures via any convex potentials in addition to the V-potential, by defining Riemannian metrics as their Hessians and dual flat connections satisfying Equation (10) for ∇. The relations between such dually flat structures and the ones on the other deformed exponential families are left for future work.
Robustness in statistical estimations involving the q-Gaussian family (β-model) and the β-divergence, which is another important aspect, is roughly explained at the end of Section 2.3.
Recently, the theory of optimal transportation has made great developments in which a geometrical insight is founded with a close relation to the problem of Poincaré’s conjecture [34]. We can find some ideas and arguments similar to those established in this paper, while no direct link exists between their objectives. In effect, a divergence is defined in [35] and shown to play a significant role on checking a condition for the existence of the optimal transportation, where geometry is deeply explored to consider a family of probability density functions on a space in place of investigating properties of the space directly. It is expected that a coupling of the theory and information geometry will give fruitful results in the near future.

Acknowledgments

We thank the anonymous referees for their useful comments and suggestions, in particular, for pointing out a relation with the optimal transportation theory. Atsumi Ohara was partially supported by Japan Society for the Promotion of Science (JSPS) Grant-in-Aid (C) 23540134. Shinto Eguchi was supported by Japan Science and Technology Agency (JST), Core Research for Evolutionary Science and Technology (CREST).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Eguchi, S. Information geometry and statistical pattern recognition. Sugaku Expos. 2006, 19, 197–216. [Google Scholar]
  2. Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar]
  3. Kaniadakis, G.; Lissia, M.; Scarfone, A.M. Deformed logarithms and entropies. Physica A 2004, 340, 41–49. [Google Scholar] [CrossRef]
  4. Naudts, J. Generalized thermostatics; Springer: London, UK, 2010. [Google Scholar]
  5. Ollila, E.; Tyler, D.; Koivunen, V.; Poor, V. Complex elliptically symmetric distributions: Survey, new results and applications. IEEE Trans. Signal Process. 2012, 60, 5597–5623. [Google Scholar] [CrossRef]
  6. Tsallis, C. Introduction to Nonextensive Statistical Mechanics; Springer: New York, NY, USA, 2009. [Google Scholar]
  7. Barndorff-Nielsen, O.E. Information and Exponential Families in Statistical Theory; Wiley: Chichester, UK, 1978. [Google Scholar]
  8. Tsallis, C. What are the numbers that experiments provide? Quimica Nova 1994, 17, 468–471. [Google Scholar]
  9. Naudts, J. The q-exponential family in statistical physics. Cent. Eur. J. Phys. 2009, 7, 405–413. [Google Scholar] [CrossRef]
  10. Naudts, J.; Baeten, M. Non-extensivity of the configuration density distribution in the classical microcanonical emsenble. Entropy 2009, 11, 285–294. [Google Scholar] [CrossRef]
  11. Ohara, A.; Wada, T. Information geometry of q-Gaussian densities and behaviors of solutions to related diffusion equations. J. Phys. A: Math. Theor. 2010, 43, 035002. [Google Scholar] [CrossRef]
  12. Eguchi, S.; Copas, J. A class of logistic-type discriminant functions. Biometrika 2002, 89, 1–22. [Google Scholar] [CrossRef]
  13. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  14. Higuchi, I.; Eguchi, S. Robust principal component analysis with adaptive selection for Tuning Parameters. J. Mach. Learn. Res. 2004, 5, 453–471. [Google Scholar]
  15. Borges, E.P. A possible deformed algebra and calculus inspired in nonextensive thermostatistics. Physica A 2004, 340, 95–101. [Google Scholar] [CrossRef]
  16. Nivanen, L.; Le Mehaute, A.; Wang, Q.A. Generalized algebra within a nonextensive statistics. Rep. Math. Phys. 2003, 52, 437–444. [Google Scholar] [CrossRef]
  17. Suyari, H. Mathematical structures derived from the q-multinomial coefficient in Tsallis statistics. Physica A 2006, 368, 63–82. [Google Scholar] [CrossRef]
  18. Suyari, H.; Wada, T. Multiplicative duality, q-triplet and μ,ν,q-relation derived from the one-to-one correspondence between the (μ,ν)-multinomial coefficient and Tsallis entropy Sq. Physica A 2008, 387, 71–83. [Google Scholar] [CrossRef]
  19. Tsallis, C.; Gell-Mann, G.; Sato, Y. Asymptotically scale-invariant occupancy of phase space makes the entropy Sq. Proc. Natl. Acad. Sci. USA 2005, 102, 15377–15382. [Google Scholar] [CrossRef] [PubMed]
  20. Fang, K.T.; Kotz, S.; Ng, K.W. Symmetric multivariate and related distributions; Chapman and Hall: London, UK, 1990. [Google Scholar]
  21. Muirhead, R.J. Aspects of Multivariate Statistical Theory; Wiley: New York, NY, USA, 1982. [Google Scholar]
  22. Amari, S. Differential-Geometrical Methods in Statistics; Lecture notes in statistics; Volume 28, Springer: New York, NY, USA, 1985. [Google Scholar]
  23. Amari, S.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  24. Ohara, A.; Eguchi, S. Geometry on Positive Definite Matrices Induced from V-Potential Function; Nielsen, F., Barbaresco, F., Eds.; Geometric Science of Information; Lecture Notes in Computer Science 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 621–629. [Google Scholar]
  25. Eaton, M.L. Group Invariance Applications in Statistics; Institute of Mathematical Statistics: Hayward, CA, USA, 1989. [Google Scholar]
  26. Faraut, J.; Korányi, A. Analysis on Symmetric Cones; Oxford University Press: New York, NY, USA, 1994. [Google Scholar]
  27. Ohara, A.; Suda, N.; Amari, S. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl. 1996, 247, 31–53. [Google Scholar] [CrossRef]
  28. Vinberg, E.B. The theory of convex homogeneous cones. Trans. Moscow Math. Soc. 1963, 12, 340–430. [Google Scholar]
  29. Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, Singapore, 2007. [Google Scholar]
  30. Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
  31. Eguchi, S.; Komori, O.; Kato, S. Projective power entropy and maximum Tsallis entropy Distributions. Entropy 2011, 13, 1746–1764. [Google Scholar] [CrossRef]
  32. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
  33. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  34. Villani, C. Optimal Transport: Old and New; Springer: Heidelberg, Germany, 2009. [Google Scholar]
  35. Lott, J.; Villani, C. Ricci curvature for metric-measure spaces via optimal transport. Ann. Math. 2009, 3, 903–991. [Google Scholar] [CrossRef]

Appendix: Proof of Theorem 2

Proof. 
We prove the theorem by solving the corresponding V-potential according to Equation (18).
Derivation for the case β > 0
In the case β > 0 , we see that the density function, f ( x , P ) , has a bounded support, and f ( 0 , P ) > 0 implies c β ( det P ) < 1 / β by considering the positivity of u.
We first calculate c β ( det P ) . Set y = P 1 / 2 x ; then the normalizing condition is:
R n f ( x , P ) d x = ( det P ) 1 2 R n u 1 2 y T y c β ( det P ) d y = 1
The equality is modified as:
( det P ) 1 2 π n 2 Γ ( n 2 ) 0 ρ u r 2 c β ( det P ) r n 2 1 d r = 1
where ρ = min { r | u ( r / 2 c β ) = 0 } = 2 c β + 2 / β > 0 and Γ is the Euler’s gamma function.
For a positive number, α, let us define the integral appearing in the above as:
I α , β ( c β ) = 0 ρ u r 2 c β r α 1 d r = 0 ρ 1 β c β β 2 r 1 / β r α 1 d r
then, by a change of variables r = ρ z , we have:
I α , β ( c β ) = β / 2 1 / β 0 ρ ( ρ r ) 1 / β r α 1 d r = β / 2 1 / β 0 1 ρ 1 / β ( 1 z ) 1 / β ρ α 1 z α 1 ρ d z = β / 2 1 / β ρ α + 1 / β B 1 β + 1 , α = 2 α β 1 / β B 1 β + 1 , α 1 β c β α + 1 / β
where B is Euler’s beta function.
From Equations (50) and (52), the normalizing constant for α = n / 2 is given by:
c β ( s ) = 1 β ( k + s 1 / 2 ) 1 / n β
where s = det P , n β = n / 2 + 1 / β > 0 and:
k + = ( 1 β ) 1 / β ( 2 π ) n 2 Γ ( n 2 ) B ( 1 β + 1 , n 2 ) = ( 1 β ) 1 / β ( 2 π ) n 2 Γ ( n β + 1 ) Γ ( 1 β + 1 ) > 0
Now, to derive V ( s ) for β > 0 , substitute U ( s ) into Equation (18). Then, we have:
V ( s ) = s 1 / 2 π n 2 ( β + 1 ) Γ ( n 2 ) 0 ρ 1 β c β ( s ) β 2 r ( β + 1 ) / β r n 2 1 d r + c β ( s )
Since the integral in the above equation can be computed similarly to Equation (52), it holds that:
V ( s ) = s 1 / 2 ( 2 π ) n 2 β 1 + 1 β ( β + 1 ) Γ ( n 2 ) B 1 β + 2 , n 2 1 β c β ( s ) n β + 1 + c β ( s )
Substituting c β ( s ) in Equation (53), we see that:
V ( s ) = κ 1 s 1 / 2 κ 2 s 1 / ( 2 n β ) n β + 1 + 1 β κ 2 s 1 / ( 2 n β ) = 1 β + κ 1 κ 2 n β 1 κ 2 s 1 / ( 2 n β )
where κ i , i = 1 , 2 are constants respectively represented by:
κ 1 = ( 2 π ) n 2 β 1 + 1 β ( β + 1 ) B 1 β + 2 , n 2 Γ ( n 2 ) = ( 2 π ) n 2 β 1 + 1 β ( β + 1 ) Γ ( 1 β + 2 ) Γ ( n β + 2 ) , κ 2 = ( k + ) 1 / n β
Using the well-known formula Γ ( x + 1 ) = x Γ ( x ) for x > 0 , we see that the coefficient meets β κ 1 κ 2 n β 1 κ 2 < 0 . Further, it holds that the exponent meets 1 / ( 2 n β ) < 1 / n , since β is positive. These two conditions assure the positive definiteness of the Riemannian metric, g ( V ) , given by Equation (30).
Derivation for the Case β < 0
In the case β < 0 , the support of f ( x , P ) is R n , and f ( 0 , P ) > 0 implies that c β ( det P ) > 1 / β .
Again, we first calculate c β ( det P ) . The normalizing condition in Equation (49) is modified as:
( det P ) 1 2 π n 2 Γ ( n 2 ) 0 r n 2 1 1 β c β ( det P ) β 2 r 1 / β d r = 1
where we assume that the integral converges. Let us define ρ = 2 c β + 2 / β < 0 and consider changes of the variables r = ( ρ ) z and z = ( 1 z ˜ ) / z ˜ ; then, for a positive number, α, the integral in the above is:
0 r α 1 1 β c β β 2 r 1 / β d r = β / 2 1 / β 0 r α 1 ( ρ ) + r 1 / β d r = β / 2 1 / β ( ρ ) α + 1 / β 0 z α 1 1 + z 1 / β d z = β / 2 1 / β ( ρ ) α + 1 / β 0 1 z ˜ 1 / β α 1 ( 1 z ˜ ) α 1 d z ˜ = 2 α ( β ) 1 / β B 1 β α , α c β 1 β α + 1 / β
Thus, by α = n / 2 the convergence condition for β is:
β > 2 / n
From Equations (59) and (60), the normalizing constant is given by:
c β ( s ) = 1 β + ( k s 1 / 2 ) 1 / n β
where s = det P , n β = n / 2 + 1 / β < 0 and:
k = ( 1 β ) 1 / β ( 2 π ) n 2 Γ ( n 2 ) B ( 1 β n 2 , n 2 ) = ( 1 β ) 1 / β ( 2 π ) n 2 Γ ( 1 β ) Γ ( n β ) > 0
Now, to derive V ( s ) for β < 0 , substitute U ( s ) into Equation (18):
V ( s ) = s 1 / 2 π n 2 ( β + 1 ) Γ ( n 2 ) 0 1 β c β ( s ) β 2 r 1 + 1 / β r n 2 1 d r + c β ( s )
Here, we assume that the exponent at least meets 1 + 1 / β < 0 for the above integral to converge. Then, the integral can be computed similarly to Equation (60) as:
V ( s ) = s 1 / 2 ( 2 π ) n 2 ( β ) 1 + 1 β ( β + 1 ) Γ ( n 2 ) B n β 1 , n 2 c β ( s ) 1 β n β + 1 + c β ( s )
Thus, the exact convergence condition is n β + 1 < 0 or, equivalently, β > 2 / ( n + 2 ) , which is stronger than Equation (61) and 1 + 1 / β < 0 . Substituting the expression of c β ( s ) in Equation (62), we see:
V ( s ) = κ 3 s 1 / 2 κ 4 s 1 / ( 2 n β ) n β + 1 + 1 β + κ 4 s 1 / ( 2 n β ) = 1 β + κ 3 κ 4 n β + 1 κ 4 s 1 / ( 2 n β )
where κ i , i = 3 , 4 are constants respectively represented by:
κ 3 = ( 2 π ) n 2 ( β ) 1 + 1 β β + 1 B n β 1 , n 2 Γ ( n 2 ) = ( 2 π ) n 2 ( β ) 1 + 1 β ( β + 1 ) Γ ( n β 1 ) Γ ( 1 β 1 ) , κ 4 = ( k ) 1 / n β .
It is readily confirmed that β κ 3 κ 4 n β + 1 κ 4 < 0 holds. Further, we see 1 / ( 2 n β ) < 1 / n , since n β is negative. These two conditions again assure the positive definiteness of g ( V ) given by Equation (30).  ☐

Share and Cite

MDPI and ACS Style

Ohara, A.; Eguchi, S. Group Invariance of Information Geometry on q-Gaussian Distributions Induced by Beta-Divergence. Entropy 2013, 15, 4732-4747. https://doi.org/10.3390/e15114732

AMA Style

Ohara A, Eguchi S. Group Invariance of Information Geometry on q-Gaussian Distributions Induced by Beta-Divergence. Entropy. 2013; 15(11):4732-4747. https://doi.org/10.3390/e15114732

Chicago/Turabian Style

Ohara, Atsumi, and Shinto Eguchi. 2013. "Group Invariance of Information Geometry on q-Gaussian Distributions Induced by Beta-Divergence" Entropy 15, no. 11: 4732-4747. https://doi.org/10.3390/e15114732

Article Metrics

Back to TopTop