Next Article in Journal
“Lagrangian Temperature”: Derivation and Physical Meaning for Systems Described by Kappa Distributions
Previous Article in Journal
Thermoeconomic Evaluation of Integrated Solar Combined Cycle Systems (ISCCS)
Previous Article in Special Issue
Network Decomposition and Complexity Measures: An Information Geometrical Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combinatorial Optimization with Information Geometry: The Newton Method

1
Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico, 39/41, 20135 Milano, Italy
2
Castro Statistics, Collegio Carlo Alberto, Via Real Collegio 30, 10024 Moncalieri, Italy
*
Author to whom correspondence should be addressed.
Entropy 2014, 16(8), 4260-4289; https://doi.org/10.3390/e16084260
Submission received: 31 March 2014 / Revised: 10 July 2014 / Accepted: 11 July 2014 / Published: 28 July 2014
(This article belongs to the Special Issue Information Geometry)

1. Introduction

In this paper, statistical exponential families [1] are thought of as differentiable manifolds along the approach called information geometry [2] or the exponential statistical manifold [3]. Specifically, our aim is to discuss optimization on statistical manifolds using the Newton method, as is suggested in ([4] (Ch. 5 and 6)); see also the monograph [5]. This method is based on classical Riemannian geometry [6], but here, we put our emphasis on coordinate-free differential geometry; see [7,8].
We mainly refer to the above-mentioned references [2,4], with one notable exception in the description of the tangent space. Our manifold will be an exponential family V of positive densities, V being a vector space of sufficient statistics. Given a one-dimensional statistical model p(t) ∈ V, tI, we define its velocity at time t to be its Fisher score s ( t ) = d d t ln p ( t )[9]. The Fisher score s(t) is a random variable with zero expectation with respect to p(t), Entropy 16 04260f11 [s(t)] = 0. Because of that, the tangent space at pV is a vector space of random variables with zero expectation at p. A vector field is a mapping from p to a random variable V (p), such that for all p, the random variable V(p) is centered at p, Entropy 16 04260f10 [V(p)] = 0. In other words, each point of the manifold has a different tangent space, and this tangent space can be used as a non-parametric model space of the manifold. In this formalism, a vector field is a mapping from densities to centered random variables, that is, it is what in statistics is called a pivot of the statistical model. To avoid confusion with the product of random variables, we do not use the standard notation for the action of a vector field on a real function. This approach is possibly unusual in differential geometry, but it is fully natural from the statistical point of view, where the Fisher score has a central place. Moreover, this approach scales nicely from the finite state space to the general state space; see the discussion in [9] and the review in [3].
A complete construction of the geometric framework based on the idea of using the Fisher scores as elements of the tangent bundle has been actually worked out. In this paper, we go on by considering a second order geometry based on the non-parametric settings.
Our main motivation for such a geometrical construction is its application to combinatorial optimization using exponential families, whose first order version was developed in [1014]. We give here an illustration of the methods in the following toy example.
Consider the function f(x1, x2) = a0 + a1x1 + a2x2 + a12x1x2, with x1, x2 = ±1, a0, a1, a2, a12 ∈ ℝ. The function f is a real random variable on the sample space Ω = {+1, −1}2 with the uniform probability λ. Note that the coordinate mappings X1,X2 of Ω generate an orthonormal basis 1,X1,X2,X1X2 of L2(Ω, λ) and that f is the general form of a real random variable on such a space. Let ℘> be the open simplex of positive densities on (Ω, λ), and let V be a statistical model, i.e., a subset of ℘>. The relaxed mapping F: V → ℝ,
F ( p ) = E p [ f ] = a o + a 1 E p [ X 1 ] + a 2 E p [ X 2 ] = a 12 E p [ X 1 X 2 ] ,
is strictly bounded by the maximum of f, F(p) = Entropy 16 04260f10 [f] < maxx∈Ω f(x), unless f is constant. We are looking for a sequence pn, n ∈ ℕ, such that Entropy 16 04260f12[f] → maxx∈Ω f(x) as n → ∞. The existence of such a sequence is a nontrivial condition for the model . Precisely, the closure of V must contain a density, whose support is contained in the set of maxima {x ∈ Ω|f(x) = max f}. This condition is satisfied by the independence model, V = Span {X1,X2}, where we can write:
F ( η 1 , η 2 ) = a 0 + a 1 η 1 + a 2 η 2 + a 12 η 1 η 2 ,             η i = E p [ X i ] ,
The gradient of Equation (2) has components 1F = a1 + a12η2, 2F = a2 + a12η1, and the flow along the gradient produces increasing values for F; however, the gradient flow does not converge to the maximum of F; see the dotted line in Figure 2. However, one can follow the suggestion by [15] and use a modified gradient (the “natural” gradient) flow that produces better results in our problem; see Figure 3. Full details on this example are given in Section 2.5.2.
In combinatorial optimization, the values of the function f are assumed to be available at each point, and the curve of steepest ascent of the relaxed function is learned through a simulation procedure based on exponential statistical models.
In this paper, we introduce, in Section 2, the geometry of exponential families and its first order calculus. The second order calculus and the Hessian are discussed in Section 3. Finally, in Section 4, we apply the formalism to the discussion of the Newton method in the context of the maximization of the relaxed function.

2. Models on a Finite State Space

We consider here the exponential statistical manifold on the set of positive densities on a measure space (Ω, μ) with Ω finite and counting measure μ. The setup we describe below is not strictly required in the finite case, because in such a case, other approaches are possible, but it provides a mathematical formalism that has its own pros and that scales naturally to the infinite case.
We provide below a schematic presentation of our formalism as an introduction to this section.
  • Two different exponential families can actually be the same statistical model, as the set of densities in the two exponential families are actually equal. This fact is due to both the arbitrariness of the reference density and the fact that sufficient statistics are actually a vector basis of the vector space generated by the sufficient statistics. In a non-parametric approach, we can refer directly to the vector space of centered log-densities, while the change of reference density is geometrically interpreted as a change of chart. The set of all possible such charts defines a manifold.
  • We make a specific interpretation of the tangent bundle as the vector space of Fisher’s scores at each density and use such tangent spaces as the space of coordinates. This produces a different tangent space/space of coordinates at each density, and different tangent spaces are mapped one onto another by a proper parallel transport, which is nothing else than the re-centering of random variables.
  • If a basis is chosen, a parametrization is given, and such a parametrization is, in fact, a new chart, whose values are real vectors. In the real parametrization, the natural scalar product in each scores space is given by Fisher’s information matrix.
  • Riemannian gradients are defined in the usual way. It is customary in information geometry to call “natural gradient” the real coordinate presentation of the Riemannian gradient. The natural gradient is computed by applying the inverse of the Fisher information matrix to the Euclidean gradient. It seems that there are tree gradients involved, but they all represent the same object when correctly understood.
  • The classical notion of expectation parameters for exponential families carries on as another chart on the statistical manifold, which gives rise to a further presentation of a geometrical object.
  • While the statistical manifold is unique, there are at least three relevant connections as structures on the vector bundles of the manifold: one relating to the exponential charts, one relating to the expectation charts and one depending on the Riemannian structure.

2.1. Exponential Families As Manifolds

On the finite sample space Ω, #Ω = n, let a set of random variables = {X1, . . . ,Xm} be given, such that ∑JαjXj is constant if, and only if, the αj’s are zero, or, equivalently, such that X0 = 1,X1, . . . ,Xm are affinely independent. The condition implies, necessarily, the linear independence of . A common choice is to take a set of linearly independent and μ-centered random variables.
We write Entropy 16 04260f14 = Span {X1, . . . ,Xm} and define the following exponential family of positive densities p ∈ ℘>:
E V = { q P > q e V p , V V } .
Given any couple p, q Entropy 16 04260f15, then there exist a unique set of parameters θ = θp(q), such that:
q = exp ( j θ j U e p X j - ψ p ( θ ) ) · p
where Entropy 16 04260f13 is the centering at p, that is,
U e p : V U U - E p [ U ] U e p V .
The linear mapping Entropy 16 04260f13 is one-to-one on Entropy 16 04260f14 and Entropy 16 04260f13Xj, j = 1, . . . ,m, and is a basis of Entropy 16 04260f13 Entropy 16 04260f14. We view each choice of a specific reference p as providing a chart centered at p on the exponential family Entropy 16 04260f15, namely:
σ p : exp ( j θ j U e p X j - ψ p ( θ ) ) · p θ ,
If:
U = U e p U + E p [ U ] = j = 1 m θ j U e p X j + E p [ U ] ,
then:
E p [ U U e p X i ] = j = 1 m θ j E p [ U e p X i U e p X j ] ,
so that θ = I B - 1 ( p ) E p [ U U e p X ], where:
I B ( p ) = [ Cov p ( X i , X j ) ] i j = E p [ X X ] - E p [ X ] E p [ X ]
is the Fisher information matrix of the basis = {X1, . . . ,Xm}.
The mappings:
σ p : E V q U θ R m
where:
s p : q U = log ( q p ) - E p [ log ( q p ) ] ,
σ p : q θ = I B - 1 ( p ) E p [ U U e p X ] = I B - 1 ( p ) E p [ log ( q p ) U e p X ] ,
are global charts in the non-parametric and parametric coordinates, respectively. Notice that Equation (12) provides the regression coefficients of the least squares estimate on Entropy 16 04260f13 Entropy 16 04260f14 of the log-likelihood.
We denote by ep : ℝm Entropy 16 04260f15 the inverse of σp, i.e.,
e p ( θ ) = exp ( j = 1 m θ j U e p X j - ψ p ( θ ) ) · p ,
so that the representation of the divergence q ↦ D(p ||q) in the chart σp is ψp:
ψ p ( θ ) = log ( E p [ e j = 1 m θ j U e p X j ] ) = E 0 [ log ( p e p ( θ ) ) ] = D ( p e p ( θ ) ) .
The mapping I: p ↦ Covp (X, X) ∈ ℝm×m is represented in the chart centered at p by:
I B , p ( θ ) = I B ( e p ( θ ) ) = [ Cov e p ( θ ) ( X i , X j ) ] i , j = Hess  ψ p ( θ ) ,
See [1].

2.2. Change of Chart

Fix p, Entropy 16 04260f15; then, we can express p in the chart centered at ,
p = exp ( U ¯ - k p ( U ¯ ) ) · p ¯ ,             U ¯ U e p ¯ V ,             k p ¯ ( U ¯ ) = log ( E p ¯ [ e U ¯ ] ) .
In coordinates U ¯ = j = 1 m θ ¯ j U e p ¯ X j.
For all q Entropy 16 04260f15, q = exp (Ukp(U)) p, U Entropy 16 04260f13 Entropy 16 04260f14, kp(U) = log ( Entropy 16 04260f10 [eU]), in coordinates U = j = 1 m θ j U e p X j, we can write:
q = exp  ( U - k p ( U ) ) · p = exp  ( U - k p ( U ) ) exp  ( U ¯ - k p ¯ ( U ¯ ) ) · p ¯ = exp  ( U - k p ( U ) + U ¯ - k p ¯ ( U ¯ ) ) · p ¯ = exp  ( ( ( U + U ¯ ) - E p ¯ [ U ] ) - ( k p ( U ) - k p ¯ ( U ¯ ) + E p ¯ [ U ] ) ) · p ¯ ,
hence, the non-parametric coordinate of q in the chart centered at is U + Ū Entropy 16 04260f16 [U] = Entropy 16 04260f17(U) + Ū.
σ p ¯ ( q ) = I V - 1 ( p ¯ ) E p ¯ [ ( U e p ¯ U + U ¯ ) U e p ¯ X ] = θ + θ ¯
This provides the change of charts σ p ¯ σ p - 1 : θ θ + θ ¯. This atlas of charts defines the affine manifold ( Entropy 16 04260f15, (σp)). This fact has deep consequences that we do not discuss here, e.g., our manifold is an instance of a Hessian manifold [16].

2.3. Tangent Bundle

The space of Fisher scores at p is Entropy 16 04260f13 Entropy 16 04260f14, and it is identified with the tangent space of the manifold at p, Tq Entropy 16 04260f15; see the discussion in [3,9]. Let us check the consistency of this statement with our θ-parametrization.
Let:
q ( τ ) = exp ( j = 1 m θ j ( τ ) U e q ( 0 ) X - ψ q ( 0 ) ( τ ) ) · q ( 0 ) ,
τI, I an open interval containing zero, a curve in Entropy 16 04260f15. In the chart centered at q(0), we have from Equation (12):
σ q ( 0 ) ( q ( τ ) ) = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ log ( q ( τ ) q ( 0 ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ ( j = 1 m θ j ( τ ) U e q ( 0 ) X j - ψ q ( 0 ) ( θ ( τ ) ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) j = 1 m θ j ( τ ) E q ( 0 ) [ U e q ( 0 ) X j U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ U e q ( 0 ) X U e q ( 0 ) X ] θ = θ ( τ ) .
The vector space Entropy 16 04260f13 Entropy 16 04260f14 is represented by the coordinates in the base Entropy 16 04260f13. The tangent bundle T Entropy 16 04260f15 as a manifold is defined by the charts (σp, σ̇p) on the domain:
T E V = { ( p , v ) p E V , v T p E V }
with:
( σ p , σ ˙ p ) : ( q , V ) ( I B - 1 ( p ) E p [ log ( q p ) U e p X ] , I B - 1 ( p ) E p [ V U e p X ] ) .
The dot notation σ̇p for the charts on the tangent spaces is justified by the computation in Equation (23) below:
d d t σ q ( 0 ) ( q ( τ ) ) | τ = 0 = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d t log ( q ( τ ) ) | τ = 0 U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( 0 ) U e q ( 0 ) X ] = σ ˙ q ( 0 ) ( δ q ( 0 ) ) .
The velocity at τ = 0 is δ q ( 0 ) = d d τ log  ( q ( τ ) ) τ = 0 T q ( 0 ) E V and:
d d τ θ ( τ ) | τ = 0 = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d τ log ( q ( τ ) ) | τ = 0 U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( 0 ) U e q ( 0 ) X ] ,
which is consistent with both the definition of tangent space as set of Fisher scores and with the chart of the tangent bundle as defined in Equation (22).
The velocity at a generic τ is δ q ( τ ) = d d τ log  ( q ( τ ) ) T q ( τ ) E V and has coordinates at p:
d d τ θ ( τ ) = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ d d τ log ( q ( τ ) ) U e q ( 0 ) X ] = I B - 1 ( q ( 0 ) ) E q ( 0 ) [ δ q ( τ ) U e q ( 0 ) X ] .
If V,W are vector fields on T Entropy 16 04260f15, i.e., V (p),W(p) ∈ Tp Entropy 16 04260f15 = Entropy 16 04260f13 Entropy 16 04260f14, p Entropy 16 04260f15, we define a Riemannian metric g(V,W)) by:
g ( V , W ) ( p ) = g p ( V ( p ) , W ( p ) ) = E p [ V ( p ) W ( p ) ]
In coordinates at p, V ( p ) = j σ ˙ p j ( V ) U e p X j , W ( p ) = j σ ˙ p j ( W ) U e p X j, so that:
g p ( V ( p ) , W ( p ) ) = σ ˙ p ( V ) I B ( p ) σ ˙ p ( W ) .

2.4. Gradients

Given a function φ: Entropy 16 04260f15 → ℝ let φp = φep, e p = σ p - 1, its representation in the chart centered at p:
Entropy 16 04260e1
The derivative of θφp(θ) at θ = 0 along α ∈ ℝm is:
φ p ( 0 ) α = φ p ( 0 ) I B - 1 ( p ) I B ( p ) α = ( I B - 1 ( p ) φ p ( 0 ) ) I B ( p ) α = g p ( I B - 1 ( p ) φ p ( 0 ) , α ) .
The mapping ˜ φ : p I B - 1 ( p ) ( φ p ( 0 ) ) R m that appears in Equation (29) is Amari’s natural gradient of φ: Entropy 16 04260f15; see [15]. It is a standard notion in Riemannian geometry; cf. [4] (p. 46).
More generally, the derivative of θφp(θ) at θ along α ∈ ℝm is:
φ p ( θ ) α = φ p ( θ ) I B - 1 ( e p ( θ ) ) I B ( e p ( θ ) ) α = ( I B - 1 ( e p ( θ ) ) φ p ( θ ) ) I B ( e p ( θ ) ) α = g e p ( θ ) ( I B - 1 ( e p ( θ ) ) φ p ( θ ) , α ) .
Let us compare ∇φq(0) and ∇φp(θ) when q = ep(θ). As φp = φep and φq = φeq, we have the change of charts:
φ q = φ e q = φ e p σ p e q = φ p σ p e q ,
hence ∇φq(0) = ∇φp(σp(q))J(σpeq)(0), where J(σpeq) is the Jacobian of σpeq. As σpeq(θ) = θ + σp(q), we have J(σpeq) = Id, and in conclusion, ∇φep(θ)(0) = ∇φp(θ). For all p Entropy 16 04260f15 and θ ∈ ℝm,
˜ φ ( e p ( θ ) ) = I B - 1 ( e p ( θ ) ) φ p ( θ ) .
Alternatively, for all q, p Entropy 16 04260f15, ∇̃φ: Entropy 16 04260f15 → ℝm is defined by:
˜ φ ( q ) = I B - 1 ( q ) φ p ( σ p ( q ) ) .
The Riemannian gradient of φ: Entropy 16 04260f15 is the vector field ∇φ, such that DYφ = g(∇φ, Y ). Note that the Riemannian gradient takes values in the tangent bundle, while the natural gradient takes values in ℝm. We compute the Riemannian gradient at p as follows. If y = σ̇p(Y (p)),
D Y φ ( p ) = d φ p ( 0 ) y = g p ( ˜ φ ( p ) , y ) = E p [ φ ( p ) Y ( p ) ] ,
hence ˜ φ ( p ) = I B - 1 ( p ) φ p ( 0 ) is the representation in the chart centered at p of the vector field ∇φ: Entropy 16 04260f15. Explicitly, we have (see Equation (22)),
˜ φ ( p ) = I B - 1 ( p ) ( φ p ( 0 ) ) = I B - 1 ( p ) E p [ φ ( p ) U e p X ] ,
φ ( p ) = j ( ˜ φ ( p ) ) j U e p X j
The Euclidean gradient ∇φp(θ) is sometimes called the “vanilla gradient.” It is equal to the covariance between the Riemannian gradient ∇φ(p) and the basis X, (∇φp(0))′ = Entropy 16 04260f10 [∇φ(p) Entropy 16 04260f13X].
We summarize in a display the relations between our three gradients: Euclidean ∇φp(0), natural ∇̃φ(p) and Riemannian ∇φ(p).
Entropy 16 04260e2
In the following, we shall frequently use the fact that the representation of the gradient vector field ∇φ in a generic chart centered at p is:
( φ ) p ( θ ) = σ ˙ p ( φ ( e p ( θ ) ) ) = ( ˜ φ ) ( e p ( θ ) ) = I B , p - 1 ( θ ) φ p ( θ ) .
It should be noted that the leftmost term (∇φ)p(θ) is the presentation of the gradient in the charts of the tangent bundle, while in the rightmost term, ∇φp(θ) denotes the Euclidean gradient of the presentation of the function φ in the charts of the manifold.

2.4.1. Expectation Parameters

As ψp is strictly convex, the gradient mapping θ ↦ (∇ψp(θ))′ is a homeomorphism from the space of parameters ℝm to the interior of the convex set generated by the image of Entropy 16 04260f13X; see [1]. The function μp : Entropy 16 04260f15 defined by:
μ p ( q ) = E q [ U e p X ] = E q [ X ] - E p [ X ] = ( ψ p ( θ ) ) ,             θ = σ p ( q )
is a chart for all p Entropy 16 04260f15. The value of the inverse q = Lp(μ) is characterized as the unique q Entropy 16 04260f15, such that μ = Entropy 16 04260f18[ Entropy 16 04260f13X], i.e., the maximum likelihood estimator.
Let us compute the change of chart from p to :
μ p ¯ μ p - 1 ( η ) = η ¯ = η + E p [ X ] - E p ¯ [ X ] .
In fact, μ = Entropy 16 04260f19 [ Entropy 16 04260f13X] and μ̄ = μ(Lp(μ)) = Entropy 16 04260f19[ Entropy 16 04260f17X].
We do not discuss here the rich theory started in [2] about the duality between σp and μp. We limit ourselves to the computation of the Riemannian gradient in the expectation parameters. If φ: Entropy 16 04260f15,
φ p ( θ ) = φ e p ( θ ) = φ L p μ p e p ( θ ) = ( φ L p ) ( ψ p ) ( θ ) ,
because μpep(θ) = Entropy 16 04260f20[ Entropy 16 04260f13X] = ∇φp(θ), hence:
φ p ( θ ) = ( φ L p ) ( ψ p ( θ ) ) Hess  ψ p ( θ ) ,
˜ φ ( p ) = I V ( p ) - 1 ( ( φ L p ) ( 0 ) Hess  ψ p ( θ ) ) = ( ( φ L p ) ( 0 ) ) ,
φ ( p ) = ( φ L p ) ( 0 ) U e p X ,
that is, the natural gradient ∇̃φ at p = Lp(μ) is equal to the Euclidean gradient of μφLp(μ) at μ = 0.

2.4.2. Vector Fields

If V is a vector field of T Entropy 16 04260f15 and φ: Entropy 16 04260f15 is a real function, then we define the action of V on φ, ∇V φ, to be the real function:
V φ : E V p V φ ( p ) = φ p ( 0 ) σ ˙ p ( V ( p ) ) .
We prefer to avoid the standard notation V φ, because in our setting, V (p) is a random variable, and the product V (p)φ(p) is otherwise defined as the ordinary product.
Let us represent ∇V φ in the chart centered at p:
( V φ ) p ( θ ) = V φ ( e p ( θ ) ) = φ e p ( θ ) ( 0 ) σ ˙ e p ( θ ) ( V ( e p ( θ ) ) ) = φ p ( θ ) V p ( θ ) ,
where we have used the equality ∇φep (θ)(0) = ∇φp(θ) and Vp(θ) = σ̇ep(θ) (V (ep(θ))).
If W is a vector field, we can compute ∇WV φ at p as:
W V φ ( p ) = ( V φ ) p ( 0 ) σ ˙ p ( W ( p ) ) = V p ( 0 ) Hess  φ p ( 0 ) W p ( 0 ) + φ p ( 0 ) J V p ( 0 ) W p ( 0 ) ,
where J denotes the Jacobian matrix.
The Lie bracket [W, V ]φ (see [7] (§4.2), [8] (V, §1), [4] (Section 5.3.1)) is given by:
[ W , V ] φ ( p ) = W V φ ( p ) - V w φ ( p ) = φ p ( 0 ) ( J V p ( 0 ) W p ( 0 ) - J W p ( 0 ) V p ( 0 ) ) ,
because of Equation (47) and the symmetry of the Hessian.
The flow of the smooth vector field V : Entropy 16 04260f15 is a family of curves γ(t, p), p Entropy 16 04260f15 tJp, Jp open real interval containing zero, such that for all p Entropy 16 04260f15 and tJp,
γ ( 0 , p ) = p ,
δ γ ( t , p ) = V ( γ ( t , p ) ) .
As uniqueness holds in Equation (50) (see [8] (VI, §1) or [7] (§4.1)), we have semi-group property γ(s + t, p) = γ(s, γ (t, p)), and Equation (50) is equivalent to δγ(0, p) = V (γ(0, p)), p Entropy 16 04260f15.
If a flow of V is available, we have an interpretation of ∇Vφ as a derivative of φ along (t, p),
d d t φ ( γ ( t , p ) ) | t = 0 = φ p ( σ p ( γ ( t , p ) ) ) ( d d t σ p ( γ ( t , p ) ) ) | t = 0 = φ p ( 0 ) V ( p ) = V φ ( p ) .

2.5. Examples

The following examples are intended to show how the formalism of gradients is usable in performing basic computations.

2.5.1. Expectation

Let f be any random variable, and define F : Entropy 16 04260f15 by F(p) = Entropy 16 04260f10 [f]. In the chart centered at p, we have:
F p ( θ ) = f exp ( j θ j U e p X j - ψ p ( θ ) ) · p d μ
and the Euclidean gradient:
F p ( 0 ) = Cov p ( f , X ) ( R m ) .
The natural gradient is:
˜ F ( p ) = Cov p ( X , X ) - 1 Cov p ( X , f ) R m ,
and the Riemannian gradient is:
F ( p ) = ( ˜ F ( p ) ) U e p X = Cov p ( f , X ) Cov p ( X , X ) - 1 U e p X T p E V .
From Equation (55), it follows that ∇F(p) is the L2(p)-projection f onto Entropy 16 04260f13 Entropy 16 04260f14, while ∇̃F(p) in Equation (54) are the coordinates of the projection. Let us consider the family of curves:
γ ( t , p ) = exp ( j = 1 m t ( ˜ F ( p ) ) j U e p X j - ψ p ( t ˜ F ( p ) ) ) · p ,             t R .
The velocity is:
δ γ ( t , p ) = d d t ( j = 1 m t ( ˜ F ( p ) ) j U e p X j - ψ p ( t ˜ F ( p ) ) ) = F ( p ) - E γ ( t , p ) [ F ( p ) ] ,
which is different from ∇F(γ(t, p)), unless f Entropy 16 04260f14 ⊕ ℝ. Then, γ is not, in general, the flow of ∇F, but it is a local approximation, as δγ(0, p) = ∇F(p).
These computation are the basis of model-based methods in combinatorial optimization; see [1014].

2.5.2. Binary Independent Variables

Here, we present, in full generality, the toy example of the Introduction; see [17] for more information on the application to combinatorial optimization. Our example is a very special case of Ising exactly solvable models [18], our aim being here to explore the geometric framework.
Let Ω = {+1, −1}m with counting measure μ, and let the space Entropy 16 04260f14 be generated by the coordinate projections = {X1, . . . , Xd}. Note that we use here the coding +1, −1 (from physics) instead of the coding 0, 1, which is more common in combinatorial optimization. The exponential family is E V = { exp  ( J = 1 m θ j X j - ψ λ ( θ ) ) · 2 - m }, λ(x) = 2m for x ∈Ω being the uniform density. The independence of the sufficient statistics Xj under all distributions in Entropy 16 04260f15 implies:
ψ λ ( θ ) = j = 1 m ψ ( θ j ) ,             ψ ( θ ) = log  ( cosh ( θ ) ) .
We have:
ψ λ ( θ ) = [ tanh ( θ j ) : j = 1 , , d ] = η λ ( θ ) ,
Hess  ψ λ ( θ ) = diag  ( cosh - 2 ( θ j ) : j = 1 , , d ) = diag  ( e - 2 ψ ( θ j ) : j = 1 , , d ) = I B , λ ( θ ) ,
I B , λ ( θ ) - 1 = diag  ( cosh 2 ( θ j ) : j = 1 , , d ) = diag  ( e 2 ψ ( θ j ) : j = 1 , , d ) .
The quadratic function f(X) = a0 +∑j ajXj +∑{i,j} ai,jXiXj has expected value at p = eλ(θ), i.e., relaxed value, equal to:
F ( p ) = F λ ( θ ) = E θ [ f ( X ) ] = a 0 + j a j tanh ( θ j ) + { i , j } a i , j tanh ( θ i ) tanh ( θ j ) ,
and covariance with Xk equal to:
Cov θ ( f ( X ) , X k ) = j a j Cov θ ( X j , X k ) + { i , j } a i , j Cov θ ( X i X j , X k ) = a k Var θ ( X k ) + i k a i , k E θ [ X i ] Var θ ( X k ) = cosh - 2 ( θ k ) ( a k + i k a i , k tanh ( θ i ) ) .
In the computation, we have used the independence and the special algebra of ±1, which implies X i 2 = 1, so that Covθ (XiXj, Xk) = 0 if i, jk, otherwise Covθ (XiXk,Xk) = Entropy 16 04260f23 [Xi] − Entropy 16 04260f23 [Xi] Entropy 16 04260f23 [Xk]2; see [13].
The Euclidean gradient, the natural gradient and the Riemannian gradient are, respectively,
F λ ( θ ) = [ cosh - 2 ( θ j ) ( a j + i j a i , j tanh ( θ i ) ) : j = 1 , , d ] ,
˜ F ( e λ ( θ ) ) = [ a j + i j a i , j tanh ( θ i ) : j = 1 , , d ] ,
F ( e λ ( θ ) ) = j = 1 m ( a j + i j a i , j E θ [ X i ] ) ( X j - E θ [ X j ] ) .
The (natural) gradient flow equations are:
θ ˙ j ( t ) = a j + i j a i , j tanh ( θ i ( t ) ) ,             j = 1 , , d .
Equations (64)(66) are usable in practice if the aj’s and the ai,j’s are estimable. Otherwise, one can use Equation (63) and the following forms of the gradients:
F λ ( θ ) = [ Cov θ ( X j , f ( X ) ) : j = 1 , , d ] ,
˜ F ( e λ ( θ ) ) = [ cosh 2 ( θ j ) Cov θ ( f ( X ) , X j ) : j = 1 , , d ] ,
in which case, the gradient flow equations are:
θ ˙ j ( t ) = cosh 2 ( θ j ) Cov θ ( f ( X ) , X j ) ,             j = 1 , , d .
Let us study the relaxed function in the expectation parameters ηj = ηj(θ), j = 1, . . . , d,
F λ ( η ) = a 0 + j a j η j + { i , j } a i , j η i η j ,             η ] - 1 , + 1 [ m .
The Euclidean gradient with respect to η has components:
j F λ ( η ) = a j + i j a i , j η i ,
which are equal to the components of the natural gradient; see Section 2.4.1. As:
η ˙ j ( t ) = d d t tanh ( θ j ( t ) ) = cosh - 2 ( θ j ( t ) ) θ ˙ j ( t ) = ( 1 - η j ( t ) 2 ) θ ˙ j ( t ) ,             j = 1 , , m ,
the gradient flow expressed in the η-parameters has equations:
η ˙ j ( t ) = ( 1 - η j ( t ) 2 ) ( a j + i j a i , j η i ( t ) ) ,             j = 1 , , d .
Alternatively, in vector form,
η ˙ ( t ) = diag  ( 1 - η j ( t ) 2 : j = 1 , , d ) ( a + A η ( t ) ) ,
where a = [aj : j = 1, . . . , d]t and Ai,j = 0 if i = j, Aij = ai,j. The matrix A is symmetric with zero diagonal, and it has the meaning of the adjacency matrix of the (weighted) interaction graph. We do not know a closed-form solution of Equation (74). An example of a numerical solution is shown in Figure 3.

2.5.3. Escort Probabilities

For a given a > 0, consider the function C(a) : Entropy 16 04260f15 defined by C(a) (p) = ∫ pa . We have:
C p ( a ) ( θ ) = exp ( a j = 1 m θ j U e p X j - a ψ p ( θ ) ) p a d μ
and:
d C p ( a ) ( 0 ) α = a ( j = 1 m α j U e p X j ) p a d μ = j = 1 m α j a U e p X j p a d μ = j = 1 m α j Cov p ( X j , a p a - 1 ) ,
that is, the Euclidean gradient is C p ( a ) ( 0 ) = Cov p ( a p a - 1 , X ) (row vector). The natural gradient is computed from Equation (35) as:
˜ C ( a ) ( p ) = I B - 1 ( p ) ( C p ( a ) ( 0 ) ) = Cov p ( X , X ) - 1 Cov p ( X , a p a - 1 ) ,
while the Riemannian gradient follows from Equation (36):
C ( a ) ( p ) = Cov p ( a p a - 1 , X ) Cov p ( X , X ) - 1 U e p X .
Note that the Riemannian gradient is the orthogonal projection of the random variable apa −1 onto the tangent space Tp Entropy 16 04260f15 = Entropy 16 04260f13 Entropy 16 04260f14.
The probability density pa/C(p) is called the escort density in the literature on non-extensive statistical mechanics; see, e.g., [19] (Section 7.4).
We compute now the tangent mapping of Entropy 16 04260f15ppa/C(a)(a) ∈ ℘>. Let us extend the basis X1, . . . , Xm to a basis X1, . . . , Xn, nm, whose exponential family is full, i.e., equal to ℘>. The non-parametric coordinate of q = ( exp ( j = 1 m θ j U e p X j - ψ p ( θ ) ) p ) a / C p ( a ) ( θ ) in the chart centered at p ¯ = p a / C p ( a ) ( 0 ) is the -centering of the random variable:
log ( q p ¯ ) = log ( ( exp ( j = 1 m θ j U e p X j - ψ p ( θ ) ) p ) a / C p ( a ) ( θ ) p a / C p ( a ) ( 0 ) ) = a j = 1 m θ j U e p X j - a ψ p ( θ ) + ln  C p ( a ) ( 0 ) - ln  C p ( a ) ( θ ) ,
that is,
v = a j = 1 m θ j U e p ¯ X j .
The coordinates of v in the basis Entropy 16 04260f17X1, . . . , Entropy 16 04260f17Xn are (1, . . . , aθm, 0, . . . , 0), and the Jacobian of θ ∋ (aθ, 0nm) is the m × n matrix [aIm|0(n m)].

2.5.4. Polarization Measure

The polarization measure has been introduced in Economics by [20]. Here, we consider the qualitative version of [21]. If π is a distribution of a finite set, the probability that in three independent samples from π there are exactly two equal is 3 j π j 2 ( 1 - π j ). If p Entropy 16 04260f15, define:
G p = p 2 ( 1 - p ) d μ = C ( 2 ) ( p ) - C ( 3 ) ( p ) ,
where C(2) and C(3) are defined as in Example 2.5.3.
From Equation (78), we find the natural gradient:
˜ G ( p ) = Cov p ( X , X ) - 1 Cov p ( X , 2 p - 3 p 2 ) .
Note that ∇̃G(p) = 0 if p is constant; see Figure 4.

3. Second Order Calculus

In this section, we turn to considering second order calculus, in particular Hessians, in order to prepare the discussion of the Newton method for the relaxed optimization of Section 4.

3.1. Metric Derivative (Levi–Civita connection)

Let V, W : Entropy 16 04260f15 be vector fields, that is, V (p), W(p) ∈ Tp Entropy 16 04260f15 = Entropy 16 04260f13 Entropy 16 04260f14, p Entropy 16 04260f15. Consider the real function R = g(V,W) : Entropy 16 04260f15 → ℝ, whose value at p Entropy 16 04260f15 is R(p) = gp(V (p), W(p)) = Entropy 16 04260f10 [V (p)W(p)]. Assuming smoothness, we want to compute the derivative of R along the vector field Y : Entropy 16 04260f15, that is, (DY R)(p) = dRp(0, with α = σ̇p(Y (p)). The expression of R in the chart centered at p is, according to Equation (27),
θ R p ( θ ) = σ ˙ p ( V ( e p ( θ ) ) ) I B ( e p ( θ ) ) σ ˙ p ( W ( e p ( θ ) ) ) = V p ( θ ) I B , p ( θ ) W p ( θ ) ,
where Vp and Wp are the presentation in the chart of the vector fields V and W, respectively.
The i-th component iRp(θ) of the Euclidean gradient ∇Rp(θ) is:
i R p ( θ ) = i ( V p ( θ ) I B , p ( θ ) W p ( θ ) ) = i V p ( θ ) I B , p ( θ ) W p ( θ ) + V p ( θ ) i I B , p ( θ ) W p ( θ ) ) + V p ( θ ) I B , p ( θ ) i W p ( θ ) = ( i V p ( θ ) + 1 2 I B , p - 1 ( θ ) i I B , p ( θ ) V p ( θ ) ) I B , p ( θ ) W p ( θ ) + V p ( θ ) I B , p ( θ ) ( i W p ( θ ) + 1 2 I B , p - 1 ( θ ) i I B , p ( θ ) W p ( θ ) ) ,
so that the derivative at θ along α = σ̇ep(θ)(Y (ep(θ))) is:
d R p ( θ ) α = ( d V p ( θ ) α + 1 2 I B , p - 1 ( θ ) ( d I B , p ( θ ) α ) V p ( θ ) ) I B , p ( θ ) W p ( θ ) + V p ( θ ) I B , p ( θ ) ( d W p ( θ ) α + 1 2 I B , p - 1 ( θ ) ( d I B , p ( θ ) α ) W p ( θ ) ) .

Proposition 1

If we define DY V to be the vector field on Entropy 16 04260f15, whose value at q = ep(θ) has coordinates centered at p given by:
σ ˙ p ( D Y V ( q ) ) = d V p ( θ ) α = 1 2 I B - 1 ( p ) ( d I B , p ( θ ) α ) V p ( θ ) ,             α = σ ˙ p ( Y ( q ) ) ,
then:
D Y g ( V , W ) = g ( D Y V , W ) + g ( V , D Y W ) ,
i.e., Equation (87) is a metric covariant derivative; see [6] (Ch. 2 §3), [8] (VIII §4), [4] (§5.3.2).
The metric derivative Equation (87) could be computed from the flow of the vector field Y. Let (t, p) ↦ γ(t, p) be the flow of the vector field V, i.e., δγ(t, p) = V (γ (t, p)) and γ (0, p) = p. Using Equation (23), we have:
d d t σ ˙ ( V ( γ ( t , p ) ) ) | t = 0 = d d t V p ( σ p ( γ ( t , p ) ) ) | t = 0 = d V p ( σ p ( γ ( t , p ) ) ) d d t σ p ( γ ( t , p ) ) | t = 0 = d V p ( 0 ) σ ˙ p ( δ γ ( 0 , p ) ) = d V p ( 0 ) σ ˙ p ( Y ( p ) ) ,
and:
d d t I V ( γ ( t , p ) ) | t = 0 = d d t I B , p ( σ p γ ( t , p ) ) | t = 0 = d I B , p ( 0 ) σ ˙ p ( δ γ ( 0 , p ) ) = d I B , p ( 0 ) σ ˙ p ( Y ( p ) ) V p ( 0 ) ,
so that:
σ ˙ ( D Y V ( p ) ) = d d t σ ˙ V ( γ ( t , p ) ) | t = 0 + 1 2 I V - 1 ( p ) d d t I V ( γ ( t , p ) ) | t = 0 .
Let us check the symmetry of the metric covariant derivative to show that it is actually the unique Riemannian or Levi–Civita affine connection; see [6] (Th. 3.6).
The Lie bracket of the vector fields V and W is the vector field [V, W], whose coordinates are:
[ V , W ] p ( θ ) = d V p ( 0 ) σ ˙ p ( W ( p ) ) - d W p ( 0 ) σ ˙ p ( V ( p ) ) .
As the ij entry of kI;p(0) is kijψp(0), then the symmetry (dI,p(0)α)β = (dI,p(0)β)α holds, and we have:
σ ˙ p ( D W V ( p ) - D V W ( p ) ) = d V p ( 0 ) σ ˙ p ( W ( p ) ) + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) σ ˙ p ( W ( p ) ) ) V p ( 0 ) - d W p ( 0 ) σ ˙ p ( V ( p ) ) - 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) σ ˙ p ( V ( p ) ) ) W p ( 0 ) = σ ˙ [ V , W ] ( p ) .
The term Γ k ( p ) = 1 2 I p - 1 ( 0 ) k d I B , p ( 0 ) of Equation (87) is sometimes referred to as the Christoffel matrix, but we do not use this terminology in this paper. As:
I B , p ( θ ) = I B ( e p ( θ ) ) = [ Cov e p ( θ ) ( X i , X j ) ] i , j = 1 , , m = [ i j ψ p ( θ ) ] i , j = i , , m ,
we have kI(ep(θ)) = [ijkψp(θ)]i,j=i,...,m = [Covep(θ) (Xi, Xj, Xk)]i,j=i,...,m and:
Γ k ( p ) = 1 2 [ Cov p ( X i , X j ) ] i , j = i , , m - 1 [ Cov p ( X i , X j , X k ] i , j = i , , m
If V, W are vector fields of T Entropy 16 04260f15, we have:
Γ ( p , V , W ) = 1 2 I B - 1 ( p ) Cov p ( X , V , W ) = 1 2 I B - 1 ( p ) E p [ U e p X V W ] ,
which is the projection of V (p)W(p)/2 on Entropy 16 04260f13 Entropy 16 04260f14.
Notice also that:
( d I p - 1 ( 0 ) α ) I B , p ( 0 ) = - I p - 1 ( 0 ) ( d I B , p ( 0 ) α ) I p - 1 ( 0 ) I B , p ( 0 ) y = - I p - 1 ( 0 ) ( d I B , p ( 0 ) α ) .

3.2. Acceleration

Let p(t), tI, be a smooth curve in Entropy 16 04260f15. Then, the velocity δ p ( t ) = d d t log  ( p ( t ) ) is a vector field V (p(t)) = δp(t), defined on the support p(I) of the curve. As the curve is the flow of the velocity field, we can compute the metric derivative of the velocity along the the velocity itself Dδpδp from Equation (91) with V (p(0)) = δp(0); we can use Equation (91) to get:
σ ˙ p ( D δ p δ p ) ( p ( 0 ) ) = d d t σ ˙ p ( 0 ) ( δ ( p ( t ) ) ) | t = 0 + 1 2 I B - 1 ( p ( 0 ) ) d d t I B ( p ( t ) ) | t = 0 = d 2 d t 2 σ p ( 0 ) ( p ( t ) ) | t = 0 + 1 2 I B - 1 ( p ( 0 ) ) d d t I B ( p ( t ) ) | t = 0 .
which can be defined to be the Riemannian acceleration of the curve at t = 0.
Let us write θ(t) = σp(p(t)), p = p(0) and:
p ( t ) = exp  ( j = 1 m θ j ( t ) U e p X j - ψ p ( θ ( t ) ) ) · p ,
so that σ̇p(δp)(0) = θ̇(0) and d 2 d t 2 σ p ( p ( t ) ) | t = 0 = θ ¨ ( 0 ). We have:
d d t I B ( p ( t ) ) | t = 0 = d d t I B , p ( θ ( t ) ) | t = 0 = d d t Hess  ψ p ( θ ( t ) ) | t = 0 = Cov p ( X , X , j = 1 m θ ˙ j ( t ) X j )
so that the acceleration at p has coordinates:
θ ¨ ( 0 ) + 1 2 i , j = 1 m θ ˙ i ( 0 ) θ ˙ j ( 0 ) Cov p ( X , X ) - 1 Cov p ( X , X i , X j ) = θ ¨ ( 0 ) = 1 2 Cov p ( X , X ) - 1 Cov p ( X , i m θ ˙ i ( 0 ) X i , j = 1 m θ ˙ j ( 0 ) X j ) .
A geodesic is a curve whose acceleration is zero at each point. The exponential map is the mapping Exp: T Entropy 16 04260f15 Entropy 16 04260f15 defined by:
( p , U ) Exp p U = p ( 1 ) ,
where tp(t) is the geodesic, such that p(0) = p and δp(0) = U, for all U, such that the geodesic exists for t = 1.
The exponential map is a particular retraction, that is, a family of mappings Rp, p, from the tangent space at p to the manifold; here R: Tp, such that Rp(0) = p and dRp(0) = Id; see [4] (§5.4). It should be noted that exponential manifolds have natural retractions other than Exp, a notable one being the exponential family itself. A retraction provides a crucial step in a gradient search algorithms by mapping a direction of increase of the objective function to a new trial point.

3.2.1. Example: Binary Independent 2.5.2 Continued

Let us consider the binary independent model of Section 2.5.2. We have
I B ( e λ ( θ ) ) = I B , λ ( θ ) = diag  ( cosh - 2 ( θ j ) : j = 1 , , d ) ,
it follows that
k I B , λ ( θ ) = k diag  ( cosh - 2 ( θ j ) : j = 1 , , d ) = - 2 cosh - 3 ( θ k ) sinh ( θ h ) E k k ,
where Ekk is the d × d matrix with entry one at (k, k), zero otherwise. The k-th Christoffel’s matrix in the second term in the definition of the metric derivative (aka Levi–Civita connection) is:
Γ B k ( e λ ( θ ) ) = Γ λ k ( θ ) = 1 2 I B , λ - 1 ( θ ) k I B , λ ( θ ) = - tanh ( θ k ) E k k .
In terms of the moments, we have Iλ(θ) = Covθ (X, X′) = Hess ψλ(θ). As kijψλ(θ) = Covθ (Xk, Xi, Xj), we that can write:
k I B , λ ( θ ) = k diag  ( Var θ ( X j ) : j = 1 , , d ) = Cov θ ( X k , X k , X k ) E k k
and:
Γ λ k ( θ ) = 1 2 Cov θ ( X k , X k ) - 1 Cov θ ( X k , X k , X k ) E k k = 1 2 ( 1 - ( η k ) 2 ) - 1 ( - 2 η k + 2 ( η k ) 3 ) E k k = - η k E k k .
The equations for the geodesics starting from θ(0) with velocity θ̇(0) = u are:
θ ¨ k ( t ) + i j = 1 m Γ i j k ( θ ( t ) ) θ ˙ i ( t ) θ ˙ j ( t ) = θ ¨ k ( t ) - tanh ( θ k ( t ) ) ( θ ˙ k ( t ) ) 2 = 0 ,             k = 1 , , d .
The ordinary differential equation:
θ ¨ - tanh ( θ ) θ ˙ 2 = 0
has the closed form solution:
θ ( t ) = gd - 1 ( gd ( θ ( 0 ) ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t ) = tanh - 1 ( sin  ( gd ( θ ( 0 ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t ) )
for all t, such that:
- π / 2 < gd ( θ ( 0 ) ) + θ ˙ ( 0 ) cosh ( θ ( 0 ) ) t < π / 2 ,
where gd: ℝ →] −π/2,+ π/2[ is the Gudermannian function, that is, gd′(x) = 1/cosh x, gd(0) = 0; in closed form, gd(x) = arcsin(tanh(x)). In fact, if θ is a solution of Equation (109), then:
d d t gd ( θ ( t ) ) = θ ¨ ( t ) cosh ( θ ( t ) )
d 2 d t 2 gd ( θ ( t ) ) = - sinh ( θ ( t ) ) ( θ ˙ ( t ) ) 2 cosh 2 ( θ ( t ) ) + θ ¨ ( t ) cosh ( θ ( t ) ) = 1 cosh ( θ ( h ) ) ( θ ¨ ( t ) - tanh ( θ ( t ) ) ( θ ˙ ( t ) ) 2 ) = 0 ,
so that t ↦ gd(θ(t)) coincides (where it is defined) with an affine function characterized by the initial conditions.
In particular, at t = 1, the geodesic Equation (110) defines the Riemannian exponential Exp: T Entropy 16 04260f15 Entropy 16 04260f15. If (p, U) ∈ T Entropy 16 04260f15, that is, p Entropy 16 04260f15 and UTp Entropy 16 04260f15, then σλ(p) = θ(0) and U = ∑uj Entropy 16 04260f13Xj, σ̇λ(U) = u. If:
- π / 2 < gd ( θ j ) + u j cosh ( θ j ) < π / 2 ,
then we can take θ̇(0) = u and t = 1, so that:
Exp p : U σ ˙ λ u [ gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) : j = 1 , , d ] e λ j = 1 m exp ( gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) X j - ψ ( gd - 1 ( gd ( θ j ) + u j cosh ( θ j ) ) ) ) 2 - m .
We have:
exp  ( gd - 1 ( v ) ) = exp  ( tanh - 1 ( sin ( v ) ) ) = 1 + sin  v 1 - sin  v
and:
ψ ( gd - 1 ( v ) ) = + log  ( gd - 1 ( sin  v ) ) = log ( 1 cos v ) ,
hence u Exp p ( j = 1 d u j U e p X j ) is given for:
u j = 1 d ] cosh ( θ j ) ( - π / 2 - gd ( θ j ) ) , cosh ( θ j ) ( π / 2 - gd ( θ j ) ) [ ,
by:
Exp θ ( u ) = j = 1 m cos  ( gd ( θ j ) + u j cosh ( θ j ) )     ( 1 + sin  ( gd ( θ j ) + u j cosh ( θ j ) ) 1 - sin  ( gd ( θ j ) + u j cosh ( θ j ) ) ) X j 2 = j = 1 m ( 1 + sin  ( gd ( θ j ) + u j cosh ( θ j ) ) X j ) 2 - m E V .
The expectation parameters are:
η i ( t ) = E θ = 0 [ X i j = 1 m ( 1 + sin ( gd ( θ j ) + t u j cosh ( θ j ) ) X j ) ] = sin  ( gd ( θ j ) + t u j cosh ( θ j ) ) ,
and:
gd ( θ j ) = arcsin ( η j ) ,             cosh ( θ j ) = 1 ( 1 - ( η j ) 2 ) 1 2 ,
so that the exponential in terms of the expectation parameters is:
Exp η ( u ) = ( sin  ( arcsin  η j + ( 1 - ( η j ) 2 ) 1 2 u j ) : j = 1 , , m ) .
The inverse of the Riemannian exponential provides a notion of translation between two elements of the exponential model, which is a particular parametrization of the model:
η 1 η 2 = Exp η 1 - 1 η 2 = [ ( ( 1 - ( η i 2 ) 2 ) - 1 2 ( arcsin  η 2 j - arcsin  η 1 j ) : j = 1 , , m ]
In particular, at θ = 0, we have the geodesic:
t j = 1 d ( 1 + sin ( t u j ) X j ) 2 - m ,             t < π 2 max u j
See in Figure 5 some geodesic curves.

3.3. Riemannian Hessian

Let φ: Entropy 16 04260f15 ℝ with Riemannian gradient ∇φ(p) = ∑i(∇̃φ)i(p) Entropy 16 04260f13Xi, ˜ φ ( p ) = I B - 1 ( p ) φ p ( 0 ). The Riemannian Hessian of φ is the metric derivative of the gradient ∇φ along the vector field Y, that is, HessY φ = DYφ; see [6] (Ch. 6, Ex. 11), [4] (§5.5). in the following, we denote by the symbol Hess, without a subscript, the ordinary Hessian matrix.
From Equation (87), we have the coordinates of HessY φ(p). Given a generic tangent vector α, we compute from Equation (38):
d ( φ ) p ( θ ) α θ = 0 = d ( I B , p - 1 ( θ ) φ p ( θ ) ) α θ = 0 = ( d I B , p - 1 ( 0 ) α ) φ p ( 0 ) + I B , p - 1 ( 0 ) Hess  φ p ( 0 ) α = - I B - 1 ( p ) ( d I B , p ( 0 ) α ) ˜ φ ( p ) + I B - 1 ( p ) Hess  φ p ( 0 ) α
and, upon substitution of (∇φ)p to Vp in Equation (87),
σ ˙ p ( Hess Y φ ( p ) ) = d ( φ ) p ( 0 ) α + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ( φ ) p ( 0 ) ,             α = S p ( Y ( p ) ) = - I B - 1 ( p ) ( d I B , p ( 0 ) α ) ˜ φ ( p ) + I B - 1 ( p ) Hess  φ p ( 0 ) + 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ˜ φ ( p ) = I B - 1 ( p ) Hess  φ p ( 0 ) α - 1 2 I B - 1 ( p ) ( d I B , p ( 0 ) α ) ˜ φ ( p ) = I B - 1 ( p ) ( Hess  φ p ( 0 ) α - 1 2 ( d I B , p ( 0 ) α ) ˜ φ ( p ) )
HessY φ is characterized by knowing the value of g(HessY φ, X) : Entropy 16 04260f15 for all vector fields X. We have from Equation (126), with α = σ̇p(Y (p)) and β = σ̇p(X(p)),
g p ( Hess Y ( p ) φ ( p ) , X ( p ) ) = β Hess  φ p ( 0 ) α - 1 2 β ( d I B , p ( 0 ) α ) ˜ φ ( p ) .
This is the presentation of the Riemannian Hessian as a bi-linear form on T Entropy 16 04260f15; see the comments in [4] (Prop. 5.5.2-3). Note that the Riemannian Hessian is positive definite if:
α Hess  φ p ( 0 ) α 1 2 α ( d I B , p ( 0 ) α ) ˜ φ ( p ) ,             α R m .

4. Application to Combinatorial Optimization

We conclude our paper by showing how the geometric method applies to the problem of finding the maximum of the expected value of a function.

4.1. Hessian of a Relaxed Function

Here is a key example of vector field. Let f be any bounded random variable, and define the relaxed function to be φ(p) = Entropy 16 04260f10 [f], p>. Define F(p) to be the projection of f, as an element of L2(p), onto Tp Entropy 16 04260f15 = Entropy 16 04260f13 Entropy 16 04260f14, i.e., F(p) is the element of Entropy 16 04260f13 Entropy 16 04260f14, such that:
E p [ ( f - F ( p ) ) v ] = 0 ,             v U e p V
In the basis Entropy 16 04260f13, we have F(p) = ∑ip,i Entropy 16 04260f13Xi and:
Cov p ( f , X j ) = i f ^ p , i E p [ U e p X i U e p X j ] ,             j = 1 , , m ,
so that f ^ p = I B - 1 ( p ) Cov p ( X , f ) and
F ( p ) = f ^ p U e p X = Cov p ( f , X ) I B - 1 ( p ) U e p X .
Let us compute the gradient of the relaxed function φ = Entropy 16 04260f21. [f] : Entropy 16 04260f15. We have φp(θ) = Entropy 16 04260f22 [f], and from the properties of exponential families, the Euclidean gradient is ∇φp(0) = Covp (f, X). It follows that the natural gradient is:
˜ φ p ( 0 ) = I B - 1 ( p ) Cov p ( X , f ) = f ^ ,
and the Riemannian gradient is ∇φ(p) = F(p).
From the properties of exponential families, we have:
Hess  φ p ( 0 ) = Cov p ( X , X , f ) ,
so that, in this case, Equation (127), when written in terms of the moments, is:
β Cov p ( X , X , f ) α - 1 2 β Cov p ( X , X , α · X ) Cov p ( X , X ) - 1 Cov p ( X , f ) .

4.1.1. Example: Binary Independent 2.5.2 and 3.2.1 Continued

We list below the computation of the Hessian in the case of two binary independent variables. Computations were done with Sage [22], which allows both the reduction x i 2 = 1 in the ring of polynomials and the simplifications in the symbolic ring of parameters.
Cov η ( X , f ) = ( - ( η 1 2 - 1 ) a 1 - ( η 1 2 η 2 - η 2 ) a 12 - ( η 2 2 - 1 ) a 2 - ( η 1 η 2 2 - η 1 ) a 12 ) = ( - ( η 1 - 1 ) ( η 1 + 1 ) ( a 12 η 2 + a 1 ) - ( η 2 - 1 ) ( η 2 + 1 ) ( a 12 η 1 + a 2 ) )
Cov η ( X , X ) = ( - η 1 2 + 1 0 0 - η 2 2 + 1 ) = ( - ( η 1 - 1 ) ( η 1 + 1 ) 0 0 - ( η 2 - 1 ) ( η 2 + 1 ) )
Cov η ( X , X ) - 1 Cov η ( X , f ) = ( a 12 η 2 + a 1 a 12 η 1 + a 2 ) = F ( η )
Cov η ( X , X , f ) = ( 2 ( η 1 3 - η 1 ) a 1 + 2 ( η 1 3 η 2 - η 1 η 2 ) a 12 ( η 1 2 η 2 2 - η 1 2 - η 2 2 + 1 ) a 12 ( η 1 2 η 2 2 - η 1 2 - η 2 2 + 1 ) a 12 2 ( η 1 η 2 3 - η 1 η 2 ) a 12 + 2 ( η 2 3 - η 2 ) a 2 ) = ( 2 ( η 1 - 1 ) ( η 1 + 1 ) ( a 12 η 2 + a 1 ) η 1 ( η 2 - 1 ) ( η 2 + 1 ) ( η 1 - 1 ) ( η 1 + 1 ) a 12 ( η 2 - 1 ) ( η 2 + 1 ) ( η 1 - 1 ) ( η 1 + 1 ) a 12 2 ( η 2 - 1 ) ( η 2 + 1 ) ( a 12 η 1 + a 2 ) η 2 )
Cov η ( X , X ) - 1 Cov η ( X , X , f ) = ( - 2 ( a 12 η 2 + a 1 ) η 1 - a 12 η 2 2 + a 12 - a 12 η 1 2 + a 12 - 2 ( a 12 η 1 + a 2 ) η 2 )
Cov η ( X , X , F ( η ) ) = ( - 2 ( a 12 η 2 + a 1 ) ( η 1 + 1 ) ( η 1 - 1 ) η 1 0 0 2 ( a 12 η 1 + a 2 ) ( η 2 + 1 ) ( η 2 - 1 ) η 2 )
Cov η ( X , X ) - 1 Cov η ( X , X , F ( η ) ) = ( - 2 ( a 12 η 2 + a 1 ) η 1 0 0 - 2 ( a 12 η 1 + a 2 ) η 2 )
The Riemannian Hessian as a matrix in the basis of the tangent space is:
Hess  F ( η ) = Cov η ( X , X ) - 1 ( Cov η ( X , X , f ) - 1 2 Cov η ( X , X , F ( η ) ) ) = ( - ( a 12 η 2 + a 1 ) η 1 - a 12 ( η 2 + 1 ) ( η 2 - 1 ) - a 12 ( η 1 + 1 ) ( η 1 - 1 ) - ( a 12 η 1 + a 2 ) η 2 )
As a check, let us compute the Riemannian Hessian as a natural Hessian in the Riemannian parameters, Hess φ ∘ Expp(u) | u=0; see [4] (Prop. 5.5.4). We have:
F Exp η ( u ) = a 12 sin  ( - η 1 2 + 1 u 1 + arcsin  ( η 1 ) ) sin  ( - η 2 2 + 1 u 2 + arcsin  ( η 2 ) ) + a 1 sin  ( - η 1 2 + 1 u 1 + arcsin  ( η 1 ) ) + a 2 sin  ( - η 2 2 + 1 u 2 + arcsin  ( η 2 ) )
and:
Hess  F Exp η ( u ) u = 0 = ( ( η 1 2 - 1 ) a 12 η 1 η 2 + ( η 1 2 - 1 ) a 1 η 1 ( η 1 2 - 1 ) ( η 2 2 - 1 ) a 12 ( η 1 2 - 1 ) ( η 2 2 - 1 ) a 12 ( η 2 2 - 1 ) a 12 η 1 η 2 + ( η 2 2 - 1 ) a 2 η 2 ) = ( ( a 12 η 2 + a 1 ) ( η 1 + 1 ) ( η 1 - 1 ) η 1 a 12 ( η 1 + 1 ) ( η 1 - 1 ) ( η 2 + 1 ) ( η 2 - 1 ) a 12 ( η 1 + 1 ) ( η 1 - 1 ) ( η 2 + 1 ) ( η 2 - 1 ) ( a 12 η 1 + a 2 ) ( η 2 + 1 ) ( η 2 - 1 ) η 2 ) .
Note the presence of the factor Covη (X,X).

4.2. Newton Method

The Newton method is an iterative method that generates a sequence of points pt, with t = 0, 1, . . . , that converges towards a stationary point of a F(p) = Entropy 16 04260f10 [f], p Entropy 16 04260f15, that is, a critical point of the vector field p ↦ ∇F(p), ∇F() = 0. Here, we follow [4] (Ch. 5–6), and in particular Algorithm 5 on Page 113.
Let ∇F be a gradient field. We reproduce in our case the basic derivation of the Newton method in the following. Note that, in this section, we use the notation Hess •[α] to denote Hessα •. Using the definition of metric derivative, we have for a geodesic curve [0, 1] ∋ tp(t) ∈ Entropy 16 04260f15 connecting p = p(0) to = p(1) that:
d d t g p ( t ) ( F ( p ( t ) ) , δ p ( t ) ) = g p ( t ) ( Hess  F ( p ( t ) ) [ δ p ( t ) ] , δ p ( t ) )
hence the increment from p to is:
g p ^ ( F ( p ^ ) , δ p ( 1 ) ) - g p ( F ( p ) , δ p ( 0 ) ) = 0 1 g p ( t ) ( Hess  F ( p ( t ) ) [ δ p ( t ) ] , δ p ( t ) )     d t .
Now, we assume that ∇F() = 0 and that in Equation (145), the integral is approximated by the initial value of the integrand, that is to say, the Hessian is approximately constant on the geodesic from p to ; we obtain:
- g p ( F ( p ) , δ p ( 0 ) ) - g p ( Hess  F ( p ) [ δ p ( 0 ) ] , δ p ( 0 ) ) + .
If we can solve the Newton equation:
Hess  F ( p ( t ) ) [ u ] = - F ( p )
then u is approximately equal to the initial velocity of the geodesic connecting p to , that is, = Expp(u).
The particular structure of the exponential manifold suggests at least two natural retractions that could be used to move from u to . Namely, we have the Riemannian exponential (θt, θt+1) ↦ Expθt (θt+1 θt) and the e-retraction coming from the exponential family itself and defined by (θt, θt+1) ↦ eθt (θt+1 θt), with θt+1 θt = ut.
In the θ parameters, with the e-retraction, the Newton method generates a sequence (θt) according to the following updating rule:
θ t + 1 = θ t - λ Hess  F ( θ t ) - 1 ˜ F ( θ t )
where λ > 0 is an extra parameter intended to control the step size and, in turn, the convergence to θ̂; see [5].
We can rewrite Equation (148) in terms of covariances as:
θ t + 1 = θ t - λ ( Cov θ t ( X , X , f ) - 1 2 Cov θ t ( X , X , ˜ F ( θ t ) ) ) - 1 ˜ F ( θ t ) .

4.3. Example: Binary Independent

In the η parameters, the Newton step is:
u = - Hess  F ( η ) - 1 F ( η ) = ( a 12 2 η 1 + a 12 a 2 + ( a 1 a 12 η 1 + a 1 a 2 ) η 2 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 1 η 1 2 + a 1 a 2 η 1 ) η 2 a 1 a 2 η 1 + a 1 a 12 + ( a 12 a 2 η 1 + a 12 2 ) η 2 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 )
and the new η in the Riemannian retraction is:
Exp η ( u ) = ( sin  ( ( a 12 2 η 1 + a 12 a 2 + ( a 1 a 12 η 1 + a 1 a 2 ) η 2 ) - η 1 2 + 1 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 + arcsin  ( η 1 ) ) sin ( ( a 1 a 2 η 1 + a 1 a 12 + ( a 12 a 2 η 1 + a 12 2 ) η 2 ) - η 2 2 + 1 a 12 2 η 1 2 + ( a 12 a 2 η 1 + a 12 2 ) η 2 2 - a 12 2 + ( a 1 a 12 η 1 2 + a 1 a 2 η 1 ) η 2 + arcsin  ( η 2 ) ) . )
In Figure 6, we represented the vector field associated with the Newton step in the η parameters, with λ = 0.05, using the Riemannian retraction, for the case a1 = 1, a2 = 2 and a12 = 3, with:
Exp η ( u ) = ( sin  ( λ - η 1 2 + 1 ( ( 3 η 1 + 2 ) η 2 + 9 η 1 + 6 ) 3 ( 2 η 1 + 3 ) η 2 2 + 9 η 1 2 + ( 3 η 1 2 + 2 η 1 ) η 2 - 9 + arcsin  ( η 1 ) ) sin  ( λ ( 3 ( 2 η 1 + 3 ) η 2 + 2 η 1 + 3 ) - η 2 2 + 1 3 ( 2 η 1 + 3 ) η 2 2 + 9 η 1 2 + ( 3 η 1 2 + 2 η 1 ) η 2 - 9 + arcsin  ( η 2 ) ) ) .
The red dotted lines represented in the figure identify the basins of attraction of the vector field and correspond to the solutions of the explicit equation in η for which the Newton step u is not defined. This vector field can be compared to that in Figure 7, associated with the Newton step for F(η) using the Euclidean geometry. In the Euclidean geometry, F(η) is a quadratic function with one saddle point, so that from any η, the Newton step points in the direction of the critical point. This makes the Newton step unsuitable for an optimization algorithm. On the other side, in the Riemannian geometry, the vertices of the polytope are critical points for F(η), and they determine the presence of multiple basins of attraction, as expected.
Figure 8 shows the Newton step in the θ parameters based on the e-retraction of Equation (149), while Figure 9 represents the Newton step evaluated with respect to the Euclidean geometry. A comparison of the two vector fields shows that, differently from the η parameters, the number of basins of attraction is the same in the two geometries; however, the scale of the vectors is different. In particular, notice how on the plateau, for diverging θ, the Newton step in the Euclidean geometry vanishes, while in the Riemannian geometry, it gets larger. This behavior suggests better convergence properties for an optimization algorithm based on the Newton step evaluated using the proper Riemannian geometry. In the θ parameters, the boundaries of the basins of attraction represented by the red dotted lines have been computed numerically and correspond to the values of θ for which the update step is not defined.
Finally, notice that in both the η and θ parameters, the step is not always in the direction of descent for the function, a common behavior of the Newton method, which converges to the critical points.

5. Discussion and Conclusions

In this paper, we introduced second-order calculus over a statistical manifold, following the approach described in [4], which has been adapted to the special case of exponential statistical models [2,3]. By defining the Riemannian Hessian and using the notion of retraction, we developed the proper machinery necessary for the definition of the updating rule of the Newton method for the optimization of a function defined over an exponential family.
The examples discussed in the paper show that by taking into account the proper Riemannian geometry of a statistical exponential family, the vector fields associated with the Newton step in the different parametrizations change profoundly. Not only new basins of attraction associated with local and global minima appear, as for the expectation parameters, but also the magnitude of the Newton step is affected, as over the plateau in the natural parameters. Such differences are expected to have a strong impact on the performance of an optimization algorithm based on the Newton step, from both the point of view of achievable convergence and the speed of convergence to the optimum.
The Newton method is a popular second order optimization technique based on the computation of the Hessian of the function to be optimized and is well known for its super-linear convergence properties. However, the use of the Newton method poses a number of issues in practice.
First of all, as the examples in Figures 6 and 8 show, the Newton step does not always point in the direction of the natural gradient, and the algorithm may not converge to a (local) optimum of the function. Such behavior is not unexpected; indeed the Newton method tends to converge to critical points of the function to be optimized, which include local minima, local maxima and saddle points. In order to obtain a direction of ascent for the function to be optimized, the Hessian must be negative-definite, i.e., its eigenvalues must be strictly negative, which is not guaranteed in the general case. Another important remark is related to the computational complexity associated with the evaluation of the Hessian, compared to the (natural) gradient. Indeed, to obtain the Newton step d, Christoffel matrices have to be evaluated, together with the third order covariances between sufficient statistics and the function, and the Hessian has to be inverted. Finally, notice that when the Hessian is close to being non-invertible, numerical problems may arise in the computation of the Newton step, and the algorithm may become unstable and diverge.
In the literature, different methods have been proposed to overcome these issues. Among them, we mention quasi-Newton methods, where the update vector is obtained using a modified Hessian, which has been made negative-definite, for instance, by adding a proper correction matrix.
This paper represents the first step in the design of an algorithm based on the Newton method for the optimization over a statistical model. The authors are working on the computational aspects related to the implementation of the method, and a new paper with experimental results is in progress.

Acknowledgments

Luigi Malagò was supported by the Xerox University Affairs Committee Award and by de Castro Statistics, Collegio Carlo Alberto, Moncalieri. Giovanni Pistone is supported by de Castro Statistics, Collegio Carlo Alberto, Moncalieri, and is a member of GNAMPA–INdAM, Roma.

Author Contributions

All authors contributed to the design of the research. The research was carried out by all authors. The study of the Hessian and of the Newton method in statistical manifolds was originally suggested by Luigi Malagò. The manuscript was written by Luigi Malagò and Giovanni Pistone. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Brown, L.D. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory; Number 9 in IMS Lecture Notes. Monograph Series; Institute of Mathematical Statistics: Hayward, CA, USA, 1986; p. 283. [Google Scholar]
  2. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA, 2000; p. 206. [Google Scholar]
  3. Pistone, G. Nonparametric Information Geometry. In Geometric Science of Information; Proceedings of the First International Conference, GSI 2013, Paris, France, 28–30 August 2013, Nielsen, F., Barbaresco, F., Eds.; Lecture Notes in Computer Science, Volume 8085; Springer: Berlin/Heidelberg, Germany, 2013; pp. 5–36. [Google Scholar]
  4. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008; p. xvi+224. [Google Scholar]
  5. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006; p. xxii+664. [Google Scholar]
  6. Do Carmo, M.P. Riemannian geometry; Mathematics: Theory & Applications; Birkhäuser Boston Inc.: Boston, MA, USA, 1992; p. xiv+300. [Google Scholar]
  7. Abraham, R.; Marsden, J.E.; Ratiu, T. Manifolds, Tensor Analysis, and Applications, 2nd ed; Applied Mathematical Sciences, Volume 75; Springer: New York, NY, USA, 1988; p. x+654. [Google Scholar]
  8. Lang, S. Differential and Riemannian Manifolds, 3rd ed.; Graduate Texts in Mathematics; Springer: New York, NY, USA, 1995; p. xiv+364. [Google Scholar]
  9. Pistone, G. Algebraic varieties vs. differentiable manifolds in statistical models. In Algebraic and Geometric Methods in Statistics; Gibilisco, P., Riccomagno, E., Rogantin, M.P., Wynn, H.P., Eds.; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  10. Malagò, L.; Matteucci, M.; Dal Seno, B. An information geometry perspective on estimation of distribution algorithms: Boundary analysis. Proceedings of the 2008 GECCO Conference Companion On Genetic and Evolutionary Computation (GECCO ’08); ACM: New York, NY, USA, 2008; pp. 2081–2088. [Google Scholar]
  11. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Relaxation as a Unifying Approach in 0/1 Programming. Proceedings of the NIPS 2009 Workshop on Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra (DISCML), Whistler Resort & Spa, BC, Canada, 11–12 December 2009.
  12. Malagò, L.; Matteucci, M.; Pistone, G. Stochastic Natural Gradient Descent by Estimation of Empirical Covariances. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), New Orleans, LA, USA, 5–8 June 2011; pp. 949–956.
  13. Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family. Proceedings of the 11th Workshop on Foundations of Genetic Algorithms (FOGA ’11), Schwarzenberg, Austria, 5–8 January 2011; ACM: New York, NY, USA, 2011; pp. 230–242. [Google Scholar]
  14. Malagò, L.; Matteucci, M.; Pistone, G. Natural gradient, fitness modelling and model selection: A unifying perspective. Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 486–493.
  15. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput 1998, 10, 251–276. [Google Scholar]
  16. Shima, H. The Geometry of Hessian Structures; Scientific Publishing Co Pte. Ltd. World: Hackensack, NJ, USA, 2007; p. xiv+246. [Google Scholar]
  17. Malagò, L. On the Geometry of Optimization Based on the Exponential Family Relaxation. Ph.D. Thesis, Politecnico di Milano, Milano, Italy, 2012. [Google Scholar]
  18. Gallavotti, G. Statistical Mechanics: A Short Treatise; Texts and Monographs in Physics; Springer: Berlin, Germany, 1999; p. xiv+339. [Google Scholar]
  19. Naudts, J. Generalised exponential families and associated entropy functions. Entropy 2008, 10, 131–149. [Google Scholar]
  20. Esteban, J.; Ray, D. On the Measurement of Polarization. Econometrica 1994, 62, 819–851. [Google Scholar]
  21. Montalvo, J.; Reynal-Querol, M. Ethnic polarization, potential conflict, and civil wars. Am. Econ. Rev 2005, 796–816. [Google Scholar]
  22. Stein, W.; et al. Sage Mathematics Software (Version 6.0); The Sage Development Team, 2013. Available online: http://www.sagemath.org (accessed on 27 March 2014).
Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
Figure 1. Relaxation of the Function (2) on the independence model. a1 = 1, a2 = 2, a12 = 3.
Entropy 16 04260f1
Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]2.
Figure 2. Gradient flow of the Function (2). The domain has been increased to include values outside the square [−1, +1]2.
Entropy 16 04260f2
Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1=4, −1=4).
Figure 3. Gradient flow (blue line) and natural gradient flow (black line) for the Function (2), starting at (−1=4, −1=4).
Entropy 16 04260f3
Figure 4. Normalized polarization.
Figure 4. Normalized polarization.
Entropy 16 04260f4
Figure 5. Geodesics from η = (0.75, 0.75).
Figure 5. Geodesics from η = (0.75, 0.75).
Entropy 16 04260f5
Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
Figure 6. The Newton step in the η parameters, Riemannian retraction, λ = 0.05. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined; cf. Equation (150). The instability close to the critical lines is represented by the longer arrows.
Entropy 16 04260f6
Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
Figure 7. The Newton step in the η parameters, Euclidean geometry, λ = 0.05.
Entropy 16 04260f7
Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Figure 8. The Newton step in the θ parameters, exponential retraction, λ = 0.015. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Entropy 16 04260f8
Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Figure 9. The Newton step in the θ parameters, Euclidean geometry, λ = 0.15. The red dotted lines identify the different basins of attraction and correspond to the points for which the Newton step is not defined. The instability along the critical lines, which identifies the basins of attraction, is not represented.
Entropy 16 04260f9

Share and Cite

MDPI and ACS Style

Malagò, L.; Pistone, G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy 2014, 16, 4260-4289. https://doi.org/10.3390/e16084260

AMA Style

Malagò L, Pistone G. Combinatorial Optimization with Information Geometry: The Newton Method. Entropy. 2014; 16(8):4260-4289. https://doi.org/10.3390/e16084260

Chicago/Turabian Style

Malagò, Luigi, and Giovanni Pistone. 2014. "Combinatorial Optimization with Information Geometry: The Newton Method" Entropy 16, no. 8: 4260-4289. https://doi.org/10.3390/e16084260

Article Metrics

Back to TopTop