Black-Box Optimization Using Geodesics in Statistical Manifolds

Bensadon, Jérémy

doi:10.3390/e17010304

Open AccessArticle

Black-Box Optimization Using Geodesics in Statistical Manifolds^†

by

Jérémy Bensadon

Laboratoire de Recherche en Informatique, Université Paris-Sud, 91400 Orsay, France

^†

This paper is an extended version of our paper published in 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, 21–26 September 2014, Château Clos Lucé, Parc Leonardo Da Vinci, Amboise, France.

Entropy 2015, 17(1), 304-345; https://doi.org/10.3390/e17010304

Submission received: 8 October 2014 / Accepted: 7 January 2015 / Published: 13 January 2015

(This article belongs to the Special Issue Information, Entropy and Their Geometric Structures)

Download

Browse Figures

Versions Notes

Abstract

:

Information geometric optimization (IGO) is a general framework for stochastic optimization problems aiming at limiting the influence of arbitrary parametrization choices: the initial problem is transformed into the optimization of a smooth function on a Riemannian manifold, defining a parametrization-invariant first order differential equation and, thus, yielding an approximately parametrization-invariant algorithm (up to second order in the step size). We define the geodesic IGO update, a fully parametrization-invariant algorithm using the Riemannian structure, and we compute it for the manifold of Gaussians, thanks to Noether’s theorem. However, in similar algorithms, such as CMA-ES (Covariance Matrix Adaptation - Evolution Strategy) and xNES (exponential Natural Evolution Strategy), the time steps for the mean and the covariance are decoupled. We suggest two ways of doing so: twisted geodesic IGO (GIGO) and blockwise GIGO. Finally, we show that while the xNES algorithm is not GIGO, it is an instance of blockwise GIGO applied to the mean and covariance matrix separately. Therefore, xNES has an almost parametrization-invariant description.

Keywords:

black-box; optimization; geodesics; Gaussian; information geometry; natural gradient; Noether; learning rate; IGO; xNES

1. Introduction

Consider an objective function f: X → ℝ to be minimized. We suppose we have absolutely no knowledge about f: the only thing we can do is ask for its value at any point x ∈ X (black-box optimization) and that the evaluation of f is a costly operation. We are going to study algorithms that can be described in the IGO framework (see [1]).

We consider the following optimization procedure:

We choose (P_θ)_θ_∈Θ a family of probability distributions (which will be given a Riemannian manifold structure, following [2]) on X and an initial probability distribution

P_{θ^{0}}

. Now, we replace f by F: Θ → ℝ (for example

F (θ) = E_{x ~ P_{θ}} [f (x)]

), and we optimize F by gradient descent, corresponding to the gradient flow:

\frac{d θ^{t}}{d t} = - \nabla_{θ} E_{x ~ P_{θ}} [f (x)] .

(1)

However, because of the gradient, this equation depends entirely on the parametrization we chose for Θ, which is disturbing: we do not want to have two different updates, because we chose different parameters to represent the objects with which we are working. Moreover, in the case of a function with several local minima, changing the parametrization can change the attained optimum (see [3], for example). That is why invariance is a design principle behind IGO. More precisely, we want invariance with respect to monotone transformations of f and invariance under reparametrization of Θ.

The IGO framework uses the geometry of the family Θ, which is given by the Fisher metric to provide a differential equation on θ with the desired properties, but because of the discretization of time needed to obtain an explicit algorithm, we lose invariance under reparametrization of θ: two IGO algorithms applied to the same function to be optimized, but with different parametrizations, coincide only at first order in the step size. A possible solution to this problem is geodesic IGO (GIGO), introduced here (see also IGO-Maximum Likelihoodin [1], for example.): the initial direction of the update at each step of the algorithm remains the same as in IGO, but instead of moving straight for the chosen parametrization, we use the Riemannian manifold structure of our family of probability distributions (see [2]) by following its geodesics.

Finding the geodesics of a Riemannian manifold is not always easy, but Noether’s theorem will allow us to obtain quantities that are preserved along the geodesics, thus allowing, in the case of Gaussian distributions, one to obtain a first order differential equation satisfied by the geodesics, which makes their computation easier.

Although the geodesic IGO algorithm is not, strictly speaking, parametrization-invariant when no closed form for the geodesics is known, it is possible to compute them at arbitrary precision without increasing the numbers of objective function calls.

The first two sections are preliminaries: in Section 2, we recall the IGO algorithm, introduced in [1], and in Section 3, after a reminder about Riemannian geometry, we state Noether’s theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.

In Section 4, we consider Gaussian distributions with a covariance matrix proportional to the identity matrix: this space is isometric to the hyperbolic space, and the geodesics of the latter are known.

In Section 5.1, we consider the general Gaussian case, and we use Noether’s theorem to obtain two different sets of equations to compute the GIGO update. The equations are known (see [4–6]), but the connection with Noether’s theorem has not been mentioned. We then give the explicit solution for these equations, from [5].

In Section 6, we recall quickly the xNES and CMA-ESupdates, and we introduce a slight modification of the IGO algorithm to incorporate the direction-dependent learning rates used in CMA-ESand xNES. We then compare these different algorithms and prove that xNES is not GIGO in general, and we finally introduce a new family of algorithms extending GIGO and recovering xNES from abstract principles.

Finally, Section 7 presents numerical experiments, which suggest that when using GIGO with Gaussian distributions, the step size must be chosen carefully.

2. Definitions: IGO, GIGO

In this section, we recall what the IGO framework is and we define the geodesic IGO update. Consider again Equation (1):

\frac{d θ^{t}}{d t} = - \nabla_{θ} E_{x ~ P_{θ}} [f (x)] .

As we saw in the Introduction:

The gradient depends on the parametrization of our space of probability distributions (see Section 2.3 for an example).
The equation is not invariant under monotone transformations of f. For example, the optimization for 10f moves ten times faster than the optimization for f.

In this section, we recall how IGO deals with this (see [1] for a better presentation).

2.1. Invariance under Reparametrization of θ: Fisher Metric

In order to achieve invariance under reparametrization of θ, it is possible to turn our family of probability distributions into a Riemannian manifold (this is the main topic of information geometry; see [2]), which allows us to use a canonical, parametrization-invariant gradient (called the natural gradient).

Definition 1. Let P, Q be two probability distributions on X. The Kullback–Leibler divergence of Q from P is defined by:

KL (Q ‖ P) = \int_{X} \ln (\frac{Q (x)}{P (x)}) d Q (x) .

(2)

By definition, it does not depend on the parametrization. It is not symmetrical, but if for all x, the application θ ⟼ P_θ(x) is C², then a second-order expansion yields:

KL (P_{θ + d θ} ‖ P_{θ}) = \frac{1}{2} \sum_{i, j} I_{i j} (θ) d θ_{i} d θ_{j} + o (d θ^{2}),

(3)

where:

I_{i j} (θ) = \int_{X} \frac{\partial \ln P_{θ} (x)}{\partial θ_{i}} \frac{\partial \ln P_{θ} (x)}{\partial θ_{j}} d P_{θ} (x) = - \int_{X} \frac{\partial^{2} \ln P_{θ} (x)}{\partial θ_{i} \partial θ_{j}} d P_{θ} (x) .

(4)

This is enough to endow the family (P_θ)_θ_∈Θ with a Riemannian manifold structure: a Riemannian manifold M is a differentiable manifold, which can be seen as pieces of ℝⁿ glued together, with a metric. The metric at x is a symmetric positive-definite quadratic form on the tangent space of M at x: it indicates how expensive it is to move in a given direction on the manifold. We will think of the updates of the algorithms that we will be studying as paths on M.

The matrix I(θ) is called the “Fisher information matrix”, and the metric it defines is called the “Fisher metric”.

Given a metric, it is possible to define a gradient attached to this metric; the key property of the gradient is that for any smooth function f:

f (x + h) = f (x) + \sum_{i} h_{i} \frac{\partial f}{\partial x_{i}} + o (‖ h ‖) = f (x) + 〈 x 〉 + 〈 h, \nabla f (x) 〉 + o (‖ h ‖),

(5)

where

〈 x, y 〉 = x^{T} I_{y}

is the dot product in metric I. Therefore, in order to keep the property of Equation (5), we must have

\nabla f = I^{- 1} \frac{\partial f}{\partial x}

.

We have therefore the following gradient (called the “natural gradient”; see [2]):

{\tilde{\nabla}}_{θ} = I^{- 1} (θ) \frac{\partial}{\partial θ},

(6)

and since the Kullback–Leibler divergence does not depend on the parametrization, neither does the natural gradient.

Later in this paper, we will study families of Gaussian distributions. The following proposition gives the Fisher metric for these families.

Proposition 1. Let (P_θ)_θ_∈Θ be a family of normal probability distributions:

P_{θ} = N (μ (θ), Σ (θ)) .

If μ and Σ are C¹, the Fisher metric is given by:

I_{i, j} (θ) = \frac{\partial μ^{T}}{\partial θ_{i}} Σ^{- 1} \frac{\partial μ}{\partial θ_{j}} + \frac{1}{2} tr (Σ^{- 1} \frac{\partial Σ}{\partial θ_{i}} Σ^{- 1} \frac{\partial Σ}{\partial θ_{j}}) .

(7)

Proof. This is a non-trivial calculation. See [7] or [8] for more details.

As we will often be working with Gaussian distributions, we introduce the following notation:

Notation 1.

G_{d}

is the manifold of Gaussian distributions in dimension d, equipped with the Fisher metric.

{\tilde{G}}_{d}

is the manifold of Gaussian distributions in dimension d, with the covariance matrix proportional to identity in the canonical basis of ℝ^d, equipped with the Fisher metric.

2.2. IGO Flow, IGO Algorithm

In IGO [1], invariance with respect to monotone transformations is achieved by replacing f by the following transform; we set:

q (x) = P_{x^{'} ~ P_{θ}} (f (x^{'}) \leq f (x)),

(8)

a non-increasing function w: [0; 1] → ℝ is chosen (the selection scheme), and finally,

W_{θ}^{f} (x) = w (q (x))

(this definition has to be slightly changed if the probability of a tie is not zero, see [1] for more details). By performing a gradient descent on

E_{x ~ P_{θ}} [W_{θ^{t}}^{f} (x)]

, we obtain the “IGO flow”:

\frac{d θ^{t}}{d t} = {\tilde{\nabla}}_{θ} \int_{X} W_{θ^{t}}^{f} (x) P_{θ} (d x) = \int_{X} W_{θ^{t}}^{f} (x) {\tilde{\nabla}}_{θ} \ln P_{θ} (x) P_{θ^{t}} (d x) .

(9)

Notice that the function we are optimizing is

E_{x ~ P_{θ}} [W_{θ^{t}}^{f} (x)]

and not

E_{x ~ P_{θ}} [W_{θ}^{f} (x)]

(the second function is constant and always equal to

\int_{0}^{1} w

). In particular, the function for which we are performing the gradient descent changes at each step, although their optimum (a Dirac at the minimum of f) does not: the IGO flow is not a gradient flow; it is simply a vector flow given by the gradient of interrelated functions.

For practical implementation, the integral in (9) has to be approximated. For the integral itself, the Monte-Carlo method is used; N values (x₁, …, x_N) are sampled from the distribution

P_{θ^{t}}

, and the integral becomes:

\frac{1}{N} \sum_{i = 1}^{N} W_{θ^{t}}^{f} (x_{i}) {\tilde{\nabla}}_{θ} \ln P_{θ} (x_{i})

(10)

and we approximate

\frac{1}{N} W_{θ}^{f} (x_{i}) = \frac{1}{N} w (q (x_{i}))

by

{\hat{w}}_{i} = \frac{1}{N} w (\frac{rk (x_{i}) + 1 / 2}{N})

, where rk(x_i) = |{j, f(x_j) < f(x_i)}|: it can be proven (see [1]) that

\lim_{N \to \infty} N {\hat{w}}_{i} = W_{f}^{θ^{t}} (x_{i})

(here again, we are assuming that there are no ties).

We now have an algorithm that can be used in practice if the Fisher information matrix is known.

Definition 1. The IGO update associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:

θ^{t + δ t} = θ^{t} + δ t I^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ} .

(11)

We call IGO speed the vector

I^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ} .

Notice that one could start directly with the

{\hat{w}}_{i}

rather than w, as we will do later.

Replacing f by its expected value under a probability distribution P_θ and optimizing over θ using the natural gradient have already been discussed. For example, in the case of a function f defined on {0, 1}ⁿ, IGO with the Bernoulli distributions yields the algorithm, PBIL[9]. Another similar approach (stochastic relaxation) is given in [10]. For a continuous function, as we will see later, the IGO framework recovers several known ranked-based natural gradient algorithms, such as pure rank-μ CMA-ES [11], xNES or SNES (Separable Natural Evolution Strategies) [12]. See [13] or [14] for other, not necessarily gradient-based, optimization algorithms on manifolds.

2.3. Geodesic IGO

Although the IGO flow associated with a family of probability distributions is intrinsic (it only depends on the family itself, not the parametrization we choose for it), the IGO update is not. However, the difference between two steps of IGO that differ only by the parametrization is only O(δt²), whereas the different between two vanilla gradient descents with different parametrizations is O(δt).

Intuitively, the reason for this difference is that two IGO algorithms start at the same point and follow “straight lines” with the same initial speed, but the definition of “straight lines” changes with the parametrization.

For instance, in the case of Gaussian distributions, let us consider two different IGO updates with Gaussian distributions in dimension one, the first one with parametrization (μ, σ) and the second one with parametrization (μ, c := σ²). We suppose that the IGO speed for the first algorithm is

(\dot{μ}, \dot{σ})

. The corresponding IGO speed in the second parametrization is given by the identity

\dot{c} = 2 σ \dot{σ}

. Therefore, the first algorithm gives the standard deviation

σ_{new, 1} = σ_{old} + δ t \dot{σ}

and the variance

c_{new, 1} = {(σ_{new, 1})}^{2} = c_{old} + 2 δ t σ_{old} \dot{σ} + δ t^{2} {\dot{σ}}^{2} = c_{new, 2} + δ t^{2} {\dot{σ}}^{2}

.

The geodesics of a Riemannian manifold are the generalization of the notion of a straight line: they are curves that locally minimize length. In particular, given two points a and b on the Riemannian manifold M, the shortest path from a to b is always a geodesic (the converse is not true, though). The notion will be explained precisely in Section 3, but let us define the geodesic IGO algorithm, which follows the geodesics of the manifold instead of following the straight lines for an arbitrary parametrization.

Definition 2 (GIGO). The geodesic IGO update (GIGO) associated with sample size N, step size δt and selection scheme w is given by the following update rule:

θ^{t + δ t} = \exp_{θ^{t}} (Y δ t)

(12)

where:

Y = I^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ},

(13)

is the IGO speed and

\exp_{θ^{t}}

is the exponential of the Riemannian manifold Θ. Namely,

\exp_{θ^{t}} (Y δ t)

is the endpoint of the geodesic of Θ starting at θ^t, with initial speed Y, after a time δt. By definition, this update does not depend on the parametrization θ.

Notice that while the GIGO update is compatible with the IGO flow (in the sense that when δt → 0 and N → ∞, a parameter θ^t updated according to the GIGO algorithm is a solution of Equation (9), the equation defining the IGO flow), it not necessarily an IGO update. More precisely, the GIGO update is an IGO update if and only if the geodesics of Θ are straight lines for some parametrization (by Beltrami’s theorem, this is equivalent to Θ having constant curvature).

This is a particular case of a retraction [14]: a map from the tangent bundle of a manifold to the manifold itself satisfying a rigidity condition. Arguably, the Riemannian exponential is the most natural retraction, since it depends only on the Riemannian manifold itself and not on any decomposition. However, in general, the geodesics are difficult to compute.

In the next section, we will state Noether’s theorem, which will be our main tool to compute the GIGO update for Gaussian distributions.

3. Riemannian Geometry, Noether’s Theorem

3.1. Riemannian Geometry

The goal of this section is to state Noether’s theorem. See [15] for the proofs and [16] or [17] for a more detailed presentation. Noether’s theorem states that if a system has symmetries, then there are invariants attached to these symmetries. Firstly, we need some definitions.

Definition 3 (Motion in a Lagrangian system). Let M be a differentiable manifold, TM the set of tangent vectors on M (a tangent vector is identified by the point at which it is tangent and a vector in the tangent space) and

\begin{array}{r} L : T M \to ℝ \\ (q, v) \mapsto L (q, v) \end{array}

differentiable function called the Lagrangian function (in general, it could depend on t). A “motion in the Lagrangian system (M,

ℒ

) from x to y” is map γ : [t₀, t₁] → M, such that:

γ(t₀) = x
γ(t₁) = y
γ is a local extremum of the functional:

$Φ (γ) = \int_{t_{0}}^{t_{1}} ℒ (γ (t), \dot{γ} (t)) d t,$

(14)

among all curves c: [t₀, t₁] → M, such that c(t₀) = x, and c(t₁) = y.

For example, when (M, g) is a Riemannian manifold, the length of a curve γ between γ(t₀) and γ(t₁) is:

\int_{t_{0}}^{t_{1}} \sqrt{g (\dot{γ} (t), \dot{γ} (t))} d t .

(15)

The curves that follow the shortest path between two points x, y ∈ M are therefore the minima γ of the functional (15), such that γ(t₀) = x and γ(t₁) = y, and the corresponding Lagrangian function is

(q, v) \mapsto \sqrt{g (v, v)}

. However, any curve following the shortest trajectory will have minimum length. For example, if γ₁: [a, b] → M is a curve of the shortest path, so is γ₂: t ↦ γ₁(t²): these two curves define the same trajectory in M, but they do not travel along this trajectory at the same speed. This leads us to the following definition:

Definition 4 (Geodesics). Let I be an interval of ℝ and (M, g) be a Riemannian manifold. A curve γ: I → M is called a geodesic if for all

t_{0}, t_{1} \in I, γ |_{[t_{0}, t_{1}]}

is a motion in the Lagrangian system (M,

ℒ

) from γ(t₀) to γ(t₁), where:

ℒ (γ) = \int_{t_{0}}^{t_{1}} g (\dot{γ} (t), \dot{γ} (t)) d t

(16)

It can be shown (see [16]) that geodesics are curves that locally minimize length, with constant velocity, in the sense that

\frac{d g (\dot{γ} (t), \dot{γ} (t))}{d t} = 0

. In particular, given a starting point and a starting speed, the geodesic is unique. This motivates the definition of the exponential of a Riemannian manifold.

Definition 5. Let (M, g) be a Riemannian manifold. We call the exponential of M the application:

\begin{matrix} \exp : T M \to M \\ (x, v) \mapsto \exp_{x} (v), \end{matrix}

such that for any x ∈ M, if γ is the geodesic of M satisfying γ(0) = x and γ′(0) = v, then exp_x(v) = γ(1).

In order to find an extremal of a functional, the most commonly-used result is called the “Euler–Lagrange equations” (see [15], for example); a motion γ in the Lagrangian system (M,

ℒ

) must satisfy:

\frac{\partial ℒ}{\partial x} (γ (t)) - \frac{d}{d t} (\frac{\partial ℒ}{\partial \dot{x}} (\dot{γ} (t))) = 0.

(17)

By applying this equation with the Lagrangian given by (16), it is possible to show that the geodesics of a Riemannian manifold follow the “geodesic equations”:

{\ddot{x}}^{k} + Γ_{i j}^{k} {\dot{x}}^{i} {\dot{x}}^{j} = 0,

(18)

where the

Γ_{i j}^{k} = \frac{1}{2} g l k (\frac{\partial g_{j l}}{\partial q_{i}} + \frac{\partial g_{l i}}{\partial q_{j}} - \frac{\partial g_{i j}}{\partial q_{l}})

(19)

are called “Christoffel symbols” of the metric g. However, these coefficients are tedious (and sometimes difficult) to compute, and (18) is a second order differential equation. Noether’s theorem will give us a first order equation to compute the geodesics.

3.2. Noether’s Theorem

Definition 6. Let h: M → M, a diffeomorphism. We say that the Lagrangian system (M,

ℒ

) admits the symmetry h if for any (q, v) ∈ TM,

ℒ (h (q), d h (v)) = ℒ (q, v),

(20)

where dh is the differential of h.

If M is clear in the context, we will sometimes say that

ℒ

is invariant under h.

An example will be given in the proof of Theorem 3.

We can now state Noether’s theorem (see, for example, [15]).

Theorem 1 (Noether’s Theorem). If the Lagrangian system (M,

ℒ

) admits the one-parameter group of symmetries h^s: M → M, s ∈ ℝ, then the following quantity remains constant during motions in the system (M,

ℒ

). Namely,

I (γ (t), \dot{γ} (t)) = \frac{\partial ℒ}{\partial v} (\frac{d h^{s} (γ (t))}{d s} |_{s = 0})

(21)

does not depend on t if γ is a motion in (M,

ℒ

).

Now, we are going to apply this theorem to our problem: computing the geodesics of Riemannian manifolds of Gaussian distributions.

4. GIGO in ${\tilde{G}}_{d}$

If we force the covariance matrix to be either diagonal or proportional to the identity matrix, the geodesics have a simple expression that we give below. In the former case, the manifold we are considering is

{(G_{1})}^{d}

, and in the latter case, it is

{\tilde{G}}_{d}

.

The geodesics of

{(G_{1})}^{d}

are given by:

Proposition 2. Let M be a Riemannian manifold; let d ∈ ℕ; let Φ be the Riemannian exponential of M^d; and let φ be the Riemannian exponential of M. We have:

Φ_{(x_{1}, \dots, x_{n})} ((v_{1}, \dots, v_{n})) = (ϕ_{x_{1}} (v_{1}), \dots, ϕ_{x n} (v_{n}))

(22)

In particular, knowing the geodesics of

G_{1}

is enough to compute the geodesics of

{(G_{1})}^{d}

.

This is true, because a block of the product metric does not depend on variables of the other blocks. Consequently, a GIGO update with a diagonal covariance matrix with the sample (x_i) is equivalent to d separate one-dimensional GIGO updates using the same samples. Moreover,

G_{1} ≅ {\tilde{G}}_{1}

, the geodesics of which are given below.

We will show that

{\tilde{G}}_{d}

and the “hyperbolic space”, of which the geodesics are known, are isometric.

4.1. Preliminaries: Poincaré Half-Plane, Hyperbolic Space

In dimension two, the hyperbolic space is called the “hyperbolic plane” or the Poincaré half-plane. We recall its definition:

Definition 7 (Poincaré half-plane). We call the “Poincaré half-plane” the Riemannian manifold:

ℋ = {(x, y) \in ℝ^{2}, y > 0},

with the metric

d s^{2} = \frac{d x^{2} + d y^{2}}{y^{2}}

.

We also recall the expression of its geodesics (see, for example, [18]):

Proposition 3 (Geodesics of the Poincaré half-plane). The geodesics of the Poincaré half-plane are exactly the:

t \mapsto (Re (z (t)), Im (z (t))),

where:

z (t) = \frac{a i e^{v t} + b}{c i e^{v t} + d},

(23)

with ad − bc = 1 and v > 0.

The geodesics are half-circles perpendicular to the line y = 0 and vertical lines, as shown in Figure 1 below.

The generalization to the higher dimension is the following:

Definition 8 (Hyperbolic space). We call the “hyperbolic space of dimension n” the Riemannian manifold:

ℋ_{n} = {(x_{1}, \dots, x_{n - 1,} y) \in ℝ^{n}, y > 0},

with the metric

d s^{2} = \frac{d x_{1}^{2} + \dots + d x_{n - 1}^{2} + d y^{2}}{y^{2}}

(or equivalently, the metric given by the matrix

Diag (\frac{1}{y^{2}})

).

The Lagrangian for the geodesics is invariant under all translations along the x_i, so by Noether’s theorem, its geodesics stay in a plane containing the direction y and the initial speed. The induced metric on this plane is the metric of the Poincaré half-plane. The geodesics are therefore given by the following proposition:

Proposition 4 (Geodesics of the hyperbolic space). If γ : t ⟼ (x₁(t), …, x_n−₁(t), y(t)) = (x(t), y(t)) is a geodesic of

ℋ

_n, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that

x (t) = x (0) + \frac{{\dot{x}}_{0}}{‖ {\dot{x}}_{0} ‖} \tilde{x} (t), y (t) = Im (γ_{ℂ} (t))

, with

\tilde{x} (t) = Re (γ_{ℂ} (t))

and:

γ_{ℂ} (t) : = \frac{a i e^{v t} + b}{c i e^{v t} + d} .

(24)

4.2. Computing the GIGO Update in ${\tilde{G}}_{d}$

If we want to implement the GIGO algorithm in

{\tilde{G}}_{d}

, we need to compute the natural gradient in

{\tilde{G}}_{d}

and to be able to compute the Riemannian exponential of

{\tilde{G}}_{d}

.

Using Proposition 1, we can compute the metric of

{\tilde{G}}_{d}

in the parametrization (μ, σ) ⟼ N(μ, σ²I). We find:

(\begin{matrix} \frac{1}{σ^{2}} & 0 & \dots & 0 \\ 0 & ⋱ & ⋱ & ⋮ \\ ⋮ & ⋱ & \frac{1}{σ^{2}} & 0 \\ 0 & \dots & 0 & \frac{2 d}{σ^{2}} \end{matrix}) .

(25)

Since this matrix is diagonal, it is easy to invert, and we immediately have the natural gradient and, consequently, the IGO speed.

Proposition 5. In

{\tilde{G}}_{d}

, the IGO speed Y is given by:

Y_{μ} = \sum_{i} {\hat{w}}_{i} (x_{i} - μ),

(26)

Y_{σ} = \sum_{i} {\hat{w}}_{i} (\frac{{(x_{i} - μ)}^{T} (x_{i} - μ)}{2 d σ} - \frac{σ}{2}) .

(27)

Proof. We recall the IGO speed is defined by

Y = I^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ}

. Since

P_{μ, σ} (x) = {(2 π σ^{2})}^{- d / 2} \exp (- \frac{{(x - μ)}^{T} (x - μ)}{2 σ^{2}})

, we have:

\begin{matrix} \frac{\partial \ln P_{μ, σ} (x)}{\partial μ} = x - μ, \\ \frac{\partial \ln P_{μ, σ} (x)}{\partial σ} = - \frac{d}{σ} + \frac{{(x - μ)}^{T} (x - μ)}{σ^{3}} . \end{matrix}

The result follows. □

The metric defined by Equation (25) is not exactly the metric of the hyperbolic space, but with the substitution

μ \leftarrow \frac{μ}{\sqrt{2 d}}

, the metric becomes

\frac{2 d}{σ^{2}} I

, which is proportional to the metric of the hyperbolic space and, therefore, defines the same geodesics.

Theorem 2 (Geodesics of

{\tilde{G}}_{d}

). If

γ : t \mapsto N (μ (t), σ {(t)}^{2} I)

is a geodesic of

{\tilde{G}}_{d}

, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that:

μ (t) = μ (0) + \sqrt{2 d} \frac{{\dot{μ}}_{0}}{‖ {\dot{μ}}_{0} ‖} \tilde{r}, σ (t) = Im (γ_{ℂ} (t))

, with

\tilde{r} (t) = Re (γ_{ℂ} (t))

and

γ_{ℂ} (t) : = \frac{a i e^{v t} + b}{c i e^{v t} + d} .

(28)

Now, in order to implement the corresponding GIGO algorithm, we only need to be able to find the coefficients a, b, c, d, v corresponding to an initial position (μ₀, σ₀) and an initial speed

({\dot{μ}}_{0}, {\dot{σ}}_{0})

. This is a tedious but easy computation, the result of which is given in Proposition 17.

The pseudocode of GIGO in

{\tilde{G}}_{d}

is also given in the Appendix: it is obtained by concatenating Algorithms 1 and 7 (Proposition 17 and the pseudocode in the Appendix allow the metric to be slightly modified; see Section 6.2).

5. GIGO in ${\tilde{G}}_{d}$

5.1. Obtaining a First Order Differential Equation for the Geodesics of $G_{d}$

In the case where both the covariance matrix and the mean can vary freely, the equations of the geodesics have been computed in [4] and [5]. However, these articles start with the equations of the geodesics obtained with the Christoffel symbols, then partially integrate them. These equations are in fact a consequence of Noether’s theorem and can be found directly.

Theorem 3. Let

γ : t \mapsto N (μ_{t}, \sum_{t})

be a geodesic of

G_{d}

. Then, the following quantities do not depend on t:

J_{μ} = \sum_{t}^{- 1} {\dot{μ}}_{t},

(29)

J_{\sum} = \sum_{t}^{- 1} ({\dot{μ}}_{t} μ_{t}^{T} + {\sum^{˙}}_{t}) .

(30)

Proof. This is a direct application of Noether’s theorem, with suitable groups of diffeomorphisms. By Proposition 1, the Lagrangian associated with the geodesics of

G_{d}

is:

L (μ, \sum, \dot{μ}, \sum^{˙}) = {\dot{μ}}^{T} \sum^{- 1} \dot{μ} + \frac{1}{2} tr (\sum^{˙} \sum^{- 1} \sum^{˙} \sum^{- 1}) .

(31)

Its derivative is:

\frac{\partial ℒ}{\partial \dot{θ}} = [(h, H) \mapsto 2 {\dot{μ}}^{T} \sum^{- 1} h + tr (H \sum^{- 1} \sum^{˙} \sum^{- 1})]

(32)

Let us show that this Lagrangian is invariant under affine changes of basis (thus illustrating Definition 6).

The general form of an affine change of basis is ϕ_μ₀_,A : (μ, Σ) ⟼ (Aμ + μ₀, AΣA^T ), with μ₀ ∈ ℝ^d and A ∈ GL_d(ℝ).

We have:

L (ϕ_{μ 0, A} (μ, \sum), d ϕ_{μ 0, A} (\dot{μ}, \sum^{˙})) = {\dot{\bar{A μ}}}^{T} {(A \sum A^{T})}^{- 1} \dot{\bar{A μ}} + \frac{1}{2} tr (\bar{A \sum A^{T}} {(A \sum A^{T})}^{- 1} \bar{A \sum A^{T}} {(A \sum A^{T})}^{- 1} .)

(33)

and since

\dot{\bar{A μ}} = A \dot{μ}

and,

\bar{A \sum A^{T}} (A \sum A^{T})

we find easily that:

ℒ (ϕ_{μ 0, A} (μ, Σ), d ϕ_{μ 0, A} (\dot{μ}, \dot{Σ})) = ℒ (μ, Σ, \dot{μ}, \dot{Σ}),

(34)

or in other words:

ℒ

is invariant under ϕ_μ₀_,A for any μ₀ ∈ ℝ^d, A ∈ GL_d(ℝ).

In order to use Noether’s theorem, we also need one-parameter groups of transformations. We choose the following:

Translations of the mean vector. For any i ∈ [1, d], $let h_{i}^{s} : (μ, Σ) \mapsto (μ + s e_{i}, Σ)$ , where e_i is the i-th basis vector. We have $\frac{d h_{i}^{s}}{d s} |_{s = 0} = (e_{i}, 0)$ , so by Noether’s theorem,

$\frac{\partial ℒ}{\partial \dot{θ}} (e_{i,} 0) = 2 {\dot{μ}}^{T} Σ^{- 1} e_{i} = 2 e_{i}^{T} Σ^{- 1} \dot{μ}$

remains constant for all i. The fact that J_μ is an invariant immediately follows.
Linear base changes. For any i, j ∈ [1, d], $let h_{i, j}^{s} : (μ, Σ) \mapsto (\exp (s E_{i j}) μ, \exp (s E_{i j}) Σ \exp (s E_{i j})),$ , where E_ij is the matrix with a one at position (i, j) and zeros elsewhere. We have:

$\frac{d h_{E_{i j}}^{s}}{d s} |_{s = 0} = (E_{i j μ}, E_{i j} Σ + E_{j i} Σ) .$

Therefore, by Noether’s theorem, we then obtain the following invariants:

J_{i j} : = \frac{\partial ℒ}{\partial \dot{θ}} (E_{i j μ}, E_{i j} Σ + E Σ_{i j})

(35)

= 2 {\dot{μ}}^{T} Σ^{- 1} E_{i j μ} + tr ((E_{i j μ} Σ + Σ E_{j i}) Σ^{- 1} \dot{Σ} Σ^{- 1})

(36)

= 2 {(Σ^{- 1} \dot{μ})}^{T} E_{i j} μ + tr (E_{i j} \dot{Σ} Σ^{- 1}) + tr (E_{j i} Σ^{- 1} \dot{Σ})

(37)

= 2 {(J_{μ} μ^{T})}_{i j} + 2 {(Σ^{- 1} \dot{Σ})}_{i j},

(38)

and the coefficients of J_Σ in (30) are the (J_ij/2).

This leads us to first order equations satisfied by the geodesics mentioned in [4–6].

Theorem 4 (GIGO-Σ).

t 7 \mapsto N (μ_{t}, Σ_{t})

is a geodesic of

G_{d}

if and only if μ : t 7↦μ_t and Σ : t ↦Σ_t satisfy the equations:

{\dot{μ}}_{t} = Σ_{t} J_{μ}

(39)

{\dot{Σ}}_{t} = Σ_{t} (J_{Σ} - J_{μ} μ_{t}^{T}) = Σ_{t} J_{Σ} - {\dot{μ}}_{t} μ_{t}^{T},

(40)

where:

J_{μ} = \sum_{0}^{- 1} {\dot{μ}}_{0},

and:

J_{Σ} = Σ_{0}^{- 1} ({\dot{μ}}_{0} μ_{0}^{T} + {\dot{Σ}}_{0}) .

Proof. This is an immediate consequence of Proposition 3.

These equations can be solved analytically (see [5]); however, usually, that is not the case, and they have to be solved numerically, for example with the Euler method (the corresponding algorithm, which we call GIGO-Σ, is described in the Appendix). The goal of the remainder of the subsection is to show that having to use the Euler method is fine.

To avoid confusion, we will call the step size of the GIGO algorithm (δt in Proposition 2) “GIGO step size” and the step size of the Euler method (inside a step of the GIGO algorithm) “Euler step size”.

Having to solve our equations numerically brings two problems:

The first one is a theoretical problem: the main reason to study GIGO is its invariance under reparametrization of θ, and we lose this invariance property when we use the Euler method. However, GIGO can get arbitrarily close to invariance by decreasing the Euler step size. In other words, the difference between two different IGO algorithms is O(δt²), and the difference between two different implementations of the GIGO algorithm is O(h²), where h is the Euler step size; it is easier to reduce the latter. Still, without a closed form for the geodesics of

G_{d}

, the GIGO update is rather expensive to compute, but it can be argued that most of the computation time will still be the computation of the objective function f.

The second problem is purely numerical: we cannot guarantee that the covariance matrix remains positive-definite along the Euler method. Here, apart from finding a closed form for the geodesics, we have two solutions.

We can enforce this a posteriori: if the covariance matrix we find is not positive-definite after a GIGO step, we repeat the failed GIGO step with a reduced Euler step size (in our implementation, we divided it by four; see Algorithm 2 in the Appendix.).

The other solution is to obtain differential equations on a square root of the covariance matrix (any matrix A, such that Σ = AA^T ).

Theorem 5 (GIGO-A). If μ : t ↦μ_t and A : t ↦A_t satisfy the equations:

{\dot{μ}}_{t} = A_{t} A_{t}^{T} J_{μ},

(41)

{\dot{A}}_{t} = \frac{1}{2} {(J_{Σ} - J_{μ} μ_{t}^{T})}^{T} A_{t},

(42)

where:

J_{μ} = {(A_{0}^{- 1})}^{T} A_{0}^{- 1} μ_{0}

and:

J_{Σ} = {(A_{0}^{- 1})}^{T} A_{0}^{- 1} ({\dot{μ}}_{0} μ_{t}^{T} + {\dot{A}}_{0} A_{0}^{T} + A_{0} {\dot{A}}_{0}^{T}),

then

t \mapsto N (μ_{t}, A_{t} A_{t}^{T})

is a geodesic of

G_{d}

.

Proof. This is a simple rewriting of Theorem 4: if we write Σ := AA^T, we find that J_μ and J_Σ are the same as in Theorem 4, and we have:

\dot{μ} = Σ J_{μ},

and:

\begin{matrix} \dot{Σ} = (\dot{A} A^{T} + A {\dot{A}}^{T}) = \frac{1}{2} {(J_{Σ} - J_{μ} μ^{T})}^{T} + A A^{T} + \frac{1}{2} A A^{T} (J_{Σ} - J_{μ} μ^{T}) \\ = \frac{1}{2} {(J_{Σ} - J_{μ} μ^{T})}^{T} Σ + \frac{1}{2} (J_{Σ} - J_{μ} μ^{T}) = \frac{1}{2} Σ (J_{Σ} - J_{μ} μ^{T}) + \frac{1}{2} {[Σ (J_{Σ} - J_{μ} μ^{T})]}^{T} . \end{matrix}

By Theorem 4, Σ(J_Σ − J_μμ^T) is symmetric (since

\dot{Σ}

has to be symmetric). Therefore, we have

\dot{Σ} (J_{Σ} - J_{μ} μ^{T})

, and the result follows. □

Notice that Theorem 5 gives an equivalence, whereas Theorem 4 does not. The reason is that the square root of a symmetric positive-definite matrix is not unique. Still, it is canonical; see the discussion in Section 6.1.2.

As for Theorem 4, we can solve Equations (41) and (42) numerically, and we obtain another algorithm (Algorithm 3 in the Appendix; we will call it GIGO-A), with a behavior similar to the previous one (with Equations (39) and (40)). For both of them, numerical problems can arise when the covariance matrix is almost singular.

We have not managed to find any example where one of these two algorithms converged to the minimum of the objective function, whereas the other did not, and their behavior is almost the same.

More interestingly, the performances of these two algorithms are also the same as the performances of the exact GIGO algorithm, using the equations of Section 5.2.

Notice that even though GIGO-A directly maintains a square root of the covariance matrix, which makes sampling new points easier (to sample a point from

N (μ, Σ)

, a square root of Σ is needed), both GIGO-Σ and GIGO-A still have to invert the covariance matrix (or its square root) at each step, which is as costly as the decomposition, so one of these algorithms is roughly as expensive to compute as the other.

5.2. Explicit Form of the Geodesics of $G_{d}$ (from [5])

We now give the exact geodesics of

G_{d}

: the following results are a rewriting of Theorem 3.1 and its first corollary in [5].

Theorem 6. Let

({\dot{μ}}_{0}, {\dot{Σ}}_{0}) \in T_{N} (0, I) G_{d}

. The geodesic of

G_{d}

starting from

N (0, 1)

with initial speed

({\dot{μ}}_{0}, {\dot{Σ}}_{0})

is given by:

\exp_{N (0, I)} (s {\dot{μ}}_{0}, s {\dot{Σ}}_{0}) = N (2 R (s) sh (\frac{s G}{2}) G^{-} {\dot{μ}}_{0}, R (s) R {(s)}^{T}),

(43)

where exp is the Riemannian exponential of

G_{d}

, G is any matrix satisfying:

G^{2} = {\dot{Σ}}_{0}^{2} + 2 {\dot{μ}}_{0} {\dot{μ}}_{0}^{T},

(44)

R (s) = {({(ch (\frac{s G}{2}) - {\dot{Σ}}_{0} G^{-} sh (\frac{s G}{2}))}^{- 1})}^{T}

(45)

and G⁻ is a pseudo-inverse of G

In [5], the existence of G (as a square root of

{\dot{Σ}}_{0}^{2} + 2 {\dot{μ}}_{0} {\dot{μ}}_{0}^{T}

) is proven. Notice that, anyway, in the expansions of (43) and (45), only even powers of G appear.

Additionally, since, for all A ∈ GL_d(ℝ), for all μ₀ ∈ ℝ^d, the application:

\begin{array}{l} ϕ : G_{d} \to G_{d} \\ N (μ, Σ) \mapsto N (A μ + μ_{0}, A Σ A^{T}) \end{array}

(46)

preserves the geodesics, we find the general expression for the geodesics of

G_{d}

.

Corollary 1. Let μ₀ ∈ ℝ^d, A ∈ GL_d(ℝ) and

({\dot{μ}}_{0}, {\dot{Σ}}_{0}) \in T_{N (μ_{0.}, A_{0} A_{0}^{T})} G_{d}

. The geodesic of

G_{d}

starting from

N (μ, Σ)

with initial speed

({\dot{μ}}_{0}, {\dot{Σ}}_{0})

is given by:

\exp_{N (μ_{0}, A_{0} A_{0}^{T})} (s {\dot{μ}}_{0}, s {\dot{Σ}}_{0}) = N (μ_{1}, A_{1} A_{1}^{T}),

(47)

μ_{1} = 2 A_{0} R (s) sh (\frac{s G}{2}) G^{-} A_{0}^{- 1} {\dot{μ}}_{0} + μ_{0},

(48)

A_{1} = A_{0} R (s),

(49)

where exp is the Riemannian exponential of

G_{d}

, G is any matrix satisfying:

G^{2} = A_{0}^{- 1} ({\dot{Σ}}_{0} Σ_{0}^{- 1} {\dot{Σ}}_{0} + 2 {\dot{μ}}_{0} {\dot{μ}}_{0}^{T}) {(A_{0}^{- 1})}^{T},

(50)

R (s) = {({(ch (\frac{s G}{2}) - A_{0}^{- 1} {\dot{Σ}}_{0} {(A_{0}^{- 1})}^{T} G^{-} sh (\frac{s G}{2}))}^{- 1})}^{T},

(51)

and G⁻ is a pseudo-inverse of G.

It should be noted that the final values for mean and covariance do not depend on the choice of G as a square root of:

A_{0}^{- 1} ({\dot{Σ}}_{0} Σ_{0}^{- 1} {\dot{Σ}}_{0} + 2 {\dot{μ}}_{0} {\dot{μ}}_{0}^{T}) {(A_{0}^{- 1})}^{T} .

The reason for this is that ch(G) is a Taylor series in G², and so are sh(G)G⁻ and G⁻sh(G).

For our practical implementation, we actually used these Taylor series instead of the expression of the corollary

6. Comparing GIGO, xNES and Pure Rank-μ CMA-ES

6.1. Definitions

In this section, we recall the xNES and pure rank-μ CMA-ES, and we describe them in the IGO framework, thus allowing a reasonable comparison with the GIGO algorithms.

6.1.1. xNES

We recall a restriction of the xNES algorithm, introduced in [19] (this restriction is sufficient to describe the numerical experiments in [19]).

Definition 9 (xNES algorithm). The xNES algorithm with sample size N, weights w_i and learning rates η_μ and η_Σ updates the parameters μ ∈ ℝ^d, A ∈ M_d(ℝ) with the following rule: At each step, N points x₁, …, x_N are sampled from the distribution.

N (μ, A A^{T})

. Without loss of generality, we assume f(x₁) < … < f(x_N). The parameter is updated according to:

\begin{matrix} μ \leftarrow μ + η_{μ} A G_{μ}, \\ A \leftarrow A \exp (η_{Σ} G_{M} / 2), \end{matrix}

where, setting z_i = A⁻¹(x_i − μ):

\begin{matrix} G_{μ} = \sum_{i = 1}^{N} w_{i} z_{i}, \\ G_{M} = \sum_{i = 1}^{N} w_{i} (z_{i} z_{i}^{T} - I) . \end{matrix}

The more general version decomposes the matrix A as σB, where det B = 1, and uses two different learning rates for σ and for B. We gave the version where these two learning rates are equal (in particular, for the default parameters in [19], these two learning rates are equal). This restriction of the xNES algorithm can be described in the IGO framework, provided all of the learning rates are equal (most of the elements of the proof can be found in [19] (the proposition below essentially states that xNES is a natural gradient update) or in [1]):

Proposition 6 (xNES as IGO). The xNES algorithm with sample size N, weights w_i and learning rates η_μ = η_Σ = δt coincides with the IGO algorithm with sample size N, weights w_i, step size δt and in which, given the current position (μ_t, A_t), the set of Gaussians is parametrized by:

ϕ_{μ_{t}, A_{t}} : (δ, M) \mapsto N (μ_{t} + A_{t} δ, (A_{t} \exp (\frac{1}{2} M)) {(A_{t} \exp (\frac{1}{2} M))}^{T}),

with δ ∈ ℝ^m and M ∈ Sym(ℝ^m).

The parameters maintained by the algorithm are (μ, A), and the x_i are sampled from

N (μ, A A^{T})

.

Proof. Let us compute the IGO update in the parametrization

ϕ_{μ_{t}, A_{t}}

: we have δ^t = 0, M^t = 0, and by using Proposition 1, we can see that for this parametrization, the Fisher information matrix at (0, 0) is the identity matrix. The IGO update is therefore,

{(δ, M)}^{t + δ t} = {(δ, M)}^{t} + δ t Y_{δ} (δ, M) + δ t Y_{M} (δ, M) = δ t Y_{δ} (δ, M) + δ t Y_{M} (δ, M),

where:

Y_{δ} (δ, M) = \sum_{i = 1}^{N} w_{i} \nabla_{δ} \ln (p (x_{i} | (δ, M))

and:

Y_{M} (δ, M) = \sum_{i = 1}^{N} w_{i} \nabla_{M} \ln (p (x_{i} | (δ, M)) .

Since tr(M) = log(det(exp(M))), we have:

\begin{array}{l} \ln P_{δ, M} (x) = - \frac{d}{2} \ln (2 π) - \ln (\det A) - \frac{1}{2} tr M - \frac{1}{2} ‖ \exp (- \frac{1}{2} M) A^{- 1} (x - μ - A δ) | |^{2}, \end{array}

and a straightforward computation yields:

Y_{δ} (δ, M) = \sum_{i = 1}^{N} w_{i} z_{i} = G_{μ},

and:

Y_{M} (δ, M) = \frac{1}{2} \sum_{i = 1}^{N} w_{i} (z_{i} z_{i}^{T} - I) = G_{M} .

Therefore, the IGO update is:

\begin{array}{l} δ (t + δ t) = δ (t) + δ t G_{μ}, \\ M (t + δ t) = M (t) + δ t G_{M}, \end{array}

or, in terms of mean and covariance matrix:

\begin{array}{l} μ (t + δ t) = μ (t) + δ t A (t) G_{μ} \\ A (t + δ t) = A (t) \exp (δ t G_{M} / 2), \end{array}

or:

Σ (t + δ t) = A (t) \exp (δ t G_{M}) A {(t)}^{T} .

This is the xNES update. □

6.1.2. Using a Square Root of the Covariance Matrix

Firstly, we recall that the IGO framework (on

G_{d}

, for example) emphasizes the Riemannian manifold structure on

G_{d}

. All of the algorithms studied here (including GIGO, which is not strictly speaking an IGO algorithm) define a trajectory in

G_{d}

(a new point for each step), and to go from a point θ to the next one (θ′), we follow some curve

γ : [0, δ t] \to G_{d}

, with γ(0) = θ, γ(δt) = θ′ and

\dot{γ} (0)

given by the natural gradient

(\dot{γ} (0) = {\sum_{i = 1}^{N} {\hat{w}}_{i} {\tilde{\nabla}}_{θ} P_{θ} (x_{i}) \in T}_{θ} G_{d})

_.

To be compatible with this point of view, an algorithm giving an update rule for a square root (any matrix A such that Σ = AA^T: since we do not force A to be symmetric, the decomposition is not unique) of the covariance matrix A has to satisfy the following condition: for a given initial speed, the covariance matrix Σ^t⁺^δt after one step must depend only on Σ^t and not on the square root A^t chosen for Σ^t.

The xNES algorithm does satisfy this condition: consider two xNES algorithms, with the same learning rates, respectively, at

(μ, A_{1}^{t})

and

(μ, A_{2}^{t})

, with

A_{1}^{t} {(A_{1}^{t})}^{T} = A_{2}^{t} {(A_{2}^{t})}^{T}

(i.e., they define the same Σ^t), using the same samples x_i to compute the natural gradient update, then we will have

\sum_{1}^{t + δ t} = \sum_{2}^{t + δ t}

. Using the definitions of Section 6.3, we have just shown that what we will call the “xNES trajectory” is well defined.

It is also important to notice that, in order to be well defined, a natural gradient algorithm updating a square root of the covariance matrix has to specify more conditions than simply following the natural gradient.

The reason for this is that the natural gradient is a vector tangent to

G_{d}

: it lives in a space of dimension d(d + 3)/2 (the dimension of

G_{d}

), whereas the vector (μ, A) lives in a space of dimension d(d + 1) (the dimension of ℝⁿ × GL_n(ℝ)), which is too large: there exists infinitely many applications t ↦A_t, such that a given curve

γ : t \mapsto N (μ_{t}, Σ_{t})

can be written

γ (t) = N (μ_{t} A_{t} A_{t}^{T})

. This is why Theorem 5 is simply an implication, whereas Theorem 4 is an equivalence.

More precisely, let us consider A in GL_d(ℝ) and v_A,

v_{A}^{'}

two infinitesimal updates of A. Since Σ = AA^T, the infinitesimal update of Σ corresponding to

v_{A}^{'}

(resp.

v_{A}^{'}

) is

v_{Σ} = A v_{A}^{T} + v_{A} A^{T}

(resp.

v_{Σ}^{'} = A {v^{'}}_{A}^{T} + v_{A}^{'} A^{T}) .

It is now easy to see that v_A and

v_{A}^{'}

define the same direction for Σ (i.e.,

v_{Σ} = v_{Σ}^{'}

) if and only if AM^T + MA^T = 0, where

M = v_{A} - v_{A}^{'}

. This is equivalent to A⁻¹M antisymmetric.

For any A ∈ M_d(ℝ), let us denote by T_A the space of the matrices M, such that A⁻¹M is antisymmetric or, in other words, T_A := {u ∈ M_d(ℝ), Au^T + uA^T = 0}. Having a subspace S_A in direct sum with T_A for all A is sufficient (but not necessary) to have a well-defined update rule. Namely, consider the (linear) application:

\begin{matrix} ϕ_{A} : M_{d} (ℝ) & \to & S_{d} (ℝ) \\ v_{A} & \mapsto & A v_{A}^{T} + v_{A} A^{T} \end{matrix},

sending an infinitesimal update of A to the corresponding update of Σ. It is not bijective, but as we have seen before, Ker ϕ_A = T_A, and therefore, if we have, for some U_A,

M_{d} (ℝ) = U_{A} \oplus T_{A},

(52)

then φ_A|_UA is an isomorphism. Let v_Σ be an infinitesimal update of Σ. We choose the following update of A corresponding to v_Σ:

v_{A} : {(ϕ_{A} | U_{A})}^{- 1} (v_{Σ}) .

(53)

Any U_A, such that U_A ⊕ T_A = M_d(ℝ), is a reasonable choice to pick v_A for a given v_Σ. The choice S_A = {u ∈ M_d(ℝ), Au^T − uA^T = 0} has an interesting additional property; it is the orthogonal of T_A for the norm:

{‖ v_{A} ‖}_{\sum}^{2} : = Tr (v_{A}^{T} \sum^{- 1} v_{A}) = Tr ({(A^{- 1} v_{A})}^{T} A^{- 1} v_{A}) .

(54)

and consequently, it can be defined without referring to the parametrization, which makes it a canonical choice. To prove this, remark that T_A = {M ∈ M_d(ℝ), A⁻¹M antisymmetric} and S_A = {M ∈ M_d(ℝ), A⁻¹M symmetric} and that if M is symmetric and N is antisymmetric, then

Tr (M^{T} N) = \sum_{i, j = 1}^{d} m_{i j} n_{i j} = \sum_{i = 1}^{d} m_{i i} n_{i i} + \sum_{1 \leq i \leq j \leq d} m_{i j} (n_{i j} + n_{j i}) = 0.

(55)

Let us now show that this is the choice made by xNES and GIGO-A (which are well-defined algorithms updating a square root of the covariance matrix).

Proposition 7. Let A ∈ M_n(ℝ). The v_A given by the xNES and GIGO-A algorithms lies in S_A = {u ∈ M_d(ℝ), Au^T − uA^T = 0} = S_A.

Proof. For xNES, let us write

\dot{γ} (0) = (υ_{μ}, υ_{Σ})

and

υ_{A} : = \frac{1}{2} A G_{M}

. We have

A^{- 1} υ_{A} = \frac{1}{2} G_{M}

, and therefore, forcing M (and G_M) to be symmetric in xNES is equivalent to A⁻¹ υ_A = (A⁻¹ υ_A)^T, which can be rewritten as

A υ_{A}^{T} = υ_{A} A^{T}

. For GIGO-A, Equation (40) shows that

Σ_{t} (J_{Σ} - J_{μ} μ_{t}^{T})

is symmetric, and with this fact in mind, Equation (42) shows that we have

A υ_{A}^{T} = υ_{A} A^{T} (υ_{A} is {\dot{A}}_{t})

. □

6.1.3. Pure Rank-μ CMA-ES

We now recall the pure rank-μ CMA-ES algorithm. The general CMA-ES algorithm is described in [21].

Definition 10 (Pure rank-μ CMA-ES algorithm). The pure rank-μ CMA-ES algorithm with sample size N, weights w_i and learning rates η_μ and η_Σ is defined by the following update rule: At each step, N points x₁, …, x_N are sampled from the distribution

N (μ, Σ)

. Without loss of generality, we assume f)x₁) < … < f(x_N). The parameter is updated according to:

μ \leftarrow μ + η_{μ} \sum_{i = 1}^{N} w_{i} (x_{i} - μ),

Σ \leftarrow Σ + η_{Σ} \sum_{i = 1}^{N} w_{i} ((x_{i} - μ) {(x_{i} - μ)}^{T} - Σ) .

The pure rank-μ CMA-ES can also be described in the IGO framework; see, for example, [20].

Proposition 8 (Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights w_i and learning rates η_μ = η_Σ = δt coincides with the IGO algorithm with sample size N, weights w_i, step size δt and the parametrization (μ, Σ).

6.2. Twisting the Metric

As we can see, the IGO framework does not allow one to recover the learning rates for xNES and pure rank-μ CMA-ES, which is a problem, since usually, the covariance learning rate is set much smaller than the mean learning rate (see either [19] or [21]).

A way to recover these learning rates is to incorporate them directly into the metric (see also blockwise GIGO, in Section 6.4). More precisely:

Definition 11 (Twisted Fisher metric). Let η_μ, η_Σ ∈ ℝ, and let (P_θ)_θ_∈Θ be a family of normal probability distributions: P_θ = N (μ(θ), Σ(θ)), with μ and Σ C¹. We call the “(η_μ, η_Σ)-twisted Fisher metric” the metric defined by:

I_{i, j} (η_{μ}, η_{Σ}) (θ) = \frac{1}{η_{μ}} \frac{\partial μ^{T}}{\partial θ_{i}} Σ^{- 1} \frac{\partial μ}{\partial θ_{j}} + \frac{1}{η_{Σ}} \frac{1}{2} t r (Σ^{- 1} \frac{\partial Σ}{\partial θ_{i}} Σ^{- 1} \frac{\partial Σ}{\partial θ_{j}}) .

(56)

All of the remainder of this section is simply a rewriting of the work in Section 2 with the twisted Fisher metric instead of the regular Fisher metric. We will use the term “twisted geodesic” instead of “geodesic for the twisted metric”.

This approach seems to be somewhat arbitrary: arguably, the mean and the covariance play a “different role” in the definition of a Gaussian (only the covariance can affect diversity, for example), but we lack a reasonable intrinsic characterization that would make this choice of twisting more natural. This construction can be slightly generalized (see the Appendix).

The IGO flow and the IGO algorithms can be modified to take into account the twisting of the metric; the (η_μ, η_Σ)-twisted IGO flow reads:

\frac{d θ^{t}}{d t} = I {(η_{μ}, η_{Σ})}^{- 1} (θ) \int_{X} W_{θ^{t}}^{f} (x) \nabla_{θ} \ln P_{θ} (x) P_{θ^{t}} (d x) .

(57)

The only difference with (9) is that I⁻¹(θ) has been replaced by I(η_μ, η_Σ)⁻¹(θ).

This leads us to the twisted IGO algorithms.

Definition 12. The (η_μ, η_Σ)-twisted IGO algorithm associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:

θ^{t + δ t} = θ^{t} + δ t I {(η_{μ}, η_{Σ})}^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ} .

Definition 13. The (η_μ, η_Σ)-twisted geodesic IGO algorithm associated with sample size N, step size δt and selection scheme w is given by the following update rule:

θ^{t + δ t} = \exp_{θ^{t}} (Y δ t)

(58)

where:

Y = I {(η_{μ}, η_{Σ})}^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ} .

(59)

By definition, the twisted geodesic IGO algorithm does not depend on the parametrization (but it does depend on η_μ and η_Σ).

There is some redundancy between δt, η_μ and η_Σ: the only values actually appearing in the equations are δtη_μ and δtη_Σ. More formally:

Proposition 9. Let k, d, N ∈ N, η_μ, η_Σ, δt, λ₁, λ₂ ∈ ℝ and w : [0; 1] → ℝ.

The (η_μ, η_Σ)-twisted IGO algorithm with sample size N, step size δt and selection scheme w coincides with the (λ₁η_μ, λ₁η_Σ)-twisted IGO algorithm with sample size N, step size λ₂δt and selection scheme

\frac{1}{λ_{1} λ_{2}} w

. The same is true for geodesic IGO.

In order to obtain the twisted algorithms, the Fisher metric in IGO has to be replaced by the metric from Definition 11. In practice, the equations found by twisting the metric are exactly the equations without twisting, except that we have “forced” the learning rates η_μ, η_Σ to appear by multiplying the increments of μ and Σ by η_μ and η_Σ.

We can now describe pure rank-μ CMA-ES and xNES with separate learning rates as twisted IGO algorithms:

Proposition 10 (xNES as IGO). The xNES algorithm with sample size N, weights w_i and learning rates η_μ, η_σ = η_B = η_Σ coincides with the

\frac{η_{μ}}{δ t}

,

\frac{η_{Σ}}{δ t}

-twisted IGO algorithm with sample size N, weights w_i, step size δt and in which, given the current position (μ_t, A_t), the set of Gaussians is parametrized by:

(δ, M) \mapsto N (μ_{t} + A_{t} δ, (A_{t} \exp (\frac{1}{2} M)) {(A_{t} \exp (\frac{1}{2} M))}^{T}),

with δ ∈ ℝ^m and M ∈ Sym(ℝ^m).

The parameters maintained by the algorithm are (μ, A), and the x_i are sampled from N (μ, AA^T).

Proposition 11 (Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights w_i and learning rates η_μ and η_Σ coincides with the

(\frac{η_{μ}}{δ t}, \frac{η_{Σ}}{δ t})

-twisted IGO algorithm with sample size N, weights w_i, step size δt and the parametrization (μ, Σ).

The proofs of these two statements are an easy rewriting of their non-twisted counterparts: one can return to the non-twisted metric (up to a η_Σ factor) by changing μ to

\frac{\sqrt{η σ}}{\sqrt{η μ}} μ

.

We give the equations of the twisted geodesics of

G_{d}

in the Appendix.

6.3. Trajectories of Different IGO Steps

As we have seen, two different IGO algorithms (or an IGO algorithm and the GIGO algorithm) coincide at first order in δt when δt → 0. In this section, we study the differences between pure rank-μ CMA-ES, xNES and GIGO by looking at the second order in δt, and in particular, we show that xNES and GIGO do not coincide in the general case.

We view the updates done by one step of the algorithms as paths on the manifold

G_{d}

, from (μ(t), Σ(t)) to (μ(t + δt), Σ(t + δt)), where δt is the time step of our algorithms, seen as IGO algorithms. More formally:

Definition 14. (1) We call the GIGO update trajectory the application:

T_{GIGO} : (μ, Σ, v_{μ} v_{Σ}) \mapsto (δ t \mapsto \exp_{N (μ, A A^{T})} (δ t η_{μ} v_{μ}, δ t η_{Σ} v_{Σ})) .

(exp is the exponential of the Riemannian manifold

G_{d} (η_{μ}, η_{Σ})

)

(2) We call the xNES update trajectory the application:

T_{xNES} : (μ, Σ, v_{μ}, v_{Σ}) \mapsto (δ t \mapsto N (μ + δ t η_{μ} v_{μ}, A \exp [η_{Σ} δ t A^{- 1} v_{Σ} {(A^{- 1})}^{T}] A^{T})),

with AA^T = Σ. The application above does not depend on the choice of a square root A.

(3) We call the CMA-ES update trajectory the application:

T_{CMA} : (μ, Σ, v_{μ}, v_{Σ}) \mapsto (δ t \mapsto N (μ + δ t η_{μ} v_{μ}, A A^{T} + δ t η_{Σ} v_{Σ})) .

These applications map the set of tangent vectors to

G_{d}

(

(T G_{d})

to the curves in

G_{d} (η_{μ}, η_{Σ})

.

We will also use the following notation: μ_GIGO := ϕ_μ○T_GIGO, μ_xNES := ϕ_μ○T_xNES, μ_CMA := ϕ_μ○T_CMA, Σ_GIGO := ϕ_Σ ○ T_GIGO, Σ_xNES := ϕ_Σ ○ T_xNES and Σ_CMA := ϕ_Σ ○ T_CMA, where ϕ_μ (resp. ϕ_Σ) extracts the μ-component (resp. the Σ-component) of a curve.

In particular, Im(ϕ_μ) ⊂ ℝ^d and Im(ϕ_Σ) ⊂ P_d, where P_d (the set of real symmetric positive-definitematrices of dimension d) is seen as a subset of ℝ^d².

For instance, T_GIGO(μ, Σ, v_μ, v_Σ)(δt) gives the position (mean and covariance matrix) of the GIGO algorithm after a step of size δt, while μ_GIGO and Σ_GIGO give, respectively, the mean component and the covariance component of this position.

This formulation ensures that the trajectories we are comparing had the same initial position and the same initial speed, which is the case provided the sampled points (the values directly sampled from

N (μ, Σ)

, not from

N (0, I)

and transformed) are the same.

Different IGO algorithms coincide at first order in δt. The following proposition gives the second order expansion of the trajectories of the algorithms.

Proposition 12 (Second derivatives of the trajectories). We have:

\begin{matrix} μ_{GIGO} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = η_{μ} η_{Σ} v_{Σ} \sum_{0}^{- 1} v_{μ}, \\ μ_{xNES} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = μ_{CMA} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = 0, \\ Σ_{GIGO} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = η_{Σ}^{2} v_{Σ} Σ^{- 1} v_{Σ} - η_{μ} η_{Σ} v_{μ} v_{μ}^{T}, \\ Σ_{xNES} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = μ_{Σ}^{2} v_{Σ} Σ^{- 1} v_{Σ}, \\ Σ_{CMA} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = 0. \end{matrix}

Proof. We can immediately see that the second derivatives of μ_xNES, μ_CMA and Σ_CMA are zero. Next, we have:

\begin{array}{l} Σ_{xNES} (μ, Σ, v_{μ}, v_{Σ}) (t) = A \exp [t A^{- 1} η_{Σ} v_{Σ} {(A^{- 1})}^{T}] A^{T} \\ = A A^{T} + t η_{Σ} v_{Σ} + \frac{t^{2}}{2} η_{Σ}^{2} v_{Σ} {(A^{- 1})}^{T} A^{- 1} v_{Σ} + o (t^{2}) \\ = Σ + t η_{Σ} v_{Σ} + \frac{t^{2}}{2} η_{Σ}^{2} v_{Σ} Σ^{- 1} v_{Σ} + o (t^{2}) . \end{array}

The expression of Σ_xNES(μ, Σ, v_μ, v_Σ)^″(0) follows.

Now, for GIGO, let us consider the geodesic starting at (μ₀, Σ₀) with initial speed (η_μv_μ, η_Σv_Σ). By writing J_μ(0) = J_μ(t), we find

\dot{μ} (t) = Σ (t) Σ_{0}^{- 1} {\dot{μ}}_{0}

. We then easily have

\ddot{μ} (0) = {\dot{Σ}}_{0} Σ_{0}^{- 1} {\dot{μ}}_{0}

In other words:

μ_{GIGO} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = η_{μ} η_{Σ} v_{Σ} \sum_{0}^{- 1} v_{μ} .

Finally, by using Theorem 4 and differentiating, we find:

\begin{matrix} \ddot{Σ} = η_{Σ} \dot{Σ} (J_{Σ} - J_{μ} μ^{T}) - η_{Σ} Σ J_{μ} {\dot{μ}}^{T}, \\ {\ddot{Σ}}_{0} = η_{Σ} {\dot{Σ}}_{0} \frac{1}{η_{Σ}} Σ_{0}^{- 1} {\dot{Σ}}_{0} - \frac{η_{Σ}}{η_{^{^{μ}}}} {\dot{μ}}_{0} {\dot{μ}}_{0}^{T} = η_{Σ}^{2} v_{Σ} Σ_{0}^{- 1} v_{Σ} - η_{Σ} η_{μ} v_{μ} v_{μ}^{T} . \end{matrix}

In order to interpret these results, we will look at what happens in dimension one. In higher dimensions, we can suppose that the algorithms exhibit a similar behavior, but an exact interpretation is more difficult for GIGO in

G_{d}

.

In [19], it has been noted that xNES converges to quadratic minima slower than CMA-ES and that it is less subject to premature convergence. That fact can be explained by observing that the mean update is exactly the same for CMA-ES and xNES, whereas xNES tends to have a higher variance (Proposition 12 shows this at order two, and it is easy to see that in dimension one, for any μ, Σ, v_μ, v_Σ, we have Σ_xNES(μ, Σ, v_μ, v_Σ) > Σ_CMA(μ, Σ, v_μ, v_Σ)).
At order two, GIGO moves the mean faster than xNES and CMA-ES if the standard deviation is increasing and more slowly if it is decreasing. This seems to be a reasonable behavior (if the covariance is decreasing, then the algorithm is presumably close to a minimum, and it should not leave the area too quickly). This remark holds only for isolated steps, because we do not take into account the evolution of the variance.
The geodesics of $G_{1}$ are half-circles (see Figure 2 below; we recall that $G_{1}$ is the Poincaré half-plane). Consequently, if the mean is supposed to move (which always happens), then σ → 0 when δt → ∞. For example, a step whose initial speed has no component on the standard deviation will always decrease it. See also Proposition 15, about the optimization of a linear function.
For the same reason, for a given initial speed, the update of μ always stays bounded as a function of δt: it is not possible to make one step of the GIGO algorithm go further than a fixed point by increasing δt. Still, the geodesic followed by GIGO changes at each step, so the mean of the overall algorithm is not bounded.

We now show that xNES follows the geodesics of

G_{d}

if the mean is fixed, but that xNES and GIGO do not coincide otherwise.

Proposition 13 (xNES is not GIGO in the general case). Let μ, v_μ ∈ ℝ^d, A ∈ GL_d, v_Σ ∈ M_d.

Then, the GIGO and xNES updates starting at

N (μ, Σ)

with initial speeds v_μ and v_Σ follow the same trajectory if and only if the mean remains constant. In other words:

T_{GIGO} (μ, Σ, v_{μ}, v_{Σ}) = T_{xNES} (μ, Σ, v_{μ}, v_{Σ}) i f a n d o n l y i f v_{μ} = 0.

Proof. If v_μ = 0, then we can compute the GIGO update by using Theorem 4: since J_μ = 0,

\dot{μ} = 0

, and μ remains constant. Now, we have

J_{Σ} = Σ^{- 1} \dot{Σ}

; this is enough information to compute the update. Since this quantity is also preserved by the xNES algorithm (see, for example, the proof of Proposition 14), the two updates coincide.

Figure 2. One step of the geodesic IGO (GIGO) update.

If v_μ ≠ 0, then

Σ_{xNES} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) - Σ_{GIGO} {(μ, Σ, v_{μ}, v_{Σ})}^{″} (0) = η_{μ} η_{Σ} v_{μ} v_{μ}^{T} \neq 0

and, in particular, T_GIGO(μ, Σ, v_μ, v_Σ) ≠ T_xNES(μ, Σ, v_μ, v_Σ).

6.4. Blockwise GIGO

Although xNES is not GIGO, it is possible to define a family of algorithms extending GIGO and including xNES, by decomposing our family of probability distributions as a product and by following the restricted geodesics simultaneously.

Definition 15 (Splitting). Let Θ be a Riemannian manifold. A splitting of Θ is n manifolds Θ₁, …, Θ_n and a diffeomorphism Θ ≅ Θ₁ × … × Θ_n. If for all x ∈ Θ, for all 1 ≤ i < j ≤ n, we also have T_i,xM ⊥ T_j,xM as subspaces of T_xM (see Notation 2), then the splitting is said to be compatible with the Riemannian structure. If the Riemannian manifold is not ambiguous, we will simply write a “compatible splitting”.

We now give some notation, and we define the blockwise GIGO update:

Notation 2. Let Θ be a Riemannian manifold, Θ₁, …, Θ_n a splitting of Θ, θ = (θ₁, …, θ_n) ∈ Θ, Y ∈ T_θΘ and 1 ≤ i ≤ n.

We denote by Θ_θ,i the Riemannian manifold

${θ_{1}} \times \dots \times {θ_{i - 1}} \times Θ_{i} \times {θ_{i + 1}} \times \dots \times {θ_{n}},$

with the metric induced from Θ. There is a canonical isomorphism of vector spaces $T_{θ} Θ = \oplus_{i = 1}^{n} T Θ_{θ, i}$ . Moreover, if the splitting is compatible, it is an isomorphism of Euclidean spaces.
We denote by Φ_θ,i the exponential at θ of the manifold Θ_θ,i.

Definition 16 (Blockwise GIGO update). Let Θ₁, …, Θ_n be a compatible splitting. The blockwise GIGO algorithm in Θ with splitting Θ₁, …, Θ_n associated with sample size N, step sizes δt₁, …, δt_n and selection scheme w is given by the following update rule:

θ \leftarrow (θ_{1}^{t + δ t_{1}}, \dots, θ_{n}^{t + δ t_{n}})

(60)

where:

Y = I^{- 1} (θ^{t}) \sum_{i = 1}^{N} {\hat{w}}_{i} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ},

(61)

θ_{k}^{t + δ t_{k}} = Φ_{θ^{t}, k} (δ t_{k} Y_{k}),

(62)

with Y_k the TΘ_θ,k-component of Y. This update only depends on the splitting (and not on the parametrization inside each Θ_k).

The compatibility condition ensures that the natural gradient of

W_{θ^{t}}^{f}

(defined in Section 2.2) in the whole manifold Θ really is the sum of the gradients of this same function in the submanifolds Θ_k. A practical consequence is that the Y_k in Equation (62) can be computed simply by taking the natural gradient in Θ_k:

Y_{k} = I_{k}^{- 1} (θ_{i}^{t}) \sum_{i = 1}^{N} \hat{w} \frac{\partial \ln P_{θ} (x_{i})}{\partial θ_{k}},

(63)

where I_k is the metric of Θ_k.

Since blockwise GIGO only depends on the splitting (and the tunable parameters: sample size, step sizes and selection scheme), it can be thought of as almost parametrization-invariant.

Notice that blockwise GIGO updates and twisted GIGO updates are two different things: firstly, blockwise GIGO can be defined on any manifold with a compatible splitting, whereas twisted GIGO (and twisted IGO) are only defined for Gaussians. However, even in

G_{d} (η_{μ}, η_{\sum})

, with the splitting (μ, Σ), these two algorithms are different: for instance, if η_μ = η_Σ and δt = 1, then the twisted GIGO is the regular GIGO algorithm, whereas blockwise GIGO is not (actually, we will prove that it is the xNES algorithm). The only thing blockwise GIGO and twisted GIGO have in common is that they are compatible with the (η_μ, η_Σ)-twisted IGO flow Equation (57): a parameter θ^t following these updates with δt → 0 and N → ∞ is a solution of Equation (57).

We now have a new description of the xNES algorithm:

Proposition 14 (xNES is a Blockwise GIGO algorithm). The Blockwise GIGO algorithm in

G_{d}

with splitting

Φ : N (μ, Σ) \mapsto (μ, Σ)

, sample size N, step sizes δt_μ, δt_Σ and selection scheme w coincides with the xNES algorithm with sample size N, weights w_i and learning rates η_μ = δt_μ, η_σ = η_B = δt_Σ.

Proof. Firstly, notice that the splitting (μ, Σ) is compatible, by Proposition 1.

Now, let us compute the Blockwise GIGO update: we have

G_{d} ≅ ℝ^{d} \times P_{d}

, where P_d is the space of real positive-definite matrices of dimension d. We have

Θ_{θ^{t}, 1} = ℝ^{d} \times ({\sum^{t}}) \to G_{d}, Θ_{θ^{t}, 2} = ({μ^{t}} \times P_{d}) \to G_{d}

. The induced metric on

Θ_{θ^{t}}_{,_{1}}

is the Euclidean metric, so we have:

μ \leftarrow μ^{t} + δ t_{1} Y_{μ} .

Since we have already shown (using the notation in Definition 9) that Y_μ = AG_μ (in the proof of Proposition 6), we find:

μ \leftarrow μ^{t} + δ t_{1} A G_{μ} .

On Θ_θt,₂, we have the following Lagrangian for the geodesics:

ℒ (\sum, \sum^{˙}) = \frac{1}{2} t r (\sum^{˙} \sum^{- 1} \sum^{˙} \sum^{- 1}) .

By applying Noether’s theorem, we find that

J_{Σ} = Σ^{-}^{1} \dot{Σ}

is invariant along the geodesics of

Θ_{θ^{t}, 2}

, so they are defined by the equation

\dot{Σ} = Σ J_{Σ} = Σ Σ_{0}^{- 1} {\dot{Σ}}_{0}

(and therefore, any update preserving the invariant J_Σ will satisfy this first-order differential equation and follow the geodesics of

Θ_{θ^{t}, 2})

. The xNES update for the covariance matrix is given by A(t) = A₀ exp(tG_M/2). Therefore, we have

Σ (t) = A_{0} \exp (t G_{M}) A_{0}^{T},

Σ^{- 1} (t) = {(A_{0}^{- 1})}^{T} \exp (- t G_{M}) A_{0}^{- 1}

,

\sum^{˙} (t) = A_{0} \exp (t G_{M}) G_{M} A_{0}^{T}

and, finally,

Σ^{-}^{1} (t) \dot{Σ} (t) = {(A_{0}^{- 1})}^{T} G_{M} A_{0}^{T} = \sum_{0}^{- 1} {\sum^{˙}}_{0}

. Therefore, xNES preserves J_Σ, and therefore, xNES follows the geodesics of

Θ_{θ^{t}, 2}

(notice that we had already proven this in Proposition 13, since we are looking at the geodesics of

G_{d}

with a fixed mean).

Although blockwise GIGO is somewhat “less natural” than GIGO, it can be easier to compute for some splittings (as we have just seen), and in the case of the Gaussian distributions, the mean-covariance splitting seems reasonable.

7. Numerical Experiments

We conclude this article with some numerical experiments to compare the behavior of GIGO, xNES and pure rank-μ CMA-ES (we give the pseudocodes for these algorithms in the Appendix). We made two series of tests. The first one is a performance test, using classical benchmark functions and the settings from [19]. The goal of the second series of tests is to illustrate the computations in Section 6.3 by plotting the trajectories (standard deviation versus mean) of these three algorithms in dimension one.

The source code is available at [22].

7.1. Benchmarking

For the first series of experiments, presented in Figure 3, we used the following parameters, taken from [19] (we recall that xNES and pure rank-μ CMA-ES are seen as IGO algorithms):

Varying dimension.
Sample size: $⌊ 4 + 3 \log (d) ⌋ .$
Weights: $w_{i} = \frac{max (0, \log (\frac{n}{2} + 1) - \log (i))}{\sum_{j = 1}^{N} max (0, \log (\frac{n}{2} + 1) - \log (j))} - \frac{1}{N}$ .
IGO step size and learning rates: δt = 1, η_μ = 1, $η_{\sum} = \frac{3}{5} \frac{3 + \log (d)}{d \sqrt{d}}$ ..
Initial position: $θ^{0} = N (x_{0}, I)$ , where x₀ is a random point of the circle with center zero, and radius 10.
Euler method for GIGO: Number of steps: 100. We used the GIGO-A variant of the algorithm. No significant difference was noticed with GIGO-Σ or with the exact GIGO algorithm. The only advantage of having an explicit solution of the geodesic equations is that the update is quicker to compute.
We chose not to use the exact expression of the geodesics for this benchmarking to show that having to use the Euler method is fine. However, we did run the tests, and the results are basically the same as GIGO-A.

We plot the median number of runs to achieve target fitness (10⁻⁸). Each algorithm has been tested in dimension 2, 4, 8, 16, 32 and 64: a missing point means that all runs converged prematurely.

7.1.1. Failed Runs

In Figure 3, a point is plotted even if only one run was successful. Below is the list of the settings for which at least one run converged prematurely.

Only one run reached the optimum for the cigar-tablet function with CMA-ES in dimension eight.
Seven runs (out of 24) reached the optimum for the Rosenbrock function with CMA-ES in dimension 16.
About half of the runs reached the optimum for the sphere function with CMA-ES in dimension four.

For the following settings, all runs converged prematurely.

GIGO did not find the optimum of the Rosenbrock function in any dimension.
CMA-ES did not find the optimum of the Rosenbrock function in dimension 2, 4, 32 and 64.
All of the runs converged prematurely for the cigar-tablet function in dimension two with CMA-ES, for the sphere function in dimension two for all algorithms and for the Rosenbrock function in dimension two and four for all algorithms.

7.1.2. Discussion

As the last item in Section 7.1.1 shows, all of the algorithms converge prematurely in a low dimension, probably because the covariance learning rate has been set too high (or because the sample size is too small). This is different from the results in [19].

This remark aside, as noted in [19], the xNES algorithm shows more robustness than CMA-ES and GIGO: it is the only algorithm able to find the minimum of the Rosenbrock function in high dimensions. However, its convergence is consistently slower.

In terms of performance, when both of them work, pure rank-μ CMA-ES (or equivalently, IGO in the parametrization (μ, Σ)) and GIGO are extremely close (GIGO is usually a bit better). An advantage of GIGO is that it is theoretically defined for any δt, η_Σ, whereas the covariance matrix maintained by CMA-ES (not only pure rank-μ CMA-ES) can stop being positive definite if η_Σδt > 1. However, in that case, the GIGO algorithm is prone to premature convergence (remember Figure 2 and see Proposition 15 below), and in practice, the learning rates are much smaller.

7.2. Plotting Trajectories in $G_{1}$

We want the second series of experiments to illustrate the remarks about the trajectories of the algorithms in Section 6.3, so we decided to take a large sample size to limit randomness, and we chose a fixed starting point for the same reason. We use the weights below because of the property of quantile improvement proven in [23]: the 1/4-quantile will improve at each step. The parameters we used were the following:

Sample size: λ = 5, 000
Dimension one only.
Weights: w = 41_q_⩽1_/₄ (w_i = 4.1_i_⩽1_,₂₅₀)
IGO step size and learning rates: η_μ = 1, $η_{\sum} = \frac{3}{5} \frac{3 + \log (d)}{d \sqrt{d}} = 1.8$ , varying δt.
Initial position: $θ^{0} = N (10, 1)$
Dots are placed at t = 0, 1, 2 … (except for the graph δt = 1.5, for which there is a dot for each step).

Figures 4–8 show the optimization of x ↦ x², and Figures 9–11 show the optimization of x ↦ −x.

Figures 7, 8 and 11 show that when δt ≥ 1, GIGO reduces the covariance, even at the first step. More generally, when using the GIGO algorithm in

{\tilde{G}}_{d}

for the optimization of a linear function, there exists a critical step size δt_cr (depending on the learning rates η_μ, η_σ and on the weights w_i), above which, GIGO will converge, and we can compute its value when the weights are of the form

1_{q}_{\leq q_{0}}

(for q₀ ≥ 0.5, the discussion is not relevant, because in that case, even the IGO flow converges prematurely. Compare with the critical δt of the smoothed cross entropy method and IGO-ML in [1]).

Proposition 15. Let d ∈ ℕ, k, η_μ,

η_{σ} \in ℝ_{+}^{*}

; let

w = k {.1}_{q \leq q_{0}}

and let

\begin{array}{r} \begin{array}{r} g & : & ℝ^{d} & \to & ℝ \end{array} \\ \begin{array}{r} x & \mapsto & - x_{1} \end{array} \end{array} .

Let μ_n be the first coordinate of the mean, and let

σ_{n}^{2}

be the variance (at step n) maintained by the (η_μ, η_σ)-twisted geodesic IGO algorithm in

{\tilde{G}}_{d}

associated with selection scheme w, sample size ∞ and step size δt, when optimizing g (“sample size ∞” meaning the limit of the update when the sample size tends to infinity, which is deterministic [1]).

There exists δt_cr, such that:

if δt > δt_cr, (σ_n) converges to zero with exponential speed and (μ_n) converges.
if δt = δt_cr, (σ_n) remains constant and (μ_n) tends to ∞ with linear speed.
if 0 < δt < δt_cr, both (σ_n) and μ_n tend to ∞ with exponential speed.

The proof and the expression of δt_cr can be found in the Appendix.

In the case corresponding to k = 4, n = 1, q₀ = 1/4, η_μ = 1, η_σ = 1.8, we find:

δ t_{cr} \approx 0.84.

(64)

8. Conclusions

We introduced the geodesic IGO algorithm, and we showed that in the case of Gaussian distributions, Noether’s theorem directly gives a first order equation satisfied by the geodesics. In terms of performance, the GIGO algorithm is similar to pure rank-μ CMA-ES, which is rather encouraging: it would be interesting to test GIGO on real problems. Moreover, GIGO is a reasonable and totally parametrization-invariant algorithm (provided we can compute the solution of the equations of the geodesics), and as such, it should be studied for other families of probability distributions, like Bernoulli distributions (although in that case, the Riemannian exponential is not defined if the step size is too large, because the length of the geodesics is finite). Noether’s theorem could be a crucial tool for this.

We also showed that xNES and GIGO are not the same algorithm, and we defined blockwise GIGO, a simple extension of the GIGO algorithm, showing that xNES has a special status, as it admits a definition that is “almost” parametrization-invariant.

Conflicts of Interest

The author declares no conflict of interest.

Proof of Proposition 15

Let us first consider the case k = 1.

When optimizing a linear function, the non-twisted IGO flow in

{\tilde{G}}_{d}

with the selection function

q \mapsto 1_{q \leq q_{0}}

is known [1], and in particular, we have:

μ_{t} = μ_{0} + \frac{β (q_{0})}{α (q_{0})} σ_{t},

(65)

σ_{t} = σ_{0} \exp (α (q_{0}) t),

(66)

where, if we denote by

N

a random vector following a standard normal distribution and

ℱ

the cumulative distribution of a standard normal distribution,

α (q_{0}, d) = \frac{1}{2 d} (\int_{0}^{q_{0}} ℱ^{- 1} {(u)}^{2} d u - q_{0}),

(67)

and:

β (q_{0}) = E (N 1_{N \leq ℱ^{- 1} (q_{0})}) .

(68)

In particular,

α : = α (\frac{1}{4}, 1) \approx 0.107

and

β : = β (\frac{1}{4}) \approx - 0.319

.

With a minor modification of the proof in [1], we find that the (η_μ, η_σ)-twisted IGO flow is given by:

μ_{t} = μ_{0} + \frac{β (q_{0})}{α (q_{0})} σ_{0} \exp (η_{μ} α (q_{0}) t),

(69)

σ_{t} = σ_{0} \exp (η_{σ} α (q_{0}) t),

(70)

Notice that Equation (69) shows that the assertions about the convergence of (σ_n) immediately imply the assertions about the convergence of (μ_n).

Let us now consider a step of the GIGO algorithm: The twisted IGO speed is Y = (η_μβσ₀, η_σασ₀), with ασ₀ > 0 (i.e., the variance should be increased: this is where we need q₀ < 0.5).

Proposition 17 shows that the covariance at the end of the step is (using the same notation):

σ (δ t) = σ (0) Im (\frac{d i e^{v δ t} - c}{c i e^{v δ t} + d}) = σ (0) \frac{e^{v δ t} (d^{2} + c^{2})}{c^{2} e^{2 v δ t} + d^{2}} = : σ (0) f (δ t),

(71)

and it is easy to see that f only depends on δt (and on q₀). In other words, f(δt) will be the same at each step of the algorithm. The existence of δt_cr easily follows (furthermore, recall Figure 1 in Section 4.1), and δt_cr is the positive solution of f(x) = 1.

After a quick computation, we find:

\exp (v δ t_{cr}) = \frac{\sqrt{1 + u^{2}} + 1}{\sqrt{1 + u^{2}} - 1} .

(72)

where:

u : = \sqrt{\frac{η_{μ}}{2 n η_{σ}}} \frac{β}{α},

(73)

and:

v : = \sqrt{η_{σ}^{2} α^{2} + \frac{η_{μ} η_{σ}}{2 n} β^{2}} .

(74)

Finally, for

w = k {.1}_{q \leq q_{0}}

, Proposition 9 shows that:

δ t_{cr} = \frac{1}{k} \frac{1}{v} \ln (\frac{\sqrt{1 + u^{2}} + 1}{\sqrt{1 + u^{2}} - 1}) .

(75)

A1. Generalization of the Twisted Fisher Metric

The following definition is a more general way to introduce the twisted Fisher metric.

Definition 17. Let (Θ, g) be a Riemannian manifold,

(Θ_{1}, g |_{Θ_{1}}), \dots, (Θ_{n}, g |_{Θ_{n}})

, a splitting (as defined in Section 6.4) of Θ compatible with the metric g.

We call (η₁, …, η_n)-twisted metric on (Θ, g) for the splitting Θ₁, …, Θ_n the metric g′ on Θ defined by

g^{'} |_{Θ_{i}} = \frac{1}{η_{i}} g |_{Θ_{i}}

for 1 ≤ i ≤ n, and Θ_i ⊥ Θ_j for i ≠ j.

Proposition 16. The (η_μ, η_Σ)-twisted metric on

G_{d}

with the Fisher metric for the splitting

N (μ, \sum) \mapsto (μ, \sum)

coincides with the (η_μ, η_Σ)-twisted Fisher metric from Definition 11.

Proof. It is easy to see that the (η_μ, η_Σ)-twisted Fisher metric satisfies the condition in Definition 17.

A2. Twisted Geodesics

The following theorem can be used to compute the twisted geodesics from the non twisted geodesics. It is a simple calculation.

Theorem 7. Let η_μ, η_Σ ∈ ℝ, μ₀ ∈ ℝ^d, A₀ ∈ GL_d(ℝ), and

({\dot{μ}}_{0}, {\sum^{˙}}_{0}) \in T_{N (μ_{0}, A_{0} A_{0}^{T})} G_{d}

Let

\begin{array}{r} \begin{matrix} h & : & G_{d} & \to & G_{d} \end{matrix} \\ \begin{matrix} N (μ, Σ) & \mapsto & N (\sqrt{\frac{η_{μ}}{η_{Σ}} μ, Σ}) \end{matrix} \end{array}

(76)

We denote by ϕ (resp. ψ) the Riemannian exponential of

G_{d}

(resp.

G_{d}

with the (η_μ, η_Σ)-twisted Fisher metric) at

N (\sqrt{\frac{η_{μ}}{η_{Σ}}} μ_{0}, A_{0} A_{0}^{T})

(r e s p . N (μ_{0}, A_{0} A_{0}^{T}))

. We have:

ψ ({\dot{μ}}_{0}, {\dot{Σ}}_{0}) = h \circ ϕ (\sqrt{\frac{η_{Σ}}{η_{μ}}} {\dot{μ}}_{0}, {\dot{Σ}}_{0})

(77)

Proof. Let us denote by:

(\begin{array}{r} I_{μ} & 0 \\ 0 & I_{Σ} \end{array})

the Fisher metric in the parametrization μ, Σ, and consider the 0 I_Σ following parametrization of

G_{d} : (\tilde{μ}, Σ) \mapsto N (\frac{\sqrt{η_{Σ}}}{\sqrt{η_{μ}}} \tilde{μ}, Σ)

.

The Riemannian exponential at

N (μ_{0}, A_{0} A_{0}^{T})

in this parametrization is:

h \circ ϕ \circ {(d h (μ_{0}, A_{0} A_{0}^{T}))}^{- 1}

(78)

However, in this parametrization, the Fisher metric reads:

(\begin{matrix} \frac{η_{Σ}}{η_{μ}} I_{μ} & 0 \\ 0 & I_{Σ} \end{matrix}),

(79)

which is proportional to the (η_μ, η_Σ)-twisted Fisher metric up to a factor

\frac{1}{η_{Σ}}

. Consequently, the Christoffel symbols are the same as the Christoffel symbols of the (η_μ, η_Σ)-twisted Fisher metric, and so are the geodesics. Therefore, we have:

ψ = h \circ ϕ \circ {(d h (μ_{0}, A_{0} A_{0}^{T}))}^{- 1},

(80)

which is what we wanted.

For the remainder of this section, we fix η_μ and η_Σ;

G_{d}

is endowed with the (η_μ, η_Σ)-twisted Fisher metric, and

{\tilde{G}}_{d}

is endowed with the induced metric. The proofs of the propositions below are a simple rewriting of their non-twisted counterparts that can be found in Sections 4 and 5.1 and can be seen as corollaries of Theorem 7.

Theorem 8. If

γ : t \mapsto N (μ (t), σ {(t)}^{2} I)

is a twisted geodesic of

{\tilde{G}}_{d}

, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that

μ (t) = μ (0) + \sqrt{\frac{2 d η_{μ}}{η_{σ}}} \frac{{\dot{μ}}_{0}}{‖ {\dot{μ}}_{0} ‖} \tilde{r} (t)

, σ(t) = Im(γ_ℂ(t)), with

\tilde{r} (t) = Re (γ_{ℂ} (t))

and:

γ_{ℂ} (t) : = \frac{a i e^{v t} + b}{c i e^{v t} + d} .

(81)

Proposition 17. Let n ∈ ℕ, v_μ ∈ ℝⁿ_r,v_σ, η_μ, η_σ, σ₀ ∈ ℝ, with σ₀ > 0.

Let v_r := ║v_μ║

λ = \sqrt{\frac{2 n η_{μ}}{η_{σ}}} v : = \sqrt{\frac{\frac{1}{λ^{2}} v_{r}^{2} + v_{σ}^{2}}{σ_{0}^{2}}}

,

M_{0} : = \frac{1}{λ} \frac{v_{r}}{v σ_{0}^{2}}

and

S_{0} : = \frac{v_{σ}}{v σ_{0}^{2}}

.

Let

c : = {(\frac{\sqrt{M_{0}^{2} + S_{0}^{2}} - S_{0}}{2})}^{\frac{1}{2}}

and

d : = {(\frac{\sqrt{M_{0}^{2} + S_{0}^{2}} - S_{0}}{2})}^{\frac{1}{2}}

.

Let

γ_{ℂ} (t) : = σ_{0} \frac{d i e^{v t} - c}{c i e^{v t} + d}

.

Then

γ : t \mapsto N (μ_{0} + λ \frac{v_{μ}}{‖ v_{μ} ‖} Re (γ_{ℂ} (t)), Im (γ_{ℂ} (t)))

(82)

is the twisted geodesic of

{\tilde{G}}_{n}

satisfying γ(0) = (μ₀, σ₀) and

\dot{γ} (0) = (v_{μ}, v_{σ})

. The regular geodesics of

{\tilde{G}}_{n}

are obtained with η_μ = η_σ = 1.

Theorem 9. Let

γ : t \mapsto N (μ_{t}, Σ_{t})

be a twisted geodesic of

G_{d}

. Then, the following quantities are invariant:

J_{μ} = \frac{1}{η_{μ}} \sum_{t}^{- 1} {\dot{μ}}_{t}

(83)

J_{Σ} = \sum_{t}^{- 1} (\frac{1}{η_{μ}} {\dot{μ}}_{t} μ_{t}^{T} + \frac{1}{η_{Σ}} {\dot{Σ}}_{t}) .

(84)

Theorem 10. If μ : t ⟼ μ_t and Σ : t ⟼ Σ_t satisfy the equations:

{\dot{μ}}_{t} = η_{μ} Σ_{t} J_{μ}

(85)

{\dot{Σ}}_{t} = η_{Σ} Σ_{t} (J_{Σ} - J_{μ} μ_{t}^{T}) = η_{Σ} Σ_{t} J_{Σ} - \frac{η_{Σ}}{η_{μ}} {\dot{μ}}_{t} μ_{t}^{T},

(86)

where:

J_{μ} = \frac{1}{η_{μ}} \sum_{0}^{- 1} {\dot{μ}}_{0}

and:

J_{Σ} = \sum_{0}^{- 1} (\frac{1}{η_{μ}} {\dot{μ}}_{0} μ_{0}^{T} + \frac{1}{η_{Σ}} {\dot{Σ}}_{0}) .

then

t \mapsto N (μ_{t}, Σ_{t})

is a twisted geodesic of

G_{d}

.

Theorem 11. If μ : t ⟼ μ_t and A : t ⟼ A_t satisfy the equations:

\dot{μ} = η_{μ} A_{t} A_{t}^{T} J_{μ},

(87)

{\dot{A}}_{t} = \frac{η_{Σ}}{2} {(J_{Σ} - J_{μ} μ_{t}^{T})}^{T} A_{t},

(88)

where:

J_{μ} = \frac{1}{η_{μ}} {(A_{0}^{- 1})}^{T} A_{0}^{- 1} {\dot{μ}}_{0}

and:

J_{Σ} = {(A_{0}^{- 1})}^{T} A_{0}^{- 1} (\frac{1}{η_{μ}} {\dot{μ}}_{0} μ_{0}^{T} + \frac{1}{η_{Σ}} {\dot{A}}_{0} A_{0}^{T} + \frac{1}{η_{Σ}} A_{0} {\dot{A}}_{0}^{T}),

then

t \mapsto N (μ_{t}, A_{t} A_{t}^{T})

is a twisted geodesic of

G_{d}

.

A3. Pseudocodes

A3.1. For All Algorithms

All studied algorithms have a common part, given here:

Variables: μ, Σ (or A such that Σ = AA^T).

List of parameters: f : ℝ^d → ℝ, step size δt, learning rates η_μ, η_Σ, sample size λ, weights (w_i)_i_∈[1_,λ_], N number of steps for the Euler method, r Euler step size reduction factor (for GIGO-Σ only).

Algorithm 1. For all algorithms.

**Algorithm 1.** For all algorithms.

Notice that we always need a square root A of Σ to sample the x_i, but the decomposition Σ = AA^T is not unique. Two different decompositions will give two algorithms, such that one is a modification of the other as a stochastic process: same law (the x_i are abstractly sampled from

N (μ, Σ)

, but different trajectories (for given z_i, different choices for the square root will give different x_i). For GIGO-Σ, since we have to invert the covariance matrix, we used the Cholesky decomposition (A lower triangular. The the other implementation directly maintains a square root of Σ). Usually, in CMA-ES, the square root of Σ (Σ = AA^T, A symmetric) is used.

A3.2. Updates

When describing the different updates, μ, Σ, A, the x_i and the z_i are those defined in Algorithm 1. For Algorithm 2 (GIGO-Σ), when the covariance matrix after one step is not positive-definite, we compute the update again, with a step size divided by r for the Euler method (we have no reason to recommend any particular value of r, the only constraint is r > 1).

Algorithm 2. GIGO Update, one step, updating the covariance matrix.

**Algorithm 2.** GIGO Update, one step, updating the covariance matrix.

Algorithm 3. GIGO Update, one step, updating a square root of the covariance matrix.

**Algorithm 3.** GIGO Update, one step, updating a square root of the covariance matrix.

Algorithm 4. Exact GIGO, one step. Not exactly our implementation; see the discussion after Corollary 1.

**Algorithm 4.** Exact GIGO, one step. Not exactly our implementation; see the discussion after Corollary 1.

Algorithm 5. xNES update, one step.

**Algorithm 5.** xNES update, one step.

Algorithm 6. pure rank-μ CMA-ES update, one step

**Algorithm 6.** pure rank-μ CMA-ES update, one step

Algorithm 7. GIGO in

{\tilde{G}}_{d}

, one step.

**Algorithm 7.** GIGO in ${\tilde{G}}_{d}$ , one step.

Acknowledgments

I would like to thank Yann Ollivier for his numerous remarks about this article and Frédéric Barbaresco for finding [5].

References

Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-geometric optimization algorithms: A unifying picture via invariance principles 2011, arXiv, 1106.3708.
Amari, S.-I.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); American Mathematical Society: Providence, RI, USA, 2007. [Google Scholar]
Malagò, L.; Pistone, G. Combinatorial optimization with information geometry: The Newton method. Entropy 2014, 16, 4260–4289. [Google Scholar]
Eriksen, P. Geodesics Connected with the Fisher Metric on the Multivariate Normal Manifold; Technical Report 86-13; Institute of Electronic Systems, Aalborg University: Aalborg, Denmark, 1986. [Google Scholar]
Calvo, M.; Oller, J.M. An Explicit Solution of Information Geodesic Equations for the Multivariate Normal Model. Stat. Decis. 1991, 9, 119–138. [Google Scholar]
Imai, T.; Takaesu, A.; Wakayama, M. Remarks on geodesics for multivariate normal models. J. Math-for-Industry 2011, 3, 125–130. [Google Scholar]
Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1981, 11, 211–223. [Google Scholar]
Porat, B.; Friedlander, B. Computation of the Exact Information Matrix of Gaussian Time Series with Stationary Random Components. IEEE Trans. Acoust. Speech Signal Process 1986, 34, 118–130. [Google Scholar]
Baluja, S.; Caruana, R. Removing the Genetics from the Standard Genetic Algorithm; Technical Report CMU-CS-95-141; Morgan Kaufmann Publishers: Burlington, MA, USA, 1995; pp. 38–46. [Google Scholar]
Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family, Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, Schwarzenberg, Austria, 5–9 January 2011; pp. 230–242.
Kern, S.; Müller, S.D.; Hansen, N.; Büche, D.; Ocenasek, J.; Koumoutsakos, P. Learning probability distributions in continuous evolutionary algorithms—A comparative review. Nat. Comput. 2003, 3, 77–112. [Google Scholar]
Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; Schmidhuber, J. Natural evolution strategies. J. Mach. Learn. Res. 2014, 15, 949–980. [Google Scholar]
Huang, W. Optimization Algorithms on Riemannian Manifolds with Applications. Ph.D. Thesis, Florida State University, Tallahassee, FL, USA, 2013. [Google Scholar]
Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
Arnold, V.; Vogtmann, K.; Weinstein, A. Mathematical Methods of Classical Mechanics (Graduate Texts in Mathematics); Springer: New York, NY, USA, 1989. [Google Scholar]
Bourguignon, J. Calcul variationnel; Ecole Polytechnique: Palaiseau, France, 2007; in French. [Google Scholar]
Jost, J.; Li-Jost, X. Calculus of Variations (Cambridge Studies in Advanced Mathematics); Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
Gallot, S.; Hulin, D.; LaFontaine, J. Riemannian Geometry (Universitext), 3rd ed; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Glasmachers, T.; Schaul, T.; Yi, S.; Wierstra, D.; Schmidhuber, J. Exponential natural evolution strategies, Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, Portland, OR, USA, 7–11 July 2010.
Akimoto, Y.; Nagata, Y.; Ono, I.; Kobayashi, S. Bidirectional relation between CMA evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature, PPSN XI; Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G., Eds.; Springer: New York, NY, USA, 2010. [Google Scholar]
Hansen, N. The CMA evolution strategy: A tutorial. Available online: https://www.lri.fr/∼hansen/cmatutorial.pdf accessed on 1 January 2015.
Bensadon, J. Source Code. Available online: https://www.lri.fr/~bensadon/ accessed on 13 January 2015.
Akimoto, Y.; Ollivier, Y. Objective improvement in information-geometric optimization, Proceedings of the twelfth workshop on Foundations of genetic algorithms XII, Adelaide, Australia, 16–20 January 2013.

Figure 1. Geodesics of the Poincaré half-plane.

Figure 2. One step of the geodesic IGO (GIGO) update.

Figure 3. Median number of function calls to reach 10⁻⁸ fitness on 24 runs for: sphere function, cigar-tablet function and Rosenbrock function. Initial position θ⁰ = N (x₀, I), with x₀ uniformly distributed on the circle of center zero and radius 10. We recall that the “CMA-ES” algorithm here is using the so-called pure rank-μ CMA-ES update.

Figure 4. Trajectories of GIGO, CMA and xNES optimizing x ↦ x² in dimension one with δt = 0.01, sample size 5000, weights w_i = 4.1_i_⩽1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot every 100 steps. All algorithms exhibit a similar behavior

Figure 5. Trajectories of GIGO, CMA and xNES optimizing x 7→x² in dimension one with δt = 0.5, sample size 5000, weights w_i = 4.1_i_⩽1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot every two steps. Stronger differences. Notice that after one step, the lowest mean is still GIGO (∼ 8.5, whereas xNES is around 8.75), but from the second step, GIGO has the highest mean, because of the lower variance.

Figure 6. Trajectories of GIGO, CMA and xNES optimizing x ⟼ x² in dimension one with δt = 0.1, sample size 5000, weights w_i = 4.1_i_≤1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot every 10 steps. All algorithms exhibit a similar behavior, and differences start to appear. It cannot be seen on the graph, but the algorithm closest to zero after 400 steps is CMA (∼ 1.10⁻¹⁶, followed by xNES (∼ 6.10⁻¹⁶) and GIGO (∼ 2.10⁻¹⁵).

Figure 7. Trajectories of GIGO, CMA and xNES optimizing x ⟼ x² in dimension one with δt = 1, sample size 5000, weights w_i = 4.1_i_≤1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot per step. The CMA-ES algorithm fails here, because at the fourth step, the covariance matrix is not positive definite anymore (it is easy to see that the CMA-ES update is always defined if δtη_Σ < 1, but this is not the case here). Furthermore, notice (see also Proposition 15) that at the first step, GIGO decreases the variance, whereas the σ-component of the IGO speed is positive.

Figure 8. Trajectories of GIGO, CMA and xNES optimizing x ⟼ x² in dimension one with δt = 1.5, sample size 5000, weights w_i = 4.1_i_≤1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot per step. Same as δt = 1 for CMA. GIGO converges prematurely.

Figure 9. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.01, sample size 5000, weights w_i = 4.1_i_≤1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot every 100 steps. Almost the same for all algorithms.

Figure 10. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.1, sample size 5000, weights w_i = 4.1_i_≤1250 and learning rates η_μ = 1, η_Σ = 1.8. One dot every 10 steps. It is not obvious on the graph, but xNES is faster than CMA, which is faster than GIGO.

Figure 11. Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 1, sample size 5, 000, weights w_i = 4.1_i_≤1_,₂₅₀ and learning rates η_μ = 1, η_Σ = 1.8. One dot per step. GIGO converges, for the reasons discussed earlier.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bensadon, J. Black-Box Optimization Using Geodesics in Statistical Manifolds. Entropy 2015, 17, 304-345. https://doi.org/10.3390/e17010304

AMA Style

Bensadon J. Black-Box Optimization Using Geodesics in Statistical Manifolds. Entropy. 2015; 17(1):304-345. https://doi.org/10.3390/e17010304

Chicago/Turabian Style

Bensadon, Jérémy. 2015. "Black-Box Optimization Using Geodesics in Statistical Manifolds" Entropy 17, no. 1: 304-345. https://doi.org/10.3390/e17010304

APA Style

Bensadon, J. (2015). Black-Box Optimization Using Geodesics in Statistical Manifolds. Entropy, 17(1), 304-345. https://doi.org/10.3390/e17010304

Article Menu

Black-Box Optimization Using Geodesics in Statistical Manifolds †

Abstract

1. Introduction

2. Definitions: IGO, GIGO

2.1. Invariance under Reparametrization of θ: Fisher Metric

2.2. IGO Flow, IGO Algorithm

2.3. Geodesic IGO

3. Riemannian Geometry, Noether’s Theorem

3.1. Riemannian Geometry

3.2. Noether’s Theorem

4. GIGO in G ˜ d

4.1. Preliminaries: Poincaré Half-Plane, Hyperbolic Space

4.2. Computing the GIGO Update in G ˜ d

5. GIGO in G ˜ d

5.1. Obtaining a First Order Differential Equation for the Geodesics of G d

5.2. Explicit Form of the Geodesics of G d (from [5])

6. Comparing GIGO, xNES and Pure Rank-μ CMA-ES

6.1. Definitions

6.1.1. xNES

6.1.2. Using a Square Root of the Covariance Matrix

6.1.3. Pure Rank-μ CMA-ES

6.2. Twisting the Metric

6.3. Trajectories of Different IGO Steps

6.4. Blockwise GIGO

7. Numerical Experiments

7.1. Benchmarking

7.1.1. Failed Runs

7.1.2. Discussion

7.2. Plotting Trajectories in G 1

8. Conclusions

Conflicts of Interest

Proof of Proposition 15

A1. Generalization of the Twisted Fisher Metric

A2. Twisted Geodesics

A3. Pseudocodes

A3.1. For All Algorithms

A3.2. Updates

Acknowledgments

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Black-Box Optimization Using Geodesics in Statistical Manifolds^†

4. GIGO in ${\tilde{G}}_{d}$

4.2. Computing the GIGO Update in ${\tilde{G}}_{d}$

5. GIGO in ${\tilde{G}}_{d}$

5.1. Obtaining a First Order Differential Equation for the Geodesics of $G_{d}$

5.2. Explicit Form of the Geodesics of $G_{d}$ (from [5])

7.2. Plotting Trajectories in $G_{1}$