**About the Editors**

#### **Alexandru Agapie**

Alexandru Agapie is a Full Professor in the Department of Applied Mathematics at The Bucharest University of Economic Studies, Bucharest, Romania, and a Senior Researcher with the "Gheorghe Mihoc—Caius Iacob" Institute of Mathematical Statistics and Applied Mathematics of the Romanian Academy, Bucharest. He received his Lic. and Ph.D. degrees in Mathematics from the University of Bucharest, Romania, in 1993 and 2001, respectively. He has worked with the National Institute for Microtechnology, Bucharest, and the Fraunhofer Institute for Autonomous Intelligent Systems, Sankt Augustin, Germany. His fields of interest are stochastic processes applied to evolutionary algorithms and to socio-economic models. Dr. Agapie was a DAAD Research Fellow at TU Dortmund, Germany, in 2011, a NATO-COBASE Research Fellow at Montana University, Missoula, USA, in 2000 and received a Student Fellowship Award at the IEEE International Conference on Evolutionary Computation, Nagoya, Japan, in 1996. He has also been serving as the Scientific Secretary of the Romanian Society for Probability and Statistics since 2012, has been a Member of the European Cooperation in Science and Technology (COST) since 2009, and has been an Expert for the National Research Council, Romania, since 2011.

#### **Denis Enachescu**

Denis Enachescu is currently a Ph.D. supervisor at the Computer Science Doctoral School and an ˘ emeritus professor at the University of Bucharest, Bucharest, Romania. He received his Lic. and Ph.D. degrees in Mathematics from the University of Bucharest, Romania, in 1975 and 1980, respectively. He was the Deputy Dean of the Faculty of Mathematics and Computer Science from 2008 to 2015. His research fields include simulation methods, particularly the Monte Carlo method, data mining methods, especially factorial methods, cluster and discriminant analysis, artificial intelligence, neuronal networks and support vector machines for supervised and unsupervised statistical learning and biostatistics, especially statistical methods for bioavailability and bioequivalence. He was awarded the "Ghe. Lazar" Prize of the Romanian Academy in 1985. He has been a member ˘ of the International Association for Statistical Computing—IASC, a member of the Gesellschaft fur Angewandte Mathematik und Mechanik—GAMM, and an elected ordinary member of the International Statistical Institute—ISI. He is also a founding member of the Romanian Society of Probability and Mathematical Statistics, a member of the Association of Balkan Statisticians—ABS, a member of the International Biometric Society—IBS, and a member of the International Society for Clinical Biostatistics—ISCB.

#### **Vlad Stefan Barbu**

Vlad Stefan Barbu has been an Associate Professor at the Laboratory of Mathematics Raphael¨ Salem (LMRS) at the University of Rouen, Normandy, France, since 2006. He received his Ph.D. in Statistics from the University of Technology of Compiegne, France, in 2005. In 2017, he ` received his HDR (Habilitation to Conduct Research) in Statistics. His main research fields lie in Markov and Hidden Markov processes, statistical inference for stochastic processes, parametric and nonparametric estimation, hypotheses testing, stochastic methods in reliability and survival analysis. He is also a serving member of the French Statistical Society—SFdS (elected member of the Council of the group Reliability and Uncertainty), a member of the Romanian Society of Probability and Statistics—SPSR (elected scientific secretary), a member of the French Society of Applied and Industrial Mathematics—SMAI, a member of the Romanian Society of Applied and Industrial Mathematics—ROMAI (elected vice-president), a member of the Society for International Stochastic Modeling Techniques and Data Analysis—SMTDA, and member of the Institute of Complex Systems in Normandy—ISCN.

#### **Bogdan Iftimie**

Bogdan Iftimie is a Full Professor in the Department of Applied Mathematics and the Vice Dean of the Faculty of Economic Cybernetics, Statistics and Informatics at the Bucharest University of Economic Studies, Bucharest, Romania. He received his Ph.D. in mathematics from the Institute of Mathematics of the Romanian Academy in 2001. His research interests include stochastic processes in finance, stochastic partial differential equations, and financial models with advanced or delayed information.

**Alexandru Agapie 1,2**


**Abstract:** Performance of evolutionary algorithms in real space is evaluated by local measures such as success probability and expected progress. In high-dimensional landscapes, most algorithms rely on the normal multi-variate, easy to assemble from independent, identically distributed components. This paper analyzes a different distribution, also spherical, yet with dependent components and compact support: uniform in the sphere. Under a simple setting of the parameters, two algorithms are compared on a quadratic fitness function. The success probability and the expected progress of the algorithm with uniform distribution are proved to dominate their normal mutation counterparts by order *n*!!.

**Keywords:** probabilistic optimization; spherical distribution; multi-variate calculus; hypergeometric functions; transition kernel

#### **1. Introduction**

Probabilistic algorithms are among the most popular optimization techniques due to their easy implementation and high efficiency. Their roots can be traced back to the first random walk problem proposed by Pearson in 1905: "*A man starts from a point O, and walks yards in a straight line; he then turns through any angle whatever, and walks another yards in a second straight line. He repeats this process n times. I require the probability that after these n stretches he is at a distance between r and r* + *dr from his starting point O.*" [1–4].

Using the ability of computers to generate and store large samples from multi-variate distributions, physicists and engineers have transformed the original random walk into a powerful optimization tool. Probabilistic algorithms do not require additional information on the fitness function, they simply *generate* potential candidate solutions, *select* the best, and move on.

Sharing the same random generator, yet differing with respect to the selection phase, two classes of probabilistic algorithms became more popular over the last decades: *simulated annealing* (also known as Metropolis or Hastings algorithm) [5] and *evolutionary algorithms* (EAs) [6,7]. Only the latter will be discussed in this paper.

EAs are assessed based on local quantities, such as success probability and progress rate, respectively, on global measures, like expected convergence time. Performance depends on both the fitness landscape, and on the particular probabilistic scheme (leading to a probability distribution) involved in the generating-selection mechanism. A popular test problem consists in minimizing the quadratic *SPHERE* function (To avoid confusion, we use uppercase for the fitness function, and lowercase for the uniform distribution in/on the sphere.), with optimum in the origin.

$$\mathcal{F}: \mathfrak{R}^n \to \mathfrak{R} \qquad \mathcal{F}(\mathfrak{x}\_1, \dots, \mathfrak{x}\_n) = \sum\_{i=1}^n \mathfrak{x}\_i^2. \tag{1}$$

**Citation:** Agapie, A. Spherical Distributions Used in Evolutionary Algorithms. *Mathematics* **2021**, *9*, 3098. https://doi.org/10.3390/ math9233098

Academic Editors: Anatoliy Swishchuk and Antonio Di Crescenzo

Received: 22 October 2021 Accepted: 27 November 2021 Published: 30 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

An elitist (that is, keeping always the best solution found so far), one-individual, mutation+selection EA is depicted below (Algorithm 1).

#### **Algorithm 1** An elitist, one-individual, mutation+selection EA.

Set *t* = 0 and the initial point of the algorithm, *x*<sup>0</sup> Repeat *t* := *t* + 1 *Mutation*: generate a new point *x* ∈ *<sup>n</sup>* according to a multi-variate distribution *Selection*:if <sup>F</sup>(**x**) <sup>&</sup>lt; <sup>F</sup>(**x***t*−1) then **<sup>x</sup>***<sup>t</sup>* :<sup>=</sup> **<sup>x</sup>** else *x<sup>t</sup>* := *xt*−<sup>1</sup> Until *t* = *t*max, for some fixed *t*max

The region of success of Algorithm 1 is

$$R\_x^S = \{ y \in \mathfrak{R}^n | \mathcal{F}(y) < \mathcal{F}(x) \}. \tag{2}$$

A rigorous description of the long-term behavior of EAs involves renewal processes, drift analysis, order statistics, martingales, or other stochastic processes [7–13]. However, the basic structure is that of a Markov chain, as the algorithm's state at the next iteration depends only on its current state. Difficulties occur when, even for simple fitness functions, SPHERE included, the actual position of the algorithm affects significantly the local behavior, such that the process lacks homogeneity; so begins the search for powerful mathematical tools, able to describe the transition kernel, which encapsulates the local probabilistic structure of the algorithm [6,7].

For any fitness function F and fixed point *x* ∈ *<sup>n</sup>*, assumed as current state, the transition kernel provides the probability of the algorithm to be in set *A* ⊂ *<sup>n</sup>* at the next step. Even if the mutation distribution has a probability density function (pdf), discontinuity occurs due to the disruptive effect of elitist selection. To make that clear, let us denote the mutation random variable by **Y**, its pdf by *f* , and cumulative distribution function (cdf) by *F*. The singular (Dirac) distribution, that loads only one point, is denoted *δ*. Index *x* designates conditioning by the current state. The finite time behavior of the algorithm is inscribed in the random variable **Z***<sup>x</sup>* and the next state of the algorithm, provided the current state, is *x*.

$$\mathbf{Z}\_{\mathbf{x}}(\omega) = \begin{cases} \mathbf{Y}\_{\mathbf{x}}(\omega), & \mathbf{Y}\_{\mathbf{x}}(\omega) \in R\_x^S \\ \mathbf{x}, & \mathbf{Y}\_{\mathbf{x}}(\omega) \in \mathfrak{R}^n \backslash R\_x^S \end{cases}. \tag{3}$$

while the (Markov) *transition kernel* carries the local transition probabilities

$$P\_{\mathfrak{X}}(A) = \int\_{A \cap \mathbb{R}^S\_x} f(y) dy + \left[ 1 - \int\_{A \cap \mathbb{R}^S\_x} f(y) dy \right] \cdot \delta\_{\mathfrak{X}}(A). \tag{4}$$

As the mutation distribution is entirely responsible for the evolution of the algorithm, let us take a look at the possible candidates, the multi-variate distributions.

Let **x** = (**x1**,..., **xn**) denote some *n*-dimensional random variable, and the euclidean norm in *<sup>n</sup>* be ||*x*|| = (*x x*)1/2 = ∑*n <sup>i</sup>*=<sup>1</sup> *x*<sup>2</sup> *i* 1/2. The *n*-dimensional sphere of radius 1, its surface and volume are given by [14]

$$\mathcal{S}\_n = \{ \mathbf{x} \in \mathfrak{R}^n \mid ||\mathbf{x}|| \le 1 \}, \quad \delta \mathcal{S}\_n = \{ \mathbf{x} \in \mathfrak{R}^n \mid ||\mathbf{x}|| = 1 \}, \quad V\_n = \frac{2\pi^{\frac{n}{2}}}{n\Gamma\_{\frac{n}{2}}}.\tag{5}$$

We use 'bold-face' for (single, or multi-variate) random variables, and 'normal-face' for real numbers (or vectors). When partitioning the *n* dimensions into two sets {1, ... , *m*}

and {*m* + 1, ... , *n*} with 1 ≤ *m* < 1, we use the compact notation *x* = (*x*(1), *x*(2)) = ((*x*1,..., *xm*),(*xm*+1,..., *xn*)), for either vectors or random variables. Unless otherwise stated, **Beta***a*,*<sup>b</sup>* denotes the beta random variable with support (0, 1) and parameters *a*, *b*, while *β* and Γ stand for the corresponding coefficients. The (one-dimensional) uniform

random variable with support (*a*, *<sup>b</sup>*) is denoted **<sup>U</sup>**(*a*,*b*). The sign *<sup>d</sup>* = denotes two random variables with identical cdf.

The class of spherical distributions can be defined in a number of equivalent ways, two of which are depicted below ([15], pp. 30, 35) (See the excellent monograph of Fang et al. [15] for an exhaustive introduction to spherical distributions.):

#### **Definition 1.**

• *An n-dimensional random variable* **x** *is said to have spherically symmetric distribution* (or simply spherical distribution) *if*

$$\mathbf{x} \stackrel{d}{=} \mathbf{r} \cdot \mathbf{u}^{\mathbf{n}} \tag{6}$$

*for some one-dimensional random variable (radius)* **r***, and the uniform distribution on the unit sphere* **un***. Moreover,* **r** *and* **u<sup>n</sup>** *are independent, and also*

$$\mathbf{r} = ||\mathbf{x}|| \ge 0, \qquad \qquad \mathbf{u}^{\mathbf{n}} \stackrel{d}{=} \frac{\mathbf{x}}{||\mathbf{x}||}. \tag{7}$$

• *If the spherical distribution has pdf g, then g satisfies g*(*x*) = *g*(||*x*||)*, and there is a special connection between g and f , the pdf of* **r***, namely,*

$$f(r) = \frac{2\pi^{n/2}}{\Gamma\_{\frac{n}{2}}} r^{n-1} g(r^2). \tag{8}$$

Three spherical distributions are of particular interest in our analysis:


A comparison of the previous distributions can be performed from many angles. NORMAL was the first discovered, applied, and thoroughly analyzed in statistics, as being one of the only spherical distributions with independent and identically distributed marginals. By contrast, the components of uniform distributions on/in the sphere are not independent, neither uniform (However, the conditional marginals are uniform, see Theorem 5). Recently, the scientific interest shifted to uniform multi-variate distributions, following the increasing application of directional statistics to earth sciences and quantum mechanics [16,17], and also the application of Dirichlet distribution (which lies at the basis of spherical analysis) to Bayesian inference, involved in medicine, genetics, and text-mining [18].

From the computer science point of view, uniform and normal distributions share an entangled history, in at least two areas: random number generators and probabilistic optimization algorithms. With respect to the first area, early approaches to sampling from the uniform distribution on sphere were actually using multi-normal random generators to produce a sample *x*, which was further divided by ||*x*||, following Equation (7). Nowadays, the situation changed, with the appearance of a new class of algorithms which circumvent the usage of the normal generator by using properties of marginal uniform distributions on/in spheres [17,19]. The comparison of mean computation times demonstrates that the uniform sampling method outperforms the normal generator for dimensions up to *n* = 7 [20].

Concerning global optimization, the probabilistic algorithms based on the two distribution types evolved at the same time, although with few overlaps. In the theory and practice

of real space (continuous) EA-*evolution strategy* being their most popular, the representativenormal distribution has played, from the beginning, the central role. Therefore, there is a great amount of literature stressing out the advantages of this distribution [6,21,22].

Occasionally, the supremacy of the normal operator has been challenged by theoretical studies that proposed different mutation distributions, such as uniform in the cube (which is non-spherical) [10], uniform *on* the sphere [7,23], uniform *in* the sphere [9], or even the Cauchy distribution [24]. An attempt to solve the problem globally, by considering the whole class of *spherical* (*isotropic*) distributions, was made in [25,26]. These approaches yielded only limited results (valid either for small space dimension *n*, or for *n* → ∞), or not so tight lower/upper bounds for the expected progress of the algorithm.

The study of EAs with uniform distribution in the sphere recently culminated with two systematic studies, one for RIDGE [27], the other for SPHERE [28], comparable to classical theory of evolution strategies [6,29]. Under a carefully constructed normalization of mutation parameters (equalizing the expectations of normal and uniform multi-variates as *n* → ∞), those studies demonstrate the same behavior for the respective EA variants. Intuitively, the explanation is that for large dimensions, both normal and uniform distributions concentrate *on* the surface of the sphere. The present paper differs from the previous analyses in the way that it does not apply any normalization of parameters. As a consequence, the results are different from those in [28] and an actual comparison between the two algorithms can be achieved.

Section 2 discusses the general framework of spherical multi-variate distributions, with special focus on uniform and normal. Then, two algorithms, one with uniform mutation and the other with normal mutation, are compared on the SPHERE fitness function with respect to their local performance in Section 3.

#### **2. Materials and Methods. Spherical Distributions**

In light of Definition 1, the spherical distributions are very much alike. They all exhibit stochastic representation (6), that is, each can be generated as a product of two independent distributions, the *n*-dimensional uniform on the sphere **u<sup>n</sup>** and some scalar, positive random variable **r**. As the distribution of **r** makes the whole difference, we point out the form of this random variable in the three cases of interest.

• **un**-**r** is obviously the Dirac distribution in 1:

$$
\delta\_1(\mathbf{x}) = 1, \quad \mathbf{x} = 1. \tag{9}
$$

• NORMAL-**r** is the *χ* distribution with *n* degrees of freedom, with pdf ([7], p. 20):

$$f(\mathbf{x}) = \frac{1}{2^{\frac{\mathbf{y}}{2} - 1} \Gamma\_{\frac{\mathbf{y}}{2}}} e^{-\frac{\mathbf{x}^{2}}{2}} x^{n-1}, \quad \mathbf{x} \in (0, \infty). \tag{10}$$

• UNIFORM-**r** is distributed **Beta***n*,1, with pdf ([15], p. 75):

$$f(\mathbf{x}) = n\mathbf{x}^{n-1}, \quad \mathbf{x} \in (0, 1). \tag{11}$$

Using as primary source the monograph [15], we next discuss in more detail the stochastic properties of the UNIFORM and NORMAL multi-variates, the two candidates for the mutation operator of the algorithm.

#### *2.1. Uniform in the Sphere*

The local analysis of the EA is based on two particular marginal distributions: the first component **x1**, and the joint marginal of the remaining *n* − 1 components, **x**(**2**). As already pointed out, the marginals of UNIFORM are not independent random variables, and we shall see that neither are they uniform. A general formula for the marginal density is provided in [15] (p. 75):

**Theorem 1.** *If* **x** = (**x**(**1**), **x**(**2**)) *is uniformly distributed in the unit sphere, with* **x**(**1**) *of dimension k,* 1 ≤ *k* < *n, then the marginal density of* **x**(**1**) *is*

$$f(\mathbf{x}\_{(1)}) = \frac{\pi^{-\frac{k}{2}}\Gamma\_{\frac{n+2}{2}}}{\Gamma\_{\frac{n-k+2}{2}}} \left(1 - ||\mathbf{x}\_{(1)}||\right)^{\frac{n-k}{2}}, \ ||\mathbf{x}\_{(1)}||^2 < 1. \tag{12}$$

**Corollary 1.** *The pdf of the first component of UNIFORM is*

$$f(\mathbf{x}) = \frac{1}{\beta^{\frac{\mathbf{n} + \mathbf{1}}{2}, \frac{1}{2}}} \left(1 - \mathbf{x}^2\right)^{\frac{\mathbf{n} - \mathbf{1}}{2}}, \quad \mathbf{x} \in (-1, 1). \tag{13}$$

Using the symmetry with respect to the origin and substituting *x*<sup>2</sup> = *t* in function (13), we obtain an interesting result, previously unreported in spherical distributions literature.

**Corollary 2.** *The square of the first component of UNIFORM is* **Beta** *<sup>n</sup>*+<sup>1</sup> <sup>2</sup> , <sup>1</sup> 2 *, with pdf*

$$f(\mathbf{x}) = \frac{1}{\beta\_{\frac{n+1}{2}, \frac{1}{2}}} (1 - \mathbf{x})^{\frac{n-1}{2}} \mathbf{x}^{-\frac{1}{2}}, \quad \mathbf{x} \in (0, 1). \tag{14}$$

The density of the last *n* − 1 components can be derived also from Theorem 1.

**Corollary 3.** *The joint pdf of the last n* − 1 *components of UNIFORM is*

$$f(\mathfrak{x}\_{(2)}) = n\pi^{-\frac{\mu}{2}}\Gamma\_{\frac{\mu}{2}}\left(1 - ||\mathfrak{x}\_{(2)}||^2\right)^{\frac{1}{2}}, \ ||\mathfrak{x}\_{(2)}||^2 < 1. \tag{15}$$

As function of several variables, formula (15) might not look very appealing; however, a basic result from spherical distribution theory transforms the corresponding multiple integral into a scalar one ([15], p. 23).

**Theorem 2** (Dimension reduction)**.**

$$\int f\left(\sum\_{i=1}^{m} x\_i^2\right) dx\_1, \dots, dx\_m = \frac{\pi^{\frac{m}{2}}}{\Gamma\_{\frac{m}{2}}} \int\_0^\infty y^{\frac{m}{2}-1} f(y) dy. \tag{16}$$

One can see now that, if **x** is uniformly distributed in the unit sphere, the sum of squares of the last *n* − 1 components is **Beta** distributed.

**Corollary 4.** *Let* **x** = (**x1**, **x**(**2**)) *be UNIFORM. Then, the one-dimensional random variable* ||*x*(2)||<sup>2</sup> *is* **Beta** *<sup>n</sup>*−<sup>1</sup> <sup>2</sup> , <sup>3</sup> 2 *, with pdf*

$$f(\mathbf{x}) = \frac{1}{\beta^{\frac{n-1}{2}, \frac{3}{2}}} (1 - \mathbf{x})^{\frac{n-3}{2}} \mathbf{x}^{\frac{1}{2}}, \quad \mathbf{x} \in (0, 1). \tag{17}$$

**Proof.** Apply transformation (16) to (15), *<sup>m</sup>* <sup>=</sup> *<sup>n</sup>* <sup>−</sup> 1 and *<sup>f</sup>*(*y*)=(<sup>1</sup> <sup>−</sup> *<sup>y</sup>*)(*n*−1)/2.

As the components of the uniform distribution on/in the sphere are not independent, a better understanding of the nature of such distributions is provided by conditioning one component with respect to the others. In case of uniform distribution *on* the sphere, the work in [15] (p. 74) states that all the conditional marginals are also uniform *on* the sphere.

We shall see that a similar characterization holds true for the uniform *in* the sphere. This result is not presented in [15] and, to the best of our knowledge, in no other reference on spherical distributions. Therefore, an additional theorem is needed ([30], p. 375).

**Theorem 3.** *Let* **x** *<sup>d</sup>* <sup>=</sup> **<sup>r</sup>** · **<sup>u</sup><sup>n</sup>** *be a spherical distribution and* **<sup>x</sup>** = (**x**(**1**), **<sup>x</sup>**(**2**))*, where* **<sup>x</sup>**(**1**) *is mdimensional,* 1 ≤ *m* < *n. Then the conditional distribution of* **x**(**1**) *given* **x**(**2**) = *h with* ||*h*|| = *a is given by*

$$\left(\mathbf{x}\_{(1)} \mid \mathbf{x}\_{(2)} = h\right) \stackrel{d}{=} \mathbf{r}\_{\pi^2} \cdot \mathbf{u}^{\mathbf{m}}.\tag{18}$$

*For each a* <sup>≥</sup> <sup>0</sup>*,* **<sup>r</sup>***a*<sup>2</sup> *and* **<sup>u</sup><sup>m</sup>** *are independent, and the cdf of* **<sup>r</sup>***a*<sup>2</sup> *is given by*

$$prob(\mathbf{r}\_{a^2} \le t) = \frac{\int\_a^{\sqrt{t^2 + a^2}} (r^2 - a^2)^{m/2 - 1} \, r^{-(n-2)} dF(r)}{\int\_a^{\infty} (r^2 - a^2)^{m/2 - 1} \, r^{-(n-2)} dF(r)},\tag{19}$$

*for t* ≥ 0, *a* > 0 *and F*(*a*) < 1*, F being the cdf of* **r***.*

We prove now the result on conditional marginals of the uniform distribution *in* the unit sphere. As conditioning the first component with respect to all others is the most relevant for EA analysis, this particular case is stressed out.

First, an old result from probability theory is needed, similar to the convolution operation, but for the *product* of two independent random variables [31].

**Theorem 4** (Mellin's formula)**.** *Let* **y** *and* **z** *be two independent, non-negative random variables, with densities g and h. Then,* **x** = **y** · **z** *has pdf*

$$f(\mathbf{x}) = \int\_0^\infty \frac{1}{z} \, g\left(\frac{\mathbf{x}}{z}\right) h(z) dz, \quad \mathbf{x} \in (0, \infty). \tag{20}$$

Note that Mellin's formula still holds, if only one of the random variables is continuous, the other being discrete, see, e.g., in [15] (p. 41).

#### **Theorem 5.**


$$\left(\mathbf{x\_1} \mid \mathbf{x\_{(2)}} = h\right) \stackrel{d}{=} \mathbf{U\_{(-\sqrt{1-a^2}}, \sqrt{1-a^2}}\,. \tag{21}$$

**Proof.** We begin with the last part, case *m* = 1.

Equation (18) gives the conditional first component as a product of two independent random variables, the second being the one-dimensional UNIFORM-the discrete random variable that loads −1 and 1 with equal probability, 1/2.

The cdf of the first random variable is given by (19), as a fraction of two integrals, both with respect to *F*, the cdf of **r**. In case of UNIFORM, **r** is **Beta***n*,1 given by (11), thus *dF*(*r*) = *f*(*r*)*dr* = *nrn*−1*dr*.

If *<sup>t</sup>* <sup>≥</sup> <sup>√</sup> 1 − *a*2, the upper and lower integrals in (19) are equal, so the probability is 1. If 0 <sup>≤</sup> *<sup>t</sup>* <sup>&</sup>lt; <sup>√</sup> 1 − *a*2, the upper integral is

$$m\int\_{a}^{\sqrt{t^2+a^2}} r(r^2-a^2)^{-1/2} dr = nt\_{,}$$

while the lower one is

$$n\int\_{a}^{1}r(r^2-a^2)^{-1/2}dr = n\sqrt{1-a^2}.$$

Thus, for any fixed *a* < 1, the cdf of the conditional radius is

$$prob(\mathbf{r}\_{a^2} \le t) = \frac{t}{\sqrt{1 - a^2}}, \quad t \in (0, \sqrt{1 - a^2}),$$

while the corresponding pdf is

$$\mathbf{r}\_{a^2} \stackrel{d}{=} \mathbf{U}\_{\left(0, \sqrt{1-a^2}\right)}$$

.

Back to the application of Theorem 3, Equation (18). The conditional first component of UNIFORM is a product of two independent random variables: one continuous, the other discrete. This is the easy version of Mellin's formula, and the result is the continuous uniform random variable with support (− √ 1 − *a*2, √ 1 − *a*2) from Equation (21).

As for a larger dimension, *m* > 1, the cdf in (19) becomes

$$prob(\mathbf{r}\_{a^2} \le t) = \frac{t^m}{(1 - a^2)^{\frac{m}{2}}}, \quad t \in \left(0, (1 - a^2)^{\frac{m}{2}}\right).$$

with corresponding pdf

$$f(t) = \frac{mt^{m-1}}{(1-a^2)^{\frac{m}{2}}}, \quad t \in \left(0, (1-a^2)^{\frac{m}{2}}\right),$$

which is **Beta***m*,1, yet with reduced support.

Summing up, Equation (18) provides the conditional marginal as a product of the uniform in the unit sphere and a reduced **Beta***m*,1. According to representation (11), the result is UNIFORM, in *m* dimensions, with the center being the origin and reduced radius *<sup>r</sup>* = (<sup>1</sup> <sup>−</sup> *<sup>a</sup>*2)*<sup>m</sup>*/2.

#### *2.2. Normal*

As we did with UNIFORM, we denote the first component of the standard normal multi-variate distribution by **x**, and the remaining *n* − 1 components by **x**(**2**). Due to independence of the marginals, one can write a compact equivalent of Propositions 1 and 3, see, e.g., in [6] (p. 54).

**Proposition 1.** *Let* **x** = (**x1**, **x**(**2**)) *be NORMAL.*

• *The pdf of the first component,* **x1***, is*

$$f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}, \quad x \in \mathfrak{R}.\tag{22}$$

• *The joint pdf of the last n* − 1 *components,* **x**(**2**)*, is*

$$f(\mathbf{x}\_{(2)}) = \frac{1}{\left(\sqrt{2\pi}\right)^{n-1}} e^{-\frac{||\mathbf{x}\_{(2)}||^2}{2}}, \quad \mathbf{x}\_{(2)} \in \mathfrak{R}^{n-1}.\tag{23}$$

Due to sphericity of the joint *n* − 1 components, one obtains again a compact form for the sum of squares.

**Corollary 5.** *The one-dimensional random variable* ||*x*(2)||<sup>2</sup> *is <sup>χ</sup>*<sup>2</sup> *with <sup>n</sup>* <sup>−</sup> <sup>1</sup> *degrees of freedom, with pdf*

$$f(x) = \frac{2^{-\frac{n-1}{2}}}{\Gamma\_{\frac{n-1}{2}}} x^{\frac{n-3}{2}} e^{-\frac{x}{2}}, \quad x \in (0, \infty). \tag{24}$$

#### **3. Results**

We restrict the study to the case *P* (current algorithm position) nearby *O* (optimum of SPHERE) and analyze the local performance of two EAs, one with uniform, the other with normal mutation, in terms of success probability and expected progress. Namely, we assume *<sup>R</sup>* <sup>=</sup> <sup>|</sup>*OP*| ≤ 1/2, such that success region *<sup>R</sup><sup>S</sup> <sup>P</sup>* is the sphere with center *O* and radius *R*, Figure 1.

**Figure 1.** Success region of Algorithm 1 with uniform mutation on SPHERE.

Re-set *P* as the origin of the coordinate system, and measure the algorithm's progress in the positive direction of the first axis-write *<sup>x</sup>* for *<sup>x</sup>*1, *<sup>h</sup>* for *<sup>x</sup>*(2) and *<sup>u</sup>* for ||*h*||2. The success probability and the expected progress are provided by the first term of transition kernel (4), respectively, by the upper part of random variable (3). They obey to the uniform continuous mutation distribution with pdf *f*(*x*) = 1/*Vn* and compact support, the sphere with center *P* and radius 1.

The calculus of success probability and expected progress resides in integrating the random variable (3) over *R<sup>S</sup> <sup>P</sup>*. For UNIFORM mutation this calculus is analytically tractable. For NORMAL mutation, the analytic integration is impossible, see in [6] (p. 56) and Theorem 8, but the theory of incomplete Gamma functions makes the comparison tractable.

Note that, if the success probability bears only one possible definition (the volume of success region), the situation is different with respect to the expected progress. As the random variable **Z***x* from (3) characterizes the local behavior of the algorithm, one would normally associate the expected progress to the expected value of this random variable. However, **Z***<sup>x</sup>* is *n*-dimensional, and such is *E*(**Z***x*), so there is a need to mediate somehow among the *n* components.

One could consider only the first component of the expected value, the one pointing towards the optimum, which has been applied on a different fitness landscape, the inclined plane [21]. (Yet in another landscape, the RIDGE, it is customary to consider the progress along the perpendicular component *h*, see in [6,27] for an inventory of fitness functions used in EA testing, the reader is referred to the work in [32].) For UNIFORM mutation, a simplified version of the expected progress may be defined as the centroid of the corresponding success region [9,10]. However, a more traditional view is followed here ([6], p. 54):

$$\mathbf{precgress} = R - \sqrt{(R - x)^2 + \mu}.\tag{25}$$

This corresponds to the difference in distance |*OP*|−|*OC*|, provided **C** is a random point generated by mutation, Figure 1.

#### *3.1. Uniform Mutation*

If the UNIFORM mutation in the unit sphere with center *P* is applied, one cannot use for integration the ensemble of Propositions 1 and 3, as the marginals of UNIFORM are not independent. Instead, one should use the conditional first component from Theorem 5, together with the joint *n* − 1 dimensional distribution from Proposition 3. The integration region is

$$\begin{cases} \mu = ||h||^2 \in \left(0, R^2\right) \\ \mu \in \left(R - \sqrt{R^2 - \mu}, R + \sqrt{R^2 - \mu}\right) \end{cases}$$

**Theorem 6.** *Let an EA with UNIFORM mutation minimizing the SPHERE be situated at current distance R from the origin, R* ∈ (0, 1/2)*. The* success probability *is*

$$prob^{II} = \mathbb{R}^n. \tag{26}$$

**Proof.** The use of Equations (4), (15), (16) and (21) yields

$$\begin{split} prob^{l\varPi} &= \int\_{S\_{n}^{0}} \mathbf{1}\_{S\_{n}^{0}} dx \\ &= n\pi^{-\frac{u}{2}} \Gamma\_{\frac{u}{2}} \int\_{\mathbb{R} \in S\_{n-1}^{0}} \left(1 - ||h||^{2}\right)^{\frac{1}{2}} \times \frac{1}{2\sqrt{1 - ||h||^{2}}} \int\_{R-\sqrt{R^{2} - ||h||^{2}}}^{R+\sqrt{R^{2} - ||h||^{2}}} 1 dx \, dh \\ &= \frac{n\pi^{\frac{u-1}{2} - \frac{u}{2}} \Gamma\_{\frac{u}{2}}}{\Gamma\_{\frac{u-1}{2}}} \int\_{0}^{R^{2}} y^{\frac{u-1}{2} - 1} \sqrt{R^{2} - y} \, dy = \frac{n\Gamma\_{\frac{u}{2}}}{\sqrt{\pi}\Gamma\_{\frac{u-1}{2}}} \int\_{0}^{R^{2}} y^{\frac{u-3}{2}} \sqrt{R^{2} - y} \, dy = R^{\operatorname{l}}. \end{split}$$

**Theorem 7.** *Let an EA with UNIFORM mutation minimizing the SPHERE be situated at current distance R from the origin, R* ∈ (0, 1/2)*. The* expected progress *is*

$$
\phi^{\rm II} = \frac{R^{n+1}}{n+1} \tag{28}
$$

**Proof.** Following the proof of Theorem 6 and inserting factor (25), one gets

$$\begin{split} \phi^{II} &= \frac{1}{2} \frac{n\Gamma\_{\frac{\boldsymbol{\pi}}{2}}}{\sqrt{\pi}\Gamma\_{\frac{\boldsymbol{u}-1}{2}}} \times \int\_{\boldsymbol{u}=0}^{\mathbb{R}^{2}} u^{\frac{\boldsymbol{u}-3}{2}} \int\_{\mathbb{R}-\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}}^{\mathbb{R}+\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}} \left(\mathbb{R} - \sqrt{(\mathbb{R}-\boldsymbol{x})^{2} + \boldsymbol{u}}\right) d\boldsymbol{x} \, d\boldsymbol{u} \\ &= \mathbb{C} \int\_{\boldsymbol{u}=0}^{\mathbb{R}^{2}} u^{\frac{\boldsymbol{u}-3}{2}} \int\_{\mathbb{R}-\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}}^{\mathbb{R}+\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}} \mathbb{R} \, d\boldsymbol{x} \, d\boldsymbol{u} - \mathbb{C} \int\_{\boldsymbol{u}=0}^{\mathbb{R}^{2}} u^{\frac{\boldsymbol{u}-3}{2}} \int\_{\mathbb{R}-\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}}^{\mathbb{R}+\sqrt{\mathbb{R}^{2}-\boldsymbol{u}}} \sqrt{(\mathbb{R}-\boldsymbol{x})^{2} + \boldsymbol{u}} \, d\boldsymbol{x} \, d\boldsymbol{u} \\ &= \mathbb{C} \, I\_{1} - \mathbb{C} \, I\_{2}. \end{split}$$

We treat separately *I*<sup>1</sup> and *I*2. *I*<sup>1</sup> is simply (27), multiplied by the constant *R*, thus

$$
\mathbb{C}\,I\_1 = \mathbb{R}^{n+1}.\tag{29}
$$

For *I*<sup>2</sup> one can apply formula ([33], p. 13)

$$\int (\mathbf{x}^2 + a)^{\frac{1}{2}} d\mathbf{x} = \frac{\varkappa}{2} (\mathbf{x}^2 + a)^{\frac{1}{2}} + \frac{a}{2} \ln\left(\mathbf{x} + (\mathbf{x}^2 + a)^{\frac{1}{2}}\right).$$

in order to get

$$I\_2 = \int\_{\mu=0}^{R^2} u^{\frac{\underline{n-3}}{2}} R \sqrt{R^2 - \mu} \, du + \int\_{\mu=0}^{R^2} u^{\frac{\underline{n-3}}{2}} \frac{\mu}{2} \ln \frac{R + \sqrt{R^2 - \mu}}{R - \sqrt{R^2 - \mu}} \, du = I\_3 + I\_4. \tag{30}$$

Again, *I*<sup>3</sup> is the integral (27), multiplied by *R*/2, thus

$$C \, I\_3 = \frac{R^{n+1}}{2}. \tag{31}$$

The substitution *y* = *u*/*R*<sup>2</sup> on *I*<sup>4</sup> provides

$$I\_4 = \frac{R^{n+1}}{2} \int\_0^1 y^{\frac{n-1}{2}} \ln \frac{1 + \sqrt{1-y}}{1 - \sqrt{1-y}} \, dy$$

while partial integration gives

$$I\_4 = \frac{R^{n+1}}{n+1} \int\_0^1 y^{\frac{n+1}{2}} \cdot \frac{1}{y\sqrt{1-y}} \, dy = \frac{R^{n+1}}{n+1} \beta\_{\frac{n+1}{2}, \frac{1}{2}}\cdots$$

Bringing in the constant *C*, one gets

$$C \, I\_4 = \frac{1}{2} \frac{n \Gamma\_{\frac{n}{2}}}{\sqrt{\pi} \Gamma\_{\frac{n-1}{2}}} \cdot \frac{R^{n+1}}{n+1} \beta\_{\frac{n+1}{2}, \frac{1}{2}} = \frac{1}{2} \frac{n-1}{n+1} R^{n+1}.\tag{32}$$

Summing up (29)–(32) one gets the desired result.

The results from Theorems 6 and 7 are also presented in [28], yet with different proofs. Equations (26) and (28) point out a remarkable property of the EA with UNIFORM mutation on the SPHERE.

**Corollary 6.** *In the conditions of Theorems 6 and 7, the success probability is the derivative of the expected progress.*

#### *3.2. Normal Mutation*

Setting *σ* = 1 and avoiding the transformation *σ*∗ = *σn*/*R*, one obtains the success probability and the expected progress for the EA with standard normal mutation following closely the proof in ([6], pp. 54–56). The incomplete Gamma function is ([33], p. 260):

$$P(a,x) = \frac{1}{\Gamma\_a} \int\_0^\infty e^{-t} t^{a-1} dt. \tag{33}$$

The following expressions are not restricted to the case of algorithm nearby optimum, due to the unbounded support of the normal distribution. Unfortunately, integration is impossible.

**Theorem 8.** *Let an EA with NORMAL mutation minimizing the SPHERE be situated at current distance R from the origin.*

• *The* success probability *is*

$$prob^N = \frac{1}{\sqrt{2\pi}} \int\_{x=0}^{2R} e^{-\frac{x^2}{2}} P\left(\frac{n-1}{2}, Rx - \frac{x^2}{2}\right) dx. \tag{34}$$

• *The* expected progress *is*

$$\boldsymbol{\phi}^{N} = \frac{2^{-\frac{\mathbf{u}-\mathbf{1}}{2}}}{\sqrt{2\pi}\Gamma\_{\frac{\mathbf{u}-\mathbf{1}}{2}}} \int\_{\frac{\mathbf{u}-\mathbf{1}}{2}}^{2\text{R}} \int\_{\mathbf{u}=0}^{2\text{R}x-\mathbf{x}^{2}} e^{-\frac{\mathbf{u}^{2}}{2}} u^{\frac{\mathbf{u}-\mathbf{1}}{2}-1} e^{-\frac{\mathbf{u}}{2}} \operatorname{progress} \,\text{d}\mathbf{u} \,\text{x}.\tag{35}$$

**Proof.**

$$prob^{N} = \frac{2^{-\frac{n-1}{2}}}{\sqrt{2\pi}\Gamma\_{\frac{n-1}{2}}} \int\_{x=0}^{2R} \int\_{u=0}^{2Rx-x^{2}} e^{-\frac{x^{2}}{2}} u^{\frac{n-1}{2}-1} e^{-\frac{y}{2}} du dx$$

$$= \frac{2^{-\frac{n-1}{2}}}{\sqrt{2\pi}\Gamma\_{\frac{n-1}{2}}} \int\_{x=0}^{2R} e^{-\frac{x^{2}}{2}} \left( \int\_{u=0}^{2Rx-x^{2}} u^{\frac{n-1}{2}-1} e^{-\frac{y}{2}} du \right) dx$$

$$= \frac{1}{\sqrt{2\pi}} \int\_{x=0}^{2R} e^{-\frac{x^{2}}{2}} \left( \frac{1}{\Gamma\_{\frac{n-1}{2}}} \int\_{t=0}^{Rx-\frac{x^{2}}{2}} t^{\frac{n-1}{2}-1} e^{-t} dt \right) dx$$

$$= \frac{1}{\sqrt{2\pi}} \int\_{x=0}^{2R} e^{-\frac{x^{2}}{2}} P\left(\frac{n-1}{2}, Rx - \frac{x^{2}}{2}\right) dx. \tag{36}$$

The same calculus applies for the expected progress, with the addition of the factor corresponding to the one dimensional progress along the *x* axis, Equation (25).

#### *3.3. Comparison*

Due to the analytic intractability of integral representations (34) and (35), a theoretical comparison between two variants of Algorithm 1-one with UNIFORM, the other with NORMAL mutation-must resort to inequalities. Therefore, a deeper insight into the prolific theory of Euler and hypergeometric functions is required. We start with an upper bound for the incomplete Gamma function (33) ([34], p. 1213).

**Proposition 2.** *The following inequality holds*

$$P(a, \mathbf{x}) \le \frac{\mathbf{x}^a}{a(a+1)} \frac{\Gamma\_a}{\Gamma\_a} \left(1 + a e^{-\mathbf{x}}\right). \tag{37}$$

More results are gathered from in [35], [36] (p. 240), [37] (pp. 890, 894), [38] (pp. 53, 57).

**Proposition 3** (Hypergeometric functions)**.**

• *For any real set of parameters a*, *b*, *ai*, *bi and any real number x, define*

$$\,\_1F\_1\left(a,b\mid \mathbf{x}\right) = \sum\_{k=0}^{\infty} \frac{(a)\_k}{(b)\_k} \frac{\mathbf{x}^k}{k!} \tag{38}$$

$$\,\_2F\_2\left(a\_1, a\_2; b\_1, b\_2 \mid \mathbf{x}\right) = \sum\_{k=0}^{\infty} \frac{(a\_1)\_k (a\_2)\_k}{(b\_1)\_k (b\_2)\_k} \frac{\mathbf{x}^k}{k!} \tag{39}$$

*where* (*a*)*<sup>k</sup>* = *a*(*a* + 1)...(*a* + *k* − 1) = Γ*a*+*k*/Γ*<sup>a</sup> is the Pochhammer symbol, with*

$$
\lambda(a)\_{2k} = \left(\frac{a}{2}\right)\_k \left(\frac{a+1}{2}\right)\_k 2^{2k}.\tag{40}
$$

• *If A* = (*a*1,..., *aq*) *and B* = (*b*1,..., *bq*)*, we write B* <sup>≺</sup>*<sup>W</sup> A, if*

$$0 < a\_1 \le \dots \le a\_{q'} \quad 0 < b\_1 \le \dots \le b\_q$$

$$\sum\_{i=1}^k a\_i \le \sum\_{i=1}^k b\_{i\prime} \quad k = 1, \dots, q.$$

• *If B* <sup>≺</sup>*<sup>W</sup> A, the following inequality holds:*

$${}\_{p}F\_{p}\left(A,B\mid \mathbf{x}\right) \le 1 - \theta + \theta e^{\mathbf{x}},\tag{41}$$

$$\theta = \begin{cases} \frac{a}{b}, & p = 1, \ (A,B) = (a,b) \\ \frac{a\_{1}a\_{2}}{b\_{1}b\_{2}}, & p = 2, \ (A,B) = (a\_{1}, a\_{2}; b\_{1}, b\_{2})\end{cases}.$$

We can prove now the main result stating that, for an EA acting on the SPHERE with current position at maximal range 1/2 from the origin, the UNIFORM mutation provides a larger success probability than the NORMAL mutation, for the arbitrary dimension *n*. We denote by *n*!! the double factorial (semi-factorial), that is, the product of all integers from 1 to *n* of same parity with *n*.

**Theorem 9.** *Let an EA minimizing the SPHERE be situated at current distance R from the origin, such that R* ≤ 1/2*. For any n* ≥ 3*, the following holds:*

*prob<sup>N</sup> prob<sup>U</sup>* <sup>&</sup>lt; 1 *<sup>n</sup>*!!. (42)

**Proof.** Apply inequality (37) to Equation (34)

$$\begin{split} prob^{N} &\leq \frac{1}{\sqrt{2\pi}\Gamma\frac{\frac{n}{2}}{\frac{n}{2}}} \int\_{x=0}^{2R} e^{-\frac{x^{2}}{2}} \frac{4\left(Rx - \frac{x^{2}}{2}\right)^{\frac{n-1}{2}}}{n^{2}-1} \left(1 + \frac{n-1}{2}e^{\frac{x^{2}}{2}-Rx}\right) dx \\ &= \frac{4}{\sqrt{2\pi}(n^{2}-1)\Gamma\frac{\frac{n}{2}}{\frac{n}{2}}} \int\_{x=0}^{2R} e^{-\frac{x^{2}}{2}} \left(Rx - \frac{x^{2}}{2}\right)^{\frac{n-1}{2}} dx \\ &+ \frac{2}{\sqrt{2\pi}(n+1)} \int\_{x=0}^{2R} e^{-Rx} \left(Rx - \frac{x^{2}}{2}\right)^{\frac{n-1}{2}} dx \\ &= \frac{R^{n}2^{\frac{n+5}{2}}}{\sqrt{2\pi}(n^{2}-1)\Gamma\_{\frac{n-1}{2}}} \int\_{x=0}^{1} e^{-2R^{2}t^{2}} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} dt \\ &+ \frac{R^{n}2^{\frac{n+3}{2}}}{\sqrt{2\pi}(n+1)\Gamma\_{\frac{n-1}{2}}} \int\_{x=0}^{1} e^{-2R^{2}t} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} dt. \end{split} \tag{43}$$

Using the series expansion of the exponential, the second integral in (43) becomes a hypergeometric function of type (38). The interchange of the integral and the sum is justified by the absolute convergence of the series.

2

$$\int\_{x=0}^{1} e^{-2R^2t} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} dt = \int\_{x=0}^{1} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} \sum\_{k=0}^{\infty} \frac{\left(-2R^2t\right)^k}{k!} dt$$

$$= \sum\_{k=0}^{\infty} \frac{\left(-2R^2\right)^k}{k!} \int\_{x=0}^{1} t^{\frac{n-1}{2} + k} (1-t)^{\frac{n-1}{2}} dt = \sum\_{k=0}^{\infty} \frac{\left(-2R^2\right)^k}{k!} \frac{\Gamma\_{\frac{n+1}{2} + k} \Gamma\_{\frac{n+1}{2}}}{\Gamma\_{n+1+k}}$$

$$= \beta \frac{\pi^{\frac{n}{2}+1}}{2} \sum\_{k=0}^{\infty} \frac{\left(-2R^2\right)^k}{k!} \frac{\left(\frac{n+1}{2}\right)\_k}{(n+1)\_k} = 2^{-n} \beta\_{\frac{n+1}{2}, \frac{1}{2}} \,\_1F\_1\left(\frac{n+1}{2}, n+1 \mid -2R^2\right). \tag{44}$$

Identity (40) reduces the first integral in (43) to a hypergeometric function (39).

$$\int\_{x=0}^{1} e^{-2R^2t^2} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} dt = \int\_{x=0}^{1} t^{\frac{n-1}{2}} (1-t)^{\frac{n-1}{2}} \sum\_{k=0}^{\infty} \frac{(-2R^2t^2)^k}{k!} dt$$

$$= \sum\_{k=0}^{\infty} \frac{(-2R^2)^k}{k!} \int\_{x=0}^{1} t^{\frac{n-1}{2} + 2k} (1-t)^{\frac{n-1}{2}} dt = \sum\_{k=0}^{\infty} \frac{(-2R^2)^k}{k!} \frac{\Gamma\_{\frac{n+1}{2} + k} \Gamma\_{\frac{n+1}{2}}}{\Gamma\_{n+1+k}}$$

$$= \beta\_{\frac{n+1}{2}, \frac{n+1}{2}} \sum\_{k=0}^{\infty} \frac{(-2R^2)^k}{k!} \frac{\left(\frac{n+1}{2}\right)\_{2k}}{(n+1)\_{2k}} = \beta\_{\frac{n+1}{2}, \frac{n+1}{2}} \sum\_{k=0}^{\infty} \frac{(-2R^2)^k \left(\frac{n+1}{4}\right)\_k \left(\frac{n+3}{4}\right)\_k 2^{2k}}{k!}$$

$$= 2^{-n} \beta\_{\frac{n+1}{2}, \frac{1}{2}} 2^{F\_2\left(\frac{n+1}{4}, \frac{n+3}{4}; \frac{n+1}{2}, \frac{n+2}{2}\right) - 2R^2 \Big). \tag{45}$$

Summing up Equations (43)–(45), and using inequality (41) for *p* = 1, 2, one gets

$$\begin{split} prob^{N} &\leq \frac{R^{n}}{2^{\frac{n+2}{2}}(n+1)\Gamma\_{\frac{n+2}{2}}} \left[ (n-1)\left(1+e^{-2R^{2}}\right) + 4 - \frac{n+3}{n+2} \left(1-e^{-2R^{2}}\right) \right] \\ &< \frac{R^{n}}{2^{\frac{n+2}{2}}(n+1)\Gamma\_{\frac{n+2}{2}}} [2(n-1)+4] = R^{n} \frac{1}{2^{\frac{n}{2}}\Gamma\_{\frac{n+2}{2}}} = R^{n} \frac{1}{2^{\frac{n}{2}}\frac{n-2}{2}\dots} = prob^{11} \frac{1}{n!!} .\end{split}$$

In the last equality we have used Equation (26) and the definition of the double factorial, for *<sup>n</sup>* even. Obviously, for *<sup>n</sup>* odd the constant <sup>√</sup>2/√*<sup>π</sup>* will appear at the tail of the product, yet this is a minor difference that may be neglected.

The result for the expected progress follows now easily.

**Theorem 10.** *Let an EA minimizing the SPHERE be situated at current distance R from the origin, such that R* ∈ (0, 1/2)*. For any n* > 3*, the following holds:*

$$\frac{\phi^N}{\phi^{1!}} < \frac{n+1}{n!!} \approx \frac{1}{(n-2)!!}.\tag{46}$$

**Proof.** The expected progress of NORMAL mutation (35) differs from success probability (34) only by the integration factor (25). As **progress** < *R*, inequality (46) is a simple consequence of Theorems 7 and 9.

#### **4. Discussion**

Within evolutionary algorithms acting on real space, the use of normal distribution makes the implementation easier: in order to generate an *n* dimensional point, one simply generates *n* times from the normal uni-variate. Unfortunately, simplicity of the practical algorithm does not transfer to the theoretical analysis, making EA experts go long distances in order to estimate performance quantities like the success probability and the expected progress. In the end, the normal mutation only provides asymptotic formulas, valid for large *n*.

This paper analyzes a different mutation operator, based on the uniform multi-variate in the sphere, with dependent components. Using deeper insights into the spherical distributions theory, the local performance of the algorithm with uniform mutation was measured on the SPHERE fitness function. Close expressions for the success probability and the expected progress of the EA with uniform mutation have been derived, valid for arbitrary *n*. Compared to the performance of the normal operator-which, due to the intractability of integral formulas in Theorem 8, required inequalities with hypergeometric functions-, the success probability and the expected progress of the algorithm with uniform mutation are both larger, by a factor of order *n*!!.

#### **5. Conclusions**

From a broader perspective, this paper can be seen, together with the works in [27,28], as an attempt of revisiting the classical theory of continuous evolutionary algorithms. Even if practitioners in the field will continue to use the normal multi-variate as mutation distribution, we claim that the theory can benefit from the uniform distribution inside the sphere. First, as demonstrated in this paper, a particular setting of parameters (the natural choice *ρ* = *σ* = 1) provides better performance for the uniform mutation operator on the SPHERE landscape, if current position of the algorithm is nearby the optimum. However, in light of the "no free-lunch theorem for optimization" paradigm [39], one cannot expect general dominance of an algorithm over all others, irrespective of the fitness function. Rather, specific algorithms with particular operators should be analyzed separately, on different optimization landscapes. This is where the second advantage of the new uniform distribution occurs, in terms of more tractable mathematical analysis, yielding close formulas, previously not attained by normal mutation theory—see the studies of the RIDGE landscape in [28] and of the elitist evolutionary algorithm with mutation and crossover on SPHERE in [27].

A theory of continuous evolutionary algorithms could not be complete without the analysis of global behavior and adaptive mutation parameter. These cases have already been treated in [27,28]—under a normalization of mutation sphere radius which makes algorithm behave similarly to the one with normal mutation, in terms of difference and differential equations, following the works in [6,29]. This opens the way for the challenging task, previously unattempted in probabilistic optimization literature, of linking the theory of continuous evolutionary algorithms to that of differential optimization techniques such as particle swarm optimization [40] and differential evolution [41].

**Funding:** The work was supported by the Bucharest University of Economic Studies, Romania, through Project "Mathematical modeling of factors leading to subscription or un-subscription from email and SMS lists, forums and social media groups".

**Acknowledgments:** The author is grateful to his colleague Ovidiu Solomon for guidance through the jungle of Hypergeometric functions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Evolution Strategies under the 1/5 Success Rule**

**Alexandru Agapie 1,2**


**Abstract:** For large space dimensions, the log-linear convergence of the elitist evolution strategy with a 1/5 success rule on the sphere fitness function has been observed, experimentally, from the very beginning. Finding a mathematical proof took considerably more time. This paper presents a review and comparison of the most consistent theories developed so far, in the critical interpretation of the author, concerning both global convergence and the estimation of convergence rates. I discuss the local theory of the one-step expected progress and success probability for the (1+1) ES with a normal/uniform distribution inside the sphere mutation, thereby minimizing the SPHERE function, but also the adjacent global convergence and convergence rate theory, essentially based on the 1/5 rule. Small digressions into complementary theories (martingale, irreducible Markov chain, drift analysis) and different types of algorithms (population based, recombination, covariance matrix adaptation and self-adaptive ES) complete the review.

**Keywords:** continuous evolutionary algorithm; Markov chain; martingale; drift analysis; Wald's equation; computational complexity

**MSC:** 68W50

#### **1. Introduction**

It is within the human nature to favor short, simple and intuitive constructs in abstract sciences. This is the case with the 1/5 success rule, proposed in 1965 by Rechenberg for the adaptation of the normal mutation parameter in evolution strategies (ES) [1]. According to Auger and Hansen [2], the idea of step-size adaptation for probabilistic algorithms was also independently proposed by other authors around that time, e.g., [3]. Without any theoretical explanation, the adaptation rule performed surprisingly well in experiments, providing global convergence for various algorithmic designs and fitness landscapes. The magical aura around the rule began to unravel in 2000, with the apparition of convergence proofs for the ES on SPHERE [4–6]. We use uppercase when referring to the fitness function and lowercase for the uniform distribution inside the sphere. Obviously, the optimum (minimum) of SPHERE is located at the origin of *n*.

$$\mathcal{F}: \mathfrak{R}^n \to \mathfrak{R} \quad \mathcal{F}(\mathfrak{x}\_1, \dots, \mathfrak{x}\_n) = \sum\_{i=1}^n \mathfrak{x}\_i^2. \tag{1}$$

Apart from the adaptation mechanism provided by the 1/5 rule, the ES design is very simple: a random walk for generating new individuals (mutation), plus elitist selection, the natural principle of discarding worse offspring. Together, the three procedures build up an efficient algorithm, simple in form but complicated in theory. The local behavior is difficult to estimate because of the discontinuous and inhomogeneous Markov transition kernel induced by mutation and selection, though global convergence is hard to prove due to the empirical application of the 1/5 rule. The Markov character of the ES is lost upon

**Citation:** Agapie, A. Evolution Strategies under the 1/5 Success Rule. *Mathematics* **2023**, *11*, 201. https:// doi.org/10.3390/math11010201

Academic Editor: Davide Valenti

Received: 29 October 2022 Revised: 22 December 2022 Accepted: 23 December 2022 Published: 30 December 2022

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the application of the 1/5 rule, which observes not one, but several previous iterations of the algorithm.

In order to gain a better grip on these difficulties, we introduce first the spherical distributions, a class of multi-variate random variables (r.v.s) which includes the normal distribution and also the uniform on/inside the sphere [7].

#### **Definition 1.**

• *An n-dimensional r.v.* **x** *is said to have a spherical distribution if*

$$\mathbf{x} \stackrel{d}{=} \mathbf{r} \cdot \mathbf{u}^{\mathbf{n}} \tag{2}$$

*for some one-dimensional r.v. (radius)* **r***, and a uniform distribution on the unit sphere* **un***. Moreover,* **r** *and* **u<sup>n</sup>** *are independent, and also*

$$\mathbf{r} = ||\mathbf{x}|| \ge 0, \qquad \qquad \mathbf{u}^{\mathbf{n}} \stackrel{d}{=} \frac{\mathbf{x}}{||\mathbf{x}||}. \tag{3}$$

• *If the spherical distribution has pdf g, then g*(**x**) = *g*(||**x**||)*, and there is a special connection between g and f , the pdf of* **r***:*

$$f(||\mathbf{x}||) = \frac{2\pi^{n/2}}{\Gamma\_{\frac{n}{2}}} \mathbf{x}^{n-1} \mathbf{g}(\mathbf{x}).\tag{4}$$

The basic algorithm discussed in this paper is the following.

The mutation operator yields new, potentially better solutions from the fitness landscape. Classical ES theory, developed mainly by Rechenberg, Schwefel and Beyer, uses the normal mutation distribution with normalized standard deviation *σ* = *ρ*/ <sup>√</sup>*<sup>n</sup>* [1,4,8]. Rudolph applied the uniform *on* sphere [9], and a recent study proved that, under proper scaling of the mutation parameter, the uniform distribution inside the sphere of radius *ρ* and the averaged sum of uniforms perform, both locally and globally, similarly to the normal operator [10].

Under a constant mutation rate, a real ES stagnates in the vicinity of the optimum. This is where the 1/5 rule applies, by modifying (decreasing) the mutation parameter *ρ*. Algorithm 1 depicts Rechenberg's original (symmetric) version of the 1/5 rule, commonly applied in both practical and theoretical studies [1,4,6,11,12]. A simplified, asymmetric rule that only decreases but never increases *ρ* works as well [10].

There are different ways to describe the sequence of r.v.s {*Zt*}*t*∈*<sup>N</sup>* generated by the ES evolution over successive iterations. Contrary to random walk, the Markov kernel induced by mutation and selection is inhomogeneous, since the current state changes the one-step transition (success) probabilities. For arbitrary space dimension *n*, an exact calculus involving the local transition kernel is intractable, opening the way for various approximations and making the ES convergence one of the most studied problems in literature. The stochastic models proposed so far include renewal processes, drift analysis and martingales [6,9,13–15]. Markov chain models have also been tested, first for constant mutation ES, without (or prior to) the step-size adaptation procedure provided by the 1/5 rule [16,17]. Decomposing the algorithm into a sequence of constant mutation cycles (mathematically, a sequence of Markov chains) has also been considered [2,10,18,19], which is similar to theoretical studies of closely related probabilistic algorithms such as simulated annealing and random heuristic search [20,21]. It is also worth mentioning the large number of theoretical studies that do not presume any stochastic structure at all [4,12,22–24].

#### **Algorithm 1** Elitist ES with 1/5 success rule.

	- *t* := *t* + 1
	- *Mutation*: generate a new point **x** in *<sup>n</sup>* using some spherical mutation distribution with radius mean *ρ*
	- *Selection*: if <sup>F</sup>(**x**) <sup>&</sup>lt; <sup>F</sup>(**x***t*−1) then **<sup>x</sup>***<sup>t</sup>* :<sup>=</sup> **<sup>x</sup>** else **x***<sup>t</sup>* := **x***t*−<sup>1</sup>
	- *1/5 success rule*: if *t* = 0 modulo *n*, compute the success frequency over the last *n* iterations

$$\mathcal{S}F = \mathfrak{HS}/n$$

Change *ρ* according to

(*i*) *ρ* = *ρ*/2, if *SF* < 1/5 (*ii*) *ρ* = 2 *ρ*, if *SF* > 1/5

3. until *t* = *t*max

Let us take a closer look at the success region, a key concept in the algorithm's local behavior. This is the integration region for both success probability and expected progress. Let *x* ∈ *<sup>n</sup>* be the current ES position, *<sup>A</sup>* ⊂ *<sup>n</sup>* an open set and the mutation be defined by a probability density function (pdf) *f* . The cumulative effect of elitist selection yields the success region *R<sup>S</sup> <sup>x</sup>* and transition kernel *Px*(*A*). The transition kernel is discontinuous due to *δx*(*A*), the Dirac measure in *x*—defined as 1 if *x* ∈ *A* and 0 otherwise.

$$R\_x^S = \{ y \in \mathfrak{R}^n | \mathcal{F}(y) < \mathcal{F}(x) \} \tag{5}$$

$$P\_{\mathbf{x}}(A) = \int\_{A \cap \mathbb{R}^{S}\_{\mathbf{x}}} f(y) dy + \left[ 1 - \int\_{A \cap \mathbb{R}^{S}\_{\mathbf{x}}} f(y) dy \right] \delta\_{\mathbf{x}}(A). \tag{6}$$

If *A* = *R<sup>S</sup> <sup>x</sup>*, Equation (6) reduces to the first term, understood as success probability and denoted *PS*.

Under uniform mutation inside the sphere, the modification of the success region and also the inhomogeneous Markov kernel—can be observed as the ES approaches the optimum in Figure 1, from right to left. In case (a), *R<sup>S</sup> <sup>x</sup>* is the intersection of two spheres (corresponding to mutation and fitness). As the algorithm approaches the optimum, assuming *ρ* has not changed, *R<sup>S</sup> <sup>x</sup>* becomes a full (fitness) sphere, case (b). The ES stagnates at this point, the 1/5 rule is activated and *ρ* is halved, such that *R<sup>S</sup> <sup>x</sup>* becomes again the intersection of two spheres, case (c).

**Figure 1.** Elitist ES on SPHERE: success region (blue) of uniform mutation. The solid line circle (centered in optimum O) represents the fitness sphere; the dotted circle (centered in current position P) stands for the mutation sphere. As the ES approaches O from sub-figure (**a**–**c**), mutation radius *ρ* is halved under the 1/5 success rule.

Among the many studies devoted to the convergence of elitist ES with the 1/5 success rule on SPHERE and similar quadratic fitness functions, we review in this paper only what we consider to be complete theories, which follow the general pattern presented in Algorithm 1 and also achieve proofs of global convergence. Consequently, we distinguish between classic ES theory, in Section 2, and general theories, in Section 3. In classic ES theory, the accent falls on local behavior, with global convergence being a consequence of the best-case local scenario. General theories, on the other hand, provide a unitary solution to the convergence paradigm. To classic ES theory we assimilate the works of Rechenberg, Schwefel, Beyer, Rudolph and Jägersküpper; and to general theories the analyses of Auger and Hansen of Akimoto, Auger and Glasmachers. Note that the papers of Jägersküpper fall somehow in between, since they build on local behavior, but also provide a global convergence proof outside the best-case local scenario. The adaptation mechanisms used in population-based ES—with multiple offspring, multiple parents and recombination, self-adaptation or covariance matrix adaptation—are briefly discussed as generalizations of the 1/5 success rule in Section 4.

We underline the fact that this review paper is not an exhaustive survey of the ES state-of-art—we defer to [25] for that purpose—but a personal reading of the convergence theories built in this field at the intersection of probability theory, computational complexity and statistics.

#### **2. Classic ES Theory**

As in Figure 1b, assume the algorithm and the center of the coordinate system are both in *P*; the distance to global optimum *O* is *R* = |*OP*|; denote *x*<sup>1</sup> by *x* and the remaining *n* − 1 components *x*(2) by *h*. The classic ES theory focuses on the local, one-step behavior of the algorithm, where the mutation rate *ρ* can be assumed constant.

If *C* is a random point generated by mutation, the progress becomes a one-dimensional r.v. corresponding to the difference in distance to optimum between the current ES position and the next. Due to the elitist selection, progress is non-negative. For a successful mutation, *C* is inside region *R<sup>S</sup> <sup>x</sup>*, the blue area in Figure 1. Apply Pythagoras to *OC* and *<sup>u</sup>* <sup>=</sup> ||*h*||2; then, progress becomes [10] and ([4], p. 54).

$$\text{progress} = |\text{OP}| - |\text{OC}| = R - \sqrt{(R - x\_1)^2 + \sum\_{i=2}^{n} x\_i^2} = R - \sqrt{(R - x)^2 + u}.\tag{7}$$

Insert progress into the integral and set *A* = *R<sup>S</sup> <sup>x</sup>*; then Equations (6) and (7) build the so-called ES' expected progress:

$$
\Phi = \int\_{R^S} \left( R - \sqrt{(R - \mathfrak{x})^2 + \mathfrak{u}} \right) f(\mathfrak{x}, \mathfrak{u}) d\mathfrak{x} d\mathfrak{u}.\tag{8}
$$

In order to approximate the above integral, two distinct cases occur: (i) ES close to optimum, or large step size (Figure 1b); (ii) ES far from the optimum, or small step size (Figure 1a,c).

With the same uniform mutation distribution, ref. [24] refined the results of both [13,26], and came closer to the analysis reviewed below by considering, in the estimation of local expected progress, the same two cases. However, without a deeper insight into spherical distributions, the study yielded only upper bounds for the expected progress and thus lower bounds for the global convergence time. Additionally, [23] analyzed an EA with *λ* > 1 individuals in the population, acting on the SPHERE, with uniform mutation inside the sphere. However, the analysis is confined to the case where optimum is within the mutation sphere—that means, large step size, for which only asymptotic (*λ* → ∞) results are derived; again, a comparison to (32) is intractable. Noteworthy, yet again incomparable to the results reviewed below, are theoretical analyses of mutation distributions that are uniform but non-spherical [22,27].

#### *2.1. Large Step Size*

This case is defined by the inequality *<sup>ρ</sup>* <sup>≥</sup> <sup>2</sup>*R*, such that success region *<sup>R</sup><sup>S</sup> <sup>x</sup>* is completely included in the mutation sphere, Figure 1b. We assume uniform mutation inside the sphere, but an ES with normal mutation performs similarly, for large *n* and a proper parameter scaling [10,28].

Mathematically, this is the unique situation when Formula (8) is tractable, allowing for a closed-form derivation of the expected progress. However, the case did not receive much attention in the literature, since it corresponds to the worst case scenario (stagnation), calling for an emergency application of the mutation adaptation rule.

A first rigorous result concerning this case was provided by Rudolph, with the analysis of pure random search, an algorithm that generates offspring independently, using the uniform distribution inside the sphere with a fixed center and constant radius *ρ* ([9], pp. 168–169). The minimal distance to optimum out of the first *t* trials, *Rt*, is computed, for large *t*, by order statistics [29]. The second approximation is with respect to large space dimension *n*.

$$R\_l \approx \rho \, t^{-1/n} \Gamma\_{1 + \frac{1}{n}} \approx \rho \, t^{-1/n}. \tag{9}$$

The following definitions are used in computational complexity analysis [5,10].

#### **Definition 2.**


Rudolph argues that the convergence rate of elitist, constant mutation ES decreases to the asymptotics (9), estimated, after re-noting *Rt* = , as

$$t = \Theta\left[\left(\frac{1}{\epsilon}\right)^n\right] \tag{10}$$

and classified as poor, compared to the performance of adaptive mutation ES.

One can note in Figure 1b that progress is minimal (zero) if the randomly generated point *C* is *P* and maximal (*R*) if the generated point is *O*. The success probability in the large step size case is simply the ratio of two *n*-sphere volumes, (*R*/*ρ*)*n*. A complete description of the progress r.v. (7) is provided in the following.

**Proposition 1.** *Assume the elitist ES with one individual and uniform mutation inside the sphere of radius ρ, minimizing the SPHERE, is at distance R from the origin, ρ* ≥ 2*R. Then, the progress is*

• *cdf*

$$F(\mathbf{x}) = P(\mathbf{R} - |\mathbf{OC}| \le \mathbf{x}) = \begin{cases} 0, & \mathbf{x} < \mathbf{0} \\ 1 - \left(\frac{\mathbf{R}}{\rho}\right)^n, & \mathbf{x} = \mathbf{0} \\ 1 - \left(\frac{\mathbf{R} - \mathbf{x}}{\rho}\right)^n, & \mathbf{x} \in (0, \mathbf{R}) \\ 1, & \mathbf{x} \ge \mathbf{R} \end{cases} \tag{11}$$

• *and partial pdf*

$$f(\mathbf{x}) = \begin{cases} \frac{\mathbf{n}}{\rho^n} (\mathbf{R} - \mathbf{x})^{n-1}, & \mathbf{x} \in (0, R) \\ 0, & \text{otherwise} \end{cases} \tag{12}$$

**Proof.** It is easy to see that progress is non-negative, discontinuous and *P*(*progress* = 0) = <sup>1</sup> <sup>−</sup> (*R*/*ρ*)*n*. For *<sup>x</sup>* <sup>∈</sup> (0, *<sup>R</sup>*), the point *<sup>C</sup>* can be seen as generated by the uniform distribution inside the sphere of radius *ρ*, but with center *O*. Then, |*OC*| corresponds to **r**, the radius of the uniform distribution inside the *ρ*-sphere, truncated to *SO*(*R*), the *n*-sphere with center *O* and radius *R*. Note the difference between the radius r.v. **r** from Definition 1, and the positive real number *ρ*, the radius of the mutation sphere.

However, **r** is *Beta*(*n*, 1) in the case of the uniform inside the sphere of radius *ρ*, with pdf [10]:

$$\lg(r) = \frac{1}{\rho^n} nr^{n-1}, \quad r \in (0, \rho). \tag{13}$$

Restricted to (0, *R*), the pdf (13) provides the (partial) cdf:

$$G(r) = \left(\frac{r}{\rho}\right)^n, \quad r \in (0, R), \tag{14}$$

such that

$$F(\mathbf{x}) = P(R - \mathbf{r} \le \mathbf{x}) = 1 - P(\mathbf{r} \le R - \mathbf{x}) = 1 - G(R - \mathbf{x}) = \tag{15}$$

$$\mathbf{x} = \mathbf{1} - \left(\frac{\mathbf{R} - \mathbf{x}}{\rho}\right)^n, \qquad \mathbf{x} \in (0, \mathbb{R}).\tag{16}$$

A simple derivation with respect to *x* provides the pdf (12).

The progress r.v. is depicted in Figure 2, for *ρ* = 1, *R* = 1/2 and different space dimensions *n*. To each *n* corresponds a bar at zero—the Dirac measure *δ*0, that is, the discrete part of the progress—and a thin line with the same color representing the partial pdf of the continuous part of the progress.

**Figure 2.** Progress of elitist ES on SPHERE: Dirac (bar) and pdf of continuous part (line)—large step size.

The following result is a slight generalization of a similar result (on the particular case *ρ* = 1) from [30]. However, the proof presented below is easier than the one in [30], building on the exact formulas provided by Proposition 1.

**Theorem 1.** *In the conditions of Proposition 1,*

• *the success probability is*

$$P^S = \left(\frac{R}{\rho}\right)^n \,\tag{17}$$

• *the expected progress is*

$$\phi = \frac{R^{n+1}}{\rho^n (n+1)}.\tag{18}$$

**Proof.** The success probability has already been derived geometrically. However, in a unitary setting, both success probability and expected progress are obtained by integration of partial pdf (12) over the success region (0, *R*).

$$P^S = \int\_0^R \frac{n}{\rho^n} (R - x)^{n-1} dx = \left(\frac{R}{\rho}\right)^n.$$

$$\phi = E(\text{progress}) = \int\_0^R \frac{n\chi}{\rho^n} (R - x)^{n-1} dx = \frac{R^{n+1}}{\rho^n (n+1)}.$$

where the second calculus yields from partial integration.

Theorem 1 points out an outstanding analytical property of the uniform mutation operator, in case of the algorithm with a large step size: *The success probability is the derivative, with respect to distance to optimum, of the expected progress.*

This result relies on the non-centrality property of the uniform distribution inside the sphere. Namely, we can regard the random point *C* as being generated from a uniform centered in *O*, not in *P*, then apply the radius r.v. for the random distance |*OC*|. This procedure can be applied only for Figure 1b, not for Figure 1a or 1c, and neither for other spherical mutation like the normal distribution. The uniform distribution *on* the sphere provides zero progress if *ρ* ≥ 2*R*, so that case is also tractable but not interesting, corresponding to pure stagnation.

Finally, we confirm and generalize Rudolph's result (10) on the convergence time of the elitist ES on the SPHERE, under constant, large-step-size mutation.

**Theorem 2.** *In the conditions of Proposition 1, the time required by the algorithm to reach distance from optimum is*

$$t = \Theta\left[\left(\frac{\rho}{\epsilon}\right)^n\right].\tag{19}$$

**Proof.** Start from Equation (18) and the definition of progress (7).

$$\phi = |OP| - |OC| = R\_t - R\_{t+1} = \frac{R\_t^{n+1}}{\rho^n (n+1)} \tag{20}$$

$$
\omega \Rightarrow R\_{t+1} - R\_t = -\frac{R\_t^{n+1}}{\rho^\eta (n+1)}.\tag{21}
$$

We transform the above difference equation into a differential equation, with separable variables, which we solve.

$$y' = -\frac{y^{n+1}}{\rho^n (n+1)}\tag{22}$$

$$
\Rightarrow \int \frac{y'}{y''} dy = -\int \frac{\rho''}{n} dx \tag{23}
$$

$$\mathbf{x} \Rightarrow \mathbf{y}^{n-1} = \frac{n}{n-1} \mathbf{x}^{-1} \tag{24}$$

$$\Rightarrow y = \left(\frac{n}{n-1}\right)^{n-1} \left(\frac{\chi}{\rho^n}\right)^{-\frac{1}{n-1}} \approx \chi^{-\frac{1}{n}} \rho. \tag{25}$$

At the last step, we remove the factors/exponents that converge to one as *n* → ∞.

If we denote *y* = *Rt* =  and *x* = *t*, we get

$$
\epsilon \approx t^{-\frac{1}{n}} \rho \quad \Rightarrow \quad t \approx \left(\frac{\rho}{\epsilon}\right)^n. \tag{26}
$$

One should note that the relevance of the above analysis is purely theoretical. Inspection of Equations (17) and (18) shows that both success probability and expected progress attain their maximum for *ρ* = 2*R*, where the two quantities are of order 1/2*n*, a rather low value, leading to stagnation. The situation is avoided in Algorithm 1 by halving *ρ*, such that the ES enters (again) the small step size situation of Figure 1c.

#### *2.2. Small Step Size*

This case is defined by the inequality *ρ* < 2*R*, such that success region *R<sup>S</sup> <sup>x</sup>* is the intersection of the mutation and fitness spheres; see Figure 1a,c.

If the ES uses uniform mutation inside the *ρ*-sphere, the following equation transforms the original radius into a new parameter, equivalent (asymptotically) to the standard deviation (*σ*) of the normal mutation [10,28].

$$a = \frac{\rho}{\sqrt{n}}.\tag{27}$$

Under the assumption *a* ≈ *σ*, the local behavior of Algorithm 1 does not depend on the mutation distribution used: normal with standard deviation *σ* or uniform inside the sphere of radius *ρ*—one can treat these two ES versions as one. Apply next a second normalization of both new mutation parameter *a* and expected progress *φ* ([4], p. 32):

$$a^\* = a \frac{n}{R}, \qquad \qquad \phi^\* = \phi \frac{n}{R}. \tag{28}$$

The random local behavior of the algorithm can be expressed by the following compact formulas ([4], pp. 67–68), [10]. The cumulative distribution function (cdf) of the standard normal (Gaussian) distribution *N*(0, 1) is Φ(*x*) = <sup>√</sup><sup>1</sup> 2*π x* <sup>−</sup><sup>∞</sup> *<sup>e</sup>*<sup>−</sup> *<sup>t</sup>* 2 <sup>2</sup> *dt*.

**Theorem 3.** *Let a elitist ES minimize the SPHERE, with either uniform mutation inside the sphere of radius ρ or normal mutation with standard deviation a* = *ρ*/ <sup>√</sup>*n. Then, for large n, the following approximations hold:*

• *Success probability:*

$$Prob \approx 1 - \Phi\left(\frac{a^\*}{2}\right). \tag{29}$$

• *Normalized expected progress:*

$$\phi^\* \approx \frac{a^\*}{\sqrt{2\pi}} e^{-\frac{a^\* 2}{8}} - \frac{a^{\*2}}{2} \left[ 1 - \Phi\left(\frac{a^\*}{2}\right) \right]. \tag{30}$$

The asymptotics of success probability and expected progress are depicted in Figure 3. The particular form of *φ*∗ as function of *a*∗ is essential for proving the global convergence of the adaptive ES. Note that the function (30) is uni-modal, with a maximum at *a*∗ = 1.224, *φ*∗ max = *φ*∗(1.22) = 0.202.

**Figure 3.** Success probability (29), expected progress (30) and evolution window (blue) of elitist ES on SPHERE.

The blue area corresponding to the *a*∗-interval [0.84, 1.68] is the 'evolution window' of Algorithm 1. In classic ES theory, the term is used in the broader sense of the region with expected progress being significantly greater than zero, e.g., *a*<sup>∗</sup> ∈ [0.1, 5] in Figure 3 [1], ([4], p. 69), [31]. However, we use here the more restrictive definition from [10], which ensures also global convergence of the ES under the 1/5 rule. Using the success probability formula, Formula (29), we identify first the critical value *a*∗ = 1.68, corresponding to *Prob*(1.68) = 1/5 = 0.2. Then, we apply the expected progress formula, Formula (30), get *φ*∗(1.68) = 0.188 and use again (30) to get *φ*∗(0.84) = 0.188. One could say that the evolution window in this case is defined by the condition *φ*<sup>∗</sup> ≥ 0.188.

Observing the possible benefits of different mutation distributions, the authors of [3,9] developed an analysis based on uniform mutation *on* the sphere. Using the random angle *θ* = ∠*CPO* between the mutated point and the optimum direction, and the same parameter normalization, Rudolph solved the low-dimensional case, *n* = 3. The general case proved intractable, so he resorted to the same approximation of a random variable through its expected value, yielding an asymptotic progress which is the double of Formula (30) ([9], pp. 170–172). Note that a different progress definition applies, usually referred to as *quality gain*:

$$
overline{s'} = R^2 - |OC|^2\tag{31}$$

$$
\phi' = 2\,\phi^\*.\tag{32}
$$

By applying another spherical mutation operator to the same problem, the Cauchy distribution, Rudolph obtained a different expression of progress, valid for the case *n* = 3 [32]. Like the normal multivariate, the Cauchy distribution is with un-bounded support, can be constructed from independent identical components and exhibits the rare property of being closed to addition. Under quadratic definition (31), the expected progress *φ<sup>C</sup>* depends on the mutation parameter *δ*:

$$\phi\_{\mathbb{C}} = 1 - \frac{1}{\delta \pi} \left[ \frac{3}{\delta} \arctan(2\delta) + \frac{2\delta^2 - 1}{2\delta^2} \log(4\delta^2 + 1) - 4 \right]. \tag{33}$$

Unfortunately, the generalization to larger dimensions failed, due to the intractability of radius **r** from Equation (2), in the case of the Cauchy distribution.

Considering yet another spherical distribution, the (averaged) sum of two independent uniforms inside the sphere, **x** = (**x**<sup>1</sup> + **x**2)/2 [10], also showed that *φS*, the expected progress of this new mutation operator, is

$$
\phi\_{\mathbb{S}}(a^\*) = \phi^\*\left(\frac{a^\*}{\sqrt{2}}\right). \tag{34}
$$

Note that if **x**1,2 are uniformly distributed inside the sphere with radius *ρ*, the expected value of the radius r.v. **r** from Equation (1) is also *ρ*—asymptotically for large space dimension *n*. On the other hand, the expected value of the radius of **x** is *ρ*/ <sup>√</sup>2, so the scaling factor of the argument in Equation (34) is actually the ratio between the different expected radiuses [10].

#### *2.3. Global Convergence*

For decades, the algorithm's global convergence was only a marginal subject in classic ES theory, regarded as a consequence of the ability—un-explained theoretically, though supported by empirical evidence—of the 1/5 success rule to keep the expected progress around its maximal value during the whole evolution. We illustrate this with Beyer's reasoning and apply Formula (30) to express the ES's expected progress between time *t* and *t* + 1 ([4], pp. 48–50).

$$R\_t - R\_{t+1} = \phi(a\_t) \quad \Rightarrow \quad R\_{t+1} - R\_t = -R\_t \frac{\phi^\*(a\_t^\*)}{n}.\tag{35}$$

One obtains a separable differential equation from the difference equation.

$$R'\_t = -R\_t \frac{\phi^\*(a\_t^\*)}{n}.\tag{36}$$

If the 1/5 rule manages to keep *a*∗ *<sup>t</sup>* approximately constant at its maximum, *φ*∗(*a*<sup>∗</sup> *<sup>t</sup>* ) ≈ *φ*∗ max = 0.202; the differential equation solves to the following (note that *R*<sup>0</sup> is the initial distance to optimum):

$$R\_t = R\_0 \ e^{-\frac{0.203\omega}{\pi}}.\tag{37}$$

Apply next the logarithm to get

$$
\ln\left(\frac{R\_t}{R\_0}\right) = \frac{-0.202\text{ t}}{n}\text{,}\tag{38}
$$

and then reverse Equation (38); denote *Rt* =  > 0 and 1/0.202 = *C*, such that

$$t = \mathbb{C}n \log \frac{R\_0}{\epsilon}.\tag{39}$$

According to Definition 2, Equation (39) reads as linear convergence time, with respect to both space dimension *n* and the logarithm of initial distance to optimum *R*0. The only problem is that the above analysis is based on the optimistic assumption that the 1/5 rule keeps expected progress around the value of 0.202. Rigorously, this is only a best-case scenario, so one should actually read the above equalities as inequalities and (39) as a lower bound on convergence time [10]. The ES convergence time is Ω[*n* · log(*R*0/)].

$$t \ge \mathbb{C}n \log \frac{R\_0}{\epsilon}.\tag{40}$$

However, the practical efficiency of Equation (37), expressed by its ability in predicting the behavior of the real algorithm, is undeniable. Following [10], we present here another derivation of Formula (37) with a slightly different exponential parameter (0.178 instead of 0.202) but obtained as an average, not extreme case value.

Rudolph considered first the stochastic nature of the ES, using a martingale model to derive sufficient conditions for global convergence. The following definitions and results are from ([9], pp. 25–26, 52, 166) and ([33], pp. 94, 109, 127–128, 131).

#### **Definition 3.**


**–** *Non-negative supermartingale if, for all t,*

$$E(\left|R\_t\right|) < \infty \text{ and } E(R\_{t+1} \mid R\_t) \le R\_t \tag{41}$$

**–** *Uniformly integrable (UI) if, for any*  > 0*, there is K* ≥ 0 *such that, for all t*

$$E\left(|\mathcal{R}\_t| \cdot \mathbf{1}\_{\{|\mathcal{R}\_t| > \mathcal{K}\}}\right) < \epsilon. \tag{42}$$

**Proposition 2.** *Let* {*Rt*}*t*∈<sup>N</sup> *be a sequence of r.v.s.*


Let *Xt* be the r.v. elitist ES at iteration *t*, *f* a function with minimum zero and *Rt* = *<sup>f</sup>*(*Xt*). Then, {*Rt*}*t*∈<sup>N</sup> is a non-negative supermartingale due to the elitist selection, and UI since *Rt* ≤ *R*0. Due to Proposition 2, *Rt* converges a.s. and in mean to a finite r.v. What remains to be proved is:


Sufficient convergence conditions are provided in ([9], pp. 165–167). Note that part (a) of Theorem 4 is also presented in [34], and part (b), in terms of dynamical systems and *Lyapunov functions*, is also in ([15], p. 154).

**Theorem 4.** *Let* {*X*}*t*∈<sup>N</sup> *be generated by some ES optimizing a fitness function f with global minimum at zero, and Rt* = *f*(*Xt*) > 0*.*

*(a) If the ES employs an elitist selection rule and there exist sequences <sup>t</sup>*, *δ<sup>t</sup>* ∈ (0, 1) *such that for all t*

$$
\delta\_t \le \text{Prob}\{R\_{t+1} \le (1 - \epsilon\_t)R\_t | R\_t\} \tag{43}
$$

*and* <sup>∞</sup>

$$\sum\_{t=0}^{\infty} \epsilon\_t \cdot \delta\_t = \infty \tag{44}$$

*then the ES converges to zero a.s and in mean, and the approach is exponentially fast with rate c* = 1 − *δ* ·  ∈ (0, 1)*.*

$$E\_{R\_{t+1}} \le E\_{R\_t} \mathcal{c} \implies E\_{R\_t} \le E\_{R\_0} \mathcal{c}^t. \tag{45}$$

*(b) Regardless of the selection rule, if there is a constant c* ∈ (0, 1) *such that for all t*

$$E(R\_{t+1} \mid R\_t) \le c \cdot R\_t \quad a.s.\tag{46}$$

*then the ES converges to zero a.s. and in mean, and the approach is exponentially fast with rate c.*

The key r.v.s and parameters used in this study are summarized in Table 1.

Rudolph used Theorem 4 (b) to justify global convergence on SPHERE, for the elitist ES with uniform mutation *on* sphere ([9], pp. 170–172). However, the reasoning is unrealistic: it assumes a mutation radius proportional, at each moment, to distance to the optimum of *ρ<sup>t</sup>* = *γRt*, with the parameter set to optimal value *γ*<sup>∗</sup> = 1.224.

The problem is that a real ES, modeled as a sequence of (decreasing) constant-mutation phases, delimited by the application of the 1/5 rule, does not fulfill Equations (43)–(46) per se. Assuming a 'good' starting point (see Theorem 7), there is a constant *c* > 0 such that (46) holds as long as the ES is within some narrow evolution window, such as the one depicted in Figure 3. However, as the algorithm reaches the upper limit of the evolution window, at some random time *T*, there is an exponentially small probability for the ES to continue the descent such that (46) is precluded. Since mutation adaptation is necessary for global convergence, we are interested in the (large probability) event 'the 1/5 rule applies at iteration *T*'. The situation resembles Theorem 4 (a), with the difference being that *E*(*RT*+1|*RT*) = *RT* for the 1/5 rule applies only in unsuccessful iterations, such that  *<sup>T</sup>* = 0 and (43) is precluded as well.


**Table 1.** Mathematical concepts used in ES analysis.

Summing up, Theorem 4 is not strong enough to derive upper bounds on the global convergence time of the elitist ES with the 1/5 success rule. As one can see in the following, the mathematical difficulty resides not in the multi-variate calculus, but in the computational complexity analysis of the 1/5 rule. The linear upper bounds were proved for the first time by Jägersküpper, who regarded the elitist ES as a sequence of *n*-length *phases* with a constant mutation rate in each phase, during which the success frequency of mutation was observed before the application of the 1/5 rule [5,26,35,36].

Removing line (ii) from the 1/5 rule in Algorithm 1—that is, parameter *ρ* decreases if success frequency is less than 1/5, but never increases—and using A uniform mutation inside the sphere instead of normal mutation, [10] proved that all of Jägersküpper's results still hold. Moreover, under the simplified 1/5 rule, the constant-mutation phase extends to a number of phases, called a cycle. The r.v. 'length of a cycle' will play a key role in connecting the two (otherwise distinct) parts of the convergence analysis, local and global, leading to an exponential formula able to predict the behavior of the real ES, similar to Equation (37).

We resume next the results from [5,10,26,35,36] in a unitary setting, covering both types of 1/5 rule and both mutation distributions, normal and uniform inside the sphere. The results hold also for the sum of two uniforms; see [10] for details. For the local behavior, Jägersküpper avoided the calculus from Section 2.2 and used instead the decomposition (2) of the normal mutation distribution with standard deviation *σ*, *N*(0, *σIn*), into r.v.s uniform on sphere **u<sup>n</sup>** and radius . Recall that *ρ* is the radius parameter of the uniform mutation inside the sphere, *R* the current distance to optimum, *Prob* the success probability and **r** the radius r.v. from Equation (1). We also apply the normalization *a* = *ρ*/ <sup>√</sup>*<sup>n</sup>* in order to equalize the asymptotic mean radiuses of the two distributions, normal and uniform, such that *a* can be identified to *σ* and **r** to .

**Lemma 1.** *Let Algorithm 1 with any spherical mutation minimizing the SPHERE be in current point P;* |*OP*| = *R. The mutant C is accepted with Prob* ∈ [, 1/2 − ]*,*  > 0*, if and only if* |*PC*| = Θ *R*/ √*n .*

**Lemma 2.** *If X is uniform inside the sphere of radius ρ or normal with standard deviation σ* = *ρ*/ <sup>√</sup>*n, and* **<sup>r</sup>** *is the corresponding r.v. radius, then*

$$P(|\mathbf{r} - \boldsymbol{\rho}| \le \delta \boldsymbol{\rho}) \ge 1 - \mathcal{O}\left(\frac{1}{n\delta^2}\right). \tag{47}$$

*If X*1, ... , *Xn are independent copies of X, then for any λ* ∈ (0, 1)*, there exist aλ*, *b<sup>λ</sup>* > 0 *such that the r.v. cardinal number of* {*i* | *aλρ* ≤ *ri* ≤ *bλρ*} *is* ≥ *λn w.o.p.*

The first part of Lemma 2 is actually Chebyshev inequality. Note that the inequality |**r** − *ρ*| ≤ *δρ* is equivalent to |**r** − *ρ*|/*ρ* ≤ *δ*, so the result holds for the original (that is, before normalization *ρ* = *a* <sup>√</sup>*n*) uniform distribution inside the sphere, for the normalized uniform and also for the normal distribution.

**Lemma 3.** *Let Algorithm 1 with mutation X, as in Lemma 2, minimize the SPHERE. The following are equivalent:*


**Lemma 4.** *If* **r** = Θ *R*/ √*n , then progress is* Θ(*R*/*n*)*, with probability* Ω(1)*, and within expectation.*

The radius **r** in Lemma 4 corresponds to the normalized (*ρ* = *a* <sup>√</sup>*n*) uniform inside the sphere, and hence also to the normal distribution.

Let *i*, *i* + 1 denote the states of the algorithm at the beginning/end of this phase, respectively.

**Lemma 5.** *Let Algorithm 1 with mutation X as in Lemma 2 minimize the SPHERE. Consider two variants of the 1/5 rule: (i and ii).*


**Lemma 6.** *Let Algorithm 1 be as in Lemma 5. If the 1/5 rule causes a* (*k* + 1)*-sequence of phases,* <sup>1</sup> <sup>≤</sup> *<sup>k</sup>* <sup>=</sup> *<sup>n</sup>*O(*n*)*, such that in the first phase <sup>ρ</sup> is halved and in all the following it is doubled (respectively left unchanged), or the other way around, then w.o.p. the distance from optimum is k times reduced by a constant fraction in these phases.*

We can state now the main global convergence result, valid for the ES with either uniform or normal mutation, with either complete (i–ii) or simplified (i) 1/5 success rule [5,10,35].

**Theorem 5.** *Let Algorithm 1, defined as in Lemma 5, start at distance R*<sup>0</sup> *from optimum, with mutation parameter ρ*<sup>0</sup> = Θ *R*0/ √*n . If <sup>t</sup> satisfies* <sup>1</sup> <sup>≤</sup> *<sup>t</sup>* <sup>=</sup> *<sup>n</sup>*O(1)*, then the number of iterations to reach distance Rt with Rt* <sup>≤</sup> *<sup>R</sup>*0/2*<sup>t</sup> is* <sup>Θ</sup>(*<sup>t</sup>* · *<sup>n</sup>*)*, w.o.p. and within expectation.*

For *t* = 1, Theorem 5 states the existence of two constants *a*, *b* > 0, such that for a large enough *n*, the random time *T* required to halve the initial distance to optimum *R*<sup>0</sup> satisfies

$$
\ln n \le E(T) \le b \, n. \tag{48}$$

Even if the result does not state convergence to zero of r.v. *Rt*, as *t* → ∞, neither in expectation, nor a.s. nor w.o.p., it accounts for a form of linear convergence, with respect to both *t* and *n*.

**Remark 1.** *For an arbitrary initial point, outside the prescribed range ρ*<sup>0</sup> = Θ *R*0/ √*n , Jägersküpper conjectured in [35] that: 'For other starting conditions, the number of steps until the theorem's assumption is met must be estimated before the theorem can be applied—by estimating the number of steps until the scaling factor is halved at least once. This is a rather simple task when utilizing the strong results presented in Lemma 5'. Empirical evidence for this fact can be found in [10].*

This is the point where Jägersküpper's convergence time analysis stops, without providing a formula similar to (37), to be tested against the behavior of the real ES. To fill in the gap, reference [10] applied the uniform mutation inside the sphere and the simplified 1/5 rule—obtained by removing line (ii) from Algorithm 1—and made use of the r.v. *T* = 'length of a constant-mutation ES cycle', defined as a stopping time ([33], pp. 97–98).

**Definition 4.** *A r.v. T with state space* {0, 1, ... , ∞} *is said to be a stopping time for the sequence of r.v.s* {*Xt*}*t*∈<sup>N</sup> *if one can decide whether the event* {*<sup>T</sup>* = *<sup>t</sup>*} *has occurred only by observing X*1, ... , *Xt. Note that, since the ES is a Markov chain, we do not assume independence of X*1, *X*2, ...*, as in Wald's equation and renewal theory [37].*

Stopping time will be considered the first hitting time of distance *d* > 0 from the initial point *X*0.

$$T\_d = \min\{t \mid X\_1 + \dots + X\_t \ge d\}. \tag{49}$$

A key role in the analysis is played by the following generalization of Wald's equation ([37], p. 38), as introduced in [10]. Note that we define, for some r.v. *X* and *P*(*A*) > 0, the conditional expectation *E*(*X*|*A*) = *E*(*X* · **1***A*)/*P*(*A*).

**Proposition 3** (Wald's inequality [10])**.** *Let <sup>c</sup>*, *<sup>d</sup>*, *<sup>δ</sup>*, *<sup>δ</sup><sup>u</sup>* > 0 *and* {*Xt*}*t*∈<sup>N</sup> *be non-negative r.v.s such that X*<sup>0</sup> = *c and δ* ≤ *E*(*Xt*|*Td* ≥ *t*) ≤ *δ<sup>u</sup> for t* ≥ 1*. Then*

$$\frac{d}{\delta\_{\rm u}} \le E(T\_d) \le \frac{d + \delta\_{\rm u}}{\delta\_{\ell}}.\tag{50}$$

We set in our ES analysis *X*<sup>0</sup> = *c* = *R*<sup>0</sup> > *d* and *Xt*+<sup>1</sup> = *Rt* − *Rt*+1; hence, *X*<sup>1</sup> + ... + *Xt* = *R*<sup>0</sup> − *Rt* for all *t* ≥ 0.

Another form of Wald's inequality can be obtained from the additive drift theorem, derived by Lehre and Witt in [38]—which, as the authors mention, adapts for the continuous case the discrete space drift theorem of He and Yao [14]. We apply a formalization similar to Proposition 3.

**Theorem 6** (Additive Drift)**.** *Let* {*Xt*}*t*∈<sup>N</sup> *be a stochastic process, adapted to a filtration* {F*t*}*t*∈N*, over some state space S* ⊆ -*; let d*, *δ*, *δ<sup>u</sup>* > 0 *and T*<sup>0</sup> = *min*{*t* | *Xt* > 0}*. Then, if* 0 ≤ *Xt* ≤ *d* = *X*<sup>0</sup> *and δ* ≤ *E*(*Xt* − *Xt*+<sup>1</sup> ; *Xt* > 0 | F*t*) ≤ *δ<sup>u</sup> for all t*

$$\frac{d}{\delta\_u} \le E(T\_0 | \mathcal{F}\_0) \le \frac{d}{\delta\_\ell}.\tag{51}$$

To be comparable with Wald's inequality, one should set *Xt*+<sup>1</sup> = *d* − (*R*<sup>0</sup> − *Rt*+1) in the Adaptive Drift theorem. However, we consider the assumption *Xt* ≥ 0 for all *t*, implying *XTd* = 0, to be unrealistic for *d* < *R*0. On the other hand, if one sets *d* = *R*<sup>0</sup> as above, the lower bound *δ* > 0 does not exist (independent of *t*) in the continuous case, e.g., the SPHERE. The existence of a strictly positive lower bound on either success probability or

expected progress can be seen as a sufficient convergence condition for continuous space algorithms—see also Theorem 4—yet it is only satisfied in discrete space.

Back to the application of Wald's inequality in deriving convergence rates for the ES model. Let *n* be arbitrarily fixed; *d* < *R*0; and consider the two *a*∗-values corresponding to the lower and upper limits of the evolution window depicted in Figure 3, *a*∗ <sup>1</sup> = 0.84 and *a*∗ <sup>2</sup> = 1.68. The normalized distance *d*<sup>∗</sup> between these points corresponds to the un-normalized distance

$$d = R\_{0.84} - R\_{1.68} = \rho \sqrt{n} \left( \frac{1}{0.84} - \frac{1}{1.68} \right) \tag{52}$$

which further leads, using Wald's inequality (see [10] for details), to

**Theorem 7.** *Let Algorithm 1 be as in Theorem 5, starting in point a*∗ = 0.84*, and let d*<sup>∗</sup> = 1.68 − 0.84*. Then, for large n*

$$2.5n \le E(T\_d) \stackrel{<}{\sim} 5.3n.\tag{53}$$

The resemblance between Equations (48) and (53) is obvious. According to Theorem 7, if the elitist ES is currently (initially) in point *a*∗ <sup>0</sup> = 0.84 = *ρ* <sup>√</sup>*n*/*R*0, the expected time to reach 2*a*∗ <sup>0</sup> = 1.68 = *ρ* <sup>√</sup>*n*/(*R*0/2) is within [2.5*n*, 5.3*n*]. One could use these limits as lower and upper bounds on *E*(*T*), or search for some value in between, to be used as an estimate for *E*(*T*). In [10] the simple arithmetical mean 3.9 has been used, yet we choose here the estimate 3.5, as suggested by the new empirical evidence presented in Figure 4. The experiments were conducted with Algorithm 1 and uniform mutation inside the sphere, on the SPHERE fitness function, with different space dimensions and initial points, all corresponding to *a*∗ = 0.84. The results were averaged over 100 independent runs. In each run, the maximal number of iterations was set to *tmax* = 50, 000, and the last, incomplete cycle was discarded.

**Figure 4.** Expected number of iterations per cycle, as a function of space dimension *n*. Trend-line equation with *R*<sup>2</sup> value, displayed by Excel.

Using the value 3.5 for the expected value estimate of the r.v. T = 'No. of iterations per constant-mutation ES cycle', we obtained

$$R\_{3.5\,\mathrm{m}} \approx \frac{R\_0}{2}.\tag{54}$$

By iterating Equation (54) *s* times and substituting in 3.9*ns* = *t*, we get

$$R\_{3.5\,ns} = \frac{R\_0}{2^s} \implies R\_t = R\_0 \ 2^{-\frac{t}{35\,\pi}},\tag{55}$$

which further implies, with log 2/3.5 = 0.192,

$$R\_t = R\_0 \ e^{-\frac{0.192\,t}{n}}.\tag{56}$$

Using *Rt* =  and a derivation similar to the one inferred from Equation (37), the linear convergence of the elitist ES on SPHERE is finally obtained.

$$\text{splitist ES convergence time } = \Theta\left(n \cdot \log \frac{R\_0}{\epsilon}\right).$$

As demonstrated in [10], Formula (56) can be used, with very good experimental results, as a theoretical predictor of the algorithm's global behavior. Note that the slightly different exponential coefficient, 0.192 instead of 0.178 in [10], is not disruptive.

On the other hand, the above model, indicating identical behavior of the elitist ES within each constant mutation cycle, offers an intuitive insight into the algorithm's dynamics, seen as a sequence of cycles, that are independent and with identical expected length.

#### **3. General 1/5-Rule Theories**

Different convergence theories for the elitist ES with 1/5 success rule exist, building up mathematical rigor and complexity, yet, by imposing supplementary conditions, they usually lack compact results such as Theorem 3 and Formula (56). Some relevant examples are reviewed in the following.

Rudolph's stochastic analysis based on martingales, presented already in Section 2.3, provides only sufficient global convergence conditions for the algorithm. A different stochastic process—a random system with complete connections—was used with a similar outcome in [16]. With a deeper insight into the theory of the irreducible Markov chain, which identifies the discrete states to the so-called small sets and extrapolates the basic features of a discrete homogeneous Markov chain onto the continuous space [39,40], Dorea proved, under elitist selection, the global convergence of ES and of a continuous version of simulated annealing to an -vicinity of the optimum [17]. However, since the mutation distribution is decoupled from the current position, the result is of little practical use. Bienvenüe and Francois applied a different Markov model to an adaptive ES, but under a different, problem-related adaptation rule that multiplies at each iteration the mutation parameter *ρ* with the current distance to optimum *Rt*, then searches for 'optimal universal step lengths on the basis of the convergence of the dynamics' [41]. The same adaptation rule was used by Rudolph in his convergence analysis of the elitist ES on SPHERE ([9], p. 70). Connecting the mutation rate to the current position works for the SPHERE centered on the origin, but failed for a slightly modified problem, e.g., for a SPHERE with a different center—where Algorithm 1, adapting the mutation rate according to the 1/5 rule, works very well.

Unaware of Dorea's work, Auger and Hansen re-iterated the irreducible Markov chain modeling, but on the more practical premises used by Jägersküpper [2,18]. Similarly to Bienvenüe and Francois, but without their over-simplifying assumption, Auger and Hansen achieved, at the mathematical peak of ES literature, linear convergence of the elitist ES with the 1/5 rule on scaling-invariant functions (SPHERE included) by studying the stability of the normalized Markov chain *Zt* = *Rt*/*σ* [19]. A compact formula similar to (56) was proved.

*Rt* = *R*<sup>0</sup> *e* <sup>−</sup>*CR t*, (57)

where *CR* > 0 depends on the asymptotic success probability *PS* and on the 1/5 rule parameters *γ*, *q*.

$$\begin{array}{ll} \text{CR} = -\ln \gamma \left( \frac{q+1}{q} PS - \frac{1}{q} \right) \\ \text{increasing } 1/5 \text{ rule factor} \qquad \gamma \\ \text{decreasing } 1/5 \text{ rule factor} \qquad \gamma^{-1/q} . \end{array} \tag{58}$$

⎧ ⎪⎪⎨ ⎪⎪⎩ A supplementary condition is imposed on the parameters, which reads for the SPHERE

$$
\frac{1}{2} \left( \frac{1}{\gamma} + \gamma^{\frac{1}{q}} \right) < 1. \tag{59}
$$

Bringing some clarity and a new stochastic formalization to the previous approach, Akimoto, Auger and co-workers employed drift analysis to prove the linear convergence time [6,11,42]. However, since their theory is still avoiding the expected progress calculus, a verification against the behavior of real algorithms can be performed only in terms of upper and lower bounds—see also the critique of adaptive drift theorem from Section 2.3.

#### **4. Extensions of the 1/5-Rule—Population-Based ES**

As noticed from the very beginning, the ES performance depends strongly on the value of the mutation parameter *σ* (or *ρ*). The 1/5 success rule discussed so far is the simplest, yet not the only way of adapting the parameter during the algorithm's evolution. Two of the most efficient techniques are the *σ*-self-adaptation [1,4,8] and the covariance matrix adaptation (CMA), which allows for different *σ*-values on the independent components (of normal mutation), aiming at accelerating the evolution in certain directions of the *n*-dimensional space [12,25,43,44].

As Beyer pointed out, each of these popular adaptation techniques borrows something from the original 1/5 success rule. The CMA techniques 'have an operating mechanism similar to the 1/5 rule: they analyze the statistical features of the selected mutations in order to change the strategy parameters towards the optimal value' ([4], p. 258). Namely, the pdf of multivariate normal mutation in CMA-ES is

$$f(\mathbf{x}) = \frac{1}{(\sqrt{2\pi})^n \sqrt{\det(\mathbf{C})}} e^{-\frac{\mathbf{x}^T \mathbf{C} - \mathbf{1}\_\mathbf{x}}{2}} \tag{60}$$

with symmetric covariance matrix *C* depending on *n*(*n* + 1)/2 mutation parameters. If one considers all these parameters as independent r.v.s and switches from a single to a multiple offspring algorithm—one parent, *λ* offspring and no elitist selection being the simplest case, known as (1, *λ*) ES—Rudolph argues that a very large number of individuals is required in order to obtain a good approximation of the optimal matrix *C*. Pointing out that, in case of quadratic-convex fitness functions, the optimal *C* is the inverse of the function's Hessian matrix, he suggests that the CMA update rules should follow the deterministic iterative methods of approximating the Hessian matrix, based on the information gathered from previous samples ([9], p. 198); see also [43]. According to the recent survey of the state-of-art in ES for continuous optimization [25], 'we still lack a rigorous analysis of the one-step approximation of the covariance matrix'.

As a general remark in case of population-based algorithms—with multiple offspring, as for the (1, *λ*) ES discussed above or the (*μ*/*μ*, *λ*) ES with multiple parents, offspring and recombination/crossover [45,46], the explicit reduction of the mutation parameter induced by the 1/5 success rule is not necessary anymore, its exploitation effect being accomplished by the (minimum) order statistics of the *λ* offspring sample and/or by the *μ*/*μ* recombination of the parent population. Intuitively, the explanation is provided by the following fact: The average of two multivariate uniform distributions inside the sphere of radius *ρ* is also spherical, but with a smaller (expected value) radius, *ρ*/ <sup>√</sup><sup>2</sup> instead of *<sup>ρ</sup>*; see [28] for details.

On the other hand, in self-adaptation, the dynamics of the mutation parameter is not based on some success-related statistics, but included in the evolution itself, subject to the basic algorithmic operators. Since mutation parameter *σ* becomes in this case part of the individual, 'the question reduces to the manner in which *σ* should be mutated. The answer is: *multiplicatively*, in contrast to the additive mutations of object parameters' ([4], p. 259). Obviously, the multiplicative character of the adaptation is inspired by the 1/5 rule.

#### **5. Conclusions**

Using the elitist, single-individual algorithm with mutation and the 1/5 success rule as the adaptation procedure and the SPHERE fitness function, the paper provides a bird eye's view over the theory of evolution strategies, an important class of probabilistic optimization algorithms for multi-dimensional real space.

Despite their easy implementation and huge success in applications, the mathematical models are complicated, and global convergence results are rather scarce. Taking into account some recent studies applying the uniform distribution inside the sphere as a mutation operator, the review centered on the classic evolution strategy theory, built on the one-step behavior of the algorithm and on local quantities such as success probability and expected progress. For the first time in the literature, the three main building blocks of classic theory were presented together in a coherent formalization:


Different theories, based on martingales, irreducible Markov chain and drift analysis, were also reviewed, and their results were compared with those of classic theory.

As for population based algorithms—(1, *λ*), (*μ*/*μ*, *λ*), CMA and self-adaptive ES the role of the 1/5 rule is transferred to the selection and crossover operators, yielding a reduction in the mutation parameter for local improvement.

Since the 1/5 rule is devoted to exploitation, not to the exploration phase of the algorithm, the case of multi-modal fitness functions was not tackled in this paper.

**Funding:** This research received no external funding.

**Data Availability Statement:** No new data were created.

**Acknowledgments:** The author acknowledges support over the years by the old Chair Informatics 11, Dortmund University, especially from Günter Rudolph, Hans-Paul Schwefel, Hans-Georg Beyer and Thomas Bäck.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
