A Stochastic Convergence Result for the Nelder–Mead Simplex Method

Aurél Galántai

doi:10.3390/math11091998

Abstract

We prove that the Nelder–Mead simplex method converges in the sense that the simplex vertices converge to a common limit point with a probability of one. The result may explain the practical usefulness of the Nelder–Mead method.

Keywords:

Nelder–Mead simplex method; convergence; stochastic convergence

MSC:

65K10; 90C56

1. Introduction

The Nelder–Mead (NM) simplex method [1] is a direct method for the solution of the minimization problem

f (x) \to min (f : R^{n} \to R),

where f is continuous. It is an “incredibly popular method” (see [2]) in derivative-free optimization [3,4,5,6,7] and in various application areas (see, e.g., [8]). The Nelder–Mead simplex method became popular especially in the computational chemistry area as shown by the book [3] and the references therein. The original Nelder–Mead paper [1] has 36,809 references in Google Scholar as of 26 February 2023, showing a great variety of applications and a great number of mainly heuristic variants occasionally combined with other techniques. The Nelder–Mead algorithm can be found in many software libraries or systems as well, such as IMSL, NAG, Matlab, Scilab, Python SciPy and R [9]. The popularity of the method is due to its observed good performance in practice. In spite of this, only a few theoretical results are known on its convergence (see, e.g., [2,10,11,12]).

The counterexample of McKinnon [13] is a strictly convex function

f : R^{2} \to R

with continuous derivatives on which the Nelder–Mead algorithm converges to a nonstationary point of f. For strictly convex functions

f : R^{2} \to R

with bounded level sets, Lagarias, Reeds, Wright and Wright [10] proved that the function values at all simplex vertices converged to the same value and the diameters of the simplices are converging to zero. Kelley [4,14] gave a sufficient-decrease condition for the average of the objective function values (evaluated at the simplex vertices) and proved that if this condition is satisfied during the process, then any accumulation point of the simplices is a critical point of f. Han and Neumann [15] investigated the convergence to 0 and the effect of dimensionality on the function

f (x) = x^{T} x

(

x \in R^{n}

). For the restricted Nelder–Mead algorithm, Lagarias, Poonen and Wright [11] later proved that if

f : R^{2} \to R

was a twice-continuously differentiable function with bounded level sets and everywhere positive definite Hessian, then it converged to the unique minimizer of f.

If the objective function f does not satisfy the conditions of [10] or [11], then a number of counterexamples show that the Nelder–Mead method may have different types of convergence behavior. It is possible that the function values at the simplex vertices converge to a common value, while the function f has no finite minimum and the simplex sequence is unbounded (Examples 1 and 2 of [16]). It is also possible that the simplex vertices converge to the same point, but the limit point is not a stationary point of f ([13], Examples 3 and 4 of [16]). Other examples indicate that the simplex sequence may converge to a limit simplex of positive diameter resulting in different limit values of f at the vertices at the limit simplex (Examples 4 and 5 of [16]).

Here, we study the convergence of the simplex vertices to a common limit point. In papers [16,17], we proved this type of convergence under sufficient conditions for

1 \leq n \leq 3

and

1 \leq n \leq 8

, respectively. However, the key assumption of papers [16,17] was related to an algorithmically undecidable problem and required ways to circumvent it.

In this paper, we prove two new theorems for the convergence of the Nelder–Mead method in low dimensional spaces (

n = 2, 3, 4, 5, 6

). Theorem 1 replaces the key assumption of [16,17] with an algorithmically computable one. It is the basis of Theorem 2 of Section 6, which proves that the Nelder–Mead method converges with a probability of one. This result may explain the good behavior of the Nelder–Mead method experienced in practice. The case of two-dimensional strictly convex functions is observed in Remarks 2 and 4, respectively.

2. The Nelder–Mead Simplex Method

There are several forms and variants of the Nelder–Mead method. We use the version of Lagarias, Reeds, Wright and Wright [10]. The vertices of the initial simplex

S^{(0)}

are denoted by

x_{1}^{(0)}, x_{2}^{(0)}, \dots, x_{n + 1}^{(0)} \in R^{n}

. It is assumed that vertices

x_{1}^{(0)}, x_{2}^{(0)}, \dots, x_{n + 1}^{(0)}

are ordered such that

f (x_{1}^{(0)}) \leq f (x_{2}^{(0)}) \leq \dots \leq f (x_{n + 1}^{(0)})

(1)

and this condition is maintained during the iterations of the Nelder–Mead algorithm. The simplex of iteration k is denoted by

S^{(k)} = [x_{1}^{(k)}, x_{2}^{(k)}, \dots, x_{n + 1}^{(k)}] \in R^{n \times (n + 1)}

. Define

x_{c}^{(k)} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}^{(k)}

and

x^{(k)} (λ) = (1 + λ) x_{c}^{(k)} - λ x_{n + 1}^{(k)}

. The reflection, expansion and contraction points of simplex

S^{(k)}

are defined by

x_{r}^{(k)} = x^{(k)} (1), x_{e}^{(k)} = x^{(k)} (2), x_{o c}^{(k)} = x^{(k)} (\frac{1}{2}), x_{i c}^{(k)} = x^{(k)} (- \frac{1}{2}),

respectively. The function values at the vertices

x_{j}^{(k)}

and the points

x_{r}^{(k)}

,

x_{e}^{(k)}

,

x_{o c}^{(k)}

and

x_{i c}^{(k)}

are denoted by

f (x_{j}^{(k)}) = f_{j}^{(k)}

(

j = 1, \dots, n + 1

),

f_{r}^{(k)} = f (x_{r}^{(k)})

,

f_{e}^{(k)} = f (x_{e}^{(k)})

,

f_{o c}^{(k)} = f (x_{o c}^{(k)})

and

f_{i c}^{(k)} = f (x_{i c}^{(k)})

, respectively.

The Nelder–Mead simplex method (Algorithm 1) is a nonstationary iterative method where the kernel of the iteration loop consists of ordering the simplex vertices and exactly one of four possible operations. The logical conditions for these operations (reflection, expansion, contraction and shrinking) are mutually exclusive.

For the order operation, there are two rules that apply to reindexing after each iteration. If a nonshrink step occurs, then

x_{n + 1}^{(k)}

is replaced by a new point

v \in \{x_{r}^{(k)}, x_{e}^{(k)}, x_{o c}^{(k)}, x_{i c}^{(k)}\}

. The following cases are possible:

f (v) < f (x_{1}^{(k)}), f (x_{1}^{(k)}) \leq f (v) \leq f (x_{n}^{(k)}), f (v) < f (x_{n + 1}^{(k)}) .

If

j = \{\begin{matrix} 1, & if f (v) < f (x_{1}^{(k)}) \\ max_{2 \leq ℓ \leq n + 1} \{f (x_{ℓ - 1}^{(k)}) \leq f (v) \leq f (x_{ℓ}^{(k)})\}, & otherwise \end{matrix},

then the new simplex vertices are

x_{i}^{(k + 1)} = x_{i}^{(k)} (1 \leq i \leq j - 1), x_{j}^{(k + 1)} = v, x_{i}^{(k + 1)} = x_{i - 1}^{(k)} (i = j + 1, \dots, n + 1) .

(2)

This rule inserts v into the ordering with the highest possible index. If shrinking occurs, then

z_{1} = x_{1}^{(k)}, z_{i} = (x_{i}^{(k)} + x_{1}^{(k)}) / 2 (i = 2, \dots, n + 1)

plus a reordering takes place. If

f (x_{1}^{(k)}) \leq f (z_{i})

(

i = 2, \dots, n + 1

), then by convention

x_{1}^{(k + 1)} = x_{1}^{(k)}

. Hence, it is guaranteed that

f (x_{1}^{(k)}) \leq f (x_{2}^{(k)}) \leq \dots \leq f (x_{n + 1}^{(k)}) (k \geq 0) .

(3)

The insertion rule (2) implies that if function f is bounded below on

R^{n}

and only a finite number of shrink iterations occur, then each sequence

\{f_{i}^{(k)}\}

converges to some

f_{i}^{\infty}

for

i = 1, \dots, n + 1

(see Lemma 3.3 of [10]).

Algorithm 1: Nelder–Mead algorithm

3. A Matrix Form of the Nelder–Mead Method

Assume that simplex

S^{(k)} = [x_{1}^{(k)}, x_{2}^{(k)}, \dots, x_{n + 1}^{(k)}]

is such that condition (3) holds. If the incoming vertex v is of the form

v = \frac{1 + α}{n} \sum_{i = 1}^{n} x_{i}^{(k)} - α x_{n + 1}^{(k)} = x^{(k)} (α)

for some

α \in \{1, 2, \frac{1}{2}, - \frac{1}{2}\}

, we can define the transformation matrix

T (α) = [\begin{matrix} I_{n} & \frac{1 + α}{n} e \\ 0 & - α \end{matrix}] (e = {[1, 1, \dots, 1]}^{T}) .

Since

S^{(k)} T (α) = [x_{1}^{(k)}, \dots, x_{n}^{(k)}, x (α)]

, we have to reorder the matrix columns according to the insertion rule (2). Define the permutation matrix

P_{j} = [e_{1}, \dots, e_{j - 1}, e_{n + 1}, e_{j}, \dots, e_{n}] \in R^{(n + 1) \times (n + 1)} (j = 1, \dots, n + 1) .

Then,

S^{(k)} T (α) P_{j}

is the new simplex

S^{(k + 1)}

. The following cases are possible

Operation	New simplex
Reflection ( $v = x_{r}^{(k)}$ )	$S^{(k + 1)} = S^{(k)} T (1) P_{j}$ ( $j = 2, \dots, n$ )
Expansion ( $v = x_{e}^{(k)}$ )	$S^{(k + 1)} = S^{(k)} T (2) P_{1}$
Expansion ( $v = x_{r}^{(k)}$ )	$S^{(k + 1)} = S^{(k)} T (1) P_{1}$
Outside contraction ( $v = x_{o c}^{(k)}$ )	$S^{(k + 1)} = S^{(k)} T (\frac{1}{2}) P_{j}$ ( $j = 1, \dots, n + 1$ )
Inside contraction ( $v = x_{i c}^{(k)}$ )	$S^{(k + 1)} = S^{(k)} T (- \frac{1}{2}) P_{j}$ ( $j = 1, \dots, n + 1$ )

For shrinking, the new simplex is

S^{(k + 1)} = S^{(k)} T_{s h r} P,

where

T_{s h r} = \frac{1}{2} I_{n + 1} + \frac{1}{2} e_{1} e^{T},

the permutation matrix

P \in P_{n + 1}

is defined by the ordering condition (3), and

P_{n + 1}

is the set of all possible permutation matrices of order

n + 1

.

Hence, for

k \geq 1

,

S^{(k)} = S^{(k - 1)} T_{k} P^{(k)} = S^{(0)} B_{k},

(4)

where

B_{k} = \prod_{i = 1}^{k} T_{i} P^{(i)} (T_{i} P^{(i)} \in T)

(5)

and

\begin{matrix} T = \{T (α) P_{j} : α \in \{- \frac{1}{2}, \frac{1}{2}\}, j = 1, \dots, n + 1\} \\ \cup \{T_{s h r} P : P \in P_{n + 1}\} \cup \{T (1) P_{j} : j = 1, \dots, n\} \cup \{T (2) P_{1}\} \end{matrix}

(6)

Note that

T

contains

3 n + 3 + (n + 1)!

matrices.

Observe that the transformation matrices

T (α)

,

T_{s h r}

,

T (α) P

and

T_{s h r} P

(

P \in P_{n + 1}

) are nonsingular and have the property that their column sums are equal to one (see, e.g., [16,17]). The latter property implies that

∥T P∥ \geq 1

(

T P \in T

) holds in any induced matrix norm.

We seek for conditions that guarantee that

lim_{k \to \infty} x_{i}^{(k)} \to \hat{x} (i = 1, 2, \dots, n + 1)

holds for some vector

\hat{x}

. If so, then

{lim}_{k \to \infty} S^{(k)} = \hat{x} e^{T}

and for

k \to \infty

, both

f_{i}^{(k)} \to f (\hat{x})

(

i = 1, 2, \dots, n + 1

) and diam

(S^{(k)}) \to 0

follow. In fact, we prove the convergence of the right infinite matrix product

\prod_{i = 1}^{\infty} T_{i} P^{(i)} (T_{i} P^{(i)} \in T)

to a rank-one matrix of the form

B = w e^{T}

, from which

S^{(k)} = S^{(0)} B_{k} \to \hat{x} e^{T}

(

\hat{x} = S^{(0)} w

) and the speed estimate

diam (S^{(k)}) \leq \sqrt{2} ∥S^{(0)}∥ ∥B_{k} - B∥

(7)

also follow (see also [16,17]).

Counterexamples of [16,17,18] show that even if the simplex sequence

S^{(k)}

converges to some limit

S^{\infty}

, it may happen that diam

(S^{\infty}) > 0

. If diam

(S^{\infty}) = 0

holds, that is

S^{\infty} = \hat{x} e^{T}

for some vector,

\hat{x}

, it may happen that

\hat{x}

is not a stationary or minimum point (see McKinnon [13] and also [16,17,18]).

4. Properties of the Transformation Matrices

The spectra of the transformation matrices

T (α) P_{j}

and

T_{s h r} P

are fully characterized in Section 3 of [16]. Furthermore, these matrices have a common similarity form (9). Define the matrix

F = [\begin{matrix} 1 & - e^{T} \\ 0 & I_{n} \end{matrix}] .

(8)

Lemma 1

([16,17]). For all

T_{i} P^{(i)} \in T

, matrix

F^{- 1} T_{i} P^{(i)} F

has the form

F^{- 1} T_{i} P^{(i)} F = [\begin{matrix} 1 & 0 \\ b_{i} & C_{i} \end{matrix}],

(9)

where

b_{i} \in R^{n}

and

C_{i} \in R^{n \times n}

depends on

T_{i} P^{(i)}

.

For a more general result, see Hartfiel [19]. Note that a constant

γ > 0

exist such that

∥b_{i}∥ \leq γ

holds for all

T_{i} P^{(i)} \in T

. For later use, we make the following numbering of the elements

T_{s} P^{(s)} \in T

and their corresponding matrices

C_{s}

:

$T_{s} P^{(s)} \in T$	↔	$C_{s}$
$T (1) P_{j + 1}$	↔	$C_{j} (j = 1, \dots, n - 1)$
$T (2) P_{1}$	↔	$C_{n}$
$T (1) P_{1}$	↔	$C_{n + 1}$
$T (\frac{1}{2}) P_{j}$	↔	$C_{n + 1 + j} (j = 1, \dots, n + 1)$
$T (- \frac{1}{2}) P_{j}$	↔	$C_{2 n + 2 + j} (j = 1, \dots, n + 1)$
$T_{s h r} P$ $(P \in P_{n + 1})$	↔	$C_{3 n + 3 + j} (j = 1, \dots, (n + 1)!)$

where the numbering of the permutations

P \in P_{n + 1}

follows the perms function of Matlab (in actual computations).

5. An Improved Convergence Result

Here, we prove a new convergence theorem where the key condition is numerically checkable at least for low dimensions unlike in [16,17], where the key assumption was algorithmically undecidable and required ways to circumvent this problem. The new result is the basis of the stochastic convergence result of Section 6. The latter theorem may explain, in a sense, the experienced good behavior of the Nelder–Mead method in practice.

Formula (9) implies that

B_{k} = \prod_{j = 1}^{k} T_{i_{j}} P^{(i_{j})} = F L_{k} F^{- 1} (T_{i_{j}} P^{(i_{j})} \in T),

(10)

where

L_{k} = \prod_{j = 1}^{k} [\begin{matrix} 1 & 0 \\ b_{i_{j}} & C_{i_{j}} \end{matrix}],

(11)

and

B_{k}

is convergent if and only if

L_{k}

is convergent to some matrix

[\begin{matrix} 1 & 0 \\ \tilde{x} & \tilde{Y} \end{matrix}] .

We use the following simple result (see, e.g., [16,17]). For

i \geq 1

, let

A_{i} = [\begin{matrix} 1 & 0 \\ b_{i} & C_{i} \end{matrix}] \in R^{(n + 1) \times (n + 1)} (C_{i} \in R^{n \times n}) .

(12)

Lemma 2.

Assume that

∥\prod_{j = 1}^{k} C_{j}∥ \leq c_{k}

,

\sum_{k = 1}^{\infty} c_{k}

is convergent (

< \infty

) and

∥b_{k}∥ \leq γ

for all k. Then,

L_{k} = \prod_{j = 1}^{k} A_{j}

converges and

lim_{k \to \infty} L_{k} = [\begin{matrix} 1 & 0 \\ \tilde{x} & 0 \end{matrix}]

(13)

for some

\tilde{x}

.

Proof.

It is easy to see that

L_{k} = \prod_{j = 1}^{k} A_{j} = [\begin{matrix} 1 & 0 \\ \sum_{i = 1}^{k} (\prod_{j = 1}^{i - 1} C_{j}) b_{i} & \prod_{j = 1}^{k} C_{j} \end{matrix}] = [\begin{matrix} 1 & 0 \\ x_{k} & \prod_{j = 1}^{k} C_{j} \end{matrix}] .

(14)

If

\sum_{k = 1}^{\infty} c_{k}

is convergent, then

c_{k} \to 0

. Hence,

\prod_{j = 1}^{k} C_{j} \to 0

as

k \to \infty

. Since

s_{k} = \sum_{j = 1}^{k} c_{j}

is convergent, for any

ε > 0

there is a number

k_{0} = k_{0} (ε)

such that for

m > k \geq k_{0}

,

|s_{m} - s_{k}| < ε

. Thus, for

m > k \geq k_{0}

, we obtain

∥x_{m} - x_{k}∥ \leq \sum_{i = k + 1}^{m} ∥\prod_{j = 1}^{i - 1} C_{j}∥ ∥b_{i}∥ \leq γ \sum_{i = k + 1}^{m} c_{i - 1} \leq γ ε .

Hence,

x_{k} \to

\tilde{x}

for some

\tilde{x}

. □

If

∥C_{j}∥ \leq q < 1

for

j \geq 1

, then

∥\prod_{j = 1}^{k} C_{j}∥ \leq q^{k}

and the series

\sum_{i = 1}^{\infty} q^{i}

is convergent.

Assume now that

n \geq 2

and consider all possible products of

T_{i_{j}} P^{(i_{j})}

of fixed length ℓ (

ℓ \geq 2

). For each product

\prod_{j = 1}^{ℓ} T_{i_{j}} P^{(i_{j})}

, there is a corresponding product

\prod_{j = 1}^{ℓ} C_{i_{j}}

. Define the sets

C = \{C_{i} : T_{i} P^{(i)} \in T\}

and

C_{ℓ} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} : C_{i_{j}} \in C\} = \{Y_{i} : i = 1, 2, \dots, N\},

where

N = {(3 n + 3 + (n + 1)!)}^{ℓ}

. Consider the norm of the elements of

C_{ℓ}

and decompose the set

C_{ℓ}

in the form

C_{ℓ} =

C_{ℓ}^{q} \cup C_{ℓ}^{Q}

, where

C_{ℓ}^{q} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} \in C_{ℓ} : ∥\prod_{j = 1}^{ℓ} C_{i_{j}}∥ \leq q\} (q < 1),

C_{ℓ}^{Q} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} \in C_{ℓ} : q < ∥\prod_{j = 1}^{ℓ} C_{i_{j}}∥ \leq Q\} (Q > 1),

0 < q < 1

is a fixed number, and for simplicity, Q is selected such that

∥C_{i}∥ \leq Q

also holds for all

C_{i} \in C

. That such a q exists,

C_{ℓ}^{Q}

and

C_{ℓ}^{q}

are not empty follow from [16]. This fact is also indicated by Equation (21).

We investigate the product (11). For any

k \geq ℓ

, write

k = m ℓ + r

with

m, r \in N

and

0 \leq r < ℓ

. Note that

m = ⌊\frac{k}{ℓ}⌋

, where

⌊\cdot⌋

stands for the floor function. Then,

\prod_{j = 1}^{k} C_{i_{j}} = [\prod_{j = 1}^{m} (C_{i_{(j - 1) ℓ + 1}} \dots C_{i_{j ℓ}})] C_{i_{m ℓ + 1}} \dots C_{i_{m ℓ + r}}

(15)

and

∥\prod_{j = 1}^{k} C_{i_{j}}∥ \leq (\prod_{i = 1}^{m} ∥C_{i_{(j - 1) ℓ + 1}} \dots C_{i_{j ℓ}}∥) Q^{r} .

(16)

Assume that

r_{1} (m)

ℓ-products belong to

C_{ℓ}^{q}

and

r_{2} (m)

ℓ-product belong to

C_{ℓ}^{Q}

. Clearly

r_{1} (m) + r_{2} (m) = m

. Then,

∥\prod_{j = 1}^{k} C_{i_{j}}∥ \leq q^{r_{1} (m)} Q^{r_{2} (m) + ℓ - 1}

. There exists an integer

κ \geq 1

such that

q^{1 - κ} \leq Q \leq q^{- κ}

. Hence,

∥\prod_{j = 1}^{k} C_{i_{j}}∥ \leq Q^{ℓ - 1} q^{r_{1} (m) - κ r_{2} (m)}

. Moreover, assume that

r_{1} (m) - κ r_{2} (m) \geq μ m

for some

μ \in (0, 1)

. This assumption guarantees that the elements from

C_{ℓ}^{q}

counterbalance the effect of those from

C_{ℓ}^{Q}

. It also implies the density inequalities

r_{1} (m) \geq \frac{μ + κ}{1 + κ} m > \frac{1 - μ}{1 + κ} m \geq r_{2} (m) .

It follows that

∥\prod_{j = 1}^{k} C_{i_{j}}∥ \leq Q^{ℓ - 1} q^{μ m} = Q^{ℓ - 1} {(q^{μ})}^{m} = : c_{k} .

Now,

q^{μ} < 1

and

\frac{k}{ℓ} - 1 < m = \frac{k - r}{ℓ} \leq \frac{k}{ℓ}

. Hence,

c_{k} : = Q^{ℓ - 1} {(q^{μ})}^{m} \leq Q^{ℓ - 1} {(q^{μ})}^{\frac{k}{ℓ} - 1} = [Q^{ℓ - 1} q^{- μ}] {(q^{\frac{μ}{ℓ}})}^{k} = Γ_{1} {(q^{\frac{μ}{ℓ}})}^{k} \to 0,

since

q^{\frac{μ}{ℓ}} < 1

and

\sum_{k = 1}^{\infty} c_{k} \leq Γ_{1} \sum_{k = 1}^{\infty} {(q^{\frac{μ}{ℓ}})}^{k} < \infty

. Hence, by Lemma 2

lim_{k \to \infty} L_{k} = [\begin{matrix} 1 & 0 \\ \tilde{x} & 0 \end{matrix}] = \tilde{L}

holds for some vector

\tilde{x}

. Since there exists a constant

γ > 0

such that

∥b_{i}∥ \leq γ

,

∥\tilde{x} - x_{k}∥ = ∥\sum_{i = k + 1}^{\infty} (\prod_{j = 1}^{i - 1} C_{i_{j}}) b_{i}∥ \leq γ Γ_{1} \sum_{i = k}^{\infty} {(q^{\frac{μ}{ℓ}})}^{i} \leq Γ_{2} {(q^{\frac{μ}{ℓ}})}^{k},

{∥L_{k} - \tilde{L}∥}_{ϑ} \leq Γ_{3} {(q^{\frac{μ}{ℓ}})}^{k}

holds with a suitable constant

Γ_{3} > 0

. It follows that

B_{k} \to F [\begin{matrix} 1 & 0 \\ \tilde{x} & 0 \end{matrix}] F^{- 1} = [\begin{matrix} 1 - e^{T} \tilde{x} \\ \tilde{x} \end{matrix}] e^{T} = w e^{T} = B

(17)

and

{∥B_{k} - B∥}_{ϑ} \leq Γ_{4} cond (F) {(q^{\frac{μ}{ℓ}})}^{k} .

(18)

We can summarize the obtained results in the following.

Theorem 1.

Assume that

n \geq 2

,

S^{(0)}

is nondegenerate,

ℓ \geq 2

is fixed and

C_{ℓ}^{q}

is not empty. Let

r_{1} (⌊\frac{k}{ℓ}⌋)

be the number of ℓ-products that belong to

C_{ℓ}^{q}

and

r_{2} (⌊\frac{k}{ℓ}⌋)

be the number of those ℓ-products that belong to

C_{ℓ}^{Q}

during the first k iterations of the Nelder–Mead method. Moreover, assume that for

κ \in N

,

q^{1 - κ} \leq Q \leq q^{- κ}

and for some

μ \in (0, 1)

,

r_{1} (⌊\frac{k}{ℓ}⌋) \geq μ ⌊\frac{k}{ℓ}⌋ + κ r_{2} (⌊\frac{k}{ℓ}⌋)

holds (

k \geq k_{0}

). Then, the Nelder–Mead algorithm converges in the sense that

lim_{k \to \infty} x_{j}^{(k)} = \hat{x} (j = 1, \dots, n + 1)

(19)

with a convergence speed proportional to

O (q^{μ k})

. If f is continuous at

\hat{x}

, then

lim_{k \to \infty} f (x_{j}^{(k)}) = f (\hat{x}) (j = 1, \dots, n + 1)

(20)

holds as well.

A simple but time-consuming computation of the elements of

C_{ℓ}

shows the feasibility of the assumptions of Theorem 1. Equation (21) shows the ratios of

|C_{ℓ}^{q}| / |C_{ℓ}|

for the specified

(n, ℓ)

pairs, in case of spectral norm and

q = 0.99

.

\begin{matrix} n ∖ ℓ & 2 & 3 & 4 & 5 \\ 2 & 0.7111 & 0.8361 & 0.9020 & 0.9409 \\ 3 & 0.8518 & 0.9374 & 0.9738 & 0.9891 \\ 4 & 0.8507 & 0.9760 & 0.9956 \\ 5 & 0.9641 & 0.9973 \\ 6 & 0.9935 \end{matrix}

(21)

The greater the ratio

|C_{ℓ}^{q}| / |C_{ℓ}|

, the better the chance for convergence since more elements can be selected from

C_{ℓ}^{q}

than from

C_{ℓ}^{Q}

. For any

n \geq 2

, there are problems on which the NM algorithm does not converge in the above sense (see [16]). Hence, in the general case, the density assumption seems to be necessary.

Corollary 1.

For

n = 2, 3, 4, 5, 6

, Equation (21) implies the convergence of the Nelder–Mead method under the density condition of Theorem 1.

For strictly convex functions

f : R^{n} \to R

, Lagarias et al. proved ([10] Lemma 3.5) that no shrinking occurs when the Nelder–Mead algorithm applied to f. The following observation has some importance, if we can rule out certain operations or steps of the Nelder–Mead method when it is applied to a function f.

Remark 1.

Theorem 1 remains true, if sets

T

,

C

,

C_{ℓ}^{q}

and

C_{ℓ}^{Q}

are replaced by the subsets

\tilde{T} \subset T

,

\tilde{C} = \{C_{i} : T_{i} P^{(i)} \in \tilde{T}\}

and

{\tilde{C}}_{ℓ} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} : C_{i_{j}} \in \tilde{C}\}

, such that

{\tilde{C}}_{ℓ} = {\tilde{C}}_{ℓ}^{q} \cup {\tilde{C}}_{ℓ}^{Q}

,

{\tilde{C}}_{ℓ}^{q} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} \in {\tilde{C}}_{ℓ} : ∥\prod_{j = 1}^{ℓ} C_{i_{j}}∥ \leq q\},

{\tilde{C}}_{ℓ}^{Q} = \{\prod_{j = 1}^{ℓ} C_{i_{j}} \in {\tilde{C}}_{ℓ} : q < ∥\prod_{j = 1}^{ℓ} C_{i_{j}}∥ \leq Q\}

and

{\tilde{C}}_{ℓ}^{q}

is nonempty.

Remark 2.

Assume that

n = 2

and the operation set

T

is restricted to

\tilde{T} = T ∖ \{T_{s h r} P : P \in P_{n + 1}\}

(no shrinking occurs). For

q = 0.99

, the ratios

|{\tilde{C}}_{ℓ}^{q}| / |{\tilde{C}}_{ℓ}|

are given in the next equation

\begin{matrix} ℓ & 2 & 3 & 4 & 5 & 6 & 7 \\ |{\tilde{C}}_{ℓ}^{q}| / |{\tilde{C}}_{ℓ}| & 0.3456 & 0.4691 & 0.5468 & 0.6143 & 0.6715 & 0.7187 \end{matrix}

(22)

Hence, we have convergence for strictly convex functions

f : R^{2} \to R

under the conditions of Theorem 1.

Theorem 1 is based on the behavior of ℓ consecutive steps of the Nelder–Mead method. The difference between Theorem 1 and Theorem 9 of [16] is the following. In the case of Theorem 9 of [16], we identified a subset

W_{1} \subset T

and constructed a matrix norm

{∥\cdot∥}_{ϑ}

, such that for every

T_{i} P^{(i)} \in W_{1}

,

{∥C_{i}∥}_{ϑ} \leq q < 1

held. The existence of such norm was related to an algorithmically undecidable problem. The ratio of operations from

W_{1}

and

T ∖ W_{1}

then decided the convergence. Here, in Theorem 1, we avoided the construction of the matrix norm

{∥\cdot∥}_{ϑ}

, by computing the (spectral) norms of ℓ consecutive steps (products of ℓ operators) and sorting them into two subsets

C_{ℓ}^{q}

and

C_{ℓ}^{Q}

. The rest of the proof was quite similar to that of Theorem 9 of [16]. The norms of ℓ-products can be easily computed, although the computation time and required memory quickly increase with n and ℓ.

If one can show that for a given function f, the NM method takes only steps that belong to

C_{ℓ}^{q}

for some ℓ, then the convergence immediately follows. However, if it is not the case, then either one has to make some additional condition as in Theorem 1 or seek some statistical characterization such as Theorem 2 of the next section.

6. A Random Convergence Result

The counterexamples of [16,17,18] showed that there was no sure convergence for the NM method. One may ask, however, if Theorem 1 is sharp enough in some sense. Here, we study a simple random model using the proof of Theorem 1. Formulas (15) and (16) imply that it is enough to study the infinite product

\prod_{j = 1}^{\infty} C_{i_{j}} = [\prod_{j = 1}^{\infty} (C_{i_{(j - 1) ℓ + 1}} \dots C_{i_{j ℓ}})] = \prod_{j = 1}^{\infty} Y_{i_{j}} (Y_{i_{j}} \in C_{ℓ}) .

Assume that the ℓ-products

Y_{i} \in C_{ℓ}

are randomly chosen with the probability

p_{i} \geq 0

and

\sum_{i = 1}^{N} p_{i} = 1

. Moreover, assume that the subsequent ℓ-products

C_{i_{(j - 1) ℓ + 1}} \dots C_{i_{j ℓ}}

(

T_{i_{(j - 1) ℓ + 1}} \dots T_{i_{j ℓ}}

) are randomly chosen independently of each other.

Let

a_{i} = {∥Y_{i}∥}_{2}

(

Y_{i} \in C_{ℓ}

,

i = 1, \dots, N

). Note that

a_{i} > 0

for

1 \leq i \leq N

and there are numbers

a_{j}

such that

a_{j} > 1

(see also Section 3 of [16]). Let X be a random variable and let

a_{1}, a_{2}, \dots, a_{N}

be the values which it assumes. Let

P (X = a_{i}) = p_{i}

(

1 \leq i \leq N

) be the probability distribution of X. Assume that for the expected value of X,

μ = E X = \sum_{i = 1}^{N} p_{i} a_{i} < 1

holds.

If X is uniformly distributed, the expected values

μ = E X

belonging to the cases of Equation (21) are given in the following equation (with four-decimal digit precision).

\begin{matrix} n ∖ ℓ & 2 & 3 & 4 & 5 \\ 2 & 0.8512 & 0.6515 & 0.4931 & 0.3725 \\ 3 & 0.7435 & 0.4891 & 0.3202 & 0.2093 \\ 4 & 0.5963 & 0.3305 & 0.1825 \\ 5 & 0.5704 & 0.2918 \\ 6 & 0.6015 \end{matrix}

(23)

Hence, the condition

μ = E X < 1

holds in these cases.

We need the following simple results.

Lemma 3.

Let the positive random variables

{\{X_{i}\}}_{i = 1}^{\infty}

be independent and identically distributed with the same distribution as X, and assume that

μ = E X < 1

. Then,

Z_{k} = \prod_{i = 1}^{k} X_{i} \to 0

holds with a probability of one. Moreover, there exist numbers

{\{c_{k}\}}_{k = 1}^{\infty}

such that

\sum_{k = 1}^{\infty} c_{k} < \infty

and only a finite number of the events

\{Z_{k} \geq c_{k}\}

can occur with a probability of one.

Proof.

The independence of

X_{i}

’s implies that

E Z_{k} = \prod_{i = 1}^{k} (E X_{i}) = μ^{k}

. For any

c_{k} > 0

, it follows from the Markov inequality that

P (Z_{k} \geq c_{k}) \leq \frac{E Z_{k}}{c_{k}} = \frac{μ^{k}}{c_{k}}

. Select

c_{k} = μ^{k / 2}

(

k \geq 1

). Then,

\sum_{k = 1}^{\infty} P (Z_{k} \geq c_{k}) \leq \sum_{k = 1}^{\infty} c_{k} = \frac{\sqrt{μ}}{1 - \sqrt{μ}} < \infty .

It follows from the Borel–Cantelli lemma (see, e.g., Theorem 1.5.1 of [20] or Borovkov [21]) that

Z_{n} \to 0

with a probability of one and only a finite number of the events

\{Z_{k} \geq c_{k}\}

(

k = 1, 2, \dots

) can occur. □

Corollary 2.

Under the conditions of Lemma 3 there exists numbers

{\tilde{c}}_{k} \geq 0

(

k \geq 1

) such that

Z_{k} \leq {\tilde{c}}_{k}

(

k \geq 1

) and

\sum_{k = 1}^{\infty} {\tilde{c}}_{k} < \infty

.

Proof.

There is a random index

n (ω)

such that

P (n (ω) < \infty)

with a probability of one, and

Z_{k} < c_{k}

if

k \geq n (ω)

. Since

\tilde{a} = {max}_{i} a_{i} > 1

and for any k,

P (Z_{k} = {\tilde{a}}^{k}) \geq P (X_{1} = \tilde{a}, \dots, X_{k} = \tilde{a}) > 0

, there is no fixed number

n_{0}

such that

Z_{k} < ε

, for

k \geq n_{0}

and for all

ε > 0

. However, for

k < n (ω)

,

Z_{k} \leq {\tilde{c}}_{k} : = {\tilde{a}}^{k}

and

Z_{k} < {\tilde{c}}_{k} : = c_{k}

for

k \geq n (ω)

. Thus, it follows that

Z_{k} \leq {\tilde{c}}_{k}

(

k \geq 1

) and

\sum_{k = 1}^{\infty} {\tilde{c}}_{k} < \infty

. □

Theorem 2.

Assume that

n \geq 2

,

S^{(0)}

is nondegenerate,

ℓ \geq 2

is fixed,

C_{ℓ}^{q} \neq \emptyset

, and the ℓ-products

Y_{i} \in C_{ℓ}

are randomly chosen with probability

p_{i} \geq 0

and

\sum_{i = 1}^{N} p_{i} = 1

. Furthermore, assume that the subsequent ℓ-products are randomly chosen independently of each other. If

μ = \sum_{i = 1}^{N} a_{i} p_{i} < 1

, then for the Nelder–Mead algorithm,

lim_{k \to \infty} x_{j}^{(k)} = \hat{x} (j = 1, \dots, n + 1)

(24)

holds with a probability of one. If, in addition, f is continuous at

\hat{x}

, then

lim_{k \to \infty} f (x_{j}^{(k)}) = f (\hat{x}) (j = 1, \dots, n + 1)

(25)

also holds with a probability of one.

Proof.

Let X be a random variable defined on the positive numbers

{\{a_{i}\}}_{i = 1}^{N}

with the probability distribution

{\{p_{i}\}}_{i = 1}^{N}

. By assumption,

μ = E X = \sum_{i = 1}^{N} a_{i} p_{i} < 1

. Lemma 3 and Corollary 2 imply that

∥\prod_{j = 1}^{k ℓ} C_{i_{j}}∥ \leq \prod_{j = 1}^{k} ∥C_{i_{(j - 1) ℓ + 1}} \dots C_{i_{j ℓ}}∥ = \prod_{j = 1}^{k} a_{s_{j}} \leq {\tilde{c}}_{k} (k \geq 1)

and

\sum_{i = 1}^{\infty} {\tilde{c}}_{k} < \infty

. The result now follows from Lemma 2, the proof of Theorem 1 and the “continuity theorem” (see, e.g., Borovkov [21], Theorem 6.1.4, p. 134). □

Note that assumption

r_{1} (⌊\frac{k}{ℓ}⌋) \geq μ ⌊\frac{k}{ℓ}⌋ + κ r_{2} (⌊\frac{k}{ℓ}⌋)

does not occur here. Instead, we use the assumption

μ = E X < 1

. Furthermore, note that the smaller the

μ

, the faster the convergence.

If we assume a uniform distribution for X, then we have the following.

Corollary 3.

Under the assumption of uniform distribution, Equation (23) implies Theorem 2 for

n = 2, 3, 4, 5, 6

.

Theorem 2 and the Corollary 3 ensure the convergence of the NM algorithm with a probability of one although the limit point is not necessarily a minimum point of f. Concerning the speed of convergence, we note that some classical stochastic approximation algorithms (Robbins–Monro, Kiefer–Wolfowitz methods) also converge with a probability of one (see, e.g., [22]).

Remark 3.

Theorem 2 remains true, if the sets

C_{ℓ}^{q}

and

C_{ℓ}

are replaced by the subsets

{\tilde{C}}_{ℓ}^{q}

and

{\tilde{C}}_{ℓ}

, respectively, and the probability assumptions for these also hold.

Remark 4.

Assume that

n = 2

and the operation set

T

is restricted to

\tilde{T} = T ∖ \{T_{s h r} P : P \in P_{n + 1}\}

(no shrinking occurs). For

q = 0.99

and a uniform distribution, the expected values μ are given in the next equation

\begin{matrix} ℓ & 2 & 3 & 4 & 5 & 6 & 7 \\ μ & 1.2961 & 1.2139 & 1.1215 & 1.0334 & 0.9489 & 0.8701 \end{matrix}

(26)

Hence, Theorem 2 holds for

ℓ = 6, 7

. It follows that for strictly convex functions

f : R^{2} \to R

, the Nelder–Mead method converges with a probability of one under the assumption of uniform distribution.

7. Conclusions

Although the convergence of the Nelder–Mead simplex method to a minimum point of a function f cannot be guaranteed in general, the main result indicates that in the stochastic sense, it converges to a point at which the value of f is less than the best function value at the start. It also may explain the good behavior of the Nelder–Mead method on average.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The author is highly indebted to László Szeidl for his help and comments on Section 6. The author is also indebted to the unknown referees for their observations and remarks that improved the paper.

Conflicts of Interest

The author declares no conflict of interest.

References

Nelder, J.A.; Mead, R. A simplex method for function minimization. Comput. J. 1965, 7, 308–313. [Google Scholar] [CrossRef]
Larson, J.; Menickelly, M.; Wild, S. Derivative-free optimization methods. Acta Numer. 2019, 28, 287–404. [Google Scholar] [CrossRef]
Walters, F.; Morgan, S.; Parker, L.P.L.; Deming, S. Sequential Simplex Optimization; CRC Press LLC: Boca Raton, FL, USA, 1991. [Google Scholar]
Kelley, C. Iterative Methods for Optimization; Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 1999. [Google Scholar] [CrossRef]
Conn, A.; Scheinberg, K.; Vicente, L. Introduction to Derivative-Free Optimizations; Society for Industrial and Applied Mathematics (SIAM): Philadelphia, PA, USA, 2009. [Google Scholar] [CrossRef]
Audet, C.; Hare, W. Derivative-Free and Blackbox Optimization; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Kochenderfer, M.; Wheeler, T. Algorithms for Optimization; The MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
Tekile, H.; Fedrizzi, M.; Brunelli, M. Constrained Eigenvalue Minimization of Incom-plete Pairwise Comparison Matrices by Nelder-Mead Algorithm. Algorithms 2021, 14, 222. [Google Scholar] [CrossRef]
Nash, C.J. On Best Practice Optimization Methods in R. J. Stat. Softw. 2014, 60, 1–14. [Google Scholar] [CrossRef]
Lagarias, J.; Reeds, J.; Wright, M.; Wright, P. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J. Optimiz. 1998, 9, 112–147. [Google Scholar] [CrossRef]
Lagarias, J.; Poonen, B.; Wright, M. Convergence of the restricted Nelder-Mead algorithm in two dimensions. SIAM J. Optimiz. 2012, 22, 501–532. [Google Scholar] [CrossRef]
Wright, M. Nelder, Mead, and the other simplex method. Extra Volume: Optimization Stories. Doc. Math. 2012, 271–276. [Google Scholar]
McKinnon, K. Convergence of the Nelder-Mead simplex method to a nonstationary point. SIAM J. Optimiz. 1998, 9, 148–158. [Google Scholar] [CrossRef]
Kelley, C. Detection and remediation of stagnation in the Nelder-Mead algorithm using an sufficient decrease condition. SIAM J. Optimiz. 1999, 10, 43–55. [Google Scholar] [CrossRef]
Han, L.; Neumann, M. Effect of dimensionality on the Nelder-Mead simplex method. Optim. Method. Softw. 2006, 21, 1–16. [Google Scholar] [CrossRef]
Galántai, A. Convergence of the Nelder-Mead method. Numer. Algorithms 2022, 90, 1043–1072. [Google Scholar] [CrossRef]
Galántai, A. Convergence theorems for the Nelder-Mead method. J. Comput. Appl. Mech. 2020, 15, 115–133. [Google Scholar] [CrossRef]
Galántai, A. A convergence analysis of the Nelder-Mead simplex method. Acta Polytech. Hung. 2021, 18, 93–105. [Google Scholar] [CrossRef]
Hartfiel, D. Nonhomogeneous Matrix Products; World Scientific: Singapore, 2002. [Google Scholar] [CrossRef]
Chandra, T. The Borel—Cantelli Lemma; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
Borovkov, A. Probability Theory; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar] [CrossRef]
Kushner, H.; Clark, D. Stochastic Approximation Methods for Constrained and Unconstrained Systems; Springer: New York, NY, USA, 1978. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.