On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

Krutikov, Vladimir; Tovbis, Elena; Stanimirović, Predrag; Kazakovtsev, Lev

doi:10.3390/math11234715

Open AccessArticle

On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

¹

Laboratory “Hybrid Methods of Modeling and Optimization in Complex Systems”, Siberian Federal University, 79 Svobodny Prospekt, Krasnoyarsk 660041, Russia

²

Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, Kemerovo 650043, Russia

³

Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, Krasnoyarsk 660037, Russia

⁴

Faculty of Sciences and Mathematics, University of Niš, 18000 Niš, Serbia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(23), 4715; https://doi.org/10.3390/math11234715

Submission received: 21 October 2023 / Revised: 19 November 2023 / Accepted: 20 November 2023 / Published: 21 November 2023

(This article belongs to the Section Mathematics and Computer Science)

Download Versions Notes

Abstract

:

The main results of the study of the convergence rate of quasi-Newton minimization methods were obtained under the assumption that the method operates in the region of the extremum of the function, where there is a stable quadratic representation of the function. Methods based on the quadratic model of the function in the extremum area show significant advantages over classical gradient methods. When solving a specific problem using the quasi-Newton method, a huge number of iterations occur outside the extremum area, unless there is a stable quadratic approximation of the function. In this paper, we study the convergence rate of quasi-Newton-type methods on strongly convex functions with a Lipschitz gradient, without using local quadratic approximations of a function based on the properties of its Hessian. We proved that quasi-Newton methods converge on strongly convex functions with a Lipschitz gradient with the rate of a geometric progression, while the estimate of the convergence rate improves with the increasing number of iterations, which reflects the fact that the learning (adaptation) effect accumulates as the method operates. Another important fact discovered during the theoretical study is the ability of quasi-Newton methods to eliminate the background that slows down the convergence rate. This elimination is achieved through a certain linear transformation that normalizes the elongation of function level surfaces in different directions. All studies were carried out without any assumptions regarding the matrix of second derivatives of the function being minimized.

Keywords:

minimization; quasi-Newton method; convergence rate

MSC:

90C30

1. Introduction

Quasi-Newton (QN) methods for solving nonlinear optimization problems are based on the idea of reconstructing the matrix of second derivatives of a function from its gradients. The reconstructed matrix is used similarly to the second derivative matrix in Newton’s method. Quasi-Newton methods are effective tools for solving smooth minimization problems. Methods from QN class are less costly than the Newton method for solving large-scale optimization problems because their iterations do not use second-order derivatives. QN methods are applied to various areas such as physics, biology, engineering, geophysics, chemistry, and industry to solve the nonlinear systems of equations. QN methods can be applied in the Deep Learning area as a method for the empirical risk minimization, where the number of samples as well as the number of variables is large [1,2,3]. In microscopy, QN methods help to achieve high resolution imaging [4]. In modeling the spread of infections, QN is useful for the identification of the unknown model coefficients [5]. QN methods are also useful for the modeling of complex crack propagation [6], fluid–structure interaction [7,8,9], melting and solidification of alloys [10], heat transfer systems [11], etc.

Nowadays, there are a significant number of matrix reconstruction formulas in QN methods, and hundreds of papers have been written on the topic of quasi-Newton methods (see, for example, [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] and their bibliography). The first QN method was proposed in [20] and improved in [24]. This matrix update formula is known as DFP (Davidon–Fletcher–Powell). Symmetric Rank 1 (SR1) method [19,20] is another way to update the Hessian. Today, it is generally accepted [12,13,14,30] that the BFGS (Broyden–Fletcher–Goldfarb–Shanno) matrix updating formula [19,23,26,31] is the most efficient among the family of QN methods.

The main results of the convergence of quasi-Newton methods in the extremum region are given in [13,15]. These results refer to the convergence rate of quasi-Newton minimization methods under the assumption that the method operates in the extremum area of the function, where there is a stable quadratic representation of the function and where methods based on the quadratic model show significant advantages over standard gradient methods.

An incremental quasi-Newton method with a local superlinear convergence rate is presented in [32,33]. The proposed incremental algorithm reduces computational cost by restricting the update to a single function per iteration and relative to incremental second-order methods by removing the need to compute the inverse of the Hessian.

The authors in [34] showed that a range of QN methods are first-order methods in the Nesterov definition [35]. They extended the complexity analysis for smooth strongly convex problems to finite dimensions and showed that in a worst-case scenario, the local superlinear or faster convergence rates of QN methods cannot be improved unless the number of iterations exceeds half of the problem size.

In [36], the authors confirmed certain superlinear convergence rates for the QN method, depending on the problem size and the specifically defined condition number. The analysis was developed based on the trace potential function, which was scaled by the logarithm of determinant of the inverse Hessian approximation to extend the proof to the general nonlinear case. The results of [36] were further improved in [37], where the authors demonstrated that the convergence rate of the BFGS method depends only on the product of the problem dimensionality and the logarithm of its condition number.

Another analysis of local non-asymptotic superlinear convergence of the DFP and BFGS methods was presented in [38]. The authors showed that in a local neighborhood of the optimal solution, the iterations generated by both DFP and BFGS converge to the optimal solution in a superlinear rate (1/k)^k^/2, where k is the number of performed iterations.

A sampled version of the BFGS method named limited-memory BFGS (L-BFGS) [39] was presented to handle with high dimensional problems. The algorithm stores only a few vectors that represent the approximation of Hessian instead of the entire matrix. A version with bound constraints was proposed in [40]. The algorithm developed in [1] generates points randomly around the current iterate at every iteration to produce approximations that do not depend on information about past iterations.

Randomized variants of QN algorithms have been recently investigated. Such random methods employ a random direction at each iteration for updating the approximate Hessian matrix. The online L-BFGS method [41] adapts the L-BFGS method to make use of subsampled gradients. The regularized BFGS method [42,43] modifies the BFGS update by adding a regularizer to the metric matrix. The stochastic block BFGS method was proposed in [44]. This method enables the incorporation of curvature information in stochastic approximation. The estimate of the inverse Hessian matrix is updated at each iteration using a randomly generated compressed form of the Hessian. Such an approach was called a “sketch” technique. Then, the authors developed an adaptive variant of the randomized block BFGS, AdaRBFGS in [45], in which they modified the distribution underlying the stochasticity of the method throughout the iterative process. Further, in [46], it was shown that the block BFGS method also converges superlinearly, and a framework using a curvature-adaptive step size was introduced. In [47], a stochastic QN method is proposed that employs the classical BFGS update formula in its limited memory form and is based on collecting curvature information pointwise and at regular intervals through Hessian-vector products. In [48], the authors study stochastic QN methods in the interpolation setting and prove that these algorithms, including L-BFGS, can achieve global linear convergence with a constant batch size. The authors in [30] provide a semi-local convergence rate for the randomized BFGS method under the assumption that the function is self-concordant. An extension of BFGS proposed in [49] generates an estimate of the true objective function by taking the empirical mean over a sample drawn at each step and attains R-superlinear convergence. A regularized stochastic accelerated QN method (RES-NAQ) that combines the concept of the regularized stochastic BFGS method (RES) with the Nesterov accelerating technique by introducing a new momentum coefficient was proposed in [50].

Greedy variants of the QN method were introduced in [51]. In contrast to the classical QN methods, which use the difference of successive iterations for updating the Hessian approximations, the method in [40] applies basis vectors, greedily selected to maximize a certain measure of progress. An explicit non-asymptotic bound on the local superlinear convergence rate was established. This approach was further improved in [52,53] with methods of condition-number-free superlinear convergence speed.

When solving a specific problem using the QN method, a huge number of iterations occur outside the extremum area where there is no stable quadratic approximation of the function. In this paper, our aim is to study the convergence rate of quasi-Newton-type methods without assuming the existence of second derivatives of the function. Strongly convex functions with a Lipschitz gradient are studied, and local quadratic approximations of the function based on information about the properties of its Hessian are not used.

The obtained results are related to the estimates for the convergence rate of quasi-Newton methods on strongly convex functions with a Lipschitz gradient by means of a geometric progression. It is shown that the indicators for estimating the convergence speed improve with the increase in the number of iterations of the method, which indicates the benefit of adjusting the metric matrices in the method.

It is known that it is possible to both reduce the spread of elongation of level surfaces along different directions and increase it with the help of a linear transformation of coordinates. Quasi-Newton methods eliminate this scatter on quadratic functions. The work shows that in the case of strongly convex functions, there is a scattering that can be eliminated using a linear transformation of coordinates, and then, the quasi-Newton method also eliminates it. That is, it is possible to improve the behavior of a strongly convex function using some linear coordinate transformation, and then, the quasi-Newton method can recreate this coordinate transformation. This property of the method is based on its invariance under a linear coordinate transformation, which allows us to consider the method in a coordinate system with better characteristics in terms of estimates of the convergence rate.

The rest of the paper is organized as follows. In Section 2, we provide basic information about quasi-Newton methods. In Section 3, we restate necessary information about strongly convex functions and obtain an estimate for the convergence rate of quasi-Newton methods on strongly convex functions with a Lipschitz gradient, depending on the convexity constants and Lipschitz constants. Accelerating properties of quasi-Newton methods on strongly convex functions with a Lipschitz gradient are considered in Section 4. Numerical results are presented in Section 5. A short conclusion of the obtained results is given in the last section.

2. Quasi-Newton Methods

The iteration of the quasi-Newton method has the following form (see, for example, [12]):

x_{k + 1} = x_{k} + β_{k} s_{k},

(1)

s_{k} = - H_{k} \nabla f (x_{k}), β_{k} = \underset{β \geq 0}{a r g m i n} f (x_{k} + β s_{k}),

(2)

where

\nabla f (x_{k})

is the gradient of the objective function f at

x_{k}

,

H_{k}

denotes an approximation of the Hessian inverse

{[\nabla^{2} f (x_{k})]}^{- 1}

,

s_{k}

is a search direction, and

β_{k}

is chosen to satisfy the Wolfe conditions. The following notation will be used:

Δ x_{k} = x_{k + 1} - x_{k}, y_{k} = \nabla f (x_{k + 1}) - \nabla f (x_{k}),

(3)

H_{k + 1} = H (H_{k}, Δ x_{k}, y_{k}),

(4)

where

H_{k + 1} = H (H_{k}, Δ x_{k}, y_{k})

denotes an appropriate formula for updating matrices

H_{k}

. The initial iterative point is denoted by x₀ and the approximation of

{[\nabla^{2} f (x_{0})]}^{- 1}

satisfies H₀ > 0 (H₀ = I is usually assumed).

We will denote by A(

A_{k}, Δ x_{k}, y_{k}

) the operator of the Hessian

\nabla^{2} f (x_{k})

approximation. Process (1)–(4) is a certain approximation of the Newton optimization method. We will be interested in the accelerating properties of QN methods and the conditions for their appearance.

Well-known rules for updating the matrices

H_{k}

and

A_{k}

are as follows. The Davidon–Fletcher–Powell (DFP) updating formula [20,24] is given by the following:

{H_{k + 1} = H}_{D F P} (H_{k}, Δ x_{k}, y_{k}) = H_{k} - \frac{H_{k} y_{k} y_{k}^{T} H_{k}}{(y_{k}, H_{k} y_{k})} + \frac{{Δ x}_{k} {Δ x}_{k}^{T}}{({Δ x}_{k}, y_{k})},

(5)

A_{k + 1} = A_{D F P} (A_{k}, Δ x_{k}, y_{k}) = A_{k} - \frac{(y_{k} - A_{k} {Δ x}_{k}, {Δ x}_{k}) y_{k} y_{k}^{T}}{{(y_{k}, {Δ x}_{k})}^{2}} + \frac{(y_{k} - A_{k} {Δ x}_{k}) y_{k}^{T} + y_{k} {(y_{k} - A_{k} {Δ x}_{k})}^{T}}{(y_{k}, {Δ x}_{k})},

(6)

such that

H_{k} = A_{k}^{- 1}

.

The Broyden–Fletcher–Goldfarb–Shanno (BFGS) updating formula [19,23,26,31] is defined as follows:

{H_{k + 1} = H}_{B F G S} (H_{k}, Δ x_{k}, y_{k}) = H_{k} - \frac{({Δ x}_{k} - H_{k} y_{k}, y_{k}) {Δ x}_{k} {Δ x}_{k}^{T}}{{(y_{k}, {Δ x}_{k})}^{2}} + \frac{({Δ x}_{k} - H_{k} y_{k}) {Δ x}_{k}^{T} + {Δ x}_{k} {({Δ x}_{k} - H_{k} y_{k})}^{T}}{(y_{k}, {Δ x}_{k})} .

(7)

The one-parameter family of formulas combines (5) and (7) (see, for example, [12]):

H (H_{k}, Δ x_{k}, y_{k}) = γ H_{B F G S} (H_{k}, Δ x_{k}, y_{k}) + (1 - γ) H_{D F P} (H_{k}, Δ x_{k}, y_{k}), γ \in [0,1] .

(8)

Equation (8) can be represented as follows:

H_{k + 1} = H (H_{k}, Δ x_{k}, y_{k}) = H_{k} - \frac{H_{k} y_{k} y_{k}^{T} H_{k}}{(y_{k}, H_{k} y_{k})} + \frac{{Δ x}_{k} {Δ x}_{k}^{T}}{({Δ x}_{k}, y_{k})} + γ v_{k} v_{k}^{T},

(9)

where

v_{k} = (y_{k}, H_{k} y_{k})^{0.5} [\frac{{Δ x}_{k}}{({Δ x}_{k}, y_{k})} - \frac{H_{k} y_{k}}{(y_{k}, H_{k} y_{k})}], γ \in [0,1] .

(10)

In the exact one-dimensional search (2), the approximations

x_{k}

obtained from Formulas (1)–(4) coincide, where one of the recalculation formulas for a one-parameter family is used with an arbitrary choice of β ∈ [0, 1] (see [22]).

In what follows, a symmetric positive definiteness of the matrix H will be denoted by H > 0. If

H_{0}

> 0, then the family (8) generates symmetric matrices

H_{k}

.

3. Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

The main known studies on the convergence rate of QN methods were carried out in the region of function extremum in the presence of a stable quadratic representation of the function. The fact of the geometric progression convergence rate for the DFP method was established in [15] under the condition that the function is three times continuously differentiable and the matrix of second derivatives is bounded. In our work, an estimate of the convergence rate of a one-parameter family of QN methods on strongly convex functions was considered without the assumption of the existence of second derivatives. Accelerating properties of the QN family were substantiated in comparison with the gradient method. The obtained results indicate that QN methods initially based on the assumption that the quadratic representation of a function exists in the neighborhood of a point are able to approximate a coordinate transformation that reduces the degree of function degeneracy even in the absence of a quadratic representation of the function in the neighborhood of a point. Due to this fact, QN methods have an advantage in convergence speed compared to the steepest descent method. This result forms the content of this section.

For the simplicity, the notations g(x) and

g (x_{k})

will be used instead of ∇f(x) and ∇f(

x_{k}

). In what follows, we will assume the following condition.

Condition 1.

The objective function f(x), x ∈

R^{n}

is differentiable and strongly convex in

R^{n}

, i.e., there exists ρ > 0 such that the inequality,

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y) - α (1 - α) ρ {‖x - y‖}^{2} / 2,

(11)

Holds for all x ∈

R^{n}

, y ∈

R^{n}

and α ∈ [0, 1], and the gradient g(x) satisfies the Lipschitz condition,

‖g (x) - g (y)‖ \leq L ‖x - y‖ \forall x, y \in R^{n}, L > 0 .

(12)

Functions which fulfill Condition 1 satisfy the following relations [5]:

(g (x) - g (y), x - y) \geq ρ {‖x - y‖}^{2} \forall x, y \in R^{n},

(13)

f (x) - f^{*} \leq \frac{{‖g (x)‖}^{2}}{2 ρ}, \forall x \in R^{n},

(14)

(g (x) - g (y), x - y) \geq \frac{{‖g (x) - g (y)‖}^{2}}{L}, \forall x, y \in R^{n},

(15)

f (x) - f^{*} \geq \frac{ρ {‖x - x^{*}‖}^{2}}{2,}, \forall x \in R^{n},

(16)

where x* is the minimum point and f* = f(x*) is the function value at the minimum point.

Since the sequences of approximations x_k obtained utilizing Formulas (1)–(4) during an exact one-dimensional search (2) coincide for an arbitrary choice in the matrix transformation formula (9) [22], then all further reasoning will be carried out using the sequence of matrices H_k generated by the DFP Formula (5).

If the matrix H₀ is symmetric, then the family (8) generates symmetric matrices. If the condition

(y_{k}, Δ x_{k}) > 0

(17)

holds and the matrix H₀ is strictly positive definite, then the matrices H_k retain strict positive definiteness [13]. If the function f satisfies Condition 1, then (13) implies the validity of (17). This proves that the sequence H_k obtained by the rules (8) or (9) will be strictly positive definite when the objective function satisfies Condition 1.

In Lemma 1 we estimate the reduction coefficient of the function at iteration depending on changes in the gradient.

Lemma 1.

Let the objective function f satisfy Condition 1. Then, as a result of iterations defined by (1)–(4), the function decreases with the estimate as follows:

f_{k + 1} - f^{*} \leq q_{k} (f_{k} - f^{*}),

(18)

where

q_{k} = {[1 + \frac{ρ^{2} {‖y_{k}‖}^{2}}{L^{2} {‖g_{k + 1}‖}^{2}}]}^{- 1}, ‖g_{k + 1}‖ > 0 .

(19)

In addition, the following inequality holds for a sequence of iterations

f_{j}

= 0, 1,…, k:

f_{k + 1} - f^{*} \leq Q_{k} (f_{0} - f^{*}), Q_{k} = \prod_{j = 0}^{k} q_{j .}

(20)

Proof of Lemma 1.

The exact value of the indicator is as follows:

q_{k} = \frac{f_{k + 1} - f^{*}}{f_{k} - f^{*}} = \frac{f_{k + 1} - f^{*}}{(f_{k + 1} - f^{*}) + (f_{k} - f_{k + 1})} = {[1 + \frac{f_{k} - f_{k + 1}}{f_{k + 1} - f^{*}}]}^{- 1} .

(21)

Let us make estimates for the denominator terms in (21). According to (14),

f_{k + 1} - f^{*} \leq \frac{{‖g_{k + 1}‖}^{2}}{2 ρ} .

(22)

The Lipschitz condition gives the following inequality:

|| Δ x_{k} {||}^{2} \geq \frac{{‖y_{k}‖}^{2}}{L^{2}} .

(23)

In view of (16), which is also true for a one-dimensional function along the direction

Δ x_{k}

, the following holds:

f_{k} - f_{k + 1} \geq ρ | | Δ x_{k} | |^{2} / 2 .

(24)

Combining (24) with (23) leads to the following:

f_{k} - f_{k + 1} \geq ρ \frac{| |y_{k}| |^{2}}{2 L^{2}} .

(25)

Then, the application of the estimates (22) and (25) in (21) imply (19). A recursive application of (19) produces (20). □

Denote by Sp(H) the trace of a matrix H. Applying the formulas for the matrix traces of H_k and

A_{k} = H_{k}^{- 1}

obtained in [15], we generate estimates based on which evaluations of the convergence rate are constructed.

Lemma 2.

Let the function satisfy Condition 1. The following estimates hold for sequences {H_j}, {y_j}, {g_j}, j = 0, 1,…, k generated as a result of the iterative process (1)–(4):

\sum_{j = 0}^{k} a_{j} \leq (k + 1) c_{a}, a_{j} = \frac{(g_{j + 1}, H_{j} g_{j + 1})}{| | y_{j + 1} | |^{2}}, c_{a} = \frac{1}{ρ} + \frac{S p (H_{0})}{k + 1},

(26)

\sum_{j = 0}^{k} b_{j} \leq (k + 1) c_{b}, b_{j} = \frac{| | g_{j + 1} | |^{2}}{(g_{j + 1}, H_{j} g_{j + 1})}, c_{b} = L + \frac{S p (A_{0})}{(k + 1)},

(27)

f_{k + 1} - f * \leq q_{k} (f_{k} - f^{*}),

(28)

where

q_{k} = {[1 + \frac{ρ^{2} {‖y_{k}‖}^{2}}{L^{2} {‖g_{k + 1}‖}^{2}}]}^{- 1} = {[1 + c_{0} a_{k} b_{k}]}^{- 1} = \frac{a_{k} b_{k}}{c_{0} + a_{k} b_{k}}, c_{0} = \frac{ρ^{2}}{L^{2}}, ‖g_{k + 1}‖ > 0 .

(29)

In addition, the following inequality holds for a sequence of iterations

f_{j}

, j = 0, 1,…, k:

f_{k + 1} - f^{*} \leq Q_{k} (f_{0} - f^{*}), Q_{k} = \prod_{j = 0}^{k} q_{j} = \prod_{j = 0}^{k} \frac{a_{j} b_{j}}{c_{0} + a_{j} b_{j}} .

(30)

Proof of Lemma 2.

Expressions for the trace of the matrices A_k and H_k were calculated in [15] through Formulas (5) and (6) as follows:

S p (H_{k + 1}) = S p (H_{0}) - \sum_{j = 0}^{k} \frac{| | H_{j} y_{j} | |^{2}}{(y_{j}, H_{j} y_{j})} + \sum \frac{| | Δ x_{j} | |^{2}}{(y_{j}, Δ x_{j})},

(31)

S p (A_{k + 1}) = S_{P} (A_{0}) + \frac{| | g_{k + 1} | |^{2}}{(g_{k + 1}, H_{k + 1} g_{k + 1})} - \frac{| | g_{0} | |^{2}}{(g_{0}, H_{0} g_{0})} - \sum_{j = 0}^{k} \frac{| | g_{j + 1} | |^{2}}{(g_{j + 1}, H_{j} g_{j + 1})} + \sum_{j = 0}^{k} \frac{| | y_{j} | |^{2}}{(Δ x_{j}, y_{j})} .

(32)

As noted earlier, the matrices H_k and therefore A_k are strictly positive definite. Due to the remark made above about the identity of sequences x_k for different γ ∈ [0, 1] in (9), we will carry out the proof for the sequence H_k generated by (5).

Due to the inequality

{‖z‖}^{2} \leq S_{P} (A_{k + 1}) \cdot (z, H_{k + 1} z),

valid for every

z \in R^{n}

, the next inequality is as follows:

S_{P} (A_{k + 1}) - \frac{{‖g_{k + 1}‖}^{2}}{(g_{k + 1}, H_{k + 1} g_{k + 1})} \geq 0 .

From (32), considering the last inequality and (15), one obtains the following:

\sum_{j = 0}^{k} \frac{{‖g_{j + 1}‖}^{2}}{(g_{j + 1}, H_{j} g_{j + 1})} = S_{P} (A_{0}) - \frac{{‖g_{0}‖}^{2}}{(g_{0}, H_{0} g_{0})} - [S_{P} (A_{k + 1}) - \frac{{‖g_{k + 1}‖}^{2}}{(g_{k + 1}, H_{k + 1} g_{k + 1})}] + \sum_{j = 0}^{k} \frac{{‖y_{j}‖}^{2}}{(Δ x_{j}, y_{j})} \leq S_{P} (A_{0}) + \sum_{j = 0}^{k} \frac{{‖y_{j}‖}^{2}}{(Δ x_{j}, y_{j})} \leq S_{P} (A_{0}) + (k + 1) L .

(33)

The inequality (13) implies the following:

\frac{{‖Δ x_{j}‖}^{2}}{(y_{j}, Δ x_{j})} \leq \frac{1}{ρ} .

(34)

The Schwartz’s inequality leads to the following:

{(y_{j}, H_{j} y_{j})}^{2} \leq {‖H_{j} y_{j}‖}^{2} {‖y_{j}‖}^{2} .

(35)

Due to the exact one-dimensional search condition (3.8), the next equality holds:

- (g_{j + 1}, H_{j} g_{j}) = (g_{j + 1}, Δ x_{j}) = 0 .

From this and from the positive definiteness of the matrices, the following holds:

\begin{array}{l} (y_{j}, H_{j} y_{j}) & = (g_{j + 1}, H_{j} g_{j + 1}) + (g_{j}, H_{j} g_{j}) - 2 (g_{j + 1}, H_{j} g_{j}) \\ = (g_{j + 1}, H_{j} g_{j + 1}) + (g_{j}, H_{j} g_{j}) \geq (g_{j + 1}, H_{j} g_{j + 1}) . \end{array}

(36)

Considering (34)–(36), the equality (31) is transformed as follows:

\begin{array}{l} \sum_{j = 0}^{k} \frac{(g_{j + 1}, H_{j} g_{j + 1})}{{‖y_{j}‖}^{2}} & \leq \sum_{j = 0}^{k} \frac{{‖H_{j} y_{j}‖}^{2}}{(y_{j}, H_{j} y_{j})} = \\ = S_{P} (H_{0}) - S_{P} (H_{k + 1}) + \sum_{j = 0}^{k} \frac{{‖Δ x_{j}‖}^{2}}{(y_{j}, Δ x_{j})} \leq S_{P} (H_{0}) + \frac{k + 1}{ρ} . \end{array}

(37)

From (37) and (33), we arrive at the following inequalities:

\sum_{j = 0}^{k} \frac{(g_{j + 1}, H_{j} g_{j + 1})}{{‖y_{j}‖}^{2}} \leq S_{p} (H_{0}) + \frac{k + 1}{ρ} = \frac{k + 1}{ρ} [1 + \frac{ρ S p (H_{0})}{k + 1}],

(38)

\sum_{j = 0}^{k} \frac{{‖g_{j + 1}‖}^{2}}{(g_{j + 1}, H_{j} g_{j + 1})} \leq S_{P} (A_{0}) + (k + 1) L = (k + 1) L [1 + \frac{S p (A_{0})}{L (k + 1)}] .

(39)

Inequalities (38) and (39) are identical to (26) and (27). Estimate (30) follows from (20). □

The convergence rate of the QN method is determined by the indicator

Q_{k}

in (20). In the next lemma, we find an upper bound for this indicator. The problem is formulated as follows:

{- Q_{k}} \to \underset{a, b}{m i n} subject to (26) and (27),

(40)

where a^T = (a₀, a₁, …, a_k), b^T = (b₀, b₁, …, b_k).

Lemma 3.

Under the conditions of Lemma 2, the solution to the problem (40) is of the following form:

a_{j}^{*} = c_{a} = \frac{1}{ρ} + \frac{S p (H_{0})}{k + 1}, b_{j}^{*} = c_{b} = L + \frac{S p (A_{0})}{(k + 1)}, j = 0, 1, \dots k .

(41)

The optimal value is equal to the following:

Q_{k}^{*} = (q_{k}^{*})_{}^{k + 1},

(42)

where

q_{k}^{*} = \frac{c_{a} c_{b}}{c_{0} + c_{a} c_{b}} = \frac{1}{1 + c_{0} / (c_{a} c_{b})} .

(43)

Proof of Lemma 3.

To solve the problem (40), we form the Lagrange function:

L (a, b, y_{a}, y_{b}) = - \prod_{j = 0}^{k} \frac{a_{j} b_{j}}{c_{0} + a_{j} b_{j}} + y_{a} (\sum_{j = 0}^{k} a_{j} - (k + 1) c_{a}) + y_{b} (\sum_{j = 0}^{k} b_{j} - (k + 1) c_{b}) .

The partial derivatives of L are equal to the following:

\begin{array}{l} \frac{\partial L}{\partial a_{j}} & = - \frac{Q_{k}}{q_{j}} \times \frac{b_{j} (a_{j} b_{j} + c_{0}) - b_{j} (a_{j} b_{j})}{(c_{0} + a_{j} b_{j})^{2}} + y_{a} \\ = - \frac{Q_{k}}{q_{j}} \frac{b_{j} c_{0}}{(c_{0} + a_{j} b_{j})^{2}} + y_{a} = - Q_{k} \frac{c_{0}}{a_{j} (c_{0} + a_{j} b_{j})} + y_{a} = 0, \end{array}

(44)

which implies

a_{j} (c_{0} + a_{j} b_{j}) = c_{0} Q_{k} / y_{a .}

(45)

Similarly, the coefficients b_j fulfill the following:

b_{j} (c_{0} + a_{j} b_{j}) = c_{0} Q_{k} / y_{b} .

(46)

From the expressions (45) and (46), it is easy to obtain that all elements

a_{j}

are the same. This property is also true for elements

b_{j}

.

If the solution lies on the boundary, we write the subsequent equalities to find the parameters

a_{j}

and

b_{j}

:

\sum_{j = 0}^{k} a_{j} = (k + 1) c_{a}, \sum_{j = 0}^{k} b_{j} = (k + 1) c_{b} .

From here, we obtain the statement (41).

The matrix of second derivatives of the Lagrange function is diagonal, and its elements are positive. Consequently, at the point with parameters (41), the sufficient conditions for the extremum are satisfied as the parameters from (41) are a solution to the problem (40). □

Theorem 1.

Let the function satisfy Condition 1. Then, for the sequence of iterations

f_{j}

, j = 0, 1,…, k, the convergence rate of the goal function is estimated as follows:

f_{k + 1} - f^{*} \leq Q_{k}^{*} (f_{0} - f^{*}), Q_{k}^{*} = (q_{k}^{*})_{}^{k + 1},

(47)

where

q_{k}^{*} \leq 1 - \frac{ρ^{3}}{2 L^{3} [1 + \frac{ρ n M_{0}}{k + 1}] \times [1 + \frac{n}{L m_{0} (k + 1)}]} .

(48)

Here, M₀ and m₀ are maximal and minimal eigenvalues of the matrix H₀, respectively.

Proof of Theorem 1.

We use the worst-case scenario for estimating the reduction index of the function (43) in (42), and transform it considering (26), (27), (40) and (41):

q_{k}^{*} = \frac{1}{1 + c_{0} / (c_{a} c_{b})} = {[1 + \frac{ρ^{3}}{L^{3} [1 + \frac{ρ S p (H_{0})}{k + 1}] \times [1 + \frac{S p (H_{0}^{- 1})}{L (k + 1)}]}]}^{- 1} .

(49)

To transform (49), we use the following inequality:

\frac{1}{1 + d} \leq 1 - \frac{d}{2}, d \in [0,1] .

(50)

Multiplying (50) by 1 + d gives the following:

1 \leq (1 + d) (1 - \frac{d}{2}) = 1 + \frac{d}{2} (1 - d) .

Based on this inequality, (50) is valid for d ∈ [0, 1].

Since in (49) the quantity below is positive and limited by

d = \frac{ρ^{3}}{L^{3} [1 + \frac{ρ S p (H_{0})}{k + 1}] \times [1 + \frac{S p (H_{0}^{- 1})}{L (k + 1)}]} \leq 1,

we use (50) to transform (49). The result is the following inequality:

q_{k}^{*} \leq 1 - \frac{ρ^{3}}{2 L^{3} [1 + \frac{ρ S p (H_{0})}{k + 1}] \times [1 + \frac{S p (H_{0}^{- 1})}{L (k + 1)}]} .

(51)

In view of the relations for the matrix traces

S p (H_{0}) \leq n M_{0}, S p (H_{0}^{- 1}) \leq \frac{n}{m_{0}},

(51) leads to the estimate (48). □

For a sufficiently large k, the convergence rate of QN methods can be clearly represented in the following form:

q \approx 1 - \frac{ρ^{3}}{2 L^{3}} .

(52)

Note that we did not involve any information about the matrix of second derivatives when obtaining estimates of the convergence rate. If the objective function is twice differentiable, then the eigenvalues of the matrix of second derivatives are bounded by the interval of the strong convexity parameter and Lipschitz parameter [ρ,L].

Let us analyze the convergence rate indicator (48).

The estimate (48) is determined by the strong convexity parameter ρ, Lipschitz parameter L and the properties of the initial matrix H₀.
As the number of iterations k increases, the estimate of the indicator $q_{k}^{*}$ decreases and tends to the value from the expression (52). This fact is consistent with the expected improvements in convergence rates resulting from the matrix transformation process in QN methods.

Due to the invariance of QN methods, estimates (48) can be considered in different coordinate systems. In the next section, we will consider the method in a coordinate system, where the ratio ρ/L is maximal.

4. Accelerating Properties of QN Methods on Strongly Convex Functions with Lipschitz Gradient

Further research is aimed at determining the conditions, except for the trivial case of minimizing a quadratic function, under which QN methods are superior in convergence rate to the steepest descent method. Quasi-Newton methods are invariant under variable transformations,

\hat{x} = P x,

where P is a non-singular (n × n) matrix [13]. This means that the type of process (1)–(4) is completely identical to the type of process in the new coordinate system. In this case, for identical process values, the relationship

Δ \hat{x} = P Δ x, \hat{y} = P^{- T} y, \hat{g} = P^{- T} g, \hat{H} = P H P^{T}, \hat{x} = P x,

(53)

is valid if the initial conditions are related by the following relations:

{\hat{x}}_{0} = P x_{0}, {\hat{H}}_{0} = P H_{0} P^{T} .

(54)

Here,

P^{- T} = {(P^{T})}^{- 1}

. In the new coordinate system, the process (1)–(4) is equivalent to minimizing the function,

φ (\hat{x}) = f (P^{- 1} \hat{x}) = f (x),

which, as is easy to show, satisfies Condition 1 with strong convexity constant

ρ_{P}

and Lipschitz constant

L_{P}

.

Define the following transformation:

\hat{x} = V x,

where V is a non-singular matrix such that

\frac{ρ_{V}}{L_{V}} \geq \frac{ρ_{p}}{L_{p}}

(55)

for an arbitrary non-singular (n × n) matrix P.

Theorem 2.

Let the conditions of Theorem 1 be satisfied. Then, the sequence

f (x_{k})

, k = 0, 1, 2,…, given by the process (1)–(4), fulfills the following estimate,

f_{k} - f^{*} \leq (f_{0} - f^{*}) (q_{k}^{*})^{k},

(56)

q_{k}^{*} \leq 1 - \frac{ρ_{V}^{3}}{2 L_{V}^{3} [1 + \frac{ρ n M_{0}^{v}}{k + 1}] \times [1 + \frac{n}{L m_{0}^{v} (k + 1)}]},

(57)

where M₀^ν and m₀^ν are maximal and minimal eigenvalues of matrix

P^{- T} H_{0} P^{- 1}

, respectively.

Proof of Theorem 2.

Due to the identity of the form of the minimization process in the old and new coordinate systems and the connection of the initial conditions (54) to estimate the speed of the process (1)–(4), an estimate in any coordinate system can be used and, particularly, in the system

\hat{x} = V x

. This fact and considering (48) proves (56). □

The estimate (56) was obtained in the coordinate system selected by condition (55). If the function satisfies the relation

ρ_{V} / L_{V} > > ρ / L,

then the advantages of QN methods, compared to the steepest descent method, are indisputable. The result (56) was obtained without the assumption of the existence of second derivatives of the function being minimized. Under such weakened conditions on the goal function given in Condition 1, QN methods converge at the rate of a geometric progression and can eliminate the background that slows down the convergence rate. The elimination is enabled by the corresponding linear transformation of coordinates.

5. Numerical Experiment

The purpose of a numerical experiment is to study the ability of quasi-Newton methods to eliminate the background that slows down the convergence rate through some linear transformation which normalizes the elongation of function level surfaces in different directions. For comparison, methods were chosen in which the background that slows down the convergence rate is active during the solution process. The gradient descent (GR) method, Hestenes–Stiefel (HS) conjugate gradient method, and the quasi-Newtonian (BFGS) method with a one-dimensional search procedure with cubic interpolation were implemented and compared.

Due to the fact that the use of QN methods is justified primarily on functions with a high degree of conditionality, where conjugate gradient methods do not work, the test functions were selected based on this position. The QN method is based on a quadratic model of a function; therefore, its local convergence rate in a certain neighborhood of the current minimum is largely determined by its efficiency in minimizing ill-conditioned quadratic functions. Therefore, studies were carried out on quadratic functions and functions of their derivatives.

If the function is twice differentiable, then the eigenvalues of the matrix of second derivatives are limited by the interval of the strong convexity and Lipschitz parameters [ρ,L]. When creating tests, we used the representation of a quadratic function and the analysis of its conditionality, relying on its eigenvalues. The test functions simulate the oscillatory nature of the second derivatives of the function.

The following function is accepted as the basic quadratic function:

f_{1} (x, [a \max]) = \frac{1}{2} \sum_{i = 1}^{n} a_{i}^{} x_{i}^{2}, a_{i}^{} = a \max_{}^{\frac{i - 1}{n - 1}} .

To simulate random fluctuations of second derivatives, a function

f_{2}

was created based on the basic function

f_{1}

:

f_{2} = f_{1} (x, [a \max]) .

Its gradients were distorted randomly according to the following scheme:

\nabla f_{2} = \nabla f_{1} \times (1 + r ξ) .

where ξ ∈ [−1, 1] is a uniformly distributed random number and r = 0.3 is gradient distortion parameter. It should be noted that distortion of gradients significantly reduces the accuracy of a one-dimensional search, where gradients are used to estimate directional derivatives in the cubic approximation.

The point

x_{0} = (100,100, \dots, 100)

was chosen as the starting point. The stopping criterion was as follows:

f (x_{}^{k}) - f * \leq ε = 10^{- 10} .

Minimization results are presented in Table 1 and Table 2. As shown in the tables for the methods studied, N_it—number of iterations (one-dimensional searches along the direction); nfg—number of calls to the procedure for simultaneous calculation of a function and gradient; f—achieved value of the function.

It can be seen from Table 1 that the BFGS method is the most effective, the second is the HS method, and the last is the GR method. Table 2 shows that, unlike the HS and GR methods, the BFGS method remains operational due to the removal of the background noise that worsens the convergence rate.

6. Conclusions

An overwhelming number of iterations occur outside the extremum area when solving a specific minimization problem using a quasi-Newton method if there is no stable quadratic approximation of the objective function. This paper presents a study of the convergence rate of quasi-Newton-type methods on strongly convex functions with a Lipschitz gradient without assuming the existence of second derivatives of the goal function. In our work, the convergence of quasi-Newton methods on strongly convex functions with a Lipschitz gradient is estimated in the form of a geometric progression.

The estimate of the convergence rate includes the dependence on the strong convexity parameter, Lipschitz parameter, and the initial matrix parameter. The convergence rate is determined by the ratio of constants ρ/L, which characterizes the spread of elongation of level surfaces in different directions. The greater this ratio, the higher the convergence rate.

It is shown that an increase in the number of iterations of the method improves the indicators for estimating the convergence rate, which demonstrates the benefit of adjusting the metric matrices in the method.

The property of invariance of quasi-Newton methods with respect to a linear transformation of coordinates enables us to consider the method in a coordinate system where the ratio ρ/L is maximal, that is, the spread of elongation of level surfaces in different directions is minimal, and to obtain a conclusion about the accelerating properties of quasi-Newton methods without relying on the matrix of the second derivatives of the function.

Based on the computational experiment, we can conclude that the theoretically predicted ability of quasi-Newton methods to eliminate the background noise slowing down the convergence rate has been numerically confirmed. The research results can be applied in practice, for example, when choosing a method for training neural networks. As a suggestion for future work, the numerical experiment can be extended to other functions.

The study of the convergence rate of quasi-Newton minimization methods was developed under the assumption of the exact line search (2). One area for future research may be convergence analysis based on various inexact line search procedures.

Author Contributions

Conceptualization, V.K.; methodology, V.K., E.T. and P.S.; software, V.K.; validation, L.K., E.T. and P.S.; formal analysis, P.S.; investigation, E.T.; resources, L.K.; data curation, P.S.; writing—original draft preparation, V.K.; writing—review and editing, E.T., P.S. and L.K.; visualization, V.K.; supervision, V.K. and L.K.; project administration, L.K.; funding acquisition, P.S. and L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (Grant No. 075-15-2022-1121).

Data Availability Statement

Data are contained within the article.

Acknowledgments

Vladimir Krutikov, Predrag Stanimirovic and Lev Kazakovtsev are grateful to the the Ministry of Science and Higher Education of the Russian Federation (Grant No. 075-15-2022-1121). Predrag Stanimirović is grateful to the Science Fund of the Republic of Serbia (No. 7750185, Quantitative Automata Models: Fundamental Problems and Applications—QUAM).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Berahas, A.S.; Jahani, M.; Richtárik, P.; Takác, M. Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample. Optim. Methods Softw. 2022, 37, 1668–1704. [Google Scholar] [CrossRef]
Rafati, J. Quasi-Newton Optimization Methods for Deep Learning Applications. 2019. Available online: https://arxiv.org/abs/1909.01994.pdf (accessed on 12 October 2023).
Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. [Google Scholar] [CrossRef]
Zhang, J.; Tao, X.; Sun, P.; Zheng, Z. A positional misalignment correction method for Fourier ptychographic microscopy based on the quasi-Newton method with a global optimization module. Opt. Commun. 2019, 452, 296–305. [Google Scholar] [CrossRef]
Kokurin, M.M.; Kokurin, M.Y.; Semenova, A.V. Iteratively regularized Gauss–Newton type methods for approximating quasi–solutions of irregular nonlinear operator equations in Hilbert space with an application to COVID–19 epidemic dynamics. Appl. Math. Comput. 2022, 431, 127312. [Google Scholar] [CrossRef]
Lampron, O.; Therriault, D.; Lévesque, M. An efficient and robust monolithic approach to phase-field quasi-static brittle fracture using a modified Newton method. Comput. Methods Appl. 2021, 386, 114091. [Google Scholar] [CrossRef]
Spenke, T.; Hosters, N.; Behr, M. A multi-vector interface quasi-Newton method with linear complexity for partitioned fluid–structure interaction. Comput. Methods Appl. Mech. Eng. 2020, 361, 112810. [Google Scholar] [CrossRef]
Zorrilla, R.; Rossi, R. A memory-efficient MultiVector Quasi-Newton method for black-box Fluid-Structure Interaction coupling. Comput. Struct. 2023, 275, 106934. [Google Scholar] [CrossRef]
Davis, K.; Schulte, M.; Uekermann, B. Enhancing Quasi-Newton acceleration for Fluid-Structure Interaction. Math. Comput. Appl. 2022, 27, 40. [Google Scholar] [CrossRef]
Tourn, B.; Hostos, J.; Fachinotti, V. Extending the inverse sequential quasi-Newton method for on-line monitoring and controlling of process conditions in the solidification of alloys. Int. Commun. Heat Mass Transf. 2023, 142, 1106647. [Google Scholar] [CrossRef]
Hong, D.; Li, G.; Wei, L.; Li, D.; Li, P.; Yi, Z. A self-scaling sequential quasi-Newton method for estimating the heat transfer coefficient distribution in the air jet impingement. Int. J. Therm. Sci. 2023, 185, 108059. [Google Scholar] [CrossRef]
Gill, P.E.; Murray, W.; Wright, M.H. Practical Optimization; SIAM: Philadelphia, PA, USA, 2020. [Google Scholar]
Dennis, J.E.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1996. [Google Scholar]
Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar]
Polak, E. Computational Methods in Optimization; Mir: Moscow, Russia, 1974. [Google Scholar]
Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
Biggs, M.C. Minimization algorithms making use of non-quadratic properties of the objective function. J. Inst. Math. Appl. 1971, 8, 315–327. [Google Scholar] [CrossRef]
Brodlie, K.W. An assessment of two approaches to variable metric methods. Math. Program. 1972, 7, 344–355. [Google Scholar] [CrossRef]
Broyden, C.G. The convergence of a class of double−rank minimization algorithms. J. Inst. Math. Appl. 1970, 6, 76–79. [Google Scholar] [CrossRef]
Davidon, W.C. Variable Metric Methods for Minimization; A.E.C. Res. and Develop. Report ANL−5990; Argonne National Laboratory: Argonne, IL, USA, 1959. [Google Scholar]
Davidon, W.C. Optimally conditioned optimization algorithms without line searches. Math. Program. 1975, 9, 1–30. [Google Scholar] [CrossRef]
Dixon, L.C. Quasi-Newton algorithms generate identical points. Math. Program. 1972, 2, 383–387. [Google Scholar] [CrossRef]
Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
Fletcher, R.; Powell, M.J.D. A rapidly convergent descent method for minimization. Comput. J. 1963, 6, 163–168. [Google Scholar] [CrossRef]
Fletcher, R.; Reeves, C.M. Function minimization by conjugate gradients. Comput. J. 1964, 7, 149–154. [Google Scholar] [CrossRef]
Goldfarb, D. A family of variable metric methods derived by variational means. Math. Comput. 1970, 24, 23–26. [Google Scholar] [CrossRef]
Oren, S.S. Self-scaling variable metric (SSVM) algorithms I: Criteria and sufficient conditions for scaling a class of algorithms. Manag. Sci. 1974, 20, 845–862. [Google Scholar] [CrossRef]
Oren, S.S. Self-scaling variable metric (SSVM) algorithms II: Implementation and experiments. Manag. Sci. 1974, 20, 863–874. [Google Scholar] [CrossRef]
Powell, M.J.D. Convergence Properties of a Class of Minimization Algorithms. In Nonlinear Programming; Mangasarian, O.L., Meyer, R.R., Robinson, S.M., Eds.; Academic Press: New York, NY, USA, 1975; Volume 2, pp. 1–27. [Google Scholar] [CrossRef]
Kovalev, D.; Gower, R.M.; Richtarik, P.; Rogozin, A. Fast Linear Convergence of Randomized BFGS. 2020. Available online: https://arxiv.org/pdf/2002.11337.pdf (accessed on 12 October 2023).
Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970, 24, 647–656. [Google Scholar] [CrossRef]
Mokhtari, A.; Eisen, M.; Ribeiro, A. An incremental quasi-Newton method with a local superlinear convergence rate. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4039–4043. [Google Scholar] [CrossRef]
Mokhtari, A.; Eisen, M.; Ribeiro, A. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018, 28, 1670–1698. [Google Scholar] [CrossRef]
Jensen, T.L.; Diehl, M. An Approach for Analyzing the global rate of convergence of Quasi-Newton and Truncated-Newton methods. J. Optim. Theory Appl. 2017, 172, 206–221. [Google Scholar] [CrossRef]
Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1/k2). Sov. Math. Dokl. 1983, 27, 372–376. [Google Scholar]
Rodomanov, A.; Nesterov, Y. Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 2022, 194, 159–190. [Google Scholar] [CrossRef]
Rodomanov, A.; Nesterov, Y. New results on superlinear convergence of classical Quasi-Newton methods. J. Optim. Theory Appl. 2021, 188, 744–769. [Google Scholar] [CrossRef]
Jin, Q.; Mokhtari, A. Non-asymptotic superlinear convergence of standard quasi-Newton methods. Math. Program. 2023, 200, 425–473. [Google Scholar] [CrossRef]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans. Math. Softw. 1997, 23, 550–560. [Google Scholar] [CrossRef]
Schraudolph, N.; Gunter, S.; Jin, Y. A stochastic quasi-Newton method for online convex optimization. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, 21–24 March 2007; pp. 436–443. [Google Scholar]
Mokhtari, A.; Ribeiro, A. Regularized stochastic BFGS algorithm. IEEE Trans. Signal Proc. 2014, 62, 1109–1112. [Google Scholar] [CrossRef]
Mokhtari, A.; Ribeiro, A. Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 2015, 16, 3151–3181. [Google Scholar]
Gower, R.; Goldfarb, D.; Richtárik, P. Stochastic block BFGS: Squeezing more curvature out of data. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1869–1878. [Google Scholar]
Gower, R.; Richtárik, P. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 2017, 38, 1380–1409. [Google Scholar] [CrossRef]
Gao, W.; Goldfarb, D. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019, 34, 194–217. [Google Scholar] [CrossRef]
Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
Meng, S.; Vaswani, S.; Laradji, I.; Schmidt, M.; Lacoste-Julien, S. Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation. 2019. Available online: https://arxiv.org/pdf/1910.04920.pdf (accessed on 12 October 2023).
Zhou, C.; Gao, W.; Goldfarb, D. Stochastic adaptive quasi-Newton methods for minimizing expected values. In Proceedings of the 34th ICML (PMLR), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 4150–4159. [Google Scholar]
Makmuang, D.; Suppalap, S.; Wangkeeree, R. The regularized stochastic Nesterov’s accelerated Quasi-Newton method with applications. J. Comput. Appl. Math. 2023, 428, 115190. [Google Scholar] [CrossRef]
Rodomanov, A.; Nesterov, Y. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 2021, 31, 785–811. [Google Scholar] [CrossRef]
Lin, D.; Ye, H.; Zhang, Z. Greedy and Random Quasi-Newton Methods with Faster Explicit Superlinear Convergence. In Proceedings of the 34th Conference on Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021; Volume 34, pp. 6646–6657. [Google Scholar]
Lin, D.; Ye, H.; Zhang, Z. Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods. J. Mach. Learn. Res. 2022, 23, 7272–7311. [Google Scholar]

Table 1. Function f₂ minimization results. Parameter [a max] = 10³.

n	GR			HS			BFGS
n	N_it	nfg	f	N_it	nfg	f	N_it	nfg	f
100	596	1337	9.2491 × 10⁻¹¹	464	998	9.7246 × 10⁻¹¹	193	416	9.7600 × 10⁻¹¹
200	1006	2223	6.8119 × 10⁻¹¹	442	953	8.8826 × 10⁻¹¹	223	477	8.3226 × 10⁻¹¹
300	1218	2650	3.0812 × 10⁻¹¹	452	971	6.7694 × 10⁻¹¹	242	522	8.9270 × 10⁻¹¹
400	545	1202	2.3238 × 10⁻¹²	454	974	8.5506 × 10⁻¹¹	266	579	9.1770 × 10⁻¹¹
500	1110	2417	9.9534 × 10⁻¹¹	465	1012	9.3773 × 10⁻¹¹	247	544	8.8586 × 10⁻¹¹
600	499	1109	3.4604 × 10⁻¹¹	494	1071	9.6048 × 10⁻¹¹	265	575	8.5970 × 10⁻¹¹
700	941	2081	9.5699 × 10⁻¹¹	458	994	7.8899 × 10⁻¹¹	272	586	8.7071 × 10⁻¹¹
800	761	1689	9.9708 × 10⁻¹¹	442	963	9.1321 × 10⁻¹¹	270	593	9.7620 × 10⁻¹¹
900	736	1636	6.5657 × 10⁻¹¹	472	1010	9.5092 × 10⁻¹¹	284	626	9.0551 × 10⁻¹¹
1000	944	2111	9.5688 × 10⁻¹¹	435	945	9.2528 × 10⁻¹¹	285	625	8.4472 × 10⁻¹¹

Table 2. Function f₂ minimization results. Parameter [a max] = 10⁶.

n	GR			HS			BFGS
n	N_it	nfg	f	N_it	nfg	f	N_it	nfg	f
100	10,001	22,163	18,623	10,001	23,266	1,380,105	452	1010	1.7175 × 10⁻¹¹
200	10,001	22,146	261,315	10,001	23,274	756,738	753	1696	4.1613 × 10⁻¹¹
300	10,001	22,199	0.8183	10,001	23,235	1,319,239	972	2183	9.3183 × 10⁻¹¹
400	10,001	22,094	541,444	10,001	23,218	823,225	1272	2909	7.2557 × 10⁻¹¹
500	10,001	22,182	3,456,399	10,001	23,250	83,606	1354	3072	8.7855 × 10⁻¹¹
600	10,001	21,986	3,485,875	10,001	23,238	1,303,868	1544	3525	9.8429 × 10⁻¹¹
700	10,001	22,184	1,875,235	10,001	23,297	1,016,413	1784	4066	9.8578 × 10⁻¹¹
800	10,001	22,141	1,892,176	10,001	23,262	1,484,428	1830	4203	9.5674 × 10⁻¹¹
900	10,001	22,239	344,032	10,001	23,187	1,368,998	2154	4912	8.0695 × 10⁻¹¹
1000	10,001	22,246	51,892	10,001	23,170	1,627,625	2141	4879	8.8202 × 10⁻¹¹

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krutikov, V.; Tovbis, E.; Stanimirović, P.; Kazakovtsev, L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics 2023, 11, 4715. https://doi.org/10.3390/math11234715

AMA Style

Krutikov V, Tovbis E, Stanimirović P, Kazakovtsev L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics. 2023; 11(23):4715. https://doi.org/10.3390/math11234715

Chicago/Turabian Style

Krutikov, Vladimir, Elena Tovbis, Predrag Stanimirović, and Lev Kazakovtsev. 2023. "On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient" Mathematics 11, no. 23: 4715. https://doi.org/10.3390/math11234715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

Abstract

1. Introduction

2. Quasi-Newton Methods

3. Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

4. Accelerating Properties of QN Methods on Strongly Convex Functions with Lipschitz Gradient

5. Numerical Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI