Next Article in Journal
Evaluation of Marine Predator Algorithm by Using Engineering Optimisation Problems
Previous Article in Journal
Spatio–Spectral Limiting on Replacements of Tori by Cubes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

by
Vladimir Krutikov
1,2,
Elena Tovbis
3,
Predrag Stanimirović
1,4 and
Lev Kazakovtsev
1,3,*
1
Laboratory “Hybrid Methods of Modeling and Optimization in Complex Systems”, Siberian Federal University, 79 Svobodny Prospekt, Krasnoyarsk 660041, Russia
2
Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, Kemerovo 650043, Russia
3
Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, Krasnoyarsk 660037, Russia
4
Faculty of Sciences and Mathematics, University of Niš, 18000 Niš, Serbia
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(23), 4715; https://doi.org/10.3390/math11234715
Submission received: 21 October 2023 / Revised: 19 November 2023 / Accepted: 20 November 2023 / Published: 21 November 2023
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
The main results of the study of the convergence rate of quasi-Newton minimization methods were obtained under the assumption that the method operates in the region of the extremum of the function, where there is a stable quadratic representation of the function. Methods based on the quadratic model of the function in the extremum area show significant advantages over classical gradient methods. When solving a specific problem using the quasi-Newton method, a huge number of iterations occur outside the extremum area, unless there is a stable quadratic approximation of the function. In this paper, we study the convergence rate of quasi-Newton-type methods on strongly convex functions with a Lipschitz gradient, without using local quadratic approximations of a function based on the properties of its Hessian. We proved that quasi-Newton methods converge on strongly convex functions with a Lipschitz gradient with the rate of a geometric progression, while the estimate of the convergence rate improves with the increasing number of iterations, which reflects the fact that the learning (adaptation) effect accumulates as the method operates. Another important fact discovered during the theoretical study is the ability of quasi-Newton methods to eliminate the background that slows down the convergence rate. This elimination is achieved through a certain linear transformation that normalizes the elongation of function level surfaces in different directions. All studies were carried out without any assumptions regarding the matrix of second derivatives of the function being minimized.

1. Introduction

Quasi-Newton (QN) methods for solving nonlinear optimization problems are based on the idea of reconstructing the matrix of second derivatives of a function from its gradients. The reconstructed matrix is used similarly to the second derivative matrix in Newton’s method. Quasi-Newton methods are effective tools for solving smooth minimization problems. Methods from QN class are less costly than the Newton method for solving large-scale optimization problems because their iterations do not use second-order derivatives. QN methods are applied to various areas such as physics, biology, engineering, geophysics, chemistry, and industry to solve the nonlinear systems of equations. QN methods can be applied in the Deep Learning area as a method for the empirical risk minimization, where the number of samples as well as the number of variables is large [1,2,3]. In microscopy, QN methods help to achieve high resolution imaging [4]. In modeling the spread of infections, QN is useful for the identification of the unknown model coefficients [5]. QN methods are also useful for the modeling of complex crack propagation [6], fluid–structure interaction [7,8,9], melting and solidification of alloys [10], heat transfer systems [11], etc.
Nowadays, there are a significant number of matrix reconstruction formulas in QN methods, and hundreds of papers have been written on the topic of quasi-Newton methods (see, for example, [12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] and their bibliography). The first QN method was proposed in [20] and improved in [24]. This matrix update formula is known as DFP (Davidon–Fletcher–Powell). Symmetric Rank 1 (SR1) method [19,20] is another way to update the Hessian. Today, it is generally accepted [12,13,14,30] that the BFGS (Broyden–Fletcher–Goldfarb–Shanno) matrix updating formula [19,23,26,31] is the most efficient among the family of QN methods.
The main results of the convergence of quasi-Newton methods in the extremum region are given in [13,15]. These results refer to the convergence rate of quasi-Newton minimization methods under the assumption that the method operates in the extremum area of the function, where there is a stable quadratic representation of the function and where methods based on the quadratic model show significant advantages over standard gradient methods.
An incremental quasi-Newton method with a local superlinear convergence rate is presented in [32,33]. The proposed incremental algorithm reduces computational cost by restricting the update to a single function per iteration and relative to incremental second-order methods by removing the need to compute the inverse of the Hessian.
The authors in [34] showed that a range of QN methods are first-order methods in the Nesterov definition [35]. They extended the complexity analysis for smooth strongly convex problems to finite dimensions and showed that in a worst-case scenario, the local superlinear or faster convergence rates of QN methods cannot be improved unless the number of iterations exceeds half of the problem size.
In [36], the authors confirmed certain superlinear convergence rates for the QN method, depending on the problem size and the specifically defined condition number. The analysis was developed based on the trace potential function, which was scaled by the logarithm of determinant of the inverse Hessian approximation to extend the proof to the general nonlinear case. The results of [36] were further improved in [37], where the authors demonstrated that the convergence rate of the BFGS method depends only on the product of the problem dimensionality and the logarithm of its condition number.
Another analysis of local non-asymptotic superlinear convergence of the DFP and BFGS methods was presented in [38]. The authors showed that in a local neighborhood of the optimal solution, the iterations generated by both DFP and BFGS converge to the optimal solution in a superlinear rate (1/k)k/2, where k is the number of performed iterations.
A sampled version of the BFGS method named limited-memory BFGS (L-BFGS) [39] was presented to handle with high dimensional problems. The algorithm stores only a few vectors that represent the approximation of Hessian instead of the entire matrix. A version with bound constraints was proposed in [40]. The algorithm developed in [1] generates points randomly around the current iterate at every iteration to produce approximations that do not depend on information about past iterations.
Randomized variants of QN algorithms have been recently investigated. Such random methods employ a random direction at each iteration for updating the approximate Hessian matrix. The online L-BFGS method [41] adapts the L-BFGS method to make use of subsampled gradients. The regularized BFGS method [42,43] modifies the BFGS update by adding a regularizer to the metric matrix. The stochastic block BFGS method was proposed in [44]. This method enables the incorporation of curvature information in stochastic approximation. The estimate of the inverse Hessian matrix is updated at each iteration using a randomly generated compressed form of the Hessian. Such an approach was called a “sketch” technique. Then, the authors developed an adaptive variant of the randomized block BFGS, AdaRBFGS in [45], in which they modified the distribution underlying the stochasticity of the method throughout the iterative process. Further, in [46], it was shown that the block BFGS method also converges superlinearly, and a framework using a curvature-adaptive step size was introduced. In [47], a stochastic QN method is proposed that employs the classical BFGS update formula in its limited memory form and is based on collecting curvature information pointwise and at regular intervals through Hessian-vector products. In [48], the authors study stochastic QN methods in the interpolation setting and prove that these algorithms, including L-BFGS, can achieve global linear convergence with a constant batch size. The authors in [30] provide a semi-local convergence rate for the randomized BFGS method under the assumption that the function is self-concordant. An extension of BFGS proposed in [49] generates an estimate of the true objective function by taking the empirical mean over a sample drawn at each step and attains R-superlinear convergence. A regularized stochastic accelerated QN method (RES-NAQ) that combines the concept of the regularized stochastic BFGS method (RES) with the Nesterov accelerating technique by introducing a new momentum coefficient was proposed in [50].
Greedy variants of the QN method were introduced in [51]. In contrast to the classical QN methods, which use the difference of successive iterations for updating the Hessian approximations, the method in [40] applies basis vectors, greedily selected to maximize a certain measure of progress. An explicit non-asymptotic bound on the local superlinear convergence rate was established. This approach was further improved in [52,53] with methods of condition-number-free superlinear convergence speed.
When solving a specific problem using the QN method, a huge number of iterations occur outside the extremum area where there is no stable quadratic approximation of the function. In this paper, our aim is to study the convergence rate of quasi-Newton-type methods without assuming the existence of second derivatives of the function. Strongly convex functions with a Lipschitz gradient are studied, and local quadratic approximations of the function based on information about the properties of its Hessian are not used.
The obtained results are related to the estimates for the convergence rate of quasi-Newton methods on strongly convex functions with a Lipschitz gradient by means of a geometric progression. It is shown that the indicators for estimating the convergence speed improve with the increase in the number of iterations of the method, which indicates the benefit of adjusting the metric matrices in the method.
It is known that it is possible to both reduce the spread of elongation of level surfaces along different directions and increase it with the help of a linear transformation of coordinates. Quasi-Newton methods eliminate this scatter on quadratic functions. The work shows that in the case of strongly convex functions, there is a scattering that can be eliminated using a linear transformation of coordinates, and then, the quasi-Newton method also eliminates it. That is, it is possible to improve the behavior of a strongly convex function using some linear coordinate transformation, and then, the quasi-Newton method can recreate this coordinate transformation. This property of the method is based on its invariance under a linear coordinate transformation, which allows us to consider the method in a coordinate system with better characteristics in terms of estimates of the convergence rate.
The rest of the paper is organized as follows. In Section 2, we provide basic information about quasi-Newton methods. In Section 3, we restate necessary information about strongly convex functions and obtain an estimate for the convergence rate of quasi-Newton methods on strongly convex functions with a Lipschitz gradient, depending on the convexity constants and Lipschitz constants. Accelerating properties of quasi-Newton methods on strongly convex functions with a Lipschitz gradient are considered in Section 4. Numerical results are presented in Section 5. A short conclusion of the obtained results is given in the last section.

2. Quasi-Newton Methods

The iteration of the quasi-Newton method has the following form (see, for example, [12]):
x k + 1 = x k + β k s k ,
s k = H k f x k ,   β k = a r g m i n β 0 f x k + β s k ,
where f x k is the gradient of the objective function f at x k , H k denotes an approximation of the Hessian inverse 2 f ( x k ) 1 , s k is a search direction, and β k is chosen to satisfy the Wolfe conditions. The following notation will be used:
Δ x k = x k + 1 x k ,   y k = f ( x k + 1 ) f ( x k ) ,
H k + 1 = H H k , Δ x k , y k ,
where H k + 1 = H H k , Δ x k , y k denotes an appropriate formula for updating matrices H k . The initial iterative point is denoted by x0 and the approximation of 2 f ( x 0 ) 1 satisfies H0 > 0 (H0 = I is usually assumed).
We will denote by A( A k , Δ x k , y k ) the operator of the Hessian 2 f x k approximation. Process (1)–(4) is a certain approximation of the Newton optimization method. We will be interested in the accelerating properties of QN methods and the conditions for their appearance.
Well-known rules for updating the matrices H k and A k are as follows. The Davidon–Fletcher–Powell (DFP) updating formula [20,24] is given by the following:
H k + 1 = H D F P H k , Δ x k , y k = H k H k y k y k T H k y k , H k y k + Δ x k Δ x k T Δ x k ,   y k ,
A k + 1 = A D F P A k , Δ x k , y k = A k y k A k Δ x k , Δ x k y k y k T y k , Δ x k 2 + y k A k Δ x k y k T + y k y k A k Δ x k T y k , Δ x k ,
such that H k = A k 1 .
The Broyden–Fletcher–Goldfarb–Shanno (BFGS) updating formula [19,23,26,31] is defined as follows:
H k + 1 = H B F G S H k , Δ x k , y k = H k Δ x k H k y k , y k Δ x k Δ x k T y k , Δ x k 2 + Δ x k H k y k Δ x k T + Δ x k Δ x k H k y k T y k , Δ x k .
The one-parameter family of formulas combines (5) and (7) (see, for example, [12]):
H ( H k , Δ x k , y k ) = γ H B F G S ( H k , Δ x k , y k ) + ( 1 γ ) H D F P ( H k , Δ x k , y k ) ,   γ 0,1 .
Equation (8) can be represented as follows:
H k + 1 = H ( H k , Δ x k , y k ) = H k H k y k y k T H k ( y k , H k y k ) + Δ x k Δ x k T ( Δ x k , y k ) + γ v k v k T ,
where
v k = ( y k , H k y k ) 0.5 Δ x k Δ x k , y k H k y k y k , H k y k , γ 0,1 .
In the exact one-dimensional search (2), the approximations x k obtained from Formulas (1)–(4) coincide, where one of the recalculation formulas for a one-parameter family is used with an arbitrary choice of β ∈ [0, 1] (see [22]).
In what follows, a symmetric positive definiteness of the matrix H will be denoted by H > 0. If H 0 > 0, then the family (8) generates symmetric matrices H k .

3. Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient

The main known studies on the convergence rate of QN methods were carried out in the region of function extremum in the presence of a stable quadratic representation of the function. The fact of the geometric progression convergence rate for the DFP method was established in [15] under the condition that the function is three times continuously differentiable and the matrix of second derivatives is bounded. In our work, an estimate of the convergence rate of a one-parameter family of QN methods on strongly convex functions was considered without the assumption of the existence of second derivatives. Accelerating properties of the QN family were substantiated in comparison with the gradient method. The obtained results indicate that QN methods initially based on the assumption that the quadratic representation of a function exists in the neighborhood of a point are able to approximate a coordinate transformation that reduces the degree of function degeneracy even in the absence of a quadratic representation of the function in the neighborhood of a point. Due to this fact, QN methods have an advantage in convergence speed compared to the steepest descent method. This result forms the content of this section.
For the simplicity, the notations g(x) and g ( x k ) will be used instead of ∇f(x) and ∇f( x k ). In what follows, we will assume the following condition.
Condition 1. 
The objective function f(x), x R n is differentiable and strongly convex in R n , i.e., there exists ρ > 0 such that the inequality,
f α x + 1 α y α f ( x ) + ( 1 α ) f ( y ) α ( 1 α ) ρ x y 2 / 2 ,  
Holds for all x R n , y R n and α ∈ [0, 1], and the gradient g(x) satisfies the Lipschitz condition,
g ( x ) g ( y ) L x y x , y R n ,     L > 0 .
Functions which fulfill Condition 1 satisfy the following relations [5]:
g ( x ) g ( y ) , x y ρ x y 2 x , y R n ,
f x f * g x 2 2 ρ ,   x R n ,
g ( x ) g ( y ) , x y g x g y 2 L ,   x , y R n ,
f x f * ρ x x * 2 2 ,   , x R n ,
where x* is the minimum point and f* = f(x*) is the function value at the minimum point.
Since the sequences of approximations xk obtained utilizing Formulas (1)–(4) during an exact one-dimensional search (2) coincide for an arbitrary choice in the matrix transformation formula (9) [22], then all further reasoning will be carried out using the sequence of matrices Hk generated by the DFP Formula (5).
If the matrix H0 is symmetric, then the family (8) generates symmetric matrices. If the condition
( y k , Δ x k ) > 0
holds and the matrix H0 is strictly positive definite, then the matrices Hk retain strict positive definiteness [13]. If the function f satisfies Condition 1, then (13) implies the validity of (17). This proves that the sequence Hk obtained by the rules (8) or (9) will be strictly positive definite when the objective function satisfies Condition 1.
In Lemma 1 we estimate the reduction coefficient of the function at iteration depending on changes in the gradient.
Lemma 1. 
Let the objective function f satisfy Condition 1. Then, as a result of iterations defined by (1)–(4), the function decreases with the estimate as follows:
f k + 1 f * q k f k f * ,
where
q k = 1 + ρ 2 y k 2 L 2 g k + 1 2 1 ,     g k + 1 > 0 .
In addition, the following inequality holds for a sequence of iterations  f j = 0, 1,…, k:
f k + 1 f * Q k ( f 0 f * ) ,       Q k = j = 0 k q j .
Proof of Lemma 1. 
The exact value of the indicator is as follows:
q k = f k + 1 f * f k f * = f k + 1 f * ( f k + 1 f * ) + ( f k f k + 1 ) = 1 + f k f k + 1 f k + 1 f * 1 .
Let us make estimates for the denominator terms in (21). According to (14),
f k + 1 f * g k + 1 2 2 ρ .
The Lipschitz condition gives the following inequality:
|| Δ x k || 2 y k 2 L 2 .
In view of (16), which is also true for a one-dimensional function along the direction Δ x k , the following holds:
f k f k + 1 ρ | | Δ x k | | 2 / 2 .
Combining (24) with (23) leads to the following:
f k f k + 1 ρ | y k | 2 2 L 2 .
Then, the application of the estimates (22) and (25) in (21) imply (19). A recursive application of (19) produces (20). □
Denote by Sp(H) the trace of a matrix H. Applying the formulas for the matrix traces of Hk and A k = H k 1 obtained in [15], we generate estimates based on which evaluations of the convergence rate are constructed.
Lemma 2. 
Let the function satisfy Condition 1. The following estimates hold for sequences {Hj}, {yj}, {gj}, j = 0, 1,…, k generated as a result of the iterative process (1)–(4):
j = 0 k a j ( k + 1 ) c a ,       a j = ( g j + 1 , H j g j + 1 ) | | y j + 1 | | 2 ,       c a = 1 ρ + S p ( H 0 ) k + 1 ,
j = 0 k b j ( k + 1 ) c b ,       b j = | | g j + 1 | | 2 ( g j + 1 , H j g j + 1 ) ,       c b = L + S p ( A 0 ) ( k + 1 ) ,
f k + 1 f q k ( f k f * ) ,
where
q k = 1 + ρ 2 y k 2 L 2 g k + 1 2 1 = 1 + c 0 a k b k 1 = a k b k c 0 + a k b k ,         c 0 = ρ 2 L 2 ,     g k + 1 > 0 .
In addition, the following inequality holds for a sequence of iterations  f j , j = 0, 1,…, k:
f k + 1 f * Q k ( f 0 f * ) ,       Q k = j = 0 k q j = j = 0 k a j b j c 0 + a j b j .
Proof of Lemma 2. 
Expressions for the trace of the matrices Ak and Hk were calculated in [15] through Formulas (5) and (6) as follows:
S p H k + 1 = S p H 0 j = 0 k | | H j y j | | 2 y j , H j y j + | | Δ x j | | 2 y j , Δ x j ,
S p ( A k + 1 ) = S P ( A 0 ) + | | g k + 1 | | 2 ( g k + 1 , H k + 1 g k + 1 ) | | g 0 | | 2 ( g 0 , H 0 g 0 ) j = 0 k | | g j + 1 | | 2 ( g j + 1 , H j g j + 1 ) + j = 0 k | | y j | | 2 ( Δ x j , y j ) .
As noted earlier, the matrices Hk and therefore Ak are strictly positive definite. Due to the remark made above about the identity of sequences xk for different γ ∈ [0, 1] in (9), we will carry out the proof for the sequence Hk generated by (5).
Due to the inequality
z 2 S P A k + 1 z , H k + 1 z ,
valid for every z R n , the next inequality is as follows:
S P ( A k + 1 ) g k + 1 2 g k + 1 , H k + 1 g k + 1 0 .
From (32), considering the last inequality and (15), one obtains the following:
j = 0 k g j + 1 2 g j + 1 , H j g j + 1 = S P ( A 0 ) g 0 2 g 0 , H 0 g 0 S P ( A k + 1 ) g k + 1 2 g k + 1 , H k + 1 g k + 1 + j = 0 k y j 2 ( Δ x j , y j )   S P A 0 + j = 0 k y j 2 Δ x j , y j S P A 0 + k + 1 L .
The inequality (13) implies the following:
Δ x j 2 y j , Δ x j 1 ρ .
The Schwartz’s inequality leads to the following:
y j , H j y j 2 H j y j 2 y j 2 .
Due to the exact one-dimensional search condition (3.8), the next equality holds:
g j + 1 , H j g j = g j + 1 , Δ x j = 0 .
From this and from the positive definiteness of the matrices, the following holds:
y j , H j y j = g j + 1 , H j g j + 1 + g j , H j g j 2 g j + 1 , H j g j = g j + 1 , H j g j + 1 + g j , H j g j g j + 1 , H j g j + 1 .
Considering (34)–(36), the equality (31) is transformed as follows:
j = 0 k g j + 1 , H j g j + 1 y j 2 j = 0 k H j y j 2 y j , H j y j = = S P H 0 S P H k + 1 + j = 0 k Δ x j 2 y j , Δ x j S P H 0 + k + 1 ρ .
From (37) and (33), we arrive at the following inequalities:
j = 0 k g j + 1 , H j g j + 1 y j 2 S p H 0 + k + 1 ρ = k + 1 ρ 1 + ρ S p H 0 k + 1 ,
j = 0 k g j + 1 2 g j + 1 , H j g j + 1 S P A 0 + k + 1 L = k + 1 L 1 + S p A 0 L k + 1 .
Inequalities (38) and (39) are identical to (26) and (27). Estimate (30) follows from (20). □
The convergence rate of the QN method is determined by the indicator Q k in (20). In the next lemma, we find an upper bound for this indicator. The problem is formulated as follows:
{ Q k } m i n a , b subject   to   ( 26 )   and   ( 27 ) ,
where aT = (a0, a1, …, ak), bT = (b0, b1, …, bk).
Lemma 3. 
Under the conditions of Lemma 2, the solution to the problem (40) is of the following form:
a j * = c a = 1 ρ + S p ( H 0 ) k + 1 ,       b j * = c b = L + S p ( A 0 ) ( k + 1 ) ,       j = 0 , 1 ,   k .
The optimal value is equal to the following:
Q k * = ( q k * ) k + 1 ,
where
q k * = c a c b c 0 + c a c b = 1 1 + c 0 / ( c a c b ) .
Proof of Lemma 3. 
To solve the problem (40), we form the Lagrange function:
L a , b , y a , y b = j = 0 k a j b j c 0 + a j b j + y a j = 0 k a j k + 1 c a + y b j = 0 k b j k + 1 c b .
The partial derivatives of L are equal to the following:
L a j = Q k q j × b j ( a j b j + c 0 ) b j ( a j b j ) ( c 0 + a j b j ) 2 + y a = Q k q j   b j c 0 ( c 0 + a j b j ) 2 + y a = Q k c 0 a j c 0 + a j b j + y a = 0 ,
which implies
a j ( c 0 + a j b j ) = c 0 Q k / y a .
Similarly, the coefficients bj fulfill the following:
b j c 0 + a j b j = c 0 Q k / y b .
From the expressions (45) and (46), it is easy to obtain that all elements a j are the same. This property is also true for elements b j .
If the solution lies on the boundary, we write the subsequent equalities to find the parameters a j and b j :
j = 0 k a j = ( k + 1 ) c a ,   j = 0 k b j = ( k + 1 ) c b .
From here, we obtain the statement (41).
The matrix of second derivatives of the Lagrange function is diagonal, and its elements are positive. Consequently, at the point with parameters (41), the sufficient conditions for the extremum are satisfied as the parameters from (41) are a solution to the problem (40). □
Theorem 1. 
Let the function satisfy Condition 1. Then, for the sequence of iterations  f j , j = 0, 1,…, k, the convergence rate of the goal function is estimated as follows:
f k + 1 f * Q k * ( f 0 f * ) ,   Q k * = ( q k * ) k + 1 ,
where
q k * 1 ρ 3 2 L 3 [ 1 + ρ n M 0 k + 1 ] × [ 1 + n L m 0 k + 1 ] .
Here, M0 and m0 are maximal and minimal eigenvalues of the matrix H0, respectively.
Proof of Theorem 1. 
We use the worst-case scenario for estimating the reduction index of the function (43) in (42), and transform it considering (26), (27), (40) and (41):
q k * = 1 1 + c 0 / ( c a c b ) = 1 + ρ 3 L 3 [ 1 + ρ S p ( H 0 ) k + 1 ] × [ 1 + S p ( H 0 1 ) L ( k + 1 ) ] 1 .
To transform (49), we use the following inequality:
1 1 + d 1 d 2 , d 0,1 .
Multiplying (50) by 1 + d gives the following:
1 1 + d 1 d 2 = 1 + d 2 1 d .
Based on this inequality, (50) is valid for d ∈ [0, 1].
Since in (49) the quantity below is positive and limited by
d = ρ 3 L 3 [ 1 + ρ S p ( H 0 ) k + 1 ] × [ 1 + S p ( H 0 1 ) L ( k + 1 ) ] 1 ,
we use (50) to transform (49). The result is the following inequality:
q k * 1 ρ 3 2 L 3 [ 1 + ρ S p ( H 0 ) k + 1 ] × [ 1 + S p ( H 0 1 ) L ( k + 1 ) ] .
In view of the relations for the matrix traces
S p H 0 n M 0 ,         S p H 0 1 n m 0 ,
(51) leads to the estimate (48). □
For a sufficiently large k, the convergence rate of QN methods can be clearly represented in the following form:
q 1 ρ 3 2 L 3 .
Note that we did not involve any information about the matrix of second derivatives when obtaining estimates of the convergence rate. If the objective function is twice differentiable, then the eigenvalues of the matrix of second derivatives are bounded by the interval of the strong convexity parameter and Lipschitz parameter [ρ,L].
Let us analyze the convergence rate indicator (48).
  • The estimate (48) is determined by the strong convexity parameter ρ, Lipschitz parameter L and the properties of the initial matrix H0.
  • As the number of iterations k increases, the estimate of the indicator q k * decreases and tends to the value from the expression (52). This fact is consistent with the expected improvements in convergence rates resulting from the matrix transformation process in QN methods.
Due to the invariance of QN methods, estimates (48) can be considered in different coordinate systems. In the next section, we will consider the method in a coordinate system, where the ratio ρ/L is maximal.

4. Accelerating Properties of QN Methods on Strongly Convex Functions with Lipschitz Gradient

Further research is aimed at determining the conditions, except for the trivial case of minimizing a quadratic function, under which QN methods are superior in convergence rate to the steepest descent method. Quasi-Newton methods are invariant under variable transformations,
x ^ = P x ,
where P is a non-singular (n × n) matrix [13]. This means that the type of process (1)–(4) is completely identical to the type of process in the new coordinate system. In this case, for identical process values, the relationship
Δ x ^ = P Δ x ,   y ^ = P T y ,   g ^ = P T g ,   H ^ = P H P T ,   x ^ = P x ,
is valid if the initial conditions are related by the following relations:
x ^ 0 = P x 0 , H ^ 0 = P H 0 P T .
Here, P T = P T 1 . In the new coordinate system, the process (1)–(4) is equivalent to minimizing the function,
φ ( x ^ ) = f P 1 x ^ = f ( x ) ,
which, as is easy to show, satisfies Condition 1 with strong convexity constant ρ P and Lipschitz constant L P .
Define the following transformation:
x ^ = V x ,
where V is a non-singular matrix such that
ρ V L V ρ p L p
for an arbitrary non-singular (n × n) matrix P.
Theorem 2. 
Let the conditions of Theorem 1 be satisfied. Then, the sequence  f ( x k ) , k = 0, 1, 2,…, given by the process (1)–(4), fulfills the following estimate,
f k f * f 0 f * ( q k * ) k ,
q k * 1 ρ V 3 2 L V 3 1 + ρ n M 0 v k + 1 × 1 + n L m 0 v k + 1 ,
where M0ν and m0ν are maximal and minimal eigenvalues of matrix P T H 0 P 1 , respectively.
Proof of Theorem 2. 
Due to the identity of the form of the minimization process in the old and new coordinate systems and the connection of the initial conditions (54) to estimate the speed of the process (1)–(4), an estimate in any coordinate system can be used and, particularly, in the system x ^ = V x . This fact and considering (48) proves (56). □
The estimate (56) was obtained in the coordinate system selected by condition (55). If the function satisfies the relation
ρ V / L V > > ρ / L ,
then the advantages of QN methods, compared to the steepest descent method, are indisputable. The result (56) was obtained without the assumption of the existence of second derivatives of the function being minimized. Under such weakened conditions on the goal function given in Condition 1, QN methods converge at the rate of a geometric progression and can eliminate the background that slows down the convergence rate. The elimination is enabled by the corresponding linear transformation of coordinates.

5. Numerical Experiment

The purpose of a numerical experiment is to study the ability of quasi-Newton methods to eliminate the background that slows down the convergence rate through some linear transformation which normalizes the elongation of function level surfaces in different directions. For comparison, methods were chosen in which the background that slows down the convergence rate is active during the solution process. The gradient descent (GR) method, Hestenes–Stiefel (HS) conjugate gradient method, and the quasi-Newtonian (BFGS) method with a one-dimensional search procedure with cubic interpolation were implemented and compared.
Due to the fact that the use of QN methods is justified primarily on functions with a high degree of conditionality, where conjugate gradient methods do not work, the test functions were selected based on this position. The QN method is based on a quadratic model of a function; therefore, its local convergence rate in a certain neighborhood of the current minimum is largely determined by its efficiency in minimizing ill-conditioned quadratic functions. Therefore, studies were carried out on quadratic functions and functions of their derivatives.
If the function is twice differentiable, then the eigenvalues of the matrix of second derivatives are limited by the interval of the strong convexity and Lipschitz parameters [ρ,L]. When creating tests, we used the representation of a quadratic function and the analysis of its conditionality, relying on its eigenvalues. The test functions simulate the oscillatory nature of the second derivatives of the function.
The following function is accepted as the basic quadratic function:
f 1 ( x , [ a max ] ) = 1 2 i = 1 n a i x i 2 , a i = a max i 1 n 1 .
To simulate random fluctuations of second derivatives, a function f 2 was created based on the basic function f 1 :
f 2 = f 1 ( x , [ a max ] ) .
Its gradients were distorted randomly according to the following scheme:
f 2 = f 1 × ( 1 + r ξ ) .
where ξ ∈ [−1, 1] is a uniformly distributed random number and r = 0.3 is gradient distortion parameter. It should be noted that distortion of gradients significantly reduces the accuracy of a one-dimensional search, where gradients are used to estimate directional derivatives in the cubic approximation.
The point x 0 = ( 100,100 , , 100 ) was chosen as the starting point. The stopping criterion was as follows:
f ( x k ) f * ε = 10 10 .
Minimization results are presented in Table 1 and Table 2. As shown in the tables for the methods studied, N_it—number of iterations (one-dimensional searches along the direction); nfg—number of calls to the procedure for simultaneous calculation of a function and gradient; f—achieved value of the function.
It can be seen from Table 1 that the BFGS method is the most effective, the second is the HS method, and the last is the GR method. Table 2 shows that, unlike the HS and GR methods, the BFGS method remains operational due to the removal of the background noise that worsens the convergence rate.

6. Conclusions

An overwhelming number of iterations occur outside the extremum area when solving a specific minimization problem using a quasi-Newton method if there is no stable quadratic approximation of the objective function. This paper presents a study of the convergence rate of quasi-Newton-type methods on strongly convex functions with a Lipschitz gradient without assuming the existence of second derivatives of the goal function. In our work, the convergence of quasi-Newton methods on strongly convex functions with a Lipschitz gradient is estimated in the form of a geometric progression.
The estimate of the convergence rate includes the dependence on the strong convexity parameter, Lipschitz parameter, and the initial matrix parameter. The convergence rate is determined by the ratio of constants ρ/L, which characterizes the spread of elongation of level surfaces in different directions. The greater this ratio, the higher the convergence rate.
It is shown that an increase in the number of iterations of the method improves the indicators for estimating the convergence rate, which demonstrates the benefit of adjusting the metric matrices in the method.
The property of invariance of quasi-Newton methods with respect to a linear transformation of coordinates enables us to consider the method in a coordinate system where the ratio ρ/L is maximal, that is, the spread of elongation of level surfaces in different directions is minimal, and to obtain a conclusion about the accelerating properties of quasi-Newton methods without relying on the matrix of the second derivatives of the function.
Based on the computational experiment, we can conclude that the theoretically predicted ability of quasi-Newton methods to eliminate the background noise slowing down the convergence rate has been numerically confirmed. The research results can be applied in practice, for example, when choosing a method for training neural networks. As a suggestion for future work, the numerical experiment can be extended to other functions.
The study of the convergence rate of quasi-Newton minimization methods was developed under the assumption of the exact line search (2). One area for future research may be convergence analysis based on various inexact line search procedures.

Author Contributions

Conceptualization, V.K.; methodology, V.K., E.T. and P.S.; software, V.K.; validation, L.K., E.T. and P.S.; formal analysis, P.S.; investigation, E.T.; resources, L.K.; data curation, P.S.; writing—original draft preparation, V.K.; writing—review and editing, E.T., P.S. and L.K.; visualization, V.K.; supervision, V.K. and L.K.; project administration, L.K.; funding acquisition, P.S. and L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (Grant No. 075-15-2022-1121).

Data Availability Statement

Data are contained within the article.

Acknowledgments

Vladimir Krutikov, Predrag Stanimirovic and Lev Kazakovtsev are grateful to the the Ministry of Science and Higher Education of the Russian Federation (Grant No. 075-15-2022-1121). Predrag Stanimirović is grateful to the Science Fund of the Republic of Serbia (No. 7750185, Quantitative Automata Models: Fundamental Problems and Applications—QUAM).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Berahas, A.S.; Jahani, M.; Richtárik, P.; Takác, M. Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample. Optim. Methods Softw. 2022, 37, 1668–1704. [Google Scholar] [CrossRef]
  2. Rafati, J. Quasi-Newton Optimization Methods for Deep Learning Applications. 2019. Available online: https://arxiv.org/abs/1909.01994.pdf (accessed on 12 October 2023).
  3. Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. [Google Scholar] [CrossRef]
  4. Zhang, J.; Tao, X.; Sun, P.; Zheng, Z. A positional misalignment correction method for Fourier ptychographic microscopy based on the quasi-Newton method with a global optimization module. Opt. Commun. 2019, 452, 296–305. [Google Scholar] [CrossRef]
  5. Kokurin, M.M.; Kokurin, M.Y.; Semenova, A.V. Iteratively regularized Gauss–Newton type methods for approximating quasi–solutions of irregular nonlinear operator equations in Hilbert space with an application to COVID–19 epidemic dynamics. Appl. Math. Comput. 2022, 431, 127312. [Google Scholar] [CrossRef]
  6. Lampron, O.; Therriault, D.; Lévesque, M. An efficient and robust monolithic approach to phase-field quasi-static brittle fracture using a modified Newton method. Comput. Methods Appl. 2021, 386, 114091. [Google Scholar] [CrossRef]
  7. Spenke, T.; Hosters, N.; Behr, M. A multi-vector interface quasi-Newton method with linear complexity for partitioned fluid–structure interaction. Comput. Methods Appl. Mech. Eng. 2020, 361, 112810. [Google Scholar] [CrossRef]
  8. Zorrilla, R.; Rossi, R. A memory-efficient MultiVector Quasi-Newton method for black-box Fluid-Structure Interaction coupling. Comput. Struct. 2023, 275, 106934. [Google Scholar] [CrossRef]
  9. Davis, K.; Schulte, M.; Uekermann, B. Enhancing Quasi-Newton acceleration for Fluid-Structure Interaction. Math. Comput. Appl. 2022, 27, 40. [Google Scholar] [CrossRef]
  10. Tourn, B.; Hostos, J.; Fachinotti, V. Extending the inverse sequential quasi-Newton method for on-line monitoring and controlling of process conditions in the solidification of alloys. Int. Commun. Heat Mass Transf. 2023, 142, 1106647. [Google Scholar] [CrossRef]
  11. Hong, D.; Li, G.; Wei, L.; Li, D.; Li, P.; Yi, Z. A self-scaling sequential quasi-Newton method for estimating the heat transfer coefficient distribution in the air jet impingement. Int. J. Therm. Sci. 2023, 185, 108059. [Google Scholar] [CrossRef]
  12. Gill, P.E.; Murray, W.; Wright, M.H. Practical Optimization; SIAM: Philadelphia, PA, USA, 2020. [Google Scholar]
  13. Dennis, J.E.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PA, USA, 1996. [Google Scholar]
  14. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 2006. [Google Scholar]
  15. Polak, E. Computational Methods in Optimization; Mir: Moscow, Russia, 1974. [Google Scholar]
  16. Polyak, B.T. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
  17. Biggs, M.C. Minimization algorithms making use of non-quadratic properties of the objective function. J. Inst. Math. Appl. 1971, 8, 315–327. [Google Scholar] [CrossRef]
  18. Brodlie, K.W. An assessment of two approaches to variable metric methods. Math. Program. 1972, 7, 344–355. [Google Scholar] [CrossRef]
  19. Broyden, C.G. The convergence of a class of double−rank minimization algorithms. J. Inst. Math. Appl. 1970, 6, 76–79. [Google Scholar] [CrossRef]
  20. Davidon, W.C. Variable Metric Methods for Minimization; A.E.C. Res. and Develop. Report ANL−5990; Argonne National Laboratory: Argonne, IL, USA, 1959. [Google Scholar]
  21. Davidon, W.C. Optimally conditioned optimization algorithms without line searches. Math. Program. 1975, 9, 1–30. [Google Scholar] [CrossRef]
  22. Dixon, L.C. Quasi-Newton algorithms generate identical points. Math. Program. 1972, 2, 383–387. [Google Scholar] [CrossRef]
  23. Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
  24. Fletcher, R.; Powell, M.J.D. A rapidly convergent descent method for minimization. Comput. J. 1963, 6, 163–168. [Google Scholar] [CrossRef]
  25. Fletcher, R.; Reeves, C.M. Function minimization by conjugate gradients. Comput. J. 1964, 7, 149–154. [Google Scholar] [CrossRef]
  26. Goldfarb, D. A family of variable metric methods derived by variational means. Math. Comput. 1970, 24, 23–26. [Google Scholar] [CrossRef]
  27. Oren, S.S. Self-scaling variable metric (SSVM) algorithms I: Criteria and sufficient conditions for scaling a class of algorithms. Manag. Sci. 1974, 20, 845–862. [Google Scholar] [CrossRef]
  28. Oren, S.S. Self-scaling variable metric (SSVM) algorithms II: Implementation and experiments. Manag. Sci. 1974, 20, 863–874. [Google Scholar] [CrossRef]
  29. Powell, M.J.D. Convergence Properties of a Class of Minimization Algorithms. In Nonlinear Programming; Mangasarian, O.L., Meyer, R.R., Robinson, S.M., Eds.; Academic Press: New York, NY, USA, 1975; Volume 2, pp. 1–27. [Google Scholar] [CrossRef]
  30. Kovalev, D.; Gower, R.M.; Richtarik, P.; Rogozin, A. Fast Linear Convergence of Randomized BFGS. 2020. Available online: https://arxiv.org/pdf/2002.11337.pdf (accessed on 12 October 2023).
  31. Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970, 24, 647–656. [Google Scholar] [CrossRef]
  32. Mokhtari, A.; Eisen, M.; Ribeiro, A. An incremental quasi-Newton method with a local superlinear convergence rate. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4039–4043. [Google Scholar] [CrossRef]
  33. Mokhtari, A.; Eisen, M.; Ribeiro, A. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018, 28, 1670–1698. [Google Scholar] [CrossRef]
  34. Jensen, T.L.; Diehl, M. An Approach for Analyzing the global rate of convergence of Quasi-Newton and Truncated-Newton methods. J. Optim. Theory Appl. 2017, 172, 206–221. [Google Scholar] [CrossRef]
  35. Nesterov, Y. A method of solving a convex programming problem with convergence rate o(1/k2). Sov. Math. Dokl. 1983, 27, 372–376. [Google Scholar]
  36. Rodomanov, A.; Nesterov, Y. Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 2022, 194, 159–190. [Google Scholar] [CrossRef]
  37. Rodomanov, A.; Nesterov, Y. New results on superlinear convergence of classical Quasi-Newton methods. J. Optim. Theory Appl. 2021, 188, 744–769. [Google Scholar] [CrossRef]
  38. Jin, Q.; Mokhtari, A. Non-asymptotic superlinear convergence of standard quasi-Newton methods. Math. Program. 2023, 200, 425–473. [Google Scholar] [CrossRef]
  39. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
  40. Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans. Math. Softw. 1997, 23, 550–560. [Google Scholar] [CrossRef]
  41. Schraudolph, N.; Gunter, S.; Jin, Y. A stochastic quasi-Newton method for online convex optimization. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, 21–24 March 2007; pp. 436–443. [Google Scholar]
  42. Mokhtari, A.; Ribeiro, A. Regularized stochastic BFGS algorithm. IEEE Trans. Signal Proc. 2014, 62, 1109–1112. [Google Scholar] [CrossRef]
  43. Mokhtari, A.; Ribeiro, A. Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 2015, 16, 3151–3181. [Google Scholar]
  44. Gower, R.; Goldfarb, D.; Richtárik, P. Stochastic block BFGS: Squeezing more curvature out of data. In Proceedings of the 33rd International Conference on Machine Learning (ICML’16), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1869–1878. [Google Scholar]
  45. Gower, R.; Richtárik, P. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 2017, 38, 1380–1409. [Google Scholar] [CrossRef]
  46. Gao, W.; Goldfarb, D. Quasi-Newton methods: Superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019, 34, 194–217. [Google Scholar] [CrossRef]
  47. Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
  48. Meng, S.; Vaswani, S.; Laradji, I.; Schmidt, M.; Lacoste-Julien, S. Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation. 2019. Available online: https://arxiv.org/pdf/1910.04920.pdf (accessed on 12 October 2023).
  49. Zhou, C.; Gao, W.; Goldfarb, D. Stochastic adaptive quasi-Newton methods for minimizing expected values. In Proceedings of the 34th ICML (PMLR), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 4150–4159. [Google Scholar]
  50. Makmuang, D.; Suppalap, S.; Wangkeeree, R. The regularized stochastic Nesterov’s accelerated Quasi-Newton method with applications. J. Comput. Appl. Math. 2023, 428, 115190. [Google Scholar] [CrossRef]
  51. Rodomanov, A.; Nesterov, Y. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 2021, 31, 785–811. [Google Scholar] [CrossRef]
  52. Lin, D.; Ye, H.; Zhang, Z. Greedy and Random Quasi-Newton Methods with Faster Explicit Superlinear Convergence. In Proceedings of the 34th Conference on Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021; Volume 34, pp. 6646–6657. [Google Scholar]
  53. Lin, D.; Ye, H.; Zhang, Z. Explicit Convergence Rates of Greedy and Random Quasi-Newton Methods. J. Mach. Learn. Res. 2022, 23, 7272–7311. [Google Scholar]
Table 1. Function f2 minimization results. Parameter [a max] = 103.
Table 1. Function f2 minimization results. Parameter [a max] = 103.
nGRHSBFGS
N_itnfgfN_itnfgfN_itnfgf
10059613379.2491 × 10−114649989.7246 × 10−111934169.7600 × 10−11
200100622236.8119 × 10−114429538.8826 × 10−112234778.3226 × 10−11
300121826503.0812 × 10−114529716.7694 × 10−112425228.9270 × 10−11
40054512022.3238 × 10−124549748.5506 × 10−112665799.1770 × 10−11
500111024179.9534 × 10−1146510129.3773 × 10−112475448.8586 × 10−11
60049911093.4604 × 10−1149410719.6048 × 10−112655758.5970 × 10−11
70094120819.5699 × 10−114589947.8899 × 10−112725868.7071 × 10−11
80076116899.9708 × 10−114429639.1321 × 10−112705939.7620 × 10−11
90073616366.5657 × 10−1147210109.5092 × 10−112846269.0551 × 10−11
100094421119.5688 × 10−114359459.2528 × 10−112856258.4472 × 10−11
Table 2. Function f2 minimization results. Parameter [a max] = 106.
Table 2. Function f2 minimization results. Parameter [a max] = 106.
nGRHSBFGS
N_itnfgfN_itnfgfN_itnfgf
10010,00122,16318,62310,00123,2661,380,10545210101.7175 × 10−11
20010,00122,146261,31510,00123,274756,73875316964.1613 × 10−11
30010,00122,1990.818310,00123,2351,319,23997221839.3183 × 10−11
40010,00122,094541,44410,00123,218823,225127229097.2557 × 10−11
50010,00122,1823,456,39910,00123,25083,606135430728.7855 × 10−11
60010,00121,9863,485,87510,00123,2381,303,868154435259.8429 × 10−11
70010,00122,1841,875,23510,00123,2971,016,413178440669.8578 × 10−11
80010,00122,1411,892,17610,00123,2621,484,428183042039.5674 × 10−11
90010,00122,239344,03210,00123,1871,368,998215449128.0695 × 10−11
100010,00122,24651,89210,00123,1701,627,625214148798.8202 × 10−11
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Krutikov, V.; Tovbis, E.; Stanimirović, P.; Kazakovtsev, L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics 2023, 11, 4715. https://doi.org/10.3390/math11234715

AMA Style

Krutikov V, Tovbis E, Stanimirović P, Kazakovtsev L. On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient. Mathematics. 2023; 11(23):4715. https://doi.org/10.3390/math11234715

Chicago/Turabian Style

Krutikov, Vladimir, Elena Tovbis, Predrag Stanimirović, and Lev Kazakovtsev. 2023. "On the Convergence Rate of Quasi-Newton Methods on Strongly Convex Functions with Lipschitz Gradient" Mathematics 11, no. 23: 4715. https://doi.org/10.3390/math11234715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop