Machine Learning in Quasi-Newton Methods

Krutikov, Vladimir; Tovbis, Elena; Stanimirović, Predrag; Kazakovtsev, Lev; Karabašević, Darjan

doi:10.3390/axioms13040240

Open AccessArticle

Machine Learning in Quasi-Newton Methods

by

Vladimir Krutikov

^1,2

,

Elena Tovbis

³

,

Predrag Stanimirović

^1,4

,

Lev Kazakovtsev

^1,3,*

and

Darjan Karabašević

^5,6,*

¹

Laboratory “Hybrid Methods of Modeling and Optimization in Complex Systems”, Siberian Federal University, 79 Svobodny Prospekt, 660041 Krasnoyarsk, Russia

²

Department of Applied Mathematics, Kemerovo State University, 6 Krasnaya Street, 650043 Kemerovo, Russia

³

Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, 31, Krasnoyarskii Rabochii Prospekt, 660037 Krasnoyarsk, Russia

⁴

Faculty of Sciences and Mathematics, University of Niš, 18000 Niš, Serbia

⁵

College of Global Business, Korea University, Sejong 30019, Republic of Korea

⁶

Faculty of Applied Management, Economics and Finance, University Business Academy in Novi Sad, Jevrejska 24, 11000 Belgrade, Serbia

^*

Authors to whom correspondence should be addressed.

Axioms 2024, 13(4), 240; https://doi.org/10.3390/axioms13040240

Submission received: 14 February 2024 / Revised: 22 March 2024 / Accepted: 2 April 2024 / Published: 5 April 2024

(This article belongs to the Special Issue Multiple-Criteria Decision-Making and Computational Intelligence: Recent Applications II)

Download

Browse Figures

Versions Notes

Abstract

:

In this article, we consider the correction of metric matrices in quasi-Newton methods (QNM) from the perspective of machine learning theory. Based on training information for estimating the matrix of the second derivatives of a function, we formulate a quality functional and minimize it by using gradient machine learning algorithms. We demonstrate that this approach leads us to the well-known ways of updating metric matrices used in QNM. The learning algorithm for finding metric matrices performs minimization along a system of directions, the orthogonality of which determines the convergence rate of the learning process. The degree of learning vectors’ orthogonality can be increased both by choosing a QNM and by using additional orthogonalization methods. It has been shown theoretically that the orthogonality degree of learning vectors in the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method is higher than in the Davidon–Fletcher–Powell (DFP) method, which determines the advantage of the BFGS method. In our paper, we discuss some orthogonalization techniques. One of them is to include iterations with orthogonalization or an exact one-dimensional descent. As a result, it is theoretically possible to detect the cumulative effect of reducing the optimization space on quadratic functions. Another way to increase the orthogonality degree of learning vectors at the initial stages of the QNM is a special choice of initial metric matrices. Our computational experiments on problems with a high degree of conditionality have confirmed the stated theoretical assumptions.

Keywords:

minimization algorithm; quasi-Newton method; convergence rate; machine learning

MSC:

90C53

1. Introduction

The problem of unconstrained minimization of smooth functions in a finite-dimensional Euclidean space has received a lot of attention in the literature [1,2]. In unconstrained optimization, in contrast to constrained optimization [3], the process of optimizing the objective function is carried out in the absence of restrictions on variables. Unconstrained problems arise also as reformulations of constrained optimization problems, in which the constraints are replaced by penalization terms in the objective function that have the effect of discouraging constraint violations [2].

Well-known methods [1,2] that enable us to solve such a problem include the gradient method, which is based on the idea of function local linear approximation, or Newton’s method, which uses its quadratic approximation. The Levenberg–Marquardt method is a modification of Newton’s method, where the direction of descent differs from that specified by Newton’s method. The conjugate gradient method is a two-step method in which the parameters are found from the solution of a two-dimensional optimization problem.

Quasi-Newton minimization methods are effective tools of solving smooth minimization problems when the function level curves have a high degree of elongation [4,5,6,7]. QNMs are commonly applied in a wide range of areas, such as biology [8], image processing [9], technics [10,11,12,13,14,15], and deep learning [16,17,18].

The QNM is based on the idea of using a matrix of second derivatives reconstructed from the gradients of a function. The first QNM was proposed in [19] and improved in [20]. The generally accepted notation for the matrix updating formula in this method is DFP. Nowadays, there are a significant number of equations for updating matrices in the QNM [4,5,6,7,21,22,23,24,25,26,27,28], and it is generally accepted [4,5] that among a variety of QNMs, the best methods use the BFGS matrix updating equation [29,30,31]. However, it has been experimentally established, but not theoretically explained, why the BFGS generates the best results among the QNMs [5].

A sampled version of the BFGS method named limited-memory BFGS (L-BFGS) [32] was presented to handle high-dimensional problems. The algorithm stores only a few vectors that represent the approximation of the Hessian instead of the entire matrix. A version with bound constraints was proposed in [33].

The penalty method [2] was developed for solving constrained optimization problems. The unconstrained problems are formed by adding a term, called a penalty function, to the objective function. The penalty is zero for feasible points and non-zero for infeasible points.

The development of QNMs occurred spontaneously through the search for matrix updating equations that satisfy certain properties of data approximation obtained in the problem solving process. In this paper, we consider a method for deriving matrix updating equations in QNMs by forming a quality functional based on learning relations for matrices, followed by obtaining matrix updating equations in the form of a step of the gradient method for minimizing the quality functional. This approach has shown high efficiency in organizing subgradient minimization methods [34,35].

In machine learning theory, the system in which the average risk (mathematical expectation of the total loss function) is minimal is considered optimal [36,37]. The goal of learning represents the state that has to be reached by the learning system in the process of learning. The selection of such a desired state is actually achieved by a proper choice of a certain functional that has an extremum which corresponds to the desired state [38]. Thus, in the matrix learning process, it is necessary to formulate a quality functional.

In QNMs, for each of the matrix rows, there is a product of the vector which exists as a learning relation. Consequently, we have a linear model with the coefficients of the matrix row as its parameters. Thus, we may formulate a quadratic learning quality functional for a linear model and obtain a gradient machine learning (ML) algorithm. This paper shows how one can obtain known methods for updating matrices in QNMs based on a gradient learning algorithm. Based on the general properties of convergence of gradient learning algorithms, it seems relevant to study the origins of the effectiveness of metric updating equations in QNMs.

In a gradient learning algorithm, the sequence of steps is represented as a method of minimization along a system of directions. The degree of orthogonality of these directions determines the convergence rate of the algorithm. The use of gradient learning algorithms for deriving matrix updating equations in QNMs enables us to analyze the quality of matrix updating algorithms based on the convergence rate properties of the learning algorithms. This paper shows that the higher degree of orthogonality of learning vectors in the BFGS method determines its advantage compared to the DFP method.

Studies on quadratic functions identify conditions under which the space dimension is reduced during the QNM iterations. The dimension of the minimization space is reduced when the QNM includes iterations with an exact one-dimensional descent or an iteration with additional orthogonalization. It is possible to increase the orthogonality of the learning vectors and thereby increase the convergence rate of the method through special normalization of the initial matrix.

The computational experiment was carried out on functions with a high degree of conditionality. Various ways of increasing the orthogonality of learning vectors were assessed. The theoretically predicted effects of increasing the efficiency of QNMs confirmed their effectiveness in practice. It turned out that with an approximate one-dimensional descent, additional orthogonalization in iterations of the algorithm significantly increased the efficiency of the method. In addition, the efficiency of the method also increased significantly with the correct normalization of the initial matrix.

The rest of this paper is organized as follows. In Section 2, we provide basic information about matrix learning algorithms in QNMs. Section 3 contains an analysis of matrix updating formulas in QNMs. A symmetric positive definite metric is considered in Section 4. Section 5 gives a qualitative analysis of the BFGS and DFP matrix updating equations. Methods for reducing the minimization space of QNMs on quadratic functions are presented in Section 6. Methods for increasing the orthogonality of learning vectors in QNMs are considered in Section 7. In Section 8, we present a numerical study, and the last section summarizes the work.

2. Matrix Learning Algorithms in Quasi-Newton Methods

Consider the minimization problem

f(x) → min, x ∈ Rⁿ.

The QNM for this problem is iterated as follows:

x_{}^{k + 1} = x_{}^{k} + β_{k} s_{}^{k}, s_{}^{k} = - H_{}^{k} \nabla f (x_{}^{k}),

(1)

β_{k} = \underset{β \geq 0}{\arg \min} f (x^{k} + β s^{k}),

(2)

Δ x_{}^{k} = x_{}^{k + 1} - x_{}^{k}, y_{}^{k} = \nabla f (x_{}^{k + 1}) - \nabla f (x_{}^{k}),

(3)

H_{}^{k + 1} = H (H_{}^{k}, Δ x_{}^{k}, y_{}^{k}) .

(4)

Here,

\nabla f (x)

is the gradient of a function, s^k is the search direction, and β_k is chosen to satisfy the Wolfe conditions [2]. Further,

H^{k} \in R^{n \times n}

is a symmetric matrix which is used as an approximation of the Hessian inverse. The operator

H (H, Δ x, y) \in R^{n \times n}, H \in R^{n \times n}, Δ x, y \in R^{n}

(5)

specifies a certain equation for updating the initial matrix H. At the input of the algorithm, the starting point x₀ and the symmetric strictly positive definite matrix H⁰ must be specified. Such a matrix will be denoted as H⁰ > 0.

Let us consider the relations for obtaining updating equations for H^k matrices on quadratic functions:

f (x) = \frac{1}{2} ⟨x - x_{}^{*}, A (x - x_{}^{*})⟩ + d, A > 0,

(6)

where x* is the minimum point. Here and below, the expression <·,·> means a scalar product of vectors. Without a loss of generality, we assume d = 0. The gradient of a quadratic function f(x) is ∇f(x) = A(x − x*). For Δx ∈ Rⁿ, the gradient difference y = ∇f(x + Δx) − ∇f(x) satisfies the relation:

A Δ x = y or A^{- 1} y = Δ x .

(7)

The equalities in (7) are used to obtain various equations for updating matrices H^k, which are approximations for A⁻¹, or matrices B^k = (H^k)⁻¹, which are approximations for A. An arbitrary equation for updating matrices H or B, the result of which is a matrix satisfying (7), will be denoted by H(H, Δx, y) or B(B, Δx, y), respectively.

Denoting as A_i and A_i⁻¹ rows of the corresponding matrices A and A⁻¹ with i-th index, then, according to (7), we obtain equations for the learning relations necessary to formulate algorithms for matrix rows’ learning:

A_{i} Δ x = y_{i}, A_{i}^{- 1} y = Δ x_{i}, i = 1,2, \dots, n,

(8)

where y_i and Δx_i are the components of the vectors in (7). The relations in (8) make it possible to use machine learning algorithms of a linear model in the parameters to estimate the rows of the corresponding matrices.

Let us formulate the problem of estimating the parameters of a linear model from observational data.

ML problem: find unknown parameters c* ∈ Rⁿ of the linear model

y = ⟨z, c⟩, z, c \in R^{n}, y \in R^{1}

(9)

from observational data

y_{k} \in R^{1}, z_{}^{k} \in R^{n}, k = 0,1, 2, \dots,

(10)

where y_k = <c*, z^k>. We will use an indicator of training quality,

Q (z, c) = \frac{1}{2} {(〈z, c〉 - y)}^{2},

(11)

which is an estimate of the quality functional required to find c*.

Function (11) is a loss function. Due to the large dimension of the problem of estimating the elements of metric matrices, the use of the classical least squares method becomes difficult. We use the adaptive least squares method (recurrent least squares formulas).

The gradient learning algorithm based on (11) has the following form:

c_{}^{k + 1} = c_{}^{k} - h_{k} \nabla Q (z_{}^{k}, c_{}^{k}) = c_{}^{k} - h_{k} (⟨z_{}^{k}, c_{}^{k}⟩ - y_{k}) z_{}^{k} .

(12)

Due to the orthogonality of the training vectors, the stochastic gradient method in the form “receiving of an observation-training-forgetting the observation information” in quasi-Newton methods enables us to obtain good approximations of the inverse matrices of second derivatives while maintaining their symmetry and positive definiteness.

In this paper, the value of such consideration is that we are able to identify the advantages of the BFGS method and obtain a method with orthogonalization of learning vectors and prove these provisions through testing.

The Kaczmarz algorithm [39] is a special case of (12) with the form

c_{}^{k + 1} = c_{}^{k} - \frac{(⟨z_{}^{k}, c_{}^{k}⟩ - y_{k})}{⟨z_{}^{k}, c_{}^{k}⟩} z_{}^{k} .

(13)

Let us list some of the properties of process (13), which we use to justify the properties of matrix updating in QNMs.

Property 1.

Process (13) ensures the equality

y_{k} = ⟨z_{}^{k}, c_{}^{k + 1}⟩,

(14)

and the solution is achieved under the condition of minimum changes in the parameters’ values

‖ c_{}^{k + 1} - c_{}^{k} ‖

.

Property 2.

If y_k = <c*, z^k> then the iteration of process (13) is equivalent to the step of minimizing the quadratic function

ϕ (c) = ⟨c - c^{*}, c - c^{*}⟩ / 2

(15)

from the point c^k along the direction z^k.

Proof.

Property 2 is justified by the direct implementation of the function in (15) which is the minimizing step along the direction z^k, which is presented in Figure 1. Property 1 follows from the fact that movement to the point c^k⁺¹ is carried out along the normal to the hyperplane <z^k, c> = y_k, that is, along the shortest path (Figure 1). Movement to other points on the hyperplane, for example to point A, satisfy only the condition in (14). □

Let us denote the residual as r^k = c^k − c*. By subtracting c* from both sides of (13) and making transformations, we obtain the following learning algorithm in the form of residuals:

r_{}^{k + 1} = W (z_{}^{k}) r_{}^{k}, W (z) = I - \frac{z z^{T}}{z^{T} z},

(16)

where I is the identity matrix. The sequence of minimization steps can be represented in the form of the residual transformation, where m is the number of iterations:

r_{}^{k + 1} = W_{k - m}^{k} (z) r_{}^{k - m}, W_{k - m}^{k} (z) = W (z_{}^{k}) W (z_{}^{k - 1}) \dots W (z_{}^{k - m}) .

(17)

The convergence rate of process (13) is significantly affected by the degree of orthogonality of the learning vectors z. The following property reflects the well-known fact of the minimization algorithm termination along orthogonal directions of the quadratic form of (15) with equal Hessian eigenvalues.

Property 3.

Let vectors z^k, k = l, l + 1,…, l + n − 1 for a sequence of n iterations (13) be mutually orthogonal. Then, the solution c* minimizing the function (15) is obtained in no more than n steps of the process (13) for an arbitrary initial

c_{l}

, wherein

r_{}^{l + n} = W_{l}^{l + n - 1} (z) r_{}^{l} = 0, W_{l}^{l + n - 1} (z) = 0 .

(18)

The following results are useful to estimate the convergence rate of the process in (13) as a method for minimizing the function in (15) without orthogonality of the descent vectors.

Consider a cycle of iterations for minimizing a function θ(x), x ∈ Rⁿ, along the column vectors z^k, ‖z^k‖ = 1, k = 1,…, n, of matrix Z ∈ Rⁿ^×n:

x_{k + 1} = x_{k} + β_{k} z_{k}, β_{k} = \underset{β \geq 0}{\arg \min} θ (x_{k} + β z_{k}), k = 1, \dots, n .

(19)

Here and below, we will use the Euclidean vector norm ‖x‖ = <x, x>^1/2. Let us present the result of the iterations in (19) in the form of the operator x_n₊₁ = XP(x₁, Z). Consider the process

u^{q + 1} = X P (u^{q}, Z^{q}), q = 0,1, \dots .,

(20)

where matrices Z^q and the initial approximation u⁰ are given. To estimate the convergence rate of the QNM and the convergence rate of the metric matrix approximation, we need the following assumption about the properties of the function.

Assumption 1.

Let the function be strongly convex, with a constant ρ > 0, and differentiable, and its gradient satisfy the Lipschitz condition with a constant L > 0.

We assume that the function f(x), x ∈ Rⁿ, is differentiable and strongly convex in Rⁿ, i.e., there exists ρ > 0 such that for all x,y ∈ Rⁿ and α ∈ [0, 1], the inequality holds,

f (α x + (1 - α) y) \leq α f (x) + (1 - α) f (y) - α (1 - α) ρ {‖x - y‖}^{2} / 2,

and its gradient ∇f(x) satisfies the Lipschitz condition:

‖\nabla f (x) - \nabla f (y)‖ \leq L ‖x - y‖ \forall x, y \in R^{n}, L > 0 .

(21)

Let us denote the minimum point of the function θ(x) by x*. The following theorem [40] establishes the convergence rate of the iteration cycle (20).

Theorem 1.

Let the function θ(x), x ∈ Rⁿ, satisfy Assumption 1; let matrices Z^q of the process in (20) be such that minimum eigenvalues μ ^q of matrices (Z^q)^T Z^q satisfy the constraint μ ^q ≥ μ₀ > 0. Then, the following inequality estimates the convergence rate of the process in (20):

θ (u^{m}) - θ (x *) \leq [θ (u^{0}) - θ (x *)] \exp (- \frac{m ρ^{2} μ_{0}^{2}}{2 L^{2} n^{3}}) .

(22)

Estimate (22) enables us to formulate the following property of the process in (13).

Property 4.

Let vectors z_k, k = 1,…,n−1, be given in (13), the columns of the matrices Z be composed of vectors z_k/‖z_k‖, and the minimum eigenvalue μ of the matrix Z^T Z satisfy the constraint μ ≥ μ₀ > 0. Then, the following inequality estimates the convergence rate:

‖ c_{}^{n} - c^{*} ‖^{2} \leq ‖ c_{}^{0} - c^{*} ‖^{2} \exp (- \frac{μ_{0}^{2}}{2 n^{3}}) .

(23)

Proof.

Let us apply the results of Theorem 1 to the process in (13). The strong convexity and Lipschitz constants for the gradient of the quadratic function in (15) are the same: ρ = L = 1. Using Property 2 and the estimate in (22) for m = 1, we obtain (23). □

The property of operators

W_{l}^{l + n - 1}

, when the conditions of Property 4 are met, is determined by the estimate in (23), which can be represented in the following form:

r_{}^{n} = ‖ W_{0}^{n - 1} (z) r_{}^{0} ‖^{2} \leq ‖ r_{}^{0} ‖^{2} \exp (- \frac{μ_{0}^{2}}{2 n^{3}})

(24)

Thus, the Kaczmarz algorithm provides a solution to the equality in (14) for the last observation, while it implements a local learning strategy, i.e., a strategy for iteratively improving the approximation quality from a functional (15) point of view. If the learning vectors are orthogonal, the solution is found in no more than n iterations. When n learning vectors are linearly independent, the convergence rate (23) is determined by the degree of the learning vectors’ orthogonality. The degree of the vectors’ orthogonality will indicate the boundedness of the minimum eigenvalue μ ≥ μ₀ > 0 of the matrix Z^TZ defined in Property 4.

Using the learning relations in (8), we obtain machine learning algorithms for estimating the rows of the corresponding matrices in the form of the process in (13). Consequently, the question of analyzing the quality of algorithms for updating matrices in QNMs will consist of analyzing learning relations like (8) and the degree of orthogonality of the vectors involved in training.

3. Gradient Learning Algorithms for Deriving and Analyzing Matrix Updating Equations in Quasi-Newton Methods

Well-known equations for matrix updating in QNMs were found as equations that eliminate mismatch on a new portion of training information. In machine learning theory, a quality measure is formulated. A gradient minimization algorithm is used to minimize this measure. Our goal is to give an account of QNMs from the standpoint of machine learning theory, i.e., to formulate quality measures of training and construct their minimization algorithms. This approach enables us to obtain a unified method for deriving matrix updating equations and extend the known facts and algorithms of learning theory to solve analysis of and achieve improvement in QNMs.

Let us obtain formulas for updating matrices in QNMs using the quadratic model of the minimized function in (6) and learning relations in (7). For one of the learning relations in (7), we present a complete study of Properties 1–4.

Let the current approximation H of the matrix H* = A⁻¹ be known. It is required to construct a new approximation using the learning relations in (7) for the rows of the matrix in (8):

H^{*} y = Δ x or H_{i}^{*} y = Δ x_{i}^{}, i = 1, 2, \dots, n .

(25)

To evaluate each row of the matrix H* based on (25), we apply Algorithm (13). As a result, we obtain the following matrix updating equation:

H_{}^{+} = H_{B 2} (H, Δ x, y) = H + \frac{(Δ x - H y) y^{T}}{y^{T} y},

(26)

which is known as the 2nd Broyden method for estimating matrices when solving systems of non-linear equations [5,6].

Equation (26) determines the step of minimizing a type of functional of (15) for each of the rows H_i of matrix H along the direction y:

ϕ (H_{i}) = ‖ H_{i} - H_{i}^{*} ‖^{2} / 2, i = 1, 2, \dots, n .

(27)

The matrix residual is R = H − H*. Because of the iteration of (26), the residual is transformed according to the rule

R_{}^{+} = R W (y) .

(28)

Let us denote the scalar product for matrices A,B ∈

R^{n \times n}

as

〈A, B〉 = \sum_{i = 1}^{n} A_{i}^{T} B_{i}^{} = \sum_{j = 1}^{n} \sum_{i = 1}^{n} A_{i j}^{} B_{i j}^{} .

We use the Frobenius norm of matrices:

‖ H ‖ = {(\sum_{i = 1}^{n} ‖ H_{i}^{} ‖^{2})}^{1 / 2} .

Let us define the function,

Φ (H) = \sum_{i = 1}^{n} ‖ H_{i} - H_{i}^{*} ‖^{2} / 2 = ‖ H - H_{}^{*} ‖^{2} / 2,

(29)

and reformulate Properties 1–4 for the matrix updating process in (26).

Theorem 2.

Iteration (26) is equivalent to the minimization step Φ(H) from a point H along the direction ΔH:

Δ H = (Δ x - H y) y^{T} / y^{T} y,

(30)

where

H_{}^{+} y = Δ x

(31)

‖ H_{}^{+} - H ‖ \leq ‖ H_{}^{Δ x} - H ‖

(32)

for arbitrary matrices

H^{Δ x} \in R^{n \times n}

satisfying the condition in (31).

Proof of Theorem 2.

Let us show that the condition for the minimum of the function in (27) along the direction ΔH (30) is satisfied at the point

H^{+}

:

\begin{array}{l} ⟨Δ H, \nabla Φ (H_{}^{+})⟩ & = \sum_{j = 1}^{n} \sum_{i = 1}^{n} (Δ x - H y)_{i} y_{j} (H_{i j}^{+} - H_{i j}^{*}) \\ = \sum_{i = 1}^{n} (Δ x - H y)_{i} (H_{i}^{+} - H_{i}^{*}) y = \sum_{i = 1}^{n} (Δ x - H y)_{i} (Δ x_{i} - Δ x_{i}) = 0 . \end{array}

(33)

□

Next, we prove (32) by showing that ΔH is the normal of the hyperplane of matrices satisfying the condition in (31). To do this, we prove orthogonality of the vector in (30) to an arbitrary vector of the hyperplane, formed as the difference of matrices belonging to the hyperplane V = H¹ − H²:

\begin{array}{l} ⟨Δ H, H_{}^{1} - H_{}^{2})⟩ & = \sum_{j = 1}^{n} \sum_{i = 1}^{n} (Δ x - H y)_{i} y_{j} (H_{i j}^{1} - H_{i j}^{2}) \\ = \sum_{i = 1}^{n} (Δ x - H y)_{i} (H_{i}^{1} - H_{i}^{2}) y = \sum_{i = 1}^{n} (Δ x - H y)_{i} (Δ x_{i} - Δ x_{i}) = 0 . \end{array}

Let us prove an analogue of Property 3 for (26).

Theorem 3.

Let the vectors y_k, k = l, l + 1, …, l + n − 1, for the sequence of n iterations in (26) be mutually orthogonal, then the solution H* to the minimization problem in (29) will be obtained in no more than n steps of the process in (26),

H_{}^{k + 1} = H_{B 2} (H_{}^{k}, Δ x_{k}, y_{k}), k = l, l + 1, \dots, l + n - 1,

(34)

for an arbitrary matrix H^l,

R_{}^{l + n} = R_{}^{l + n} [W_{l}^{l + n - 1} (y)]^{T} = 0 .

(35)

Proof of Theorem 3.

From (28), the orthogonality of vectors y_k and (18) follows (35). □

Theorem 4.

Let vectors y_k, k = 0, 1, …, n − 1, in (13) be given, vectors y_k/‖y_k‖ be columns of matrix P, and the minimum eigenvalue μ of a matrix P^TP satisfy the constraint μ ≥ μ₀ > 0. Then, to estimate the convergence rate of the process in (34), the following inequality holds:

‖ H_{}^{n} - H_{}^{*} ‖_{}^{2} \leq ‖ H_{}^{0} - H_{}^{*} ‖_{}^{2} \exp (- \frac{μ_{0}^{2}}{2 n^{3}}) .

(36)

Proof of Theorem 4.

According to Property 4 and conditions of the theorem, the rows of matrices will have the following estimates (23):

‖ H_{i}^{n} - H_{i}^{*} ‖_{}^{2} \leq ‖ H_{i}^{0} - H_{i}^{*} ‖_{}^{2} \exp (- \frac{μ_{0}^{2}}{2 n^{3}}), i = 0,1, \dots, n - 1 .

A similar inequality will be true for the sums of the left and right sides. Considering the connection between the norms

‖ H_{}^{n} - H_{}^{*} ‖_{}^{2} = \sum_{i = 1}^{n} ‖ H_{i}^{n} - H_{i}^{*} ‖_{}^{2}

, we obtain the estimate in (36). □

In the case when the matrix H is symmetric, two products of the matrix H* and the vector y are known:

H_{}^{*} y = Δ x, y_{}^{T} H_{}^{*} = Δ x_{}^{T} .

(37)

Applying the process in (28) twice for (37), we obtain a new process for updating the matrix residual:

R_{}^{+} = W (y) R W (y) .

(38)

Expanding (38), we obtain the updating formula

H^{+} = H_{G} (H, Δ x, y)

of J. Grinstadt [5,6], where

H_{G} (H, Δ x, y) = H + \frac{〈H y - Δ x, y〉 y y^{T}}{{〈y, y〉}^{2}} - \frac{y {(H y - Δ x)}^{T} + (H y - Δ x) y^{T}}{〈y, y〉} .

(39)

Let us reformulate Properties 1–4 of the matrix updating process (26) for (39).

Theorem 5.

The iteration of (39) is equivalent to the minimization step Φ(H) from a point H along the

Δ H

direction:

Δ H = \frac{〈H y - Δ x, y〉 y y^{T}}{{(y^{T} y)}^{2}} - \frac{y {(H y - Δ x)}^{T}}{y^{T} y} - \frac{(H y - Δ x) y^{T}}{y^{T} y} .

(40)

At the same time,

H_{}^{+} y = Δ x, y_{}^{T} H_{}^{+} = Δ x_{}^{T},

(41)

‖ H_{}^{+} - H ‖ \leq ‖ H_{}^{Δ x} - H ‖

(42)

for arbitrary matrices

H^{Δ x} \in R^{n \times n}

satisfying the condition in (41).

Proof of Theorem 5.

Let us show that at the point

H_{}^{+}

, the condition for the minimum of the function in (27) along the direction ΔH is satisfied:

⟨Δ H, \nabla Φ (H_{}^{+})⟩ = ⟨Δ H, H_{}^{+} - H_{}^{*}⟩ = 0 .

(43)

In (43), let us consider the scalar product for each term of (40) separately. The third term of Expression (40) coincides with (30). The equality to zero of the scalar product for it was obtained in (33). For the first term, the calculations are similar to (33):

\begin{array}{l} ⟨Δ H_{}^{1}, \nabla Φ (H_{}^{+})⟩ = \frac{〈H y - Δ x, y〉}{{〈y, y〉}^{2}} \sum_{j = 1}^{n} \sum_{i = 1}^{n} y_{i} y_{j} (H_{i j}^{+} - H_{i j}^{*}) \\ = \frac{〈H y - Δ x, y〉}{{〈y, y〉}^{2}} \sum_{i = 1}^{n} y_{i} (H_{i}^{+} - H_{i}^{*}) y = \frac{〈H y - Δ x, y〉}{{〈y, y〉}^{2}} \sum_{i = 1}^{n} y_{i} (Δ x_{i} - Δ x_{i}) = 0 . \end{array}

Let us carry out calculations for the second term using the symmetry of matrices:

\begin{array}{l} 〈y, y〉 ⟨Δ H_{}^{2}, \nabla Φ (H_{}^{+})⟩ = \sum_{j = 1}^{n} \sum_{i = 1}^{n} y_{i} (H y - Δ x)_{j} (H_{i j}^{+} - H_{i j}^{*}) \\ = \sum_{j = 1}^{n} (H y - Δ x)_{j} \sum_{i = 1}^{n} y_{i} (H_{i j}^{+} - H_{i j}^{*}) = \sum_{j = 1}^{n} (H y - Δ x)_{j} (H_{j}^{+} - H_{j}^{*}) y \\ = \sum_{j = 1}^{n} (H y - Δ x)_{j} (Δ x_{j} - Δ x_{j}) = 0 . \end{array}

Proof (43) is complete. Next, we prove that ΔH is the normal of the hyperplane of matrices satisfying the condition in (42). To do this, we prove that the vector ΔH is orthogonal to an arbitrary vector of the hyperplane, formed as the difference of matrices belonging to the hyperplane

V = H^{1} - H^{2}

, that is, <ΔH,

H^{1} - H^{2}

> = 0. Since the matrices

H^{1}

and

H^{2}

satisfy the condition in (42), the proof is identical to the justification of the equality in (43). □

The following theorem establishes the convergence rate for a series of successive updates (39).

Theorem 6.

Let vectors y_k, k = l, l + 1, …, l + n − 1, for the sequence of n iterations of (39) be mutually orthogonal. Then, the solution to the minimization problem in (29) can be obtained in no more than n steps of the process in (39),

H^{k + 1} = H_{G} (H^{k}, Δ x_{k}, y_{k}), k = l, l + 1, \dots, l + n - 1,

(44)

for an arbitrary symmetric matrix

H^{l}

:

R^{l + n} = W_{l}^{l + n - 1} (y) R^{l + n} {[W_{l}^{l + n - 1} (y)]}^{T} = 0 .

(45)

Proof of Theorem 6.

The update in (45) can be represented as two successive multiplications by

W_{l}^{l + n - 1} (y)

, first from the left and then from the right. For each of the updates, the estimate in (35) is valid. □

Theorem 7.

Let vectors y_k, k = 0, 1, …, n − 1, be given, vectors y_k/‖y_k‖ be columns of matrix P, and the minimum eigenvalue μ of a matrix P^TP satisfy the constraint μ ≥ μ₀ > 0. Then, to estimate the convergence rate of the process in (44), the following inequality holds:

{‖ H^{n} - H^{*} ‖}^{2} \leq {‖ H^{0} - H^{*} ‖}^{2} \exp (- \frac{μ_{0}^{2}}{n^{3}}) .

(46)

Proof of Theorem 7.

The matrix residual is updated according to the rule

R^{l + n} = W_{l}^{l + n - 1} (y) R^{l + n} {[W_{l}^{l + n - 1} (y)]}^{T},

which can be represented as two successive multiplications by

W_{l}^{l + n - 1} (y)

, first from the left and then from the right. The estimate in (36) is valid for each of the updates, which proves (46). □

4. Symmetric Positive Definite Metric and Its Analysis

Let Function (6) be quadratic. We use the coordinate transformation

\hat{x} = V x .

(47)

Let the matrix V satisfy the relation

V^{T} V = \nabla^{2} f (x) = A .

(48)

In the new coordinate system, the minimized function takes the following form:

\hat{f} (\hat{x}) = f (V^{- 1} \hat{x}) = f (x) .

(49)

Quadratic Function (6), considering (49), (47), and (48), takes the following form:

\hat{f} (\hat{x}) = \frac{1}{2} (\hat{x} - {\hat{x}}^{*})^{T} V^{- T} A V^{- 1} (\hat{x} - {\hat{x}}^{*}) = \frac{1}{2} 〈\hat{x} - {\hat{x}}^{*}, \hat{x} - {\hat{x}}^{*}〉 .

(50)

Here,

{\hat{x}}^{*}

is the minimum point of the function. According to (38) and (50), the matrix of second derivatives is the identity matrix

\nabla^{2} \hat{f} (\hat{x}) = I

. Let us denote

\hat{r} = \hat{x} - {\hat{x}}^{*}

. The gradient is

\nabla \hat{f} (\hat{x}) = r (\hat{x}) = \hat{r} = \hat{x} - {\hat{x}}^{*} .

(51)

For the characteristics of functions

\hat{f} (\hat{x})

and f(x), the following relationships are valid:

\nabla \hat{f} (\hat{x}) = V^{- T} \nabla f (x), \nabla^{2} \hat{f} (\hat{x}) = V^{- T} \nabla^{2} f (x) V^{- 1},

(52)

Δ \hat{x} = {\hat{x}}_{}^{+} - \hat{x} = V x_{}^{+} - V x = V Δ x,

(53)

\hat{y} = \nabla \hat{f} (x^{+}) - \nabla \hat{f} (x) = V^{- T} f (x^{+}) - V^{- T} f (x) = V^{- T} y

(54)

where notation

V^{- T} = {(V^{T})}^{- 1}

is used.

From (53), (54), and the properties of matrices V (48), the following equality holds:

\hat{y} = Δ \hat{x} \equiv z

(55)

For the symmetric matrix

\hat{H}

, two products of the matrix

{\hat{H}}^{*}

and the vector y are known:

{\hat{H}}_{}^{*} \hat{y} = Δ \hat{x}, {\hat{y}}_{}^{T} {\hat{H}}_{}^{*} = Δ {\hat{x}}_{}^{T} .

(56)

Applying the process in (28) twice to (56), we obtain a new process for updating the matrix residual

\hat{R} = \hat{H} - I

:

{\hat{R}}_{}^{+} = W (\hat{y}) \hat{R} W (\hat{y}) = W (z) \hat{R} W (z) .

(57)

Taking into account (55), the update in (39) takes the form

{\hat{H}}_{B F G S}^{+} = {\hat{H}}_{G} (\hat{H}, Δ \hat{x}, \hat{y}) = \hat{H} + \frac{〈\hat{H} z - z, z〉 z z^{T}}{{〈z, z〉}^{2}} - \frac{z {(\hat{H} z - z)}^{T} + (\hat{H} z - z) z^{T}}{〈z, z〉} .

(58)

Let us consider the methods in (1)–(4) in relation to the function

\hat{f} (\hat{x})

in the new coordinate system.

{\hat{x}}_{}^{k + 1} = {\hat{x}}_{}^{k} + {\hat{β}}_{k} {\hat{s}}_{}^{k}, {\hat{s}}_{}^{k} = - {\hat{H}}_{}^{k} \nabla \hat{f} ({\hat{x}}_{}^{k}),

(59)

{\hat{β}}_{k} = \underset{\hat{β} \geq 0}{\arg \min} \hat{f} ({\hat{x}}_{}^{k} + \hat{β} {\hat{s}}_{}^{k}),

(60)

Δ {\hat{x}}_{}^{k} = {\hat{x}}_{}^{k + 1} - {\hat{x}}_{}^{k} = z_{}^{k}, {\hat{y}}_{}^{k} = \nabla \hat{f} ({\hat{x}}_{}^{k + 1}) - \nabla \hat{f} ({\hat{x}}_{}^{k}) = z_{}^{k},

(61)

{\hat{H}}_{}^{k + 1} = H ({\hat{H}}_{}^{k}, Δ {\hat{x}}_{}^{k}, {\hat{y}}_{}^{k}) .

(62)

Parameter

{\hat{β}}_{k}

in (59) characterizes the accuracy of a one-dimensional descent. If the matrices are correlated by

{\hat{H}}_{}^{k} = V H_{}^{k} V^{T}, H_{}^{k} = V^{- 1} {\hat{H}}_{}^{k} V^{- T},

(63)

and the initial conditions are

{\hat{x}}_{}^{0} = V x_{}^{0}, {\hat{H}}_{}^{0} = V H_{}^{0} V^{T},

(64)

then these processes generate identical sequences

\hat{f} ({\hat{x}}_{}^{k}) = f (x^{k})

and characteristics connected by the relations in (47) and (52)–(54). In this case, the equality

{\hat{β}}_{k} = β_{k}

holds.

Considering the equality

\hat{y} = Δ \hat{x}

from (55), Equation (58) can be transformed. As a result, we obtain the BFGS equation:

H_{B F G S} (\hat{H}, Δ \hat{x}, \hat{y}) = \hat{H} - \frac{(Δ \hat{x} - \hat{H} \hat{y}, \hat{y}) Δ \hat{x} Δ {\hat{x}}^{T}}{{〈\hat{y}, Δ \hat{x}〉}^{2}} + \frac{(Δ \hat{x} - \hat{H} \hat{y}) Δ {\hat{x}}^{T} + Δ \hat{x} {(Δ \hat{x} - \hat{H} \hat{y})}^{T}}{〈\hat{y}, Δ \hat{x}〉} .

(65)

Equation (65) satisfies the requirement of (63) and has the same form in various coordinate systems. Similar properties have the matrix transformation equation H_DFP, which can be represented as a transformed formula H_BFGS [29,30,31]:

\begin{array}{l} H_{D F P} (\hat{H}, Δ \hat{x}, \hat{y}) = H_{B F G S} (\hat{H}, Δ \hat{x}, \hat{y}) - v v_{}^{T}, \\ v = {〈\hat{y}, \hat{H} \hat{y}〉}^{\frac{1}{2}} [\frac{Δ \hat{x}}{〈Δ \hat{x}, \hat{y}〉} - \frac{\hat{H} \hat{y}}{〈\hat{y}, \hat{H} \hat{y}〉}] . \end{array}

(66)

Taking into account (55) and (58), we obtain the following expression in the new coordinate system:

{\hat{H}}_{D F P} = {\hat{H}}_{B F G S} - \hat{v} {\hat{v}}_{}^{T}, \hat{v} = {〈z, \hat{H} z〉}^{\frac{1}{2}} [\frac{z}{〈z, z〉} - \frac{\hat{H} z}{〈z, \hat{H} z〉}] .

(67)

The form of the matrices in (65) and (66) does not change depending on the coordinate system. Consequently, the form of the processes in (1)–(4) and (59)–(62) is completely identical in different coordinate systems when using Formulas (65) and (67). Thus, for further studies of the properties of QNMs on quadratic functions, we can use Equations (58) and (67) in the coordinate system specified by the transformation in (47).

Within the iteration of the processes in (59)–(62) for a quadratic function with an identity matrix of second derivatives, the residual can be represented in the form of components

{\hat{r}}_{}^{k} \equiv r ({\hat{x}}_{}^{k}) = {\hat{r}}_{z}^{k} + {\hat{r}}_{⊥ z}^{k},

(68)

where

{\hat{r}}_{z}^{k}

is a component along the vector z^k (or, which is the same, along

{\hat{s}}_{}^{k}

), and

{\hat{r}}_{⊥ z}^{k}

is a component orthogonal to z^k. With an inexact one-dimensional descent in (59), the component

{\hat{r}}_{z}^{k}

decreases but does not disappear completely. For the convenience of theoretical studies, the residual transformation in Equation (68) in this case can be represented by introducing parameter γ_k ∈ (0, 2) instead of

{\hat{β}}_{k}

, characterizing the degree of descent accuracy:

{\hat{r}}_{}^{k + 1} = W (z_{}^{k}, γ_{k}) {\hat{r}}_{}^{k} = (1 - γ_{k}) {\hat{r}}_{z}^{k} + {\hat{r}}_{⊥ z}^{k}, W (z, γ) = I - γ \frac{z z_{}^{T}}{z_{}^{T} z}, γ_{k} \in (0, 2) .

(69)

Here, at arbitrary γ_k ∈ (0, 2), the objective function decreases. With an inexact one-dimensional descent, a certain value γ_k ∈ (0, 2) will be attained, at which the new value of the function becomes smaller.

The restriction on the one-dimensional search in (59), imposed on γ_k in (69), ensures a reduction in the objective function

\begin{array}{l} \hat{f} ({\hat{x}}_{}^{k + 1}) = ‖ {\hat{r}}_{}^{k + 1} ‖^{2} / 2 = ‖ W (z_{}^{k}, γ_{k}) {\hat{r}}_{}^{k} ‖^{2} / 2 \\ = (1 - γ_{k})^{2} ‖ {\hat{r}}_{z}^{k} ‖^{2} + ‖ {\hat{r}}_{⊥ z}^{k} ‖^{2} < ‖ {\hat{r}}_{z}^{k} ‖^{2} + ‖ {\hat{r}}_{⊥ z}^{k} ‖^{2} = \hat{f} ({\hat{x}}_{}^{k}) . \end{array}

As a result of the iterations in (59)–(62) with (65) and according to (57), the matrix residual

{\hat{R}}_{}^{k} = {\hat{H}}_{}^{k} - {\hat{H}}^{*} = {\hat{H}}_{}^{k} - I

is transformed according to the rule

{\hat{R}}_{}^{k + 1} = W (z_{}^{k}) {\hat{R}}_{}^{k} W (z_{}^{k}) .

(70)

Therefore, one system of vectors z^k is used in the new coordinate system of the QNM iteration with the aim of minimizing the function and residual functional for matrices (29). With the orthogonality of vectors z^k and an exact one-dimensional search, the solution

{\hat{r}}_{}^{k} = 0

will be obtained in no more than n iterations. By virtue of the equality

〈z_{}^{i}, z_{}^{j}〉 = 〈A Δ x_{}^{i}, Δ x_{}^{j}〉

, the orthogonality of vectors z^k in the chosen coordinate system is equivalent to the conjugacy of vectors Δx^k.

Due to the type of identity which defines the QNM iteration in different coordinate systems, we further denote the iteration of processes (59)–(62) and (1)–(4), considering the accuracy of one-dimensional descent (introduced in (69) by the parameter

γ_{k} \in (0, 2)

) by the operator

Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k}) .

(71)

To simplify the notation in further studies of quasi-Newton methods on quadratic functions, without a loss of generality, we use an iteration of the method in (71) adjusted to minimize the function

f (x) = \frac{1}{2} 〈x - x^{*}, x - x^{*}〉,

(71a)

which allows us, without transforming the coordinate system (47), to use all associated relations for the processes in (59)–(62) with the function in (50) for studying the process in (71), omitting the hats above the variables in the notation.

Let us note some of the properties of the QNM.

Theorem 8.

Let

H^{k}

> 0 and the iteration of (71) be carried out with matrix transformation equations

H_{B F G S}

and

H_{D F P}

(67). Then, the vector z^k is an eigenvector of the matrices

H_{B F G S}^{k + 1}

,

H_{D F P}^{k + 1}

,

R_{B F G S}^{k + 1}

, and

R_{D F P}^{k + 1}

:

R_{B F G S}^{k + 1} z_{}^{k} = 0, H_{B F G S}^{k + 1} z_{}^{k} = z_{}^{k} .

(72)

R_{D F P}^{k + 1} z_{}^{k} = 0, H_{D F P}^{k + 1} z_{}^{k} = z_{}^{k} .

(73)

Proof of Theorem 8.

The first of the equalities in (72) follows from (70). The second of the equalities in (72) follows from this fact and the definition of the matrix residual.

By direct verification, based on (67), we establish that the vectors z^k and v^k are orthogonal. Therefore, the additional term v v^T in Equation (67) does not affect the multiplication of vector z^k by a matrix, which together with (72) proves (73). □

As consequence of Theorem 8, the dimension of the space being minimized is reduced by one in the case of an exact one-dimensional descent, which will be shown below. Section 5 justifies the advantages of the BFGS equation (65) over the DFP equation (66) for matrix transformation.

5. Qualitative Analysis of the Advantages of the BFGS Equation over the DFP Equation

The effectiveness of the learning algorithm is determined by the degree of orthogonality of the learning vectors in the operator factors

W_{k - m}^{k} (y)

. In the new coordinate system, the transformation in (70) is determined by the factors

W_{k - m}^{k} (z)

in the residual expressions. Therefore, to analyze the orthogonality degree of the system of vectors z, it is necessary to involve the method of their formation. Let us show that the vectors z^k in (69) and (70) generated by the BFGS equation have a higher degree of orthogonality compared to those generated by DFP. To get rid of a large number of indices, consider the iteration of the QNM (71) in the form

Q N (\hat{f}, \hat{x}, \hat{H}, {\hat{x}}^{+}, {\hat{H}}^{+}, γ) .

(74)

Theorem 9.

Let

\hat{H} > 0

and the iteration of (74) be carried out with the matrix updating equations

{\hat{H}}_{B F G S}

(58) and

{\hat{H}}_{D F P}

(67), and

‖ \hat{v} ‖ \neq 0 .

(75)

Then, the following statements are valid.

1. The descent directions for the next iteration are of the form

${\hat{s}}_{B F G S}^{+} = {\hat{H}}_{B F G S}^{+} {\hat{r}}_{}^{+} = (1 - γ_{k}) {\hat{r}}_{z} + {〈z, \hat{H} z〉}^{\frac{1}{2}} \hat{v},$

(76)

${\hat{s}}_{D F P}^{+} = {\hat{H}}_{D F P}^{+} {\hat{r}}_{}^{+} = (1 - γ_{k}) {\hat{r}}_{z} + q {〈z, \hat{H} z〉}^{\frac{1}{2}} \hat{v},$

(77)

where

$0 < q = \frac{{⟨ \hat{H} \hat{r}, \hat{H} \hat{r} ⟩}^{2}}{⟨ \hat{r}, \hat{H} \hat{r} ⟩ ⟨ \hat{H} \hat{H} \hat{r}, \hat{H} \hat{r} ⟩} < 1 .$

(78)
2. With respect to the cosine of the angle between adjacent directions of the descent, we have the following estimate:

$\frac{{〈{\hat{s}}_{B F G S}^{+}, z〉}^{2}}{〈z, z〉 〈{\hat{s}}_{B F G S}^{+}, {\hat{s}}_{B F G S}^{+}〉} \leq \frac{{〈{\hat{s}}_{D F P}^{+}, z〉}^{2}}{〈z, z〉 〈{\hat{s}}_{D F P}^{+}, {\hat{s}}_{D F P}^{+}〉} .$

(79)
3. In the subspace of vectors orthogonal to z, the trace of the matrix ${\hat{H}}_{B F G S}^{+}$ does not change,

$s p_{⊥ z} ({\hat{H}}_{B F G S}^{+}) = s p_{⊥ z} (\hat{H}),$

(80)

and the trace of the matrix ${\hat{H}}_{D F P}^{+}$ decreases,

$s p_{⊥ z} ({\hat{H}}_{D F P}^{+}) = s p_{⊥ z} (\hat{H}) - \frac{{〈\hat{v}, \hat{H} z〉}^{2}}{〈\hat{v}, \hat{v}〉 〈z, \hat{H} z〉} < s p_{⊥ z} (\hat{H}) .$

(81)

Proof of Theorem 9.

We represent the residual, similarly to (69), in the following form:

\hat{r} = {\hat{r}}_{z} + {\hat{r}}_{⊥ z}, ‖ {\hat{r}}_{⊥ z} ‖ \neq 0 .

(82)

After performing the iteration of (74), the residual takes the form

{\hat{r}}_{}^{+} = W (z, γ) \hat{r} = (1 - γ) {\hat{r}}_{z} + {\hat{r}}_{⊥ z} .

(83)

According to (83), in

{\hat{r}}_{}^{+}

, the component

{\hat{r}}_{⊥ z}

does not depend on the accuracy of the one-dimensional search. Therefore, initially, we find new descent directions in (76) and (77) under the condition of an exact one-dimensional search, that is, with

{\hat{r}}_{}^{+} = {\hat{r}}_{⊥ z}

.

Considering the gradient expression in (51), the direction of minimization in the iteration of (74) is

\hat{s} = - \hat{H} \hat{r}

. Based on that result, considering (55) and the equality

⟨{\hat{r}}_{}^{+}, z⟩ = 0,

following from the condition of exact one-dimensional minimization (60), we obtain

{\hat{r}}_{}^{+} = W (z) \hat{r} = \hat{r} + z = \hat{r} - \hat{H} \hat{r} \frac{⟨ \hat{r}, \hat{H} \hat{r} ⟩}{⟨ \hat{H} \hat{r}, \hat{H} \hat{r} ⟩} .

(84)

This implies

z = - \hat{H} \hat{r} \frac{⟨ \hat{r}, \hat{H} \hat{r} ⟩}{⟨ \hat{H} \hat{r}, \hat{H} \hat{r} ⟩},

(85)

\hat{H} \hat{r} = - z \frac{⟨ \hat{H} \hat{r}, \hat{H} \hat{r} ⟩}{⟨ \hat{r}, \hat{H} \hat{r} ⟩} = - z \frac{⟨ \hat{H} \hat{r}, z ⟩}{⟨\hat{r}, z⟩} .

(86)

From (84), taking into account the orthogonality of the vectors

{\hat{r}}_{}^{+}

, z, we obtain the equality

⟨\hat{r}, z⟩ = - ⟨z, z⟩ .

(87)

Let us find the expression

{\hat{H}}_{}^{+} {\hat{r}}_{}^{+}

necessary to form the descent direction

{\hat{s}}_{}^{+} = - {\hat{H}}_{}^{+} {\hat{r}}_{}^{+}

in the next iteration. Considering the orthogonality of the vectors

{\hat{r}}_{}^{+}

and z, using the BFGS matrix transformation formula (58), we obtain

\begin{array}{l} {\hat{H}}_{}^{+} {\hat{r}}_{}^{+} & = \hat{H} {\hat{r}}_{}^{+} + z \frac{⟨z - \hat{H} z, {\hat{r}}_{}^{+}⟩}{⟨z, z⟩} = \hat{H} {\hat{r}}_{}^{+} - z \frac{⟨\hat{H} z, {\hat{r}}_{}^{+}⟩}{⟨z, z⟩} \\ = \hat{H} \hat{r} + \hat{H} z - z \frac{⟨\hat{H} z, {\hat{r}}_{}^{+}⟩}{⟨z, z⟩} \\ = \hat{H} \hat{r} - z \frac{⟨\hat{H} z, \hat{r}⟩}{⟨z, z⟩} + \hat{H} z - z \frac{⟨\hat{H} z, z⟩}{⟨z, z⟩} . \end{array}

(88)

Transformation of the equality in (86) based on (87) leads to

\hat{H} \hat{r} = - z \frac{⟨\hat{H} \hat{r}, z⟩}{⟨\hat{r}, z⟩} = z \frac{⟨\hat{H} \hat{r}, z⟩}{⟨z, z⟩} = z \frac{⟨\hat{H} z, \hat{r}⟩}{⟨z, z⟩} .

(89)

Making the replacement (89) in the last expression from (88), we find

{\hat{H}}_{}^{+} {\hat{r}}_{}^{+} = \hat{H} z - z \frac{⟨\hat{H} z, z⟩}{⟨z, z⟩} .

(90)

According to (90), the new descent vector can be represented using the expression for

\hat{v}

from (67)

{\hat{s}}_{}^{+} = - {\hat{H}}_{}^{+} {\hat{r}}_{}^{+} = z \frac{⟨\hat{H} z, z⟩}{⟨z, z⟩} - \hat{H} z = {⟨\hat{H} z, z⟩}_{}^{- \frac{1}{2}} \hat{v} .

(91)

Since the component

{\hat{r}}_{⊥ z}^{}

in (83) does not depend on the accuracy of the one-dimensional search, Expression (91) determines its contribution to the direction of descent in (76). Finally, the property of (72) together with the residual

\hat{r}

representation in (82) proves (76).

The condition in (75) according to (91) prevents the completion of the minimization process. If

\hat{v} = 0

, then as a result of exact one-dimensional minimization, we obtain

{\hat{s}}_{}^{+} = - {\hat{H}}_{}^{+} {\hat{r}}_{}^{+} = {⟨\hat{H} z, z⟩}_{}^{- 0.5} \hat{v} = 0

, which, taking into account

\hat{H} > 0

, means

{\hat{r}}_{}^{+} = 0

. As before, using (67), we find a new descent direction for the DFP method, assuming that the one-dimensional search is exact:

{\hat{s}}_{D F P}^{+} = - {\hat{H}}_{D F P}^{+} {\hat{r}}_{}^{+} = - {\hat{H}}_{B F G S}^{+} {\hat{r}}_{}^{+} + \hat{v} 〈\hat{v}, {\hat{r}}_{}^{+}〉 = {\hat{s}}_{B F G S}^{+} + \hat{v} 〈\hat{v}, {\hat{r}}_{}^{+}〉 .

(92)

The last term in (92), taking into account (91) and the orthogonality of the vectors

{\hat{r}}_{}^{+}, z

, can be represented in the form

\begin{array}{l} \hat{v} 〈\hat{v}, {\hat{r}}_{}^{+}〉 & = 〈z, \hat{H} z〉 ⟨ \frac{z}{〈z, z〉} - \frac{\hat{H} z}{〈z, \hat{H} z〉}, {\hat{r}}_{}^{+} ⟩ [\frac{z}{〈z, z〉} - \frac{\hat{H} z}{〈z, \hat{H} z〉}] = - ⟨ \frac{\hat{H} z}{〈z, \hat{H} z〉}, {\hat{r}}_{}^{+} ⟩ {\hat{s}}_{B F G S}^{+} \\ = - ⟨ \frac{\hat{H} z}{〈z, \hat{H} z〉}, \hat{r} + z ⟩ {\hat{s}}_{B F G S}^{+} = (- \frac{〈\hat{H} z, \hat{r}〉}{〈z, \hat{H} z〉} - 1) {\hat{s}}_{B F G S}^{+} . \end{array}

(93)

Let us transform the scalar value as follows:

q = - \frac{〈\hat{H} z, \hat{r}〉}{〈\hat{H} z, z〉} = - \frac{⟨ \hat{H} \hat{H} \hat{r}, \hat{r} ⟩}{⟨ \hat{H} \hat{H} \hat{r}, z ⟩} = \frac{{⟨ \hat{H} \hat{H} \hat{r}, \hat{r} ⟩}^{2}}{⟨ \hat{H} \hat{H} \hat{r}, \hat{H} \hat{r} ⟩ ⟨ \hat{H} \hat{r}, \hat{r} ⟩} .

(94)

Based on (92), together with (93) and (94), we obtain the expression

{\hat{s}}_{D F P}^{+} = - {\hat{H}}_{D F P}^{+} {\hat{r}}_{}^{+} = {\hat{s}}_{B F G S}^{+} + (q - 1) {\hat{s}}_{B F G S}^{+} = q {\hat{s}}_{B F G S}^{+} .

And finally, the last expression, using the property of (73) together with the representation of the residual, considering the accuracy of the one-dimensional descent (82), proves (77).

Since

\hat{H} > 0

, the left inequality in (78) will hold. We prove the right inequality by contradiction. Let us denote by

{\hat{H}}^{L} > 0 (L > 0)

a matrix with eigenvectors of the matrix H and eigenvalues in the form of powers of the corresponding eigenvalues of the matrix H, given by

λ_{i}^{{\hat{H}}^{L}} = (λ_{i}^{H})^{L}, i = 1,2, \dots, n

. Let

u = (\hat{H})_{}^{0.5} \hat{r}

. Then,

q = {〈\hat{H} u, u〉}^{2} / 〈\hat{H} u, \hat{H} u〉 〈u, u〉 .

Consequently, the equality

\hat{H} u = ρ u

holds if q = 1. Therefore, u is an eigenvector of the matrix H, and therefore, all matrices

{\hat{H}}^{L}

also have such an eigenvector. Due to this fact and the equality

u = (\hat{H})_{}^{1 / 2} \hat{r}

, the vector

\hat{r}

is also an eigenvector, and

u = (\hat{H})_{}^{1 / 2} \hat{r} = ρ_{}^{1 / 2} \hat{r}

, where ρ is the eigenvalue of the matrix

\hat{H}

. In this case, considering the representation in (85) of vector z, vector

\hat{v}

, according to its representation in (67), is zero, which cannot be true according to the condition in (75). Therefore, the right inequality in (78) also holds.

Due to the orthogonality of vectors

\hat{v}

and z and according to (76) and (77), the numerators in (79) are the same, and for the denominators, taking into account (78), the inequality

〈{\hat{s}}_{D F P}^{+}, {\hat{s}}_{D F P}^{+}〉 < 〈{\hat{s}}_{B F G S <}^{+}, {\hat{s}}_{B F G S}^{+}〉

holds, which proves (79). In an exact one-dimensional search, the equality is satisfied in (79) since the numerators in (79) are zero.

Let us justify point 3 of the theorem. In accordance with the notation of equations

H_{B F G S}

(58) and

H_{D F P}

(67), we introduce an orthogonal coordinate system in which the first two orthonormal vectors are determined by the following equations:

e_{1} = z / ‖z‖, e_{2} = p / ‖p‖, p = \hat{H} z - z \frac{〈z, \hat{H} z〉}{〈z, z〉},

(95)

where vectors p and z are orthogonal and

\hat{v} = - {〈z, \hat{H} z〉}^{- 1 / 2} p

. In such a coordinate system, these vectors are defined by

z^{T} = (‖ z ‖, 0, \dots, 0) p^{T} = (0, ‖ p ‖, 0, \dots, 0) .

(96)

Let us consider the form of matrix

\hat{H}

in the selected coordinate system. Let us determine the type of vector p based on its representation in (95). Taking into account

〈z, \hat{H} z〉 / 〈z, z〉 = {\hat{H}}_{1, 1}^{}

, components of vector p have the form

(\hat{H} z)^{T} = ‖ z ‖ ({\hat{H}}_{1,1}, {\hat{H}}_{2,1}, {\hat{H}}_{3,1}, \dots, {\hat{H}}_{n, 1}), z^{T} \frac{〈z, \hat{H} z〉}{〈z, z〉} = ‖ z ‖ ({\hat{H}}_{1,1}, 0, \dots, 0) .

Hence,

p^{T} = ‖ z ‖ (0, {\hat{H}}_{2,1}, {\hat{H}}_{3,1}, \dots, {\hat{H}}_{n, 1})

. Comparing the last expression with the expression in (96), we conclude that in the chosen coordinate system, the first column

{\hat{H}}_{1}

of matrix

\hat{H}

has the following form:

{\hat{H}}_{1} = {({\hat{H}}_{11}, {\hat{H}}_{21}, 0, \dots, 0)}^{T} .

(97)

From (97) and (96), it follows that

p^{T} = ‖ z ‖ (0, {\hat{H}}_{2,1}, 0, \dots, 0), \hat{v} = - {〈z, \hat{H} z〉}^{- 1 / 2}, p = (0, {\hat{H}}_{2,1} / {\hat{H}}_{1,1}^{1 / 2}, 0, \dots, 0),

(98)

and the original matrix will have the form

\hat{H} = (\begin{matrix} {\hat{H}}_{11} & {\hat{H}}_{12} & 0 & \dots & 0 \\ {\hat{H}}_{21} & {\hat{H}}_{22} & {\hat{H}}_{23} & \dots & {\hat{H}}_{2 n} \\ 0 & {\hat{H}}_{32} & {\hat{H}}_{33} & \dots & {\hat{H}}_{3 n} \\ \dots & \dots & \dots & \dots & \dots \\ 0 & {\hat{H}}_{n 2} & {\hat{H}}_{n 3} & \dots & {\hat{H}}_{n n} \end{matrix}) .

(99)

When correcting matrices with formulas BFGS (58) and DFP (67), changes will occur only in the space of the first two variables, determined by the unit vectors in (95). As a result of the BFGS transformation in (58), we obtain the following two-dimensional matrix:

\begin{matrix} {\hat{H}}_{2 \times 2_{B F G S}}^{+} = (\begin{matrix} {\hat{H}}_{11} & {\hat{H}}_{12} \\ {\hat{H}}_{12} & {\hat{H}}_{22} \end{matrix}) + (\begin{matrix} {\hat{H}}_{11} - 1 & 0 \\ 0 & 0 \end{matrix}) - (\begin{matrix} {\hat{H}}_{11} - 1 & {\hat{H}}_{12} \\ 0 & 0 \end{matrix}) - (\begin{matrix} {\hat{H}}_{11} - 1 & 0 \\ {\hat{H}}_{12} & 0 \end{matrix}) \\ = (\begin{matrix} 1 & 0 \\ 0 & {\hat{H}}_{22} \end{matrix}) . \end{matrix}

(100)

Based on the relationship of matrices expressed in (67), using (98), we obtain the result of the transformation according to the DFP equation in (67):

{\hat{H}}_{2 \times 2_D F P}^{+} = {\hat{H}}_{2 \times 2_B F G S}^{+} - \hat{v} {\hat{v}}_{}^{T} = (\begin{matrix} 1 & 0 \\ 0 & {\hat{H}}_{22} - \frac{{\hat{H}}_{12}^{2}}{{\hat{H}}_{11}} \end{matrix}) .

(101)

Thus, the resulting two-dimensional matrices have the following form:

{\hat{H}}_{2 \times 2_B F G S}^{+} = (\begin{matrix} 1 & 0 \\ 0 & {\hat{H}}_{22} \end{matrix}) . {\hat{H}}_{2 \times 2_D F P}^{+} = (\begin{matrix} 1 & 0 \\ 0 & {\hat{H}}_{22} - \frac{{\hat{H}}_{12}^{2}}{{\hat{H}}_{11}} \end{matrix}) .

(102)

The corresponding complete matrices are presented below:

{\hat{H}}_{B F G S}^{+} = (\begin{matrix} 1 & 0 & 0 & \dots & 0 \\ 0 & {\hat{H}}_{22} & {\hat{H}}_{23} & \dots & {\hat{H}}_{2 n} \\ 0 & {\hat{H}}_{32} & {\hat{H}}_{33} & \dots & {\hat{H}}_{3 n} \\ \dots & \dots & \dots & \dots & \dots \\ 0 & {\hat{H}}_{n 2} & {\hat{H}}_{n 3} & … & {\hat{H}}_{n n} \end{matrix}),

(103)

{\hat{H}}_{D F P}^{+} = (\begin{matrix} 1 & 0 & 0 & \dots & 0 \\ 0 & {\hat{H}}_{22} - \frac{{\hat{H}}_{12}^{2}}{{\hat{H}}_{11}} & {\hat{H}}_{23} & \dots & {\hat{H}}_{2 n} \\ 0 & {\hat{H}}_{32} & {\hat{H}}_{33} & \dots & {\hat{H}}_{3 n} \\ \dots & \dots & \dots & \dots & \dots \\ 0 & {\hat{H}}_{n 2} & {\hat{H}}_{n 3} & … & {\hat{H}}_{n n} \end{matrix}) .

(104)

Due to the condition in (75) from Expression (98) for

\hat{v}

, it follows that

{\hat{H}}_{2,1} \neq 0

. Consequently, the trace of matrix

{\hat{H}}_{D F P}^{+}

, according to (102) and (104), will decrease by

{\hat{H}}_{12}^{2} / {\hat{H}}_{11}

. The last expression can be transformed considering the definition of the coordinate system in (96). As a result, we obtain (81). From (103), we obtain (80). □

Regarding the results of Theorem 9, we can draw the following conclusions.

With an inexact one-dimensional descent in the DFP method, the successive descent directions are less orthogonal than in the BFGS method (79).
The trace of matrix $\hat{H}$ in the DFP method in the unexplored space decreases (81). This makes it difficult to enter a new subspace during subsequent minimization. Moreover, in the case of an exact one-dimensional descent, in the next step, this decrease is restored; however, a new one appears.
Theorem 9 also shows that in the case of an exact one-dimensional search, the minimization space on quadratic functions is reduced by one.

Due to the limited computational accuracy on ill-conditioned problems (i.e., problems with a high condition number), the noted effects can significantly worsen the convergence of the DFP method.

In conjugate gradient methods [39], if the accuracy of the one-dimensional descent is violated, the sequence of vectors ceases to be conjugated. In QNMs, due to the reduction in the minimization subspace by one during exact one-dimensional descent, the effect of reducing the minimization space accumulates. In Section 6, we look at methods for replenishing the space excluded from the minimization process.

6. Methods for Reducing the Minimization Space of Quasi-Newton Methods on Quadratic Functions

We will assume that the quadratic function has the form expressed in (71a):

f (x) = \frac{1}{2} 〈x - x^{*}, x - x^{*}〉 .

For matrices

H^{k + 1}

and

R^{k + 1}

obtained using the iteration of (71),

Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k})

, the relations in (72) and (73) hold:

R_{}^{k + 1} z_{}^{k} = 0, H_{}^{k + 1} z_{}^{k} = z_{}^{k} .

(105)

Vector z^k is an eigenvector for matrices

H^{k + 1}

and

R^{k + 1}

with one and zero eigenvalues, respectively. Let us consider ways to increase the dimension of the quasi-Newton relations’ execution subspace.

Let us denote by H ∈ I_m a matrix H > 0 that has m eigenvectors with unit eigenvalues, and the corresponding matrix R = H − I with the corresponding eigenvectors and zero eigenvalues we will denote by R ∈ O_m. Let us denote by Q_m a subspace of dimension m spanned by a system of eigenvectors with unit eigenvalues of the matrix H ∈ I_m, and its complement by

D_{m} = R^{n} \ Q_{m}

.

An arbitrary orthonormal system of m vectors e₁, …, e_m, of subspace Q_m is a system of eigenvectors of matrices H ∈ I_m and R ∈ O_m:

H e_{i} = e_{i}, R e_{i} = 0, i = 1, \dots, m .

(106)

It follows that an arbitrary vector, which is a linear combination of vectors e_i, will satisfy the quasi-Newton relations.

Lemma 1.

Consider the matrix H ∈ I_m and the vectors

r = r_{Q} + r_{D}, r_{Q} \in Q_{m}, r_{D} \in D_{m} .

(107)

Then,

H r = H r_{Q} + H r_{D}, H r_{Q} = r_{Q} \in Q_{m}, H r_{D} \in D_{m} .

(108)

Proof of Lemma 1.

The system of m eigenvectors of matrix H ∈ I_m is contained in the set Q_m. Due to the orthogonality of the eigenvectors, the remaining part of the matrix H ∈ I_m is contained in the set D_m. Therefore, the operation of multiplying the vectors in (107) by the matrix in (108) does not take them beyond their subspace. In this case, for the vector r_Q, the equality Hr_Q = r_Q ∈ Q_m holds, which follows from the definition of the subspace Q_m. □

Lemma 2.

Let H^k > 0, H^k ∈ I_m, m < n,

r_{Q}^{k} = 0,

r_{D}^{k} \neq 0

, and iteration

Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k})

be completed. Then,

i f γ_{k} = 1, t h e n H_{}^{k + 1} \in I_{m + 1} a n d r_{Q}^{k + 1} = 0;

(109)

i f γ_{k} \neq 1, t h e n H_{}^{k + 1} \in I_{m + 1} a n d r_{Q}^{k + 1} \neq 0 .

(110)

Proof of Lemma 2.

The descent direction, taking into account (51), has the form

s_{}^{k} = - H_{}^{k} \nabla f (x_{}^{k}) = - H_{}^{k} r_{}^{k} = - H_{}^{k} r_{D}^{k}

. Based on Lemma 1, it follows that

H_{}^{k} r_{D}^{k} \in D_{m}

. As follows from Theorem 8, a new eigenvector expressed in (72) and (73) with a unit eigenvalue appears in the subspace D_m, regardless of the accuracy of the one-dimensional descent, which proves (109), taking into account the accuracy of the one-dimensional search. With an inexact descent, part of the residual remains along the vector z^k, which proves (110). □

Lemma 3.

Let H^k > 0, H^k ∈ I_m, m ≤ n,

r_{Q}^{k} \neq 0

,

r_{D}^{k} \neq 0

, and iteration

Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k})

be completed. Then, it follows that

i f γ_{k} = 1, t h e n H_{}^{k + 1} \in I_{m} a n d r_{Q}^{k + 1} = 0;

(111)

i f γ_{k} \neq 1, t h e n H_{}^{k + 1} \in I_{m} a n d r_{Q}^{k + 1} \neq 0 .

(112)

Proof of Lemma 3.

Since

r_{Q}^{k} \neq 0

, we take a system, where one of the eigenvectors is the vector

r_{Q}^{k}

, as an orthogonal system of eigenvectors in Q_m. From the remaining eigenvectors, we form a subspace Q_m₋₁ in which there is no residual. Applying to Q_m−₁ the results of Lemma 2 under the condition H^k ∈ I_m−₁, we obtain (111) and (112). □

By alternating operations with an exact and inexact one-dimensional descent, it is possible to obtain finite convergence on quadratic functions of QNMs.

Theorem 10.

Let H^k > 0, H^k ∈ I_m,

r_{Q}^{k} \neq 0

, m < n − 1, and the iterations be completed as follows:

Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k}), γ_{k} = 1,

(113)

Q N (x^{k + 1}, H^{k}, x^{k + 2}, H^{k + 2}, γ_{k}), γ_{k} \neq 1 .

(114)

Then,

H_{}^{k + 2} \in I_{m + 1}, r_{Q}^{k + 2} \neq 0 .

(115)

Proof of Theorem 10.

For the iteration of (113), we apply the result of Lemma 3 (111), and for the iteration of (114), we apply the result of Lemma 2 (110). As a result, we obtain (115). □

Theorem 10 says that individual iterations with an exact one-dimensional descent make it possible to increase by one the dimension of the space where the quasi-Newton relation is satisfied. This means that after a finite number of such iterations, the matrix H_k = I will be obtained.

Let us consider another way of increasing the dimension of the quasi-Newton relation. It consists of using, after iterations of QNMs, an additional iteration of descent along the orthogonal vector v^k defined in (67), and according to (91), with an exact one-dimensional descent coinciding, up to a scalar factor, with the descent direction

s_{}^{k + 1} = {⟨H_{}^{k} z_{}^{k}, z_{}^{k}⟩}_{}^{- 1 / 2} v_{}^{k}

of the BFGS method:

Q N (x^{k}, H^{k}, x^{k + 1 / 2}, H^{k + 1 / 2}, γ_{k}), γ_{k} \in (0,2),

(116)

x_{}^{k + 1} = x_{}^{k + 1 / 2} + β_{k + 1 / 2} v_{}^{k}, γ_{k} \in (0,2),

(117)

v_{}^{k} = {〈z_{}^{k}, H_{}^{k} z_{}^{k}〉}^{\frac{1}{2}} [\frac{z_{}^{k}}{〈z_{}^{k}, z_{}^{k}〉} - \frac{H_{}^{k} z_{}^{k}}{〈z_{}^{k}, H_{}^{k} z_{}^{k}〉}],

(118)

H_{}^{k + 1} = H (H_{}^{k + 1 / 2}, Δ x_{}^{k + 1 / 2}, y_{}^{k + 1 / 2}) .

(119)

Let us denote the iterations in (116)–(119) by

V Q N (x^{k}, H^{k}, x^{k + 1}, H^{k + 1}, γ_{k}, γ_{k + 1 / 2}), γ_{k} \in (0,2), γ_{k + 1 / 2} \in (0,2) .

(120)

Lemma 4.

Let H^k > 0, H^k ∈ I_m,

r_{Q}^{k} \neq 0

,

r_{D}^{k} \neq 0

, m ≤ n − 1, and the iteration of (120) be completed. Then,

H_{}^{k + 1} \in I_{m + 1}, r_{Q}^{k + 1} \neq 0 .

(121)

Proof of Lemma 4.

For the iteration of (116), as in the proof of Lemma 3, since

r_{Q}^{k} \neq 0

, we take this as an orthogonal system of eigenvectors in Q_m, where one of the eigenvectors is the vector

r_{Q}^{k}

. From the remaining eigenvectors, we form a subspace Q_m−1 in which there is no residual, and for this subspace, H^k ∈ I_m₋₁ holds. As a result of (116), according to the results of Theorem 8, an eigenvector z^k ∉ Q_m₋₁ is formed. It is a derivative of vector

s_{}^{k} = - H_{}^{k} r_{Q}^{k} \notin Q_{m - 1}

, which, due to multiplication by a matrix H^k ∈ I_m−1 with residual

r_{Q}^{k} \notin Q_{m - 1}

, according to the results of Lemma 1, does not belong to the subspace Q_m−1. For this reason, the vector v^k ∉ Q_m−1 obtained by Formula (118), orthogonal to z^k, because of (117)–(119), becomes an eigenvector of the matrix H^k⁺¹. Thus, the subspace Q_m₋₁ is replenished with two eigenvectors of the matrix H^k+1, resulting in (121). □

Theorem 11.

To obtain H^k ∈ I_n, it is necessary to perform the iteration of (120) (n − 1) times.

Proof of Theorem 11.

In the first iteration of (120), we obtain H^k+1 ∈ I₂. In the next (n − 2) iterations of (120), according to the results of Lemma 4, we obtain H^k+n−1 ∈ I_n. □

The results of Theorem 11 and Lemma 5 indicate the possibility of using techniques for increasing the dimension of the subspace of quasi-Newton relations’ execution at arbitrary moments, which enables us, as will be shown below, to develop QNMs that are resistant to the inaccuracies of a one-dimensional search.

In summary, the following conclusions can be drawn about properties of QNMs on quadratic functions without the condition of an exact one-dimensional descent.

The dimension of the minimization subspace decreases as the dimension of the subspace of fulfillment of the quasi-Newton relation increases (Lemma 2).
The dimension of the subspace of fulfillment of the quasi-Newton relation does not decrease during the execution of the QNM (Lemmas 2–5).
Individual iterations with an exact one-dimensional descent increase the dimension of the subspace of the quasi-Newton relation (Lemma 4).
Separate inclusions of iterations with the transformation of matrices for pairs of conjugate vectors increase the dimension of the subspace of the quasi-Newton relation (Lemma 5).
It is sufficient to perform at most the (n − 1) inclusion of an exact one-dimensional descent (113) in arbitrary iterations to solve the problem of minimizing a quadratic function in a finite number of steps in the QNM (Lemma 4 and Theorem 10).
To solve the problem of minimizing a quadratic function in a finite number of steps in the QNM, it is sufficient to perform in arbitrary iterations no more than (n − 1) inclusions of matrix transformations for pairs of descent vectors obtained as a result of the transformations in (118) and (119) (Lemma 5 and Theorem 11).

7. Methods for Increasing the Orthogonality of Learning Vectors in Quasi-Newton Methods

The term “degree of orthogonality” refers to the type of function (71a). For the type of function (6), this term means the degree of conjugacy of the vectors. Several conclusions can be drawn from our considerations.

Firstly, it is preferable to use the BFGS method. With imprecise one-dimensional descent in the DFP method, successive descent directions are less orthogonal than in the BFGS method (79).

Secondly, it makes sense to increase the degree of accuracy of the one-dimensional search, since individual iterations with an exact one-dimensional descent increase the dimension of the subspace of the quasi-Newton relation (Theorem 10), which reduces the dimension of the minimum search region.

Thirdly, separate inclusions of iterations with matrix transformation for pairs of conjugate vectors increase the dimension of the subspace of the quasi-Newton relation (Lemma 4). This requires applying a sequence of descent iterations for pairs of conjugate vectors (120).

On the other hand, it is important to correctly select the scaling factor ω of the initial matrix H⁰ = ωI from (1) in the QNM. Let us consider an example of a function of the form expressed in (6):

f (x) = \frac{1}{2} \sum_{i = 1}^{n} x_{i}^{2} / i .

(122)

The eigenvalues of the matrix of second derivatives A and its inverse

A^{- 1}

are

λ_{i} = \frac{1}{i} a n d λ_{i}^{- 1} = i,

respectively. The gradient of the quadratic function in (122) is

\nabla f (x) = \sum_{i = 1}^{n} i x_{i} .

In the first stages of the search for

H^{0} = I,

in the gradients

\nabla f (x) = A (x - x^{*})

and gradient differences, components of eigenvectors with large eigenvalues of matrix A and, accordingly, small eigenvalues of the matrix A⁻¹ = H prevail. Let us calculate an approximation of the eigenvalues for scaling the initial matrix using data from (3) of the first iteration of the methods in (1)–(4):

λ_{m i n}^{H} \leq ω = \frac{〈Δ x^{0}, Δ x^{0}〉}{〈y^{0}, Δ x^{0}〉} = \frac{〈A^{- 1} y^{0}, A^{- 1} y^{0}〉}{〈y^{0}, A^{- 1} y^{0}〉} \leq λ_{m a x}^{H},

(123)

where

λ_{\min}^{H}, λ_{\max}^{H}

are the minimum and maximum eigenvalues of the matrix A⁻¹ = H, respectively. To scale the initial matrix H⁰, consider the following:

H^{0} = K ω I = K \frac{〈Δ x^{0}, Δ x^{0}〉}{〈y^{0}, Δ x^{0}〉} I, K \geq 1 .

(124)

Let us qualitatively investigate the operation of the quasi-Newton BFGS method (71). Taking into account the predominance of eigenvectors with large eigenvalues of the matrix A and, accordingly, small eigenvalues of the matrix A⁻¹ = H, it is possible to qualitatively display the picture of the reconstruction of the matrix A⁻¹ eigenvectors for different values of K, making a rough assumption that small eigenvalues are sequentially restored. A rough diagram of the process of reconstructing the spectrum of matrix eigenvalues is shown in Figure 2.

One of the components of increasing the degree of orthogonality of learning vectors in QNMs is the normalization of the initial metric matrix (124). In Section 8, we will consider the impact of the methods noted in this section on increasing the efficiency of QNMs.

8. Numerical Study of Ways to Increase the Orthogonality of Learning Vectors in Quasi-Newton Methods

We implemented and compared quasi-Newtonian BFGS and DFP methods. A one-dimensional search procedure with cubic interpolation [41] (exact one-dimensional descent) and a one-dimensional minimization procedure [34] (inexact one-dimensional descent) were used. We used both the classical QNM with the iterations of (1)–(4) (denoted as BFGS and DFP) and the QNM including iterations with additional orthogonalization (116)–(119) in the form of a sequence of iterations (120) (denoted as BFGS_V and DFP_V). The experiments were carried out by varying the coefficients of the initial normalization of the matrices of the QNM metric.

Since the use of quasi-Newtonian methods is justified primarily based on functions with a high degree of conditionality where conjugate gradient methods do not work efficiently, the test functions were selected based on this principle. Since the QNM is based on a quadratic model of a function, its local convergence rate in a certain neighborhood of the current minimum is largely determined by the efficiency of minimizing the ill-conditioned quadratic functions. The test functions are as follows:

(1)

f_{1} (x) = \sum_{i = 1}^{n} x_{i}^{2} i^{6}, x_{0} = (10 / 1,10 / 2, \dots, 10 / n) .

The optimal value and minimum point are

f_{1}^{*} = 0 a n d x^{*} = (0,0, \dots, 0) .

The condition number of the matrix of second derivatives for some n is

c o n d (\nabla^{2} f_{1} (x)) = λ_{m a x} / λ_{m i n} = n^{6}

. When n = 1000, the condition number will be

c o n d (\nabla^{2} f_{1} (x)) = 1000^{6} = 10^{18}

.

(2)

f_{2} (x) = \sum_{i = 1}^{n} x_{i}^{2} {(\frac{n}{i})}^{6}, x_{0} = (10,10, \dots, 10) .

The optimal value and minimum point are

f_{2}^{*} = 0 a n d x^{*} = (0,0, \dots, 0) .

The condition number of the matrix of second derivatives for some n is

c o n d (\nabla^{2} f_{2} (x)) = λ_{m a x} / λ_{m i n} = n^{6}

. When n = 1000, the condition number will be

c o n d (\nabla^{2} f_{2} (x)) = 1000^{6} = 10^{18}

.

(3)

f_{3} (x) = {(\sum_{i = 1}^{n} x_{i}^{2} i)}^{r}, x_{0} = (1,1, \dots, 1), r = 2

.

The optimal value and minimum point are

f_{3}^{*} = 0 a n d x^{*} = (0,0, \dots, 0) .

The function f₃ is based on a quadratic function with the condition number of the matrix of second derivatives for some n

c o n d (\nabla^{2} f_{3} (x)) = λ_{m a x} / λ_{m i n} = n

. When n = 1000, the condition number will be

c o n d (\nabla^{2} f_{3} (x)) = 1000

. The topology of the level surfaces of the function f₃ is identical to the topology of the level surfaces of the basic quadratic function. The matrix of second derivatives of a function tends to zero as it approaches the minimum. Consequently, the inverse matrix tends to infinity. The approximation pattern for the matrix of second derivatives in the QNM will correspond to K = 1 in Figure 2. This case makes it difficult to enter a new subspace due to the significant predominance of eigenvalues in the metric matrix in the already surveyed part of the subspace compared to the eigenvalues of the metric matrix in the unsurveyed area.

(4)

f_{4} (x) = \sum_{i = 1}^{n / 2} [10^{8} \cdot {(x_{2 i - 1}^{2} - x_{2 i})}^{2} + {(x_{2 i - 1} - 1)}^{2}], x_{0} = (1.2,1, - 1.2,1, \dots, - 1.2,1)

.

The optimal value and minimum point of rescaled multidimensional Rosenbrock function [42] are

f_{4}^{*} = 0 a n d x^{*} = (1,1, \dots, 1) .

This function has a curved ravine with small values of the second derivative in the direction of the bottom of the ravine and large values of the second derivative in the direction of the normal to the bottom of the ravine. The ratio of second derivatives along such directions is approximately 10⁸.

The stopping criterion is

f (x_{}^{k}) - f^{*} \leq ε = 1 0^{- 10} .

The results of minimizing the presented functions are given in Table 1 and Table 2 for n = 1000. The problem was considered solved if the method, within the allotted number of iterations and calculations of the function and gradient, reached a function value that satisfied the stopping criterion. The cell indicates the number of iterations (one-dimensional searches along a direction), and below is the number of calls to the function procedure, where the function and gradient are calculated simultaneously. The number of iterations in all tests were limited to 40,000. If the costs of the method exceeded the specified number of iterations, the method was stopped. It was believed that no solution had been found by this method. The dash sign indicates options where a solution could not be obtained. In cases where there was no solution, looping of methods occurred due to the smallness of the minimization steps and, as a consequence, large errors in the gradient differences used in the transformation operations of metric matrices.

Let us consider the effects of reducing the convergence rate of the method. For example, for the function f₃, the matrix of second derivatives tends to zero as it approaches the minimum. Consequently, the inverse matrix tends to infinity. The approximation pattern for the matrix of second derivatives in the QNM will correspond to K = 1 in Figure 2. In the explored part of the subspace, the matrix of the QNM grows. Therefore, the slight presence of residuals in this part of the subspace is greatly amplified. In the unexplored part of the space, the eigenvalues are fixed. This case makes it difficult to enter a new subspace due to the significant predominance of eigenvalues in the metric matrix in the explored part of the subspace compared to the eigenvalues of the metric matrix in the unexplored area. In order to enter the unexplored part of the subspace, it is necessary to eliminate the discrepancy in the explored part of the space. As a consequence, when minimizing functions with a high degree of conditionality, the search steps become smaller, the errors in the gradient differences increase, and the minimization method becomes loopy.

For exact descent, there are practically no differences between the BFGS and BFGS_V methods. In exact descent, successive descent vectors for quadratic functions are conjugated, and matrix learning, considered in a coordinate system with an identity matrix of second derivatives, is carried out using an orthogonal system of vectors. Minor errors lead to the fact that this orthogonality is violated, which affects the DFP method.

For inexact descent, the BFGS_V method significantly outperforms the BFGS method. The DFP and DFP_V methods are practically ineffective on these tests, although the DFP_V method shows better results.

Thus, with one-dimensional search errors, the BFGS_V algorithm is significantly more effective than the BFGS method. The DFP method is practically not applicable when the problem is highly conditioned.

Table 2 shows the experimental data with normalization of the matrix (124) at K > 1. For the functions f₃(x) and f₄(x), the coefficient K had to be reduced to obtain a more effective result.

The initial normalization of the metric matrices, as follows from the results of Table 1 and Table 2, significantly improves the convergence of QNMs. The situation corresponds to the case in Figure 2 for K > 1. Large eigenvalues in the unexplored part of the subspace make it easy to find new conjugate directions and efficiently train metric matrices with almost orthogonal training vectors.

For exact descent, there are practically no differences between the BFGS and BFGS_V methods. For inexact descent, the BFGS_V method significantly outperforms the BFGS method. The DFP and DFP_V methods are efficient for functions f₁(x) − f₃(x), while for inexact descent, the DFP_V method significantly outperforms the DFP method.

Thus, in the case of one-dimensional search errors, the BFGS_V algorithm is significantly more efficient than the BFGS method and correct initial normalization of metric matrices can significantly increase the convergence rate of the method.

For the purpose of giving a visual demonstration of the method, we minimize a two-dimensional function as follows:

f_{5} (x) = {(x_{1}^{2} + 100 x_{2}^{2})}^{2}, x_{0} = (1,1) .

To test the idea of the efficiency of orthogonalization to increase the performance of the quasi-Newton method, to adversely affect the minimization conditions, the initial matrix was normalized at K = 0.000001, which should significantly complicate the solution of the problem and reveal the effect of the advantages of the degree of orthogonality of the learning vectors of the BFGS and BFGS_V methods over the DFP method.

The stopping criterion was

f (x_{}^{k}) - f^{*} \leq ε = 1 0^{- 2}

The results are shown in Table 3. The row with f₅(x) shows the number of iterations, while the row with f_min shows the minimal function value achieved.

The path of three considered algorithms is shown in Figure 3.

Here, theoretical results of the influence of the orthogonality degree of matrix learning vectors on the convergence rate of the method are confirmed. The BFGS_V method performs forced orthogonalization, which improves the result of the BFGS method. The trajectories of the methods are listed in Table A1, Table A2 and Table A3 of Appendix A (the trajectory of the DFP method is shown partially).

9. Conclusions

This paper presents methods for converting metric matrices in quasi-Newton methods based on gradient learning algorithms. As a result, it is possible to represent the system of learning steps in the form of an algorithm for minimizing a certain objective function along a system of directions and to draw conclusions about the convergence rate of the learning process based on the properties of this system of directions. The main conclusion is that the convergence rate is directly dependent on the degree of orthogonality of the learning vectors.

Based on the study of learning algorithms in the DFP and BFGS methods, it is possible to show that the degree of orthogonality of the learning vectors in the BFGS method is higher than that in the DFP method. This means that entering the unexplored region of the minimization space due to the noise and inaccuracies of one-dimensional descent in the DFP method is more difficult than in the BFGS method, which explains why the BFGS updating formula has the best results.

As a result of studies on quadratic functions, it has been revealed that the dimension of the minimization space is reduced when iterations with an exact one-dimensional descent or iterations with additional orthogonalization are included in the quasi-Newton method. It is shown that it is also possible to increase the orthogonality of the learning vectors and thereby increase the convergence rate of the method through special normalization of the initial metric matrix. The theoretically predicted effects of increasing the efficiency of quasi-Newton methods were confirmed as a result of a computational experiment on complex ill-conditioned minimization problems. In future work, we plan to study minimization methods under the conditions of a linear background that adversely affects the convergence.

Author Contributions

Conceptualization, V.K. and E.T.; methodology, V.K., E.T. and P.S.; software, V.K.; validation, L.K., E.T., P.S. and D.K.; formal analysis, P.S., E.T. and D.K.; investigation, E.T.; resources, L.K.; data curation, P.S. and D.K.; writing—original draft preparation, V.K.; writing—review and editing, E.T., P.S. and L.K.; visualization, V.K.; supervision, V.K. and L.K.; project administration, L.K. and D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (Grant No. 075-15-2022-1121). Predrag Stanimirović is supported by the Science Fund of the Republic of Serbia (No. 7750185, Quantitative Automata Models: Fundamental Problems and Applications—QUAM).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. Trajectory of the BFGS_V method moving.

Iteration	f₅(x)	x₁	x₂
0	0.0	1.0	1.0
1	9.605985 × 10⁻¹	9.900005 × 10⁻¹	5.073008 × 10⁻⁵
2	9.605986 × 10⁻¹	9.900005 × 10⁻¹	5.332304 × 10⁻⁵
3	9.605988 × 10⁻¹	9.900006 × 10⁻¹	5.659179 × 10⁻⁵
4	1.396454 × 10⁻¹	6.112994 × 10⁻¹	2.122674 × 10⁻⁴
5	7.500359 × 10⁻⁴	1.654885 × 10⁻¹	5.745936 × 10⁻⁵

Table A2. Trajectory of the BFGS method moving.

Iteration	f₅(x)	x₁	x₂
0	0.0	1.0	1.0
1	4.865982 × 10²	9.854078 × 10⁻¹	−4.592161 × 10⁻¹
2	1.489919	9.895086 × 10⁻¹	−4.914216 × 10⁻²
3	9.608406 × 10⁻¹	9.899878 × 10⁻¹	−1.220497 × 10⁻³
4	9.605957 × 10⁻¹	9.899999 × 10⁻¹	−6.968405 × 10⁻⁶
5	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.031223 × 10⁻⁴
6	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.891454 × 10⁻⁵
7	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.945802 × 10⁻⁵
8	9.612290 × 10⁻¹	9.899966 × 10⁻¹	1.815384 × 10⁻³
9	9.641387 × 10⁻¹	9.899999 × 10⁻¹	−4.249478 × 10⁻³
10	9.605818 × 10⁻¹	9.899963 × 10⁻¹	9.036299 × 10⁻⁶
11	9.059276 × 10⁻¹	9.603298 × 10⁻¹	−1.719565 × 10⁻²
12	1.634266 × 10⁻²	2.596599 × 10⁻¹	2.457950 × 10⁻²
13	2.586155 × 10⁻³	−2.187391 × 10⁻²	−2.244455 × 10⁻²

Table A3. Trajectory of the DFP method moving.

Iteration	f₅(x)	x₁	x₂
0	0.0	1.0	1.0
1	9.605985 × 10⁻¹	9.900005 × 10⁻¹	5.073008 × 10⁻⁵
2	9.605986 × 10⁻¹	9.900005 × 10⁻¹	5.332304 × 10⁻⁵
3	9.605988 × 10⁻¹	9.900006 × 10⁻¹	5.659179 × 10⁻⁵
4	9.605991 × 10⁻¹	9.900006 × 10⁻¹	6.073300 × 10⁻⁵
5	9.605994 × 10⁻¹	9.900007 × 10⁻¹	6.601199 × 10⁻⁵
6	9.605942 × 10⁻¹	9.899988 × 10⁻¹	−1.191852 × 10⁻⁴
7	9.605942 × 10⁻¹	9.899988 × 10⁻¹	−1.202172 × 10⁻⁴
8	9.605942 × 10⁻¹	9.899988 × 10⁻¹	−1.215724 × 10⁻⁴
9	9.605942 × 10⁻¹	9.899988 × 10⁻¹	−1.233791 × 10⁻⁴
10	9.605942 × 10⁻¹	9.899987 × 10⁻¹	−1.258332 × 10⁻⁴
11	9.605957 × 10⁻¹	9.899999 × 10⁻¹	−8.005638 × 10⁻⁶
12	9.605974 × 10⁻¹	9.899977 × 10⁻¹	−2.294417 × 10⁻⁴
13	9.605962 × 10⁻¹	9.900000 × 10⁻¹	4.809401 × 10⁻⁶
14	9.605946 × 10⁻¹	9.899985 × 10⁻¹	−1.494894 × 10⁻⁴
15	9.605941 × 10⁻¹	9.899992 × 10⁻¹	−8.356599 × 10⁻⁵
16	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.019703 × 10⁻⁴
17	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.866443 × 10⁻⁵
18	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.826577 × 10⁻⁵
19	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.914318 × 10⁻⁵
20	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.806531 × 10⁻⁵
21	9.639558 × 10⁻¹	9.899569 × 10⁻¹	−4.240006 × 10⁻³
22	9.605941 × 10⁻¹	9.899991 × 10⁻¹	−9.292366 × 10⁻⁵
23	9.605942 × 10⁻¹	9.899988 × 10⁻¹	−1.237791 × 10⁻⁴
24	9.605943 × 10⁻¹	9.899994 × 10⁻¹	−6.417724 × 10⁻⁵
25	9.605943 × 10⁻¹	9.899986 × 10⁻¹	−1.365365 × 10⁻⁴
26	9.605942 × 10⁻¹	9.899992 × 10⁻¹	−7.609652 × 10⁻⁵
27	9.605941 × 10⁻¹	9.899989 × 10⁻¹	−1.126456 × 10⁻⁴
28	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.613542 × 10⁻⁵
29	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.018252 × 10⁻⁴
30	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.002864 × 10⁻⁴
31	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.007205 × 10⁻⁴
32	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.001624 × 10⁻⁴
33	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.006906 × 10⁻⁴
34	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.000533 × 10⁻⁴
35	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.006925 × 10⁻⁴
36	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.993761 × 10⁻⁵
37	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.007145 × 10⁻⁴
38	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.980530 × 10⁻⁵
39	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.007537 × 10⁻⁴
40	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−9.964890 × 10⁻⁵
41	9.605941 × 10⁻¹	9.899990 × 10⁻¹	−1.008103 × 10⁻⁴

References

Polyak, B.T. Introduction to Optimization; Translated from Russian; Optimization Software Inc., Publ. Division: New York, NY, USA, 1987. [Google Scholar]
Nocedal, J.; Wright, S. Numerical Optimization, Series in Operations Research and Financial Engineering; Springer: New York, NY, USA, 2006. [Google Scholar]
Bertsekas, D.P. Constrained Optimization and Lagrange Multiplier Methods; Academic Press: New York, NY, USA, 1982. [Google Scholar]
Gill, P.E.; Murray, W.; Wright, M.H. Practical Optimization; SIAM: Philadelphia, PE, USA, 2020. [Google Scholar]
Dennis, J.E.; Schnabel, R.B. Numerical Methods for Unconstrained Optimization and Nonlinear Equations; SIAM: Philadelphia, PE, USA, 1996. [Google Scholar]
Evtushenko, Y.G. Methods for Solving Extremal Problems and Their Application in Optimization Systems; Nauka: Moscow, Russia, 1982. (In Russian) [Google Scholar]
Polak, E. Computational Methods in Optimization: A Unified Approach; Academic Press: New York, NY, USA, 1971. [Google Scholar]
Kokurin, M.M.; Kokurin, M.Y.; Semenova, A.V. Iteratively regularized Gauss–Newton type methods for approximating quasi–solutions of irregular nonlinear operator equations in Hilbert space with an application to COVID-19 epidemic dynamics. Appl. Math. Comput. 2022, 431, 127312. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Tao, X.; Sun, P.; Zheng, Z. A positional misalignment correction method for Fourier ptychographic microscopy based on the quasi-Newton method with a global optimization module. Opt. Commun. 2019, 452, 296–305. [Google Scholar] [CrossRef]
Lampron, O.; Therriault, D.; Lévesque, M. An efficient and robust monolithic approach to phase-field quasi-static brittle fracture using a modified Newton method. Comput. Methods Appl. 2021, 386, 114091. [Google Scholar] [CrossRef]
Spenke, T.; Hosters, N.; Behr, M. A multi-vector interface quasi-Newton method with linear complexity for partitioned fluid–structure interaction. Comput. Methods Appl. Mech. Engrg. 2020, 361, 112810. [Google Scholar] [CrossRef]
Zorrilla, R.; Rossi, R. A memory-efficient MultiVector Quasi-Newton method for black-box Fluid-Structure Interaction coupling. Comput. Struct. 2023, 275, 106934. [Google Scholar] [CrossRef]
Davis, K.; Schulte, M.; Uekermann, B. Enhancing Quasi-Newton Acceleration for Fluid-Structure Interaction. Math. Comput. Appl. 2022, 27, 40. [Google Scholar] [CrossRef]
Tourn, B.; Hostos, J.; Fachinotti, V. Extending the inverse sequential quasi-Newton method for on-line monitoring and controlling of process conditions in the solidification of alloys. Int. Commun. Heat Mass Transf. 2023, 142, 1106647. [Google Scholar] [CrossRef]
Hong, D.; Li, G.; Wei, L.; Li, D.; Li, P.; Yi, Z. A self-scaling sequential quasi-Newton method for estimating the heat transfer coefficient distribution in the air jet impingement. Int. J. Therm. Sci. 2023, 185, 108059. [Google Scholar] [CrossRef]
Berahas, A.S.; Jahani, M.; Richtárik, P.; Takác, M. Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample. Optim. Methods Softw. 2022, 37, 1668–1704. [Google Scholar] [CrossRef]
Rafati, J. Quasi-Newton Optimization Methods For Deep Learning Applications. 2019. Available online: https://arxiv.org/abs/1909.01994.pdf (accessed on 11 January 2024).
Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. [Google Scholar] [CrossRef]
Davidon, W.C. Variable Metric Methods for Minimization; A.E.C. Res. and Develop. Report ANL–5990; Argonne National Laboratory: Argonne, IL, USA, 1959. [Google Scholar]
Fletcher, R.; Powell, M.J.D. A rapidly convergent descent method for minimization. Comput. J. 1963, 6, 163–168. [Google Scholar] [CrossRef]
Oren, S.S. Self-scaling variable metric (SSVM) algorithms I: Criteria and sufficient conditions for scaling a class of algorithms. Manag. Sci. 1974, 20, 845–862. [Google Scholar] [CrossRef]
Oren, S.S. Self-scaling variable metric (SSVM) algorithms II: Implementation and experiments. Manag. Sci. 1974, 20, 863–874. [Google Scholar] [CrossRef]
Powell, M.J.D. Convergence Properties of a Class of Minimization Algorithms. In Nonlinear Programming; Mangasarian, O.L., Meyer, R.R., Robinson, S.M., Eds.; Academic Press: New York, NY, USA, 1975; Volume 2, pp. 1–27. [Google Scholar]
Dixon, L.C. Quasi-Newton algorithms generate identical points. Math. Program. 1972, 2, 383–387. [Google Scholar] [CrossRef]
Huynh, D.Q.; Hwang, F.-N. An accelerated structured quasi-Newton method with a diagonal second-order Hessian approximation for nonlinear least squares problems. J. Comp. Appl. Math. 2024, 442, 115718. [Google Scholar] [CrossRef]
Chai, W.H.; Ho, S.S.; Quek, H.C. A Novel Quasi-Newton Method for Composite Convex Minimization. Pattern Recognit. 2022, 122, 108281. [Google Scholar] [CrossRef]
Fang, X.; Ni, Q.; Zeng, M. A modified quasi-Newton method for nonlinear equations. J. Comp. Appl. Math. 2018, 328, 44–58. [Google Scholar] [CrossRef]
Zhou, W.; Zhang, L. A modified Broyden-like quasi-Newton method for nonlinear equations. J. Comp. Appl. Math. 2020, 372, 112744. [Google Scholar] [CrossRef]
Broyden, C.G. The convergence of a class of double–rank minimization algorithms. J. Inst. Math. Appl. 1970, 6, 76–79. [Google Scholar] [CrossRef]
Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
Goldfarb, D. A family of variable metric methods derived by variational means. Math. Comput. 1970, 24, 23–26. [Google Scholar] [CrossRef]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Zhu, C.; Byrd, R.H.; Lu, P.; Nocedal, J. L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Trans. Math. Softw. 1997, 23, 550–560. [Google Scholar] [CrossRef]
Tovbis, E.; Krutikov, V.; Stanimirović, P.; Meshechkin, V.; Popov, A.; Kazakovtsev, L. A Family of Multi-Step Subgradient Minimization Methods. Mathematics 2023, 11, 2264. [Google Scholar] [CrossRef]
Krutikov, V.; Gutova, S.; Tovbis, E.; Kazakovtsev, L.; Semenkin, E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics 2022, 10, 3959. [Google Scholar] [CrossRef]
Feldbaum, A.A. On a class of dual control learning systems. Avtomat. i Telemekh. 1964, 25, 433–444. (In Russian) [Google Scholar]
Aizerman, M.A.; Braverman, E.M.; Rozonoer, L.I. Method of Potential Functions in Machine Learning Theory; Nauka: Moscow, Russia, 1970. (In Russian) [Google Scholar]
Tsypkin, Y.Z. Foundations of the Theory of Learning Systems; Academic Press: New York, NY, USA, 1973. [Google Scholar]
Kaczmarz, S. Approximate solution of systems of linear equations. Internet J. Control 1993, 54, 1239–1241. [Google Scholar] [CrossRef]
Krutikov, V.N. On the convergence rate of minimization methods along vectors of a linearly independent system. USSR Comput. Math. Math. Phys. 1983, 23, 218–220. [Google Scholar] [CrossRef]
Rao, S.S. Engineering Optimization; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Andrei, N. An Unconstrained Optimization Test Functions Collection. Available online: http://www.ici.ro/camo/journal/vol10/v10a10.pdf (accessed on 1 April 2024).

Figure 1. Step of process (13) on hyperplane <z^k, c> = y_k along the direction z^k.

Figure 2. Qualitative behavior of the spectrum of matrix H^k eigenvalues for cases of scaling (124) for various values of K.

Figure 3. Level curves and paths of the optimization algorithms for function f₅.

Table 1. Results of minimization with normalization of matrix (124) at K = 1 and n = 1000.

	Exact Descent				Inexact Descent
	BFGS	BFGS_V	DFP	DFP_V	BFGS	BFGS_V	DFP	DFP_V
f₁(x)	1157	1157	1228	1211	1854	1648	-	1762
f₁(x)	2526	2523	2712	2667	3980	3413	-	3750
f₂(x)	2400	2370	-	-	4351	3218	-	-
f₂(x)	5663	5560	-	-	9908	7242	-	-
f₃(x)	1404	1396	-	1643	1905	1508	5837	2497
f₃(x)	3206	3190	-	3743	4286	3394	13,362	5686
f₄(x)	3328	2964	-	-	-	-	-	-
f₄(x)	7455	6668	-	-	-	-	-	-

Table 2. Results of minimization with normalization of matrix (124) at K = 10,000 and n = 1000. For results marked with an asterisk, K = 100.

	Exact Descent				Inexact Descent
	BFGS	BFGS_V	DFP	DFP_V	BFGS	BFGS_V	DFP	DFP_V
f₁(x)	1038	1038	1041	1041	1221	1189	1307	1260
f₁(x)	2194	2193	2197	2195	2190	2116	2431	2343
f₂(x)	791	795	1091	852	1386	1012	2524	1509
f₂(x)	1863	1874	2560	2028	3129	2159	5794	3341
f₃(x)	1082 *	1090 *	8977 *	1343 *	1281	1129	4281	1845
f₃(x)	2436	2454	20,201	3055	2802	2453	9742	4183
f₄(x)	4062 *	3850*	-	-	-	-	-	-
f₄(x)	9135	8686	-	-	-	-	-	-

Table 3. Results of minimization with normalization of matrix (124) at K = 0.000001 and n = 2.

	Exact Descent
	BFGS	BFGS_V	DFP_V
Number of iterations	13	5	3733
F_min	2.5862 × 10⁻³	7.5003 × 10⁻⁴	7.7552 × 10⁻³

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krutikov, V.; Tovbis, E.; Stanimirović, P.; Kazakovtsev, L.; Karabašević, D. Machine Learning in Quasi-Newton Methods. Axioms 2024, 13, 240. https://doi.org/10.3390/axioms13040240

AMA Style

Krutikov V, Tovbis E, Stanimirović P, Kazakovtsev L, Karabašević D. Machine Learning in Quasi-Newton Methods. Axioms. 2024; 13(4):240. https://doi.org/10.3390/axioms13040240

Chicago/Turabian Style

Krutikov, Vladimir, Elena Tovbis, Predrag Stanimirović, Lev Kazakovtsev, and Darjan Karabašević. 2024. "Machine Learning in Quasi-Newton Methods" Axioms 13, no. 4: 240. https://doi.org/10.3390/axioms13040240

APA Style

Krutikov, V., Tovbis, E., Stanimirović, P., Kazakovtsev, L., & Karabašević, D. (2024). Machine Learning in Quasi-Newton Methods. Axioms, 13(4), 240. https://doi.org/10.3390/axioms13040240

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning in Quasi-Newton Methods

Abstract

1. Introduction

2. Matrix Learning Algorithms in Quasi-Newton Methods

3. Gradient Learning Algorithms for Deriving and Analyzing Matrix Updating Equations in Quasi-Newton Methods

4. Symmetric Positive Definite Metric and Its Analysis

5. Qualitative Analysis of the Advantages of the BFGS Equation over the DFP Equation

6. Methods for Reducing the Minimization Space of Quasi-Newton Methods on Quadratic Functions

7. Methods for Increasing the Orthogonality of Learning Vectors in Quasi-Newton Methods

8. Numerical Study of Ways to Increase the Orthogonality of Learning Vectors in Quasi-Newton Methods

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI