1. Introduction
The conjugate gradient (CG) and conjugate direction (CD) methods have been extended to the optimization of nonquadratic functions by several authors. Fletcher and Reeves [
1] gave a direct extension of the conjugate gradient (CG) method. An approach to conjugate direction (CD) methods using only function values was developed by Powell [
2]. Davidon [
3] developed a variable metric algorithm, which was later modified by Fletcher and Powell [
4]. According to Davidon [
3], variable metric methods are considered to be very effective techniques for optimizing a nonquadratic function.
In 1952, Hestenes and Stiefel [
5] developed conjugate direction (CD) methods for minimizing a quadratic function defined on a finite dimensional space. One of their objectives was to find efficient computational methods for solving a large system of linear equations. In 1964, Fletcher and Reeves [
1] extended the conjugate gradient (CG) method of Hestenes and Stiefel [
5] to nonquadratic functions. The method presented here is related to those described by G.S. Smith [
6], M.J.B. Powell [
2] and W.I. Zangwill [
7]. The method of Smith is also described by Fletcher [
8] on pp. 9–10, Brent [
9] on p. 124 and Hestenes [
10] on p. 210. In addition to that, Nocedal [
11] explored the possibility of nonlinear conjugate gradient methods converging without restarts and with the use of practical line search. In the field of numerical optimization, a number of additional authors, including Kelley [
12], Zang and Li [
13], among others, investigated a wide range of approaches in the use of conjugate direction methods.
The primary purpose of this work is to implement Hestenes’ Gram–Schmidt conjugate direction method without derivatives, which uses function values with no line searches. We will refer to this method as the GSCD method; Hestenes refers to it as the CGS method. We illustrate the procedure numerically, computing asymptotic constants and the quotient convergent factors of Ortega and Rheinboldt [
14]. In reference to Hestenes [
10], p. 202, where he states that the CGS has Newton’s algorithm as its limit, Russak [
15] shows that n-step superlinear convergence is possible. We verify numerically that the GSCD procedure converges quadratically under appropriate conditions.
As for notation, we use capital letters, such , to denote matrices and lower case letters, such as , for scalars. The value denotes the transpose of matrix A. If F is a real-valued differentiable function of n real variables, we denote its gradient at x by and the Hessian of F at x by . We use subscripts to distinguish vectors and superscripts to denote components when these distinctions are made together, for example, =.
The method of steepest descent is due to Cauchy [
16]. It is one of the oldest and most obvious ways to find a minimum point of a function
f.
There are two versions of steepest descent. The one due to Cauchy, which we call an iterative method, uses line searches and another, described by Eells [
17] in Equation (
10), p. 783, uses a differential equation of steepest descent. In Equation (4.3) we describe another version of the differential equation of steepest descent. However, numerically, both have flaws. The iterative method is generally quite slow, as shown by Rosenbrock [
18] in his banana valley function.
Newton’s method applied to
, where
f is a function to be minimized, is another approach for finding a minimum of the function
f. Newton’s method has rapid convergence, but it is costly because of derivative evaluations. Hestenes’ CGS method without derivatives [
10], p. 202, has Newton’s method as its limit as
.
If the minimizing function is strongly convex quadratic and the line search is exact, then, in theory, all choices for the search direction in standard conjugate gradient algorithms are equivalent. However, for nonquadratic functions, each choice of the search direction leads to standard conjugate gradient algorithms with very different performances [
19].
In this article, we investigate quotient convergence factors and root convergence factors. We computationally compare the conjugate Gram–Schmidt direction method with Newton’s method. There are other types of convergence for the conjugate gradient, the conjugate direction, the gradient method, Newton’s method and the steepest descent method, such as superlinear convergence [
20,
21,
22], Wall [
23] root convergence and Ostrowski convergence factors [
24], but, for the sake of this research, quotient convergence is the one that is the most appropriate for the quadratic convergence.
In this article, the well-known conjugate directions algorithm, for minimizing a quadratic function, is modified to become an algorithm for minimizing a nonquadratic function, in the manner described in
Section 2. The algorithm uses the gradient estimates and Hessian matrix estimates described in
Section 3. In
Section 4, a test example for minimizing a nonquadratic function by the developed conjugate direction algorithm without derivatives is analyzed. The advantage of this approach compared to Newton’s method is efficiency. The proposed approach is justified in sufficient detail. The results obtained are of certain theoretical and practical interest for specialists in the theory and methods of optimization.
3. Results
A brief description of the CG method is given below using a quadratic function:
The CG method is the CD method, which is described previously, with the first CD being in the direction of the negative gradient of function
F. The remaining CDs can be determined in a variety of ways, and the CG procedure described by Hestenes [
10] is given below.
3.1. CG—Algorithms for Nonquadratic Approximations
One can apply the CG method to the quadratic function in
z, namely
, to obtain a minimum of
. Let
f be a function of
n variables, then
Assume that a Hessian matrix is a positive definite symmetric matrix, which implies that
has a unique minimum
. Then,
Applying Newton’s method to
, we get
Remark 1. We solved directly to obtain .
In general, Newton’s method is used to solve
for
. It is given by
where
is an initial guess and
is the Jacobian matrix, i.e.,
Now, we apply Newton’s method by taking
to
and assuming that
F and its second partial derivatives are continuous. So, one can apply Newton’s method to
, with
as the initial point, to obtain the minimum point
of
F, where
Then,
where we take
.
For convenience in exposition, we include formulas below from Hestenes [
10], pp. 136–137 and pp. 199–202.
Here, the first step of Newton’s method is applied to
and
also turns out to be the only
of
(a quadratic equation with positive definite symmetric term), i.e.,
which satisfies
. Therefore, Newton’s method terminates in one iteration [
10].
The initial formulas for
and
given in Algorithm 1 imply the basic CG relations
Algorithm 1 CG algorithm |
Step 1: Select an initial point . Set , , . for do perform the following iteration: Step 2: , Step 3: , , or , Step 4: , , Step 5: , or . end for Step 6: When consider the next estimate of the minimum point of f to be the point . Then choose as the final estimate, if is sufficiently small enough. Otherwise, reset and the CG cycle – is repeated. |
The CG cycle in Step 1 can terminate prematurely at the mth step if . In this case, we replace by and restart the algorithm.
If we take
, where
A is positive definite symmetric, then we establish the formula
for the inverse of
.
Since Step 2 implies that
, then, in Algorithm 1, we find
We obtain the difference quotient by rewriting the vector
in Algorithm 1 (see Hestenes [
10]). Therefore, without computing the second derivative we find
In view of the development of Algorithms 1 and 2, each cycle of n steps is clearly comparable to one Newton step.
Thus, we replace
by
and obtain the following relation
where
Algorithm 2 CG algorithm without derivative |
Step 1: Initially select and choose a positive constant . Set , , . for do perform the following iteration: Step 2: , , Step 3: , , , Step 4: , , Step 5: , . end for Step 6: When , then is to be the next estimate of the minimum point of f. Then accept as the final estimate of , if is sufficiently small. Otherwise, reset and repeat the CG cycle –. |
The new initial point
generated by one cycle of the modified Algorithm 2 is, therefore, given by the Newton-type formula
So, we have
. The above algorithm approximates the Newton algorithm
and has this algorithm as a limit as
. Therefore, Algorithm 2 will have nearly identical convergence features to Newton’s algorithm if
is replaced by
at the end of each cycle.
3.2. Conjugate Gram–Schmidt (CGS)—Algorithms for Nonquadratic Functions
With an appropriate initial point
, we can derive the algorithm that is described by Hestenes [
10] on p. 135, which relates Newton’s method to a CGS algorithm. Since [
10]
We can approximate the vector
by the vector
with a small value of
. Then, we obtain the following modification of Newton’s algorithm, the CGS algorithm (see Hestenes [
10]):
In Step 2 of Algorithm 3, substitute
with the following formula
and repeat the CGS algorithm. Then, we obtain Newton’s algorithm.
Algorithm 3 CGS algorithm |
Step 1: Select a point . a small positive constant, and n linearly independent vectors ; set , , . for and having obtained , and do perform the following iteration: Step 2: , , Step 3: , , , Step 4: , Step 5: , Step 6: . end for Step 7: When when has been computed, the cycle is terminated. Then choose as the final estimate, if is sufficiently small enough. Otherwise, reset and repeat the CGS cycle –. |
In view of (
11), for small
, the CGS Algorithm 3 is a good approximation of Newton’s algorithm as a limit as
.
A simple modification of Algorithm 3 is obtained by replacing the following formulas in Step 2 and Step 3, as described in Hestenes [
10].
, ,
, , , .
A CGS algorihtm for nonquadratic functions is obtained form the following relation, where the ratios
,
,
have the properties
and
p is a nonzero vector. Moreover, for a given vector
, the ratio
has the property that
The details are as follows. Suppose
is an orthogonal basis that spans the same vector space as that spanned by
, which are linearly independent vectors. The inner product
is defined by
, where
A is a positive definite symmetric matrix. Then, the Gram–Schmidt process works as follows:
Take
, then
Therefore,
Now using function values only, a conjugate Gram–Schmidt process without derivatives is described by Hestenes [
10] as follows, as the CGS routine without derivatives (Algorithm 4):
Algorithm 4 CGS algorithm without derivatives |
Step 1: select an initial point , small and a set of unit vectors , which are linearly independent; set , , , ; compute . for and having obtained and , do perform the following iteration: Step 2: , Step 3: , Step 4: , Step 5: , , Step 6: . end for Step 7: When has been computed, the cycle is terminated. Then choose as the final estimate, if is sufficiently small, is the minimum of f. Otherwise, reset and repeat the CGS cycle – with the initial condition . |
In addition, the conjugate Gram–Schmidt method without derivatives is described by Dennemeyer and Mookini [
26]. In this program, they used different notations from Hestenes’ notations, but they provided the same procedure.
Initial step: select an initial point , a small and a set of linearly independent vectors ;
set , , , and compute .
Iterative steps: given
, compute
for
compute
then,
Terminate when is obtained, and set . If the value is small enough, is the minimum point of f. Otherwise, set and repeat the program.
The term
is used to terminate the algorithm because the gradient is not explicitly computed. Another termination method would be to test if
is chosen beforehand. Both of these tests were used on the computer by Dennemeyer and Mookini [
26] and the results were comparable.
4. Discussion
In this section, we present a computation to illustrate convergence rates, as well as the relationship between that computation and Newton’s method. Two of the most important concepts in the study of iterative processes are the following: (a) when the iterations converge; and (b) how fast the convergence is. We introduce the idea of rates of convergence, as described by Ortega and Rheinboldt [
14].
4.1. Rates of Convergence
A precise formulation of the asymptotic rate of convergence of a sequence
converging to
is motivated by the fact that estimates of the form
for all
, often arise naturally in the study of certain iterative processes.
Definition 1. Let be a sequence of points in that converges to a point . Let . Ortega and Rheinboldt [14] define the quantitiesand refer to them as quotient convergence factors, or Q-factors for short. Definition 2. Let denote the set of all sequences having a limit of that are generated by an iterative process .are the -factors of at with respect to the norm in which the are computed. Note that if
for some
p where
, then, for any
, there is some positive integer
K such that (
13) above holds for
. If
, then we say that
converges to
with
Q-order of convergence p, and if
, for some fixed
p satisfying
, then we say that
has superconvergence of
Q-order
p to
. For example, if
when
, then we also have
in (
13), we say that
converges to
linearly. In addition, if
when
, we say that
converges to
superlinearly.
Definition 3. One other method of describing convergence rate involves the root convergence factors. See ([14]). 4.2. Acceleration
One acceleration procedure is the following: first, apply
n CD steps to an initial point
to obtain a point
; then, take
to be a new initial point and apply
n CD steps again to obtain another
; finally, check for acceleration by evaluating
, if
; then, we accelerate by taking
as our initial point; if
, then take
as a new initial point; after two more applications of the CD method, we check for acceleration again. The procedure continues in this manner [
25].
4.3. Test Function
4.3.1. Rosenbrook’s Banana Valley Function
We carry out the following computations for Rosenbrook’s banana valley function
. This function possesses a steep sided valley that is nearly parabolic in shape. First, we determine values in the domain of Rosenbrock’s function for which its Hessian matrix is positive definite symmetric. Since the Rosenbrock’s banana valley function is non-negative, i.e.,
then we have
and
and
Therefore, the Hessian matrix is positive definite symmetric if and only if Sylvester’s criterion holds:
which implies that
, and
So, the Hessian matric is positive definite symmetric if and only if .
Figure 1 shows the maximal convex level set on which the Hessian is positive definite symmetric in the interior for Rosenbrock’s Banana Valley Function.
4.3.2. Kantorovich’s Function
The following function
which is non-negative, i.e.,
, is called Kantorovich’s Function.
Calculating the Hessian matrix for Kantorovich’s function, we find that
and
Minimizing this function is equivalent to solving the nonlinear system of equations. Therefore, for the initial point
, we obtain the minimum point at (0.992779, 0.306440) [
25].
4.4. Numerical Computation
The goal of this numerical computation is to provide a system of iterative approaches for finding these extreme points [
10]. A significant point is that a Newton step can be performed instead by a CD sequence of
n linear minimizations in
n appropriately chosen directions.
It is important to keep in mind that a function acts like a quadratic function when it is in the neighborhood of a nondegenerate minimum point. Conjugacy can be thought of as a generalization of the concept of orthogonality. Conjugate direction methods include substituting conjugate bases for orthogonal bases in the foundational structure. The formulas for determining the minimum point of a quadratic function can be reduced to their simplest forms by following the CD technique.
The conjugate direction algorithms for minimizing a quadratic function, which are discussed in the current work, were initially presented in Hestenes and Stiefel, 1952 [
5]. These algorithms can be found in the present work. The authors Davidon [
3], Fletcher and Powell [
4] are most known for the modifications and additions that they made to these methods. However, numerous other authors also made these changes.
The iterative methods described above apply to many problems. They are used in least squares fitting, in solving linear and nonlinear systems of equations and in optimization problems with and without constraints [
25]. The computing performances and numerical results of these techniques and comparisons have received a significant amount of attention in recent years. This interest has been focused on the solving of unconstrained optimization problems and large-scale applications [
19,
27].
The Rosenbrock function of two variables, considered in
Section 4.3, was introduced by Rosenbrock [
18] as a simple test function for minimization algorithms. We chose
as the initial point. We applied algorithm
–
with
, using 400-digit accuracy. Algorithm (4) is basically Newton’s algorithm.
The final estimate of has more than 150-digit accuracy. The successive values , , , , , , , … of quotients that lead to the quotient convergence factor oscillate. The lim sup of these quotients give the quotient converge factor, which indicates quadratic convergence. The lim sup is .
For
,
,
and the initial values, we obtained the following computations for Rosenbrock’s function
f using the Gram-Schmidt Conjugate Direction Method without Derivatives or the CGS method, no derivatives, and Newton’s Method applied to
: (See [
28])
4.5. Differential Equations of Steepest Descent
The following equations are known as the differential equations of steepest descent:
and
The solution to either differential equation of steepest descent with initial condition
,
is shown in
Figure 2, one can refer to Equation (
10), p. 783, in Eells [
17]. For Equation (
14), the solution will not include the minimum for finite values of
t. For Equation (
15), the solution will approach the minimum, but will blow up at the minimum.
From a numerical point of view, the differential equation approach has to be used with caution. Rosenbrock [
15] pointed out that the iterative method of steepest descent with line searches was not effective with steep valleys. The iterative method was introduced by Cauchy [
16].
In summary, the method of steepest descent is not effective and does not compare with Hestenes’ CGS method with no derivatives, which is almost numerically equivalent to Newton’s method applied to grad , where f is the function to be minimized.
Below are level curves of Rosenbrock’s banana valley function. We used this function to compare Hestenes’ CGS method, Newton’s method and the steepest descent methods. In
Figure 2, the level curves of Rosenbrock’s Banana Valley Function show that the minimizer is at
. Level curves are plotted for function values
in
Figure 3. For steepest descent, the iterative method and the ODE approach are illustrated. The curve
appears to parallel the valley floor in the graph.
We use the CGS method for computation. The Rosenbrock’s banana valley function
gives the minimum point at
.
This example provided us with geometric illustrations in
Figure 2. For specific algorithms, please refer to
Section 3 for the Gram–Schmidt conjugate direction method and the Newton method in order to compare the two methods along side one another.
The outcomes of the numerical experiments performed on the standard test function using the CGS method are reported above. Based on these data, it is clear that this particular implementation of the CGS method is quite effective.
5. Conclusions
In this paper, we introduced a class of CD algorithms that, for small values of n, provided effective minimization methods. As n grew, however, the algorithms became more and more costly to run.
The computer program above showed that the CGS algorithm without derivatives could generate Newton’s method. Since the Hessian matrix of Rosenbrock’s function was positive definite symmetric and satisfied Sylvester’s criterion, the CGS method converged if we began anywhere in the closed convex set in the nearby area of a minimum. This was because the CGS method is based on the fact that the Hessian matrix of Rosenbrock’s function is positive definite symmetric.
Using quotient convergence factors, one can see that for Rosenbrock’s function one sequence converged quadratically. In particular, the numerical computation on p. 21 revealed that the asymptotic constant oscillated between
and
, so the quotient convergence factor by Ortega and Rheinboldt [
14] was, approximately,
, which indicated quadratic convergence. The results agreed for Newton’s method.
Moreover, the CGS algorithm uses function evaluations and difference quotients for gradient and Hessian evaluations, it does not require accurate gradient evaluation nor function minimization. This approach is the most efficient algorithm that has been discussed in this study; yet, it is extremely sensitive to both the choice of that is used for difference quotients and the choice of that is used for scaling.
The Gram–Schmidt conjugate direction method without derivatives has been used quite successfully in a variety of applications, including radar designs by Norman Olsen [
27] in developing corporate feed systems for antennas and aperture distributions for antenna arrays. He tweaked the parameters sigma and rho in our GSCD computer programs to obtain successful radar designs.