An Active Set Limited Memory BFGS Algorithm for Machine Learning

Liu, Hanger; Li, Yan; Zhang, Maojun

doi:10.3390/sym14020378

Open AccessArticle

An Active Set Limited Memory BFGS Algorithm for Machine Learning

by

Hanger Liu

¹,

Yan Li

^2,* and

Maojun Zhang

^3,*

¹

Center for Applied Mathematics of Guangxi, College of Mathematics and Information Science, Guangxi University, Nanning 530004, China

²

School of Mathematics and Statistics, Baise University, Baise 533000, China

³

School of Business, Suzhou University of Science and Technology, Suzhou 215011, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2022, 14(2), 378; https://doi.org/10.3390/sym14020378

Submission received: 21 December 2021 / Revised: 8 January 2022 / Accepted: 17 January 2022 / Published: 14 February 2022

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, a stochastic quasi-Newton algorithm for nonconvex stochastic optimization is presented. It is derived from a classical modified BFGS formula. The update formula can be extended to the framework of limited memory scheme. Numerical experiments on some problems in machine learning are given. The results show that the proposed algorithm has great prospects.

Keywords:

nonconvex stochastic optimization; stochastic approximation; quasi-Newton method; damped limited-memory BFGS method; variance reduction

PACS:

62L20; 90C30; 90C15; 90C60

1. Introduction

Machine learning is an interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and so on. In machine learning, people usually construct an appropriate model from an extraordinary large amount of data. Therefore, the traditional algorithms for solving optimization problems are no longer suitable for machine learning problems. A stochastic algorithm must be used to solve the model optimization problem we encounter in machine learning.

This type of problem is considered in machine learning

min_{x \in ℜ^{d}} f (x) = E [T (x, τ)],

(1)

where

T : ℜ^{D_{x}} \times ℜ^{n} \to ℜ

is a continuously differentiable function,

E [\cdot]

denotes the expectation taken with respect

τ

and

τ

is the random variable of the distribution function

P

. In most practical cases, the function

T (\cdot, τ)

is not given intuitively. In addition, even worse, the distribution function

P

may also be unknown. The objective function (1) is defined using the empirical expectation

f (x) = \frac{1}{N} \sum_{n = 1}^{N} f_{i} (x),

(2)

where

f_{i} : ℜ^{D_{x}} \to ℜ

is the loss function that corresponds to the ith data sample, and

N

denotes the number of data samples which is assumed to be extremely large.

The stochastic approximation (

S A

) algorithm is usually used to solve the above problems by Robbins and Monro [1]. The original SA algorithm can also be called random gradient descent (

S G D

). It is somewhat similar to the classical steepest descent method, which adopts the iterative process of

x_{k + 1} = x_{k} - α_{k} g_{k}

. In general, random gradient

g_{k}

is used to represent the approximation of the full gradient

\nabla f

of f at

x_{k}

and

α_{k}

is the step size (Learning rate). The SA algorithm has been deeply studied by many scholars [2,3,4].

In this thesis, we mainly study the stochastic second-order method, that is, stochastic quasi-Newton methods

(S Q N)

to solve problem (2). Among the traditional optimization methods, the quasi-Newton methods have faster convergence speed and higher convergence accuracy than the first-order method because it uses the approximate second-order derivative information. The quasi Newton method is usually updated by the following iterative formula:

x_{k + 1} = x_{k} - α_{k} B_{k}^{- 1} \nabla f (x) o r x_{k + 1} = x_{k} - α_{k} H_{k} \nabla f (x)

(3)

where

B_{k}

is the symmetric positive definite approximation of Hessian matrix

\nabla^{2} f (x)

at

x_{k}

or

H_{k}

is the symmetric positive definite approximation of

{[\nabla^{2} f (x)]}^{- 1}

. In the traditional BFGS algorithm, the iterative formula of

B_{k}

is as follows:

B_{k} = B_{k - 1} + \frac{y_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} s_{k - 1}} - \frac{B_{k - 1} s_{k - 1} s_{k - 1}^{T} B_{k - 1}}{s_{k - 1}^{T} B_{k - 1} s_{k - 1}},

(4)

where

s_{k - 1} = x_{k} - x_{k - 1} = α_{k} d_{k}

and

y_{k - 1} = \nabla f (x_{k}) - \nabla f (x_{k - 1})

. If formula Sherman–Morrison–Woodbury formula is used, the iterative formula of

H_{k}

can be easily obtained:

H_{k} = (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) H_{k - 1} (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) + \frac{s_{k - 1} s_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}} .

(5)

It is very important to use a limited memory variant for large-scale problems. This so-called L-BFGS [5] algorithm has a linear convergence rate. It produces well scaled and productive search directions that yield an approximate solution in fewer iterations and function evaluations. In stochastic optimization, many stochastic quasi Newton formulas have been proposed.

The LBFGS method has the following iteration rule

x_{k + 1} = x_{k} - α_{k} H_{k} \nabla f (x)

. The LBFGS method updates

H_{k}

by the following rule:

\begin{matrix} H_{k} & = Q_{k - 1}^{T} H_{k - 1} Q_{k - 1} + ρ_{k - 1} s_{k - 1} s_{k - 1}^{T} \\ = Q_{k - 1}^{T} [Q_{k - 2}^{T} H_{k - 2} Q_{k - 2} + ρ_{k - 2} s_{k - 2} s_{k - 2}^{T}] Q_{k - 1} + ρ_{k - 1} s_{k - 1} s_{k - 1}^{T} \\ = \dots \\ = [Q_{k - 1}^{T} \dots Q_{k - r + 1}^{T}] H_{k - r + 1} [Q_{k - r + 1}^{T} \dots Q_{k - 1}^{T}] \\ + ρ_{k - r + 1} [Q_{k - 2}^{T} \dots Q_{k - r + 2}^{T}] s_{k - r + 1} s_{k - r + 1}^{T} [Q_{k - r + 2}^{T} \dots Q_{k - 2}^{T}] \\ + \dots + ρ_{k - 1} s_{k - 1} s_{k - 1}^{T}, \end{matrix}

(6)

where

Q_{k - 1} = I - ρ_{k - 1} y_{k - 1} s_{k - 1}^{T}

,

ρ_{k - 1} = \frac{1}{s_{k - 1}^{T} y_{k - 1}}

and r is the memory size. Bordes, Bottomu, and Gallinari studied the quasi Newton method of diagonal rescaling matrix based on secant in [6]. In [7], Byrd et al. proposed a stochastic LBFGS method based on SA and proved its convergence for strongly convex problems. In [8], Gower, Goldfarb, and Richtárik proposed a variance reduced block L-BFGS method that converges linearly for convex functions. It is worth noting that, in the above quasi-Newton methods, the convergence of the algorithm needs to be convex or strongly convex.

If the objective function itself does not have the property of convexity, there are several problems that the LBFGS method has difficulty overcoming:

How can we guarantee the positive definiteness of iterative matrix $H_{k}$ without line search?
How can we guarantee the convergence of the proposed L-BFGS method?

These problems seem particularly difficult. However, a modified stochastic limited BFGS (LMLBFGS) is proposed to solve the above problems. On this basis, a new improved algorithm (LMLBFGS-VR) is proposed. Note that our presented algorithm can be adapted to approximate the solution of a nonlinear system of equations in [9].

This paper is divided into five parts: in Section 2, the LMLBFGS and LMLBFGS-VR are presented and their convergence properties are discussed in Section 3. In Section 4, the numerical experiments of the proposed algorithm are given. A summary is given in the last part.

2. Premise Setting and Algorithm

In this part, a new LBFGS(LMLBFGS) algorithm is proposed, which can automatically generate a positive definite matrix

B_{k}

.

2.1. LMLBFGS Algorithm

In order to solve this kind of problem, suppose that

E \subset ℜ^{n}

does not depend on x and the random gradient

g (x, τ)

at x is generated by a stochastic first-order oracle (SFO), for which the distribution of

τ

is supported on

E \subset ℜ^{n}

. It is common to use a mini-batch stochastic gradient of the i-th sampling during the k-th iteration, which is described as

g_{k} = \frac{1}{z_{k}} \sum_{i \in Z_{k}} g (x_{k}, τ_{k, i}) = \frac{1}{z_{k}} \sum_{i \in Z_{k}} \nabla f_{i} (x_{k}),

(7)

and a sub-sampled Hessian defined as follows

G_{k} = \frac{1}{z_{k}^{*}} \sum_{i \in Z_{k}^{*}} G (x_{k}, τ_{k, i}) = \frac{1}{z_{k}^{*}} \sum_{i \in Z_{k}^{*}} \nabla^{2} f_{i} (x_{k}) .

(8)

We have the subset

Z_{k}

and

Z_{k}^{*}

is the sample number where

z_{k}

and

z_{k}^{*}

are the cardinalities of

Z_{k}

and

Z_{k}^{*}

.

τ_{k, i}

is a random variable. From the definition of random gradient, it is not difficult to find that the random gradient under this setting can be calculated faster than the full gradient. We assume here that the SFO generation method can separate

x_{k}

and

τ_{k}

independently and generate the output

g (x_{k}, τ_{k, i})

. Therefore, the stochastic gradient difference and the iterative difference are defined as

y_{k} = g_{k} - g_{k - 1} = \frac{1}{z_{k}} \sum_{i \in Z_{k}} g (x_{k}, τ_{k, i}) - \frac{1}{z_{k - 1}} \sum_{i \in Z_{k - 1}} g (x_{k - 1}, τ_{k - 1, i}),

(9)

s_{k} = x_{k} - x_{k - 1} .

(10)

In traditional methods, the authors in [10] proposed a new type of

\bar{y_{k}}

by using

\bar{y_{k}} = y_{k} + λ_{k} s_{k},

(11)

where

λ_{k} = \frac{2 [f (x_{k - 1}) - f (x_{k})] + {(g_{k} + g_{k - 1})}^{T} s_{k}}{{(s_{k}^{T} y_{k})}^{2}} \cdot (y_{k} y_{k}^{T}) .

(12)

Inspired by their methods, we have the following new definitions:

y_{k}^{*} = y_{k} + λ_{k} s_{k},

(13)

where

λ_{k} = \frac{2 [f (x_{k - 1}) - f (x_{k})] + {(g_{k} + g_{k - 1})}^{T} s_{k}}{m a x \{{(s_{k}^{T} y_{k})}^{2}, {∥ s_{k} ∥}^{4}\}} \cdot (y_{k} y_{k}^{T}) .

(14)

Our

λ_{k}

is guaranteed to be meaningful by

m a x \{{(s_{k}^{T} y_{k})}^{2}, {∥ s_{k} ∥}^{4}\} > 0 .

Hence, our stochastic LBFGS algorithm updates

B_{k}

is

B_{k} = B_{k - 1} + \frac{y_{k - 1}^{*} y_{k - 1}^{* T}}{s_{k - 1}^{T} s_{k - 1}} - \frac{B_{k - 1} s_{k - 1} s_{k - 1}^{T} B_{k - 1}}{s_{k - 1}^{T} B_{k - 1} s_{k - 1}} .

(15)

Using the Sherman–Morrison–Woodbury formula, we can update

H_{k} = B_{k}^{- 1}

as

H_{k} = (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) H_{k - 1} (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) + \frac{s_{k - 1} s_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}} .

(16)

Through simple observation, we can find the fact that, when the function is nonconvex, we can not guarantee that

s_{k}^{T} y_{k}^{*} > 0

is true. Thus, we add some additional settings to the algorithm to ensure the nonnegativity of

s_{k}^{T} y_{k}^{*}

. Define the index set

K

as follows:

K = {i : s_{k}^{T} y_{k}^{*} \geq m ∥ s_{k} ∥^{2}},

(17)

where m is a positive constant.

Hence, our modified stochastic L-BFGS algorithm updates (18) and (19):

B_{k} = \{\begin{matrix} B_{k - 1} + \frac{y_{k - 1}^{*} y_{k - 1}^{* T}}{s_{k - 1}^{T} s_{k - 1}} - \frac{B_{k - 1} s_{k - 1} s_{k - 1}^{T} B_{k - 1}}{s_{k - 1}^{T} B_{k - 1} s_{k - 1}}, i f k \in K, \\ B_{k - 1}, o t h e r w i s e, \end{matrix}

(18)

H_{k} = \{\begin{matrix} H_{k} = (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) H_{k - 1} (I - \frac{s_{k - 1} y_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}) + \frac{s_{k - 1} s_{k - 1}^{T}}{s_{k - 1}^{T} y_{k - 1}}, i f k \in K \\ H_{k - 1}, o t h e r w i s e . \end{matrix}

(19)

As is known to all, the cost of calculating

H_{k}

through (19) is very huge when n is tremendously large. Hence, the LBFGS method is usually used instead of the BFGS method to overcome the poser of a large amount of calculation in large-scale optimization problems. The advantage of LBFGS is that it only uses curvature information and does not need to store the update matrix, which can effectively reduce the computational cost: Use (6) to iterate

H_{k, i} = (I - ρ_{j} s_{j} y_{j}^{* T}) H_{k, i - 1} (I - ρ_{j} y_{j}^{*} s_{j}^{T}) + ρ_{j} s_{j} s_{j}^{T}, j = k - (r - i); i = 0, \dots, r - 1,

(20)

where

ρ_{j} = 1 / (s_{j}^{T} y_{j}^{*})

. The initial matrix is often chosen as:

H_{k, 0} = \frac{s_{k - 1}^{T} y_{k - 1}^{*}}{y_{k - 1}^{* T}} I

. Because

s_{k - 1}^{T} y_{k - 1}^{*}

may be exceedingly close to 0, we set

H_{k, 0} = γ_{k}^{- 1} I,

(21)

and

γ_{k} = m a x {\frac{∥ y_{k - 1}^{*} ∥}{s_{k - 1}^{T} y_{k - 1}^{*}}, δ} \geq δ,

(22)

where

δ

is a given constant.

Therefore, our modified stochastic L-BFGS algorithm is outlined in Algorithm 1.

Algorithm 1: Modified stochastic LBFGS algorithm (LMLBFGS).

Input: Given

x_{1} \in ℜ^{n}

, batch size

z_{k}

,

α_{k}

, the memory size r, a positive definite matrix

H_{1}

, and a positive constant

δ

1: for

k = 0, 1, \dots,

do
2: Compute

g_{k}

by (7) and Hessian matrix

H_{k}

by Algorithm 2;
3: Compute the iteration point

x_{k + 1} = x_{k} - α_{k} H_{k} g_{k}

.
4: end for

2.2. Extension of Our LMLBFGS Algorithm with Variance Reduction

Recently, using variance reduction technology in stochastic optimization methods can make the algorithm have better properties. Motivated by the development of the SVRG method for nonconvex problems, we present a new modified stochastic LBFGS algorithm (called LMLBFGS-VR) with a variance reduction technique for a faster convergence speed, as shown in Algorithm 3.

In LMLBFGS-VR, the mini-batch stochastic gradient is defined as

g (x) = \frac{1}{| Z |} \sum_{i \in Z} \nabla f_{i} (x), Z \subset {1, 2, \dots, n} .

(23)

Algorithm 2: Hessian matrix updating.

Input: correction pairs

(s_{j}, y_{j})

, memory parameter r, and

j = k - (r - i); i = 0, \dots, r - 1,

Output: new

H_{k}

1:

H = \frac{s_{k}^{T} y_{k}^{*}}{y_{t}^{* T} y_{t}^{*}} I

2: for

j = k - (r - i); i = 0, \dots, r - 1

do
3:

m_{j} = s_{j}^{T} y_{j}^{*} - m {∥ s_{j} ∥}^{2}, ρ_{j} = \frac{1}{s_{j}^{T} y_{j}^{*}}

4: if

then m_{j} > 0

5:

H = (I - s_{j} y_{j}^{* T} ρ_{j}) H (I - y_{j}^{*} s_{j}^{T} ρ_{j}) + ρ_{j} s_{j} s_{j}^{T}

  6:     end if
  7: end for
  8: return

H_{k} = H

Algorithm 3: Modified stochastic LBFGS algorithm with variance reduction (LMLBFGS-VR).

Input: Given

{\bar{x}}_{0} \in ℜ^{n}

,

H_{0} = I

, batch size

z_{k}

,

α_{k}

, the memory size r and a constant

δ > 0

Output: Iterationxis chosen randomly from a uniform

{l_{u}^{k + 1} : u = 0, \dots, q - 1;

k = 0, \dots, N - 1}

1: for

k = 0, 1, \dots, N - 1

do
2:

l_{0}^{k + 1} = {\tilde{x}}_{k}

3: compute

\nabla f ({\bar{x}}_{k})

4: for

u = 0, 1, \dots, q - 1

do
5: Samlple a minibatch

Z

with

| Z | = | Z_{k} |

6: Calculate

g_{u}^{k + 1} = \nabla f_{Z} (l_{u}^{k + 1}) - \nabla f_{Z} ({\bar{x}}_{k}) + \nabla f ({\bar{x}}_{k})

where

\nabla f_{Z} (l_{u}^{k + 1}) = \frac{1}{| Z |} \sum_{i \in Z} \nabla f_{i} (l_{u}^{k + 1})

;
7: Compute

l_{u + 1}^{k + 1} = l_{u}^{k + 1} - α_{k} H^{k + 1} g_{u}^{k + 1}

;
8: end for
9: Generate the updated Hessian matrix

H^{k + 1}

by Algorithm 2;
10:

{\bar{x}}_{k + 1} = l_{q}^{k + 1}

11: end for

3. Global Convergence Analysis

In this section, the convergence of Algorithms 1 and 3 will be discussed and analyzed.

3.1. Basic Assumptions

In the algorithm, it is assumed that the step size satisfies

\sum_{k = 1}^{+ \infty} α_{k} = + \infty, \sum_{k = 1}^{+ \infty} α_{k} < + \infty .

(24)

Assumption 1.

f : ℜ^{D_{x}} \to ℜ

is continuously differentiable and for any

x \in ℜ^{n}

,

f (x)

is bounded below. This means that there is constant

L > 0

that makes

∥ \nabla f (l_{1}) - \nabla f (l_{2}) ∥ \leq L ∥ l_{1} - l_{2} ∥

(25)

for any

l_{1}, l_{2} \in ℜ^{n}

.

Assumption 2.

The noise level of the gradient estimation σ such that

E_{τ_{k}} [g (x_{k}, τ_{k})] = \nabla f (x_{k}),

(26)

E_{τ_{k}} [∥ g (x_{k}, τ_{k}) - \nabla f (x_{k}) ∥^{2}] \leq σ^{2},

(27)

where

σ > 0

and

E_{τ_{k}} [\cdot]

denotes the expectation taken with respect to

τ_{k}

.

Assumption 3.

There are positive

h_{1}

and

h_{2}

such that

h_{1} I ⪯ H_{k} ⪯ h_{2} I .

(28)

Our random variables

τ_{k}

are defined as follows:

τ_{k} = (τ_{k, 1}, \dots, τ_{k, z_{k}})

are the random samplings in the k-th iteration, and

τ_{[k]} = (τ_{1}, \dots, τ_{k})

are the random samplings in the first k-th iterations.

Assumption 4.

For any

k \geq 2

, the random variable

H_{k}

depends only on

τ_{[k - 1]}

.

From (2) and (4), we can get

E [H_{k} g_{k} | τ_{[k]}] = H_{k} g_{k} .

(29)

3.2. Key Propositions, Lemmas, and Theorem

Lemma 1.

If Assumptions 1–4 hold and

α_{k} \leq \frac{h_{1}}{L h_{2}^{2}}

for all k, we have

E [f (x_{k + 1}) | x_{k}] \leq - \frac{1}{2} α_{k} h_{1} {∥ \nabla f (x_{k}) ∥}^{2} + f (x_{k}) + \frac{L σ^{2} h_{2}^{2}}{2 z_{k}} α_{k}^{2},

(30)

where the conditional expectation is taken with respect to

τ_{k}

.

Proof.

\begin{matrix} f (x_{k + 1}) & \leq f (x_{k}) + 〈 \nabla f (x_{k}), x_{k + 1} - x_{k} 〉 + \frac{L}{2} ∥ x_{k + 1} - x_{k} ∥ \\ = f (x_{k}) - α_{k} 〈 \nabla f (x_{k}), H_{k} g_{k} 〉 + \frac{L α_{k}^{2}}{2} {∥ H_{k} g_{k} ∥}^{2} \\ \leq f (x_{k}) - α_{k} 〈 \nabla f (x_{k}), H_{k} \nabla f (x_{k}) 〉 - α_{k} 〈 \nabla f (x_{k}), H_{k} (g_{k} - \nabla f (x_{k})) 〉 \\ + \frac{L α_{k}^{2} h_{2}^{2}}{2} {∥ g_{k} ∥}^{2} . \end{matrix}

(31)

Taking expectation with respect to

τ_{k}

on both sides of (31) conditioned on

x_{k}

, we gain

E [f (x_{k + 1}) | x_{k}] \leq f (x_{k}) - α_{k} 〈 \nabla f (x_{k}), H_{k} \nabla f (x_{k}) 〉 + \frac{L α_{k}^{2} h_{2}^{2}}{2} E [∥ g_{k} ∥^{2} | x_{k}],

(32)

where we use the fact that

E [(g_{k} - \nabla f (x_{k})) | x_{k}] = 0

. From Assumption 2, it follows that

\begin{matrix} E [∥ g_{k} ∥^{2} | x_{k}] & = E [∥ g_{k} - \nabla f (x_{k}) + \nabla f (x_{k}) ∥^{2} | x_{k}] \\ = E [∥ \nabla f (x_{k}) ∥^{2} | x_{k}] + 2 E [∥ g_{k} - \nabla f (x_{k}) ∥ | x_{k}] + E [∥ g_{k} - \nabla f (x_{k}) ∥^{2} | x_{k}] \\ = E [∥ \nabla f (x_{k}) ∥^{2} | x_{k}] + E [∥ g_{k} - \nabla f (x_{k}) ∥^{2} | x_{k}] \\ \leq ∥ \nabla f (x_{k}) ∥^{2} + \frac{σ^{2}}{z_{k}} . \end{matrix}

Together with (32), we have

E [f (x_{k + 1}) | x_{k}] \leq f (x_{k}) - (α_{k} h_{1} - \frac{L}{2} α_{k}^{2} h_{2}^{2}) {∥ \nabla f (x_{k}) ∥}^{2} + \frac{L σ^{2} h_{2}^{2}}{2 z_{k}} α_{k}^{2} .

(33)

Then, combining that with

α_{k} \leq \frac{h_{1}}{L h_{2}^{2}}

implies (30). □

Before proceeding further, the definition of supermartingale will be introduced [11].

Definition 1.

Let

{L_{k}}

be an increasing sequence of σ-algebras. If

{W_{k}}

is a stochastic process satisfying

(1): $C [| W_{k} |] < \infty$ ,
(2): $W_{k} \in L_{k}$ and $E [W_{k + 1} | L_{k}] \leq W_{k}$ , for all k,

then

{W_{k}}

is called a supermartingale.

Proposition 1.

If

{W_{k}}

is a nonnegative supermartingale, then

{lim}_{k \to \infty} W_{k} \to W

almost surely and

E [W] \leq E [W_{0}] .

Lemma 2.

Let

{x_{k}}

be generated by Algorithm 1, where the batch size

z_{k} = z,

for all k. Then, there is a constant

M_{0}

such that

E [f (x_{k})] \leq M_{0}

(34)

for all k.

Proof.

For convenience of explanation, we have the following definitions:

w_{k} = \frac{1}{2} α_{k} h_{1} {∥ \nabla f (x_{k}) ∥}^{2}, ψ_{k} = f (x_{k}) + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{i = k}^{\infty} α_{i}^{2} .

(35)

Let

\underset{̲}{f}

be the the lower bound of the function and

W_{k}

be the

σ

-algebra measuring

x_{k}

,

w_{k}

and

ψ_{k}

. From the definition, we obtain

\begin{matrix} E [ψ_{k + 1} | W_{k}] & = E [f (x_{k + 1}) | W_{k}] + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{i = k + 1}^{\infty} α_{k}^{2} \\ \leq f (x_{k}) - \frac{1}{2} α_{k} h_{1} {∥ \nabla f (x_{k}) ∥}^{2} + \frac{L σ^{2} h_{2}}{2 z} \sum_{i = k + 1}^{\infty} α_{k}^{2} \\ = ψ_{k} - w_{k} . \end{matrix}

(36)

Hence, we obtain

E [ψ_{k + 1} - \underset{̲}{f} | W_{k}] \leq ψ_{k} - w_{k} - \underset{̲}{f} .

As a result, we have

0 \leq E [ψ_{k + 1} - \underset{̲}{f}] \leq ψ_{1} - \underset{̲}{f} < \infty .

□

3.3. Global Convergence Theorem

In this part, we provide the convergence analysis of the proposed Algorithms 1 and 3.

Theorem 1.

Assume that Assumptions 1–4 hold for

{x_{k}}

generated by Algorithm 1, where the batch size is

z_{k} = z

. The step size satisfies (24) and

α_{k} \leq \frac{h_{1}}{L h_{2}}

.Then, we have

lim_{k \to \infty} inf E [∥ \nabla f (x_{k}) ∥^{2}] = 0 w i t h p r o b a b i l i t y 1 .

(37)

Proof.

According to Definition 1,

ψ_{k} - \underset{̲}{f}

is a supermartingale. Hence, there exists a

ψ

such that

{lim}_{k \to \infty}

with probability 1, and

E [ψ] \leq E [ψ_{1}]

(Proposition 1). Form (36), we have

E [w_{k}] \leq E [ψ_{k}] - E [ψ_{k + 1}]

. Thus,

E [\sum_{k = 1}^{\infty} w_{k}] \leq \sum_{k = 1}^{\infty} (E [ψ_{k}] - E [ψ_{k + 1}]) \leq \infty,

(38)

which means that

\sum_{k = 1}^{\infty} w_{k} = \frac{h_{1}}{2} \sum α_{k} {∥ \nabla f (x_{k}) ∥}^{2} < + \infty w i t h p r o b a b i l i t y 1 .

(39)

Since (24), it follows that (48) holds. □

Next, the convergence of the algorithm can be given.

Theorem 2.

If Assumptions A1, A2, and A4 hold for

{x_{k}}

generated by Algorithm 1, where the batch size is

z_{k} = z

. The step size satisfies (24) and

α_{k} \leq \frac{h_{1}}{L h_{2}}

. Then, we have

lim_{k \to \infty} inf E [∥ \nabla f (x_{k}) ∥^{2}] = 0 w i t h p r o b a b i l i t y 1 .

(40)

Proof.

The proof will be established by contradiction, and the discussion is listed as follows.

According to the definition of

y_{k}^{*}

, we have

\begin{matrix} s_{j}^{T} y_{j}^{*} & = s_{j}^{T} y_{j} + \frac{2 [f (x_{j - 1}) - f (x_{j})] + {(g_{j} + g_{j - 1})}^{T} s_{j}}{m a x \{{(s_{j}^{T} y_{j})}^{2}, {∥ s_{j} ∥}^{4}\}} \cdot (s_{j}^{T} y_{j} y_{j}^{T} s_{j}) \\ \leq s_{j}^{T} y_{j} + \frac{2 [f (x_{j - 1}) - f (x_{j})] + {(g_{j} + g_{j - 1})}^{T} s_{j}}{{(s_{j}^{T} y_{j})}^{2}} \cdot (s_{j}^{T} y_{j} y_{j}^{T} s_{j}) \\ = s_{j}^{T} y_{j} + 2 [f (x_{j - 1}) - f (x_{j})] + {(g_{j} + g_{j - 1})}^{T} s_{j} \\ = s_{j}^{T} y_{j} - 2 g {(x_{j - 1} + θ (x_{j} - x_{j - 1}))}^{T} s_{j} + {(g_{j} + g_{j - 1})}^{T} s_{j} \\ = 2 s_{j}^{T} (g_{j} - g (x_{j - 1} + θ (x_{j} - x_{j - 1}))) \\ \leq 2 (1 - θ) L ∥ s_{j} ∥ ∥ (x_{j} - x_{j - 1}) ∥ \\ = 2 (1 - θ) L ∥ s_{j} ∥^{2}, \end{matrix}

where

θ \in (0, 1)

. It is easy to see that

m ∥ s_{j} ∥^{2} \leq s_{j}^{T} y_{j}^{*} \leq Λ {∥ s_{j} ∥}^{2},

(41)

where

Λ

is a positive constant.

According to the definition of

y_{k}^{*}

, we have

\begin{matrix} ∥ y_{j}^{*} ∥ & = ∥ y_{j} + \frac{2 [f (x_{j - 1}) - f (x_{j})] + {(g_{j} + g_{j - 1})}^{T} s_{j}}{m a x \{{(s_{j}^{T} y_{j})}^{2}, {∥ s_{j} ∥}^{4}\}} \cdot (y_{j} y_{j}^{T}) \cdot s_{j} ∥ \\ \leq ∥ y_{j} ∥ + | 2 [f (x_{j - 1}) - f (x_{j})] + {(g_{j} + g_{j - 1})}^{T} s_{j} | ∥ \frac{y_{j} y_{j}^{T}}{∥ s_{j} ∥^{4}} s_{j} ∥ \\ \leq ∥ y_{j} ∥ + | - 2 g {(x_{j - 1} + θ (x_{j} - x_{j - 1}))}^{T} s_{j} + {(g_{j} + g_{j - 1})}^{T} s_{j} | \cdot \frac{∥ y_{j} y_{j}^{T} ∥}{∥ s_{j} ∥^{4}} ∥ s_{j} ∥ \\ \leq ∥ y_{j} ∥ + | (g_{j} - g {(x_{j - 1} + θ (x_{j} - x_{j - 1}))}^{T} s_{j} \\ + (g_{j - 1} - g {(x_{j - 1} + θ (x_{j} - x_{j - 1}))}^{T} s_{j} {(g_{j} + g_{j - 1})}^{T} s_{j} | \cdot \frac{∥ y_{j} y_{j}^{T} ∥}{∥ s_{j} ∥^{3}} \\ \leq L ∥ s_{j} ∥ + (L (1 - θ) ∥ s_{j} ∥^{2} + L θ ∥ s_{j} ∥^{2}) \cdot \frac{∥ y_{j} ∥ ∥ y_{j} ∥}{∥ s_{j} ∥^{3}} \\ = 2 L ∥ s_{j} ∥ . \end{matrix}

(42)

From (41) and (42), we have

λ \leq \frac{∥ y_{j}^{*} ∥^{2}}{s_{j}^{T} y_{j}^{*}} \leq \frac{{(2 L)}^{2} {∥ s_{j} ∥}^{2}}{λ ∥ s_{j} ∥^{2}} = \frac{{(2 L)}^{2}}{λ} = M_{0},

(43)

where the first inequality is derived from the quasi Newton condition. This equation shows that the eigenvalue of our initial matrix

B_{k}^{(0)} = \frac{y_{k}^{* T} y_{k}^{*}}{s_{k}^{T} y_{k}^{*}} I

is bounded, and the eigenvalue is much greater than 0.

Instead of directly analyzing the properties of

H_{k}

, we get the results by analyzing the properties of

B_{k}

. In this situation, the limited memory quasi-Newton updating formula is as follows:

(i): $B_{k}^{(0)} = \frac{y_{k}^{* T} y_{k}^{*}}{s_{k}^{T} y_{k}^{*}} I$ .
(ii): for $i = 0, \dots, r - 1$ , $j = k - (r - i)$ and

B_{k}^{(i + 1)} = B_{k}^{(i)} - \frac{B_{k}^{(i)} s_{j} s_{j}^{T} B_{k}^{(i)}}{s_{j}^{T} B_{k}^{(i)} s_{j}} + \frac{y_{j}^{*} y_{j}^{* T}}{s_{j}^{T} y_{j}^{*}} .

(44)

The trace of matrix B is defined as

t r (B)

. Then, from (43) and (44), and the boundedness of

{∥ B_{k}^{(0)} ∥}

, we obtain

\begin{matrix} t r (B_{k + 1}) & \leq t r (B_{k}^{(0)}) + \sum_{i = 1}^{r} \frac{∥ y_{j}^{*} ∥^{2}}{s_{j}^{T} y_{j}^{*}} \\ \leq t r (B_{k}^{(0)}) + r Λ \\ = M_{1} . \end{matrix}

(45)

The determinant of

B_{k}

is now considered because the determinant can be used to prove that the minimum eigenvalue of matrix B is uniformly bounded. From the theory in [12], we can get the following equation about matrix determinant:

\begin{matrix} d e t (B_{k + 1}) & = d e t (B_{k}^{(0)}) \prod_{i = 1}^{r} \frac{y_{j_{i}}^{* T} s_{j_{i}}}{s_{j_{i}}^{T} B_{k}^{(i - 1)} s_{j_{i}}} \\ = d e t (B_{k}^{(0)}) \prod_{i = 1}^{r} \frac{y_{j_{i}}^{* T} s_{j_{i}}}{s_{j_{i}}^{T} s_{j_{i}}} \frac{s_{j_{i}}^{T} s_{j_{i}}}{s_{j_{i}}^{T} B_{k}^{(i - 1)} s_{j_{i}}} . \end{matrix}

(46)

It can be obtained from (45) that the maximum eigenvalue of matrix

B_{j}

is uniformly bounded. Therefore, according to (41) and combining the fact that the smallest eigenvalue of

B_{k}^{(0)}

is bounded away from zero, the following equation is obtained:

\begin{matrix} d e t (B_{k + 1}) & \geq d e t (B_{k}^{(0)}) {(\frac{λ}{M_{1}})}^{r} \\ \geq M_{2} . \end{matrix}

In this way, the maximum eigenvalue and the minimum eigenvalue of matrix

B_{j}

are uniformly bounded and much greater than 0. Therefore, we can get

h_{1} I ⪯ H_{k} ⪯ h_{2} I,

(47)

where

h_{1}

and

h_{2}

are positive constants. According to Theorem 1 that we proved above, the convergence of our proposed Algorithm 1 can be obtained. □

Corollary 1.

If Assumptions 1, 2, and 4 hold for

{x_{k}}

generated by Algorithm 3, where the batch size is

z_{k} = z

and the step size satisfies (24) and

α_{k} \leq \frac{h_{1}}{L h_{2}}

, then, we have

lim_{k \to \infty} inf E [∥ \nabla f (x_{k}) ∥^{2}] = 0 w i t h p r o b a b i l i t y 1 .

(48)

4. The Complexity of the Proposed Algorithm

The convergence results of the algorithm have been discussed. Now, let us analyze the complexity of Algorithms 1 and 3.

Assumption 5.

For any k, we have

α_{k} = \frac{h_{1}}{L h_{2}^{2}} k^{- β}, β \in (0.5, 1) .

(49)

Theorem 3.

Suppose Assumptions 1–5 hold,

\{t_{k}\}

is generated by Algorithm 1, and batch size

z_{k} = z

for all k. Then, we have

\frac{1}{N} \sum_{k = 1}^{N} E [∥ \nabla f (t_{k}) ∥^{2}] \leq \frac{2 L (M_{0} - \underset{̲}{f}) h_{2}^{2}}{h_{1}^{2}} N^{β - 1} + \frac{σ^{2}}{(1 - β) z} (N^{- β} - N^{- 1} + \frac{1 - β}{N}),

(50)

where N denotes the iteration number.

Moreover, for a given

ϵ \in (0, 1)

, to guarantee that

\frac{1}{N} \sum_{k = 1}^{N} E [∥ \nabla f (t_{k}) ∥^{2}] < ϵ,

the number of iterations N needed is at most

O (ϵ^{- \frac{1}{1 - ϵ}})

.

Proof.

Obviously, (49) satisfies (24) and the condition

α_{k} \leq \frac{h_{1}}{L h_{2}^{2}}

. Then, taking expectations on both sides of (30) and summing over all k yield

\begin{matrix} \frac{1}{2} h_{1} \sum_{k = 1}^{N} E [∥ \nabla f (t_{k}) ∥^{2}] & \leq \sum_{k = 1}^{N} \frac{1}{α_{k}} (E [f (t_{k})] - E f (t_{k + 1})) + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{k = 1}^{N} α_{k} \\ = \frac{1}{α_{1}} f (t_{1}) + \sum_{k = 2}^{N} (\frac{1}{α_{k}} - \frac{1}{α_{k - 1}}) E [f (t_{k})] - \frac{E [f (x_{N + 1})]}{α_{N}} + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{k = 1}^{N} α_{k} \\ \leq \frac{M_{0}}{α_{1}} + M_{0} \sum_{k = 2}^{N} (\frac{1}{α_{k} - α_{k - 1}}) - \frac{\underset{̲}{f}}{α_{N}} + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{k = 1}^{N} α_{k} \\ = \frac{M_{0} - \underset{̲}{f}}{α_{N}} + \frac{L σ^{2} h_{2}^{2}}{2 z} \sum_{k = 1}^{N} α_{k} \\ \leq \frac{L (M_{f} - \underset{̲}{f} h_{2}^{2})}{h_{1}} N^{β} + \frac{σ^{2} h_{1}}{2 (1 - β) z} (N^{1 - β} - β), \end{matrix}

which results in (50), where the second inequality is due to Lemma 2, and the last inequality is due to Theorem 1.

Next, for a given

ϵ > 0

, in order to obtain

\frac{1}{N} \sum_{k = 1}^{N} E [∥ \nabla f (x_{k}) ∥^{2}] \leq ϵ

, we only need the following equation:

\frac{2 (M_{0} - \underset{̲}{f}) L h_{2}^{2}}{h_{1}^{2}} N^{β - 1} - \frac{σ^{2}}{(1 - β) z} (N^{- 1} - N^{- β} - \frac{1 - β}{N}) < ϵ .

(51)

Since

β \in (0.5, 1)

, it follows that the number of iterations N needed is at most

O (ϵ^{- \frac{1}{1 - β}}) .

□

Corollary 2.

Assume that Assumptions 1, 3, 4 and (27) hold for

x_{k}

generated by Algorithm 3 with batch size

z_{k} = z

for all k. We also assume that

α_{k}

is specifically chosen as

α_{k} = \frac{h_{1}}{L h_{2}^{2}} k^{- β}

(52)

with

β \in (0.5, 1)

. Then,

\frac{1}{N} \sum_{k = 1}^{N} E [∥ \nabla f (x_{k}) ∥^{2}] \leq \frac{2 L (M_{0} - \underset{̲}{f}) h_{2}^{2}}{h_{1}^{2}} N^{β - 1} + \frac{σ^{2}}{(1 - β) z} (N^{- β} - N^{- 1} + \frac{1 - β}{N}),

(53)

where N denotes the iteration number. Moreover, for a given

ϵ \in (0, 1)

, to guarantee that

\frac{1}{N} \sum_{k = 1}^{N} E [∥ \nabla f (x_{k}) ∥^{2}] < ϵ

, the number of iterations N needed is at most

O (ϵ^{- \frac{1}{1 - ϵ}})

.

5. Numerical Results

In this section, we focus on the numerical performances of the proposed Algorithm 3 for solving nonconvex empirical risk minimization (ERM) problems and nonconvex support vector machine (SVM) problems.

5.1. Experiments with Synthetic Datasets

The models of the nonconvex SVM problems and nonconvex ERM problems are given as follows:

λ > 0

is a regularization parameter.

Problem 1.

The ERM problem with a nonconvex sigmoid loss function [13,14] is formulated as follows:

min_{x \in ℜ^{D_{x}}} \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x) + \frac{λ}{2} {∥ x ∥}_{2}^{2}, f_{i} (x) = \frac{1}{1 + e x p (b_{i} a_{i}^{T} x)},

(54)

where

a_{i} \in ℜ^{d}

and

b_{i} \in \{- 1, 1\}

represent the feature vector and corresponding label, respectively.

Problem 2.

The nonconvex support vector machine (SVM) problem with a sigmoid loss function [15,16] is formulated as follows:

min_{x \in ℜ^{D_{x}}} \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x) + λ {∥ x ∥}^{2}, f_{i} (x) = 1 - t a n h (b_{i} 〈x, a_{i}〉) .

(55)

We compare the proposed LMLBFGS-VR algorithm with SGD [1], SVRG [17] and SAGA [18], where the LMLBFGS-VR algorithms use a descent step size and other algorithms use a constant step size

α_{k}

. The data sets in our experiments including Adult, IJCNN, Mnist, and Coctype. All the codes are written in MATLAB 2018b on a PC with AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz and 16 GB of memory.

5.2. Numerical Results for Problem 1

In this subsection, we present the numerical results of LMLBFGS-VR, SGD, SVRG, and SAGA for solving Problem 1 on the four data sets. For LMLBFGS-VR algorithms, the step size is

α_{k} = 0.02 \times k^{- 0.6}

, and the memory size is

r = 10

and

m = 1 \times 10^{- 5}

. The step size of other algorithms is chosen as 0.02. The number of inner loop q we chose as

n / V

uniformly, where V is the batch size. The batch-size is set to 100 for Adult, IJCNN, and Covtype, and for Mnist. In order to further test the performance of the algorithm, the regularization parameter is set to

10^{- 3}, 10^{- 4}

, or

10^{- 5}

. The following pictures demonstrate the performance of different algorithms. Figure 1, Figure 2, Figure 3 and Figure 4 show the convergence performance of all the stochastic algorithms for solving Problem 1 with

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}

or

λ = 1 \times 10^{- 5}

on four different data sets. From Figure 1, Figure 2, Figure 3 and Figure 4, we obtain that all the algorithms can solve the problem successfully. However, the proposed LMLBFGS- VR algorithms have significantly faster convergence speed than other algorithms. It is clear that the proposed algorithms, especially LMLBFGS-VR, have a great advantage for solving nonconvex support vector machine problems.

5.3. Numerical Results for Problem 2

The numerical results of LMLBFGS-VR, SGD, SVRG, and SAGA for solving Problem 2 on the four data sets are presented in this subsection. All parameters are the same as the above subsection, and the regularization parameter is also set to

1 \times 10^{- 3}, 1 \times 10^{- 4},

or

1 \times 10^{- 5}

. The following figures demonstrate the performance of all the stochastic algorithms. The y-axis is the objective function value, and the x-axis denotes the number of effective passes, where computing a full gradient or evaluatingncomponent gradients is regarded as an effective pass. Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 demonstrate that the convergence performance of our LMLBFGS-VR algorithms on the four data sets, which show that they remarkably outperform the other algorithms. When

λ = 1 \times 10^{- 3}

, the objective function is almost minimized by two effective passes. In contrast, the SGD, SVRG, and SAGA algorithms converge slightly slowly, where these algorithms only use first-order information. Due to the use of second-order information and limited memory technique, LMLBFGS-VR requires only a few effective passes to quickly minimize the function value. From Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, we find that, as

λ

decreases, the value of the function decreases to a smaller value. Thus, we can choose a smaller

λ

for practical problems. Combined with the previous discussion, our LMLBFGS-VR algorithms make great progress in improving the computing efficiency for nonconvex machine learning problems.

6. Conclusions

In this paper, we proposed one efficient modified stochastic limited BFGS algorithms for solving nonconvex stochastic optimization. The proposed algorithms can preserve the positive definiteness of

H k

without any convexity properties. The LMLBFGS-VR method with variance reduction was also presented to solve nonconvex stochastic optimization problems. Numerical experiments on nonconvex SVM problems and nonconvex ERM problems were performed to demonstrate the performance of the proposed algorithms, and the results indicated that our algorithms are comparable to other similar methods. In the future, we could consider the following points: (i) Whether we can use a proper line search to determine an appropriate step size, which can reduce the complexity and enhance the accuracy of the algorithm. (ii) Further experiments on the practical problems could be performed in the future to check the performance of the presented algorithms.

Author Contributions

Writing—original draft preparation, H.L.; writing—review and editing, Y.L. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China Grant No.11661009, the High Level Innovation Teams and Excellent Scholars Program in Guangxi institutions of higher education Grant No. [2019]52, the Guangxi Natural Science Key Fund No. 2017GXNSFDA198046, the Special Funds for Local Science and Technology Development Guided by the Central Government No. ZY20198003, the special foundation for Guangxi Ba Gui Scholars, and the Basic Ability Improvement Project for Young and Middle-Aged Teachers in Guangxi Colleges and Universities No. 2020KY30018.

Conflicts of Interest

The authors declare no conflict of interest.

References

Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Chung, K.L. On a stochastic approximation method. Ann. Math. Stat. 1954, 25, 463–483. [Google Scholar] [CrossRef]
Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
Ruszczyǹski, A.; Syski, W. A method of aggregate stochastic subgradients with online stepsize rules for convex stochastic programming problems. In Stochastic Programming 84 Part II; Springer: Berlin/Heidelberg, Germany, 1986; pp. 113–131. [Google Scholar]
Wright, S.; Nocedal, J. Numerical Optimization; Springer: Berlin/Heidelberg, Germany, 1999; Volume 35, p. 7. [Google Scholar]
Bordes, A.; Bottou, L. SGD-QN: Careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res. 2009, 10, 1737–1754. [Google Scholar]
Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
Gower, R.; Goldfarb, D.; Richtárik, P. Stochastic block BFGS: Squeezing more curvature out of data. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1869–1878. [Google Scholar]
Covei, D.P.; Pirvu, T.A. A stochastic control problem with regime switching. Carpathian J. Math. 2021, 37, 427–440. [Google Scholar] [CrossRef]
Wei, Z.; Li, G.; Qi, L. New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput. 2006, 175, 1156–1188. [Google Scholar] [CrossRef]
Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
Deng, N.Y.; Li, Z.F. Ome global convergence properties of a conic-variable metric algorithm for minimization with inexact line searches. Numer. Algebra Control Optim. 1995, 5, 105–122. [Google Scholar]
Allen-Zhu, Z.; Hazan, E. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 699–707. [Google Scholar]
Shalev-Shwartz, S.; Shamir, O.; Sridharan, K. Learning kernel-based halfspaces with the 0–1 loss. SIAM J. Comput. 2011, 40, 1623–1646. [Google Scholar] [CrossRef] [Green Version]
Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef] [Green Version]
Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient descent in function space. Proc. Adv. Neural Inf. Process. Syst. 1999, 12, 512–518. [Google Scholar]
Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 2013, 26, 315–323. [Google Scholar]
Defazio, A.; Bach, F.; Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1646–1654. [Google Scholar]

Figure 1. Comparison of all the algorithms for solving Problem 1 on Adult. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 1. Comparison of all the algorithms for solving Problem 1 on Adult. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 2. Comparison of all the algorithms for solving Problem 1 on Covtype. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 2. Comparison of all the algorithms for solving Problem 1 on Covtype. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 3. Comparison of all the algorithms for solving Problem 1 on IJCNN. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 3. Comparison of all the algorithms for solving Problem 1 on IJCNN. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 4. Comparison of all the algorithms for solving Problem 1 on mnist. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 4. Comparison of all the algorithms for solving Problem 1 on mnist. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 5. Comparison of all the algorithms for solving Problem 2 on Adult. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 5. Comparison of all the algorithms for solving Problem 2 on Adult. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 6. Comparison of all the algorithms for solving Problem 2 on Covtype. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 6. Comparison of all the algorithms for solving Problem 2 on Covtype. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 7. Comparison of all the algorithms for solving Problem 2 on IJCNN. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 7. Comparison of all the algorithms for solving Problem 2 on IJCNN. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 8. Comparison of all the algorithms for solving Problem 2 on mnist. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Figure 8. Comparison of all the algorithms for solving Problem 2 on mnist. From left to right:

λ = 1 \times 10^{- 3}, λ = 1 \times 10^{- 4}, λ = 1 \times 10^{- 5}

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Li, Y.; Zhang, M. An Active Set Limited Memory BFGS Algorithm for Machine Learning. Symmetry 2022, 14, 378. https://doi.org/10.3390/sym14020378

AMA Style

Liu H, Li Y, Zhang M. An Active Set Limited Memory BFGS Algorithm for Machine Learning. Symmetry. 2022; 14(2):378. https://doi.org/10.3390/sym14020378

Chicago/Turabian Style

Liu, Hanger, Yan Li, and Maojun Zhang. 2022. "An Active Set Limited Memory BFGS Algorithm for Machine Learning" Symmetry 14, no. 2: 378. https://doi.org/10.3390/sym14020378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Active Set Limited Memory BFGS Algorithm for Machine Learning

Abstract

1. Introduction

2. Premise Setting and Algorithm

2.1. LMLBFGS Algorithm

2.2. Extension of Our LMLBFGS Algorithm with Variance Reduction

3. Global Convergence Analysis

3.1. Basic Assumptions

3.2. Key Propositions, Lemmas, and Theorem

3.3. Global Convergence Theorem

4. The Complexity of the Proposed Algorithm

5. Numerical Results

5.1. Experiments with Synthetic Datasets

5.2. Numerical Results for Problem 1

5.3. Numerical Results for Problem 2

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI