Accelerated Randomized Coordinate Descent for Solving Linear Systems

Wang, Qin; Li, Weiguo; Bao, Wendi; Zhang, Feiyu

doi:10.3390/math10224379

Open AccessArticle

Accelerated Randomized Coordinate Descent for Solving Linear Systems

by

Qin Wang

,

Weiguo Li

^*,

Wendi Bao

and

Feiyu Zhang

College of Science, China University of Petroleum, Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(22), 4379; https://doi.org/10.3390/math10224379

Submission received: 20 October 2022 / Revised: 11 November 2022 / Accepted: 12 November 2022 / Published: 21 November 2022

(This article belongs to the Section Computational and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

The randomized coordinate descent (RCD) method is a simple but powerful approach to solving inconsistent linear systems. In order to accelerate this approach, the Nesterov accelerated randomized coordinate descent method (NARCD) is proposed. The randomized coordinate descent with the momentum method (RCDm) is proposed by Nicolas Loizou, we will provide a new convergence boundary. The global convergence rates of the two methods are established in our paper. In addition, we show that the RCDm method has an accelerated convergence rate by choosing a proper momentum parameter. Finally, in numerical experiments, both the RCDm and the NARCD are faster than the RCD for uniformly distributed data. Moreover, the NARCD has a better acceleration effect than the RCDm and the Nesterov accelerated stochastic gradient descent method. When the linear correlation of matrix A is stronger, the NARCD acceleration is better.

Keywords:

Nesterov-accelerated; momentum; Kaczmarz method; large linear system

MSC:

65F10; 65F45

1. Introduction

Consider a large-scale overdetermined linear system

A x = b,

(1)

where

A \in R^{m \times n}

,

m \geq n

. We can solve the least-squares problem

m i n_{x} {∥ b - A x ∥}^{2}

. We assume that the columns of A are normalized:

∥ A_{i} ∥ = 1 .

(2)

This assumption has no substantial impact on the implementation costs. We could just normalize each

A_{i}

the first time the algorithm encounters it. However, we do not assume (2) about the algorithms, and include factors

A_{i}

as needed. Regardless of whether normalization is performed, our randomized algorithms yield the same sequence of iterates.

The coordinate descent (CD) technique [1], which can also be produced by applying the conventional Gauss−Seidel iteration method to the following normal equation [2], is one of the iteration methods that may be used to solve the problem (1) cheaply and effectively.

A^{T} A x = A^{T} b,

and it is also the same as the quadratic programming problem with no constraints.

m i n f (x) = \frac{1}{2} x^{T} A^{T} A x - b^{T} A x, x \in R^{n} .

From [1], we can obtain

x^{k + 1} = x^{k} + \frac{〈 A_{i}, b - A x 〉}{∥ A_{i} ∥^{2}} e_{i} .

(3)

In solving problem (1), the coordinate descent approach has a long history of addressing optimization issues and various applications in a wide range of fields such as biological feature selection [3], machine learning [4], protein structure [5], tomography [6,7], and so on. Inspired by the randomized coordinate descent (RCD) method, a lot of related works were presented, such as greedy versions of the randomized coordinate descent [8,9] and block versions of the randomized coordinate descent [10,11,12]. The coordinate descent method is a column projection method and the Kaczmarz [13] method is a row projection method. The RCD method is inspired by the randomized Kaczmarz(RK) [14] method. For the Kaczmarz-type approach; a lot of relevant work has also been conducted. Readers can refer to [15,16,17,18,19,20].

In this paper, for solving large systems of linear equations, we use two methods to accelerate the RCD method. First, we obtained an accelerated RCD method by adding Nesterov’s acceleration mechanism to the traditional RCD algorithm, called the Nesterov accelerated randomized coordinate descent method (NARCD). It is commonly known that by using an appropriate multi-step technique [21], the traditional gradient method may be turned into a quicker system. To solve the number of unconstrained minimization problems with strongly convex objectives, Nesterov improved this accelerated format [22]. Second, we can apply the heavy ball method (momentum method) to accelerate the RCD. Polyak invented the heavy ball method [23], which is a common approach for speeding up the convergence rate of gradient-type algorithms. Many researchers looked into variations of the heavy ball method, see [24]. By these two methods—to accelerate the RCD—the Nesterov accelerated randomized coordinate descent method (NARCD) and the randomized coordinate descent with momentum method (RCDm) were obtained.

In this paper, given a positive semidefinite matrix M,

{∥ x ∥}_{M}

is defined as

\sqrt{x^{T} M x}

,

〈 . 〉

and

∥ . ∥

stands for the scalar product and the spectral norm, where the column vector is denoted by

e_{i}

, with 1 at the ith position and 0 elsewhere. In addition, for a given matrix A,

A_{i}

,

{∥ A ∥}_{F}

,

σ_{m i n} (A)

, and

A^{T}

are used to denote its ith column, Frobenius norm, the smallest nonzero singular value, and the transpose of A respectively.

A^{+}

is the Moore–Penrose pseudoinverse of A. Note that

λ_{1} = \frac{1}{∥ {(A A^{T})}^{+} ∥}

. Let us denote

i (k)

as the index randomly generated at iteration k, and let

I (k)

denote all random indices that occurred or before iteration k, so that

I (k) = {i (k), i (k - 1), . . ., i (0)},

the sequences

x_{k + 1}

,

y_{k + 1}

,

v_{k + 1}

are determined by

I (k)

. In the following part of the proof, we use

E_{i (k) | I (k - 1)} (.)

to denote the expectation of a random variable condition of

I (k - 1)

with respect to the index

i (k)

. So that

E_{I (k)} (.) = E_{I (k - 1)} (E_{i (k) | I (k - 1)} (.)) .

(4)

The organization of this paper is as follows. In Section 2, we propose the NARCD method naturally and prove the convergence of the method. In Section 3, we propose the RCDm method and prove its convergence. In Section 4, to demonstrate the efficacy of our new methods, several numerical examples are offered. Finally, we present some brief concluding remarks in Section 5.

2. Nesterov’s Accelerated Randomized Coordinate Descent

The NARCD algorithm applies the Nesterov accelerated procedure [22], which is more well-known in terms of the gradient descent algorithm. Moreover, the Nesterov acceleration scheme creates the sequences

{x_{k}}

,

{y_{k}}

, and

{v_{k}}

. When applied to

m i n_{x} f (x)

, gradient descent sets

x_{k + 1} = x_{k} - θ_{k} \nabla f (x)

, where

\nabla f

is the objective gradient and

θ_{k}

is the step-size. We define the following iterative scheme:

\begin{matrix} y_{k} = α_{k} v_{k} + (1 - α_{k}) x_{k}, \\ x_{k + 1} = y_{k} - θ_{k} \nabla f (y_{k}), \\ v_{k + 1} = β_{k} v_{k} + (1 - β_{k}) y_{k} - γ_{k} \nabla f (y_{k}) . \end{matrix}

The aforementioned scheme’s key addition is that it employs acceptable values for the parameters

α_{k}

,

β_{k}

, and

γ_{k}

, resulting in improved convergence in traditional gradient descent. In [25], the Nesterov-accelerated procedure is applied to the Kaczmarz method, which is a row action method. The RCD is a column action method and the Nesterov-accelerated procedure can be applied in the same way. The relationship between parameters

α_{k}

,

β_{k}

, and

γ_{k}

is given in [22,25]. Now, using the general setup of Nesterov’s scheme, we can obtain the NARCD algorithm (Algorithm 1).

The framework of the NARCD method is given as follows.

Algorithm 1 Nesterov’s accelerated randomized coordinate descent method (NARCD)

Input:

A \in R^{m \times n}

,

b \in R^{m}

,

K \in R

,

x^{(0)} \in R^{n}

,

λ \in [0, λ_{1}]

.

1:: Initialize $v_{0} = x_{0}$ , $γ_{- 1} = 0$ , $k = 0$ .
2:: while $k < K$ do
3:: Choose $γ_{k}$ to be the larger root of

$γ_{k}^{2} - \frac{γ_{k}}{n} = (1 - \frac{γ_{k} λ}{n}) γ_{k - 1}^{2}$

(5)
4:: Set $α_{k}$ and $β_{k}$ as follows:

$α_{k} = \frac{n - γ_{k} λ}{γ_{k} (n^{2} - λ)}$

(6)

$β_{k} = 1 - \frac{λ γ_{k}}{n}$

(7)
5:: Set $y_{k} = α_{k} v_{k} + (1 - α_{k}) x_{k}$
6:: Choose $i = i (k)$ from $\{1, 2, . . ., n\}$ with equal probability
7:: $x_{k + 1} = y_{k} + \frac{〈A_{i}, b - A y_{k}〉}{{∥A_{i}∥}^{2}} e_{i}$
8:: Set $v_{k + 1} = β_{k} v_{k} + (1 - β_{k}) y_{k} + γ_{k} \frac{〈A_{i}, b - A y_{k}〉}{{∥A_{i}∥}^{2}} e_{i}$
9:: $k = k + 1$
10:: end while

Output:

x_{K}

Remark 1.

In order to avoid the calculation of the product of matrix and vector (

A y_{k}

in steps 7 and 8), we adopt the following

\begin{matrix} Y_{k} = α_{k} V_{k} + (1 - α_{k}) X_{k}, \\ Z_{k} = b - Y_{k}, \\ μ_{k} = \frac{〈A_{i}, Z_{k}〉}{{∥A_{i}∥}^{2}}, \\ X_{k + 1} = Y_{k} + μ_{k} A_{i}, \\ V_{k + 1} = β_{k} V_{k} + (1 - β_{k}) Y_{k} + (γ_{k} μ_{k}) A_{i}, \end{matrix}

and

X_{0} = A x_{0}

,

V_{0} = X_{0}

. At the same time, we can use

r_{k} = b - Y_{k}

to estimate the residue.

Lemma 1.

For any solution

x^{*}

to

A^{T} A x^{*} = A^{T} b

,

y \in R^{n}

and

P (y) = y + \frac{〈A_{i}, b - A y〉}{{∥A_{i}∥}^{2}} e_{i}

,

{∥A_{i}∥}^{2} = 1

. We can obtain

E_{i} ({∥ A (P (y) - x^{*}) ∥}^{2}) = {∥ A (y - x^{*}) ∥}^{2} - \frac{1}{n} {∥ A^{T} (A y - b) ∥}^{2} .

(8)

where a random variable i satisfies the uniform distribution of the set {1,2,..,n}.

Proof.

Using

E_{i}

to donate the expectation with respect to the index i, we have

\begin{matrix} E_{i} ({∥ A (P (y) - x^{*}) ∥}^{2}) \\ = E_{i} ({∥ A (y + 〈A_{i}, b - A y〉 e_{i} - x^{*}) ∥}^{2}) \\ = E_{i} (∥ (A (y - x^{*}) + 〈A_{i}, b - A y〉 A_{i}) ∥^{2}) \\ = ∥ A (y - x^{*}) ∥^{2} + E_{i} (∥ 〈A_{i}, b - A y〉 A_{i} ∥^{2}) + 2 E_{i} 〈 A (y - x^{*}), 〈 A_{i}, b - A y 〉 A_{i} 〉 \\ = ∥ A (y - x^{*}) ∥^{2} + \frac{1}{n} {∥ A^{T} A (y - x^{*}) ∥}^{2} + \frac{2}{n} 〈 A (y - x^{*}), A A^{T} (b - A y) 〉 \\ = ∥ A (y - x^{*}) ∥^{2} - \frac{1}{n} {∥ A^{T} A (y - x^{*}) ∥}^{2}, \end{matrix}

where the last equality uses

A^{T} A x^{*} = A^{T} b

. □

Lemma 2.

For any

y \in R^{n}

, we have

E_{i} (∥ A_{i} 〈 A_{i}, b - A y 〉 ∥_{{(A A^{T})}^{+}}^{2}) \leq \frac{1}{n} {∥ A^{T} A (y - x^{*}) ∥}^{2},

(9)

where the random variable i satisfies the uniform distribution of the set {1,2,..,n}.

Proof.

We know the compact singular value decomposition of A as

A = U Σ V^{T}

, where

U \in R^{m \times r}, V \in R^{n \times r}, Σ \in R^{r \times r}

, r is the rank of A, and

U^{T} U = I, V^{T} V = I,

Σ is the positive diagonal, and we can obtain

{(A A^{T})}^{+} = U Σ^{- 2} U^{T}

.

\begin{matrix} E_{i} (∥ A_{i} 〈 A_{i}, b - A y 〉 ∥_{{(A A^{T})}^{+}}^{2}) \\ = \frac{1}{n} \sum_{i = 1}^{n} 〈 A_{i} 〈 A_{i}, b - A y 〉, {(A A^{T})}^{+} A_{i} 〈 A_{i}, b - A y 〉 〉 \\ = \frac{1}{n} t r a c e [{(A A^{T})}^{+} \sum_{i = 1}^{n} A_{i} {〈 A_{i}, b - A y 〉}^{2} A_{i}^{T}] \\ = \frac{1}{n} t r a c e [{(A A^{T})}^{+} A d i a g {(A^{T} (b - A y))}^{2} A^{T}] \\ = \frac{1}{n} t r a c e [U Σ^{- 2} U^{T} U Σ V^{T} d i a g {(A^{T} (b - A y))}^{2} V Σ U^{T}] \\ = \frac{1}{n} t r a c e [U Σ^{- 1} V^{T} d i a g {(A^{T} (b - A y))}^{2} V Σ U^{T}] \\ = \frac{1}{n} t r a c e [V^{T} d i a g {(A^{T} (b - A y))}^{2} V] \\ = \frac{1}{n} {∥ d i a g (A^{T} (b - A y)) V ∥}_{F}^{2} \\ = \frac{1}{n} \sum_{i = 1}^{n} {(〈 A_{i}, b - A y 〉)}^{2} {∥ v_{i} ∥}^{2} \\ \leq \frac{1}{n} {∥ A^{T} A (y - x^{*}) ∥}^{2} . \end{matrix}

As the sixth equation is a consequence of trace(ABC)=trace(BCA), and

V = {[v_{1}, v_{2}, . . ., v_{n}]}^{T}

,

∥ v_{i} ∥^{2} \leq 1

, so that the last inequality holds. □

Lemma 3.

As all

n^{2} - λ > 0

, the following definition:

α_{k} = \frac{n - γ_{k} λ}{γ_{k} (n^{2} - λ)}, β_{k} = 1 - \frac{λ γ_{k}}{n}

for both sequences

{α_{k}}

,

{β_{k}}

lie in the interval

[0, 1]

if and only if

γ_{k}

satisfies the following property:

\frac{1}{n} \leq γ_{k} \leq \frac{n}{λ},

and

γ_{- 1} = 0

, if

γ_{k - 1} \leq \frac{1}{\sqrt{λ}}

, then

γ_{k} \in [γ_{k - 1}, \frac{1}{\sqrt{λ}}]

.

Proof.

The first part of the lemma clearly holds. For the second part, recall from (5) that

γ_{k}

is the larger root of the following convex quadratic function:

g (γ) = γ^{2} - \frac{γ}{n} (1 - λ γ_{k - 1}^{2}) - γ_{k - 1}^{2},

we note the following:

g (γ_{k - 1}) = - (γ_{k - 1} / n) (1 - λ γ_{k - 1}^{2}) \leq 0,

\begin{matrix} g (\frac{1}{\sqrt{λ}}) & = \frac{1}{λ} - \frac{1}{n \sqrt{λ}} (1 - λ γ_{k - 1}^{2}) - γ_{k - 1}^{2} \\ = \frac{1}{λ} - \frac{1}{n \sqrt{λ}} + γ_{k - 1}^{2} (\frac{\sqrt{λ}}{m} - 1) \\ \geq \frac{1}{λ} - \frac{1}{n \sqrt{λ}} + \frac{1}{λ} (\frac{\sqrt{λ}}{m} - 1) = 0, \end{matrix}

which together imply that

γ_{k} \in [γ_{k - 1}, \frac{1}{\sqrt{λ}}]

. □

Lemma 4.

Let a, b, and c be any vector in

R^{n}

, then the following identity holds:

2 〈 a - c, c - b 〉 = {∥ a - b ∥}^{2} {- ∥ a - c ∥}^{2} - {∥ c - b ∥}^{2} .

(10)

Theorem 1.

The coordinate descent method with Nesterov’s acceleration for solving linear equations,

λ \in [0, λ_{1}]

and

x^{*}

is the least-squares solution. Define

σ_{1} = 1 + \frac{\sqrt{λ}}{2 n}

and

σ_{2} = 1 - \frac{\sqrt{λ}}{2 n}

, then for all

k \geq 0

, we have the following

E (∥ A (x_{k + 1} - x^{*}) ∥^{2}) \leq \frac{4 λ ∥ A (x_{0} - x^{*}) ∥^{2}}{{(σ_{1}^{k + 1} - σ_{2}^{k + 1})}^{2}}

(11)

and

E (∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2}) \leq \frac{4 ∥ A (x_{0} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2}}{{(σ_{1}^{k + 1} + σ_{2}^{k + 1})}^{2}} .

(12)

Proof.

We follow the standard notation and steps shown in [22,25]; by (5) and (6), the following relation holds:

\frac{1 - α_{k}}{α_{k}} = \frac{n γ_{k - 1}^{2}}{γ_{k}} .

(13)

From (5) and (7), we have

γ_{k}^{2} - \frac{γ_{k}}{n} - β_{k} γ_{k - 1}^{2} = 0 .

(14)

Now, let us define

r_{k}^{2} = {∥ A (v_{k} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2}

. Then we have

\begin{matrix} r_{k + 1}^{2} & = ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \\ = ∥ A (β_{k} v_{k} + (1 - β_{k}) y_{k} + γ_{k} 〈 A_{i}, b - A y_{k} 〉 e_{i} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \\ = ∥ A (β_{k} v_{k} + (1 - β_{k}) y_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + γ_{k}^{2} {∥ A_{i} 〈 A_{i}, b - A y_{k} 〉 ∥}_{{(A A^{T})}^{+}}^{2} \\ + 2 γ_{k} 〈 A (β_{k} v_{k} + (1 - β_{k}) y_{k} - x^{*}), {(A A^{T})}^{+} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉 \\ = ∥ A (β_{k} v_{k} + (1 - β_{k}) y_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + γ_{k}^{2} {∥ A_{i} 〈 A_{i}, b - A y_{k} 〉 ∥}_{{(A A^{T})}^{+}}^{2} \\ + 2 γ_{k} 〈 A (β_{k} (\frac{1}{α_{k}} y_{k} - \frac{1 - α_{k}}{α_{k}} x_{k}) + (1 - β_{k}) y_{k} - x^{*}), {(A A^{T})}^{+} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉 \\ = ∥ A (β_{k} v_{k} + (1 - β_{k}) y_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + γ_{k}^{2} {∥ A_{i} 〈 A_{i}, b - A y_{k} 〉 ∥}_{{(A A^{T})}^{+}}^{2} \\ + 2 γ_{k} 〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), {(A A^{T})}^{+} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉 . \end{matrix}

(15)

Now, we divide (15) into three parts and simplify them separately. From the convexity of

{∥ . ∥}_{{(A A^{T})}^{+}}^{2}

and in Lemma 3, we know

β_{k} \in [0, 1]

. So the first part of (15) is as follows:

\begin{matrix} ∥ A (β_{k} v_{k} + (1 - β_{k}) y_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \\ = ∥ A (β_{k} (v_{k} - x^{*}) + (1 - β_{k}) (y_{k} - x^{*})) ∥_{{(A A^{T})}^{+}}^{2} \\ \leq β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + (1 - β_{k}) {∥ A (y_{k} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} \\ = β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + \frac{γ_{k} λ}{n} {∥ A (y_{k} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} \\ \leq β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + \frac{γ_{k}}{n} {∥ A (y_{k} - x^{*}) ∥}^{2}, \end{matrix}

(16)

where the last inequality makes use of

λ \leq λ_{1} = \frac{1}{∥ {(A A^{T})}^{+} ∥}

. Using Lemmas 1 and 2, the second part of (15) is as follows:

\begin{matrix} γ_{k}^{2} E_{i (k) | I (k - 1)} (∥ A_{i} 〈 A_{i}, b - A y_{k} 〉 ∥_{{(A A^{T})}^{+}}^{2}) \\ \leq \frac{γ_{k}^{2}}{n} {∥ A^{T} A (y_{k} - x^{*}) ∥}^{2} \\ = γ_{k}^{2} {∥ A (y_{k} - x^{*}) ∥}^{2} - γ_{k}^{2} E_{i (k) | I (k - 1)} ({∥ A (x_{k + 1} - x^{*}) ∥}^{2}) . \end{matrix}

(17)

We use the identity of (8) in the last part of our proof. We take expectations in the last part of (15) and can obtain

\begin{matrix} 2 γ_{k} E_{i (k) | I (k - 1)} (〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), {(A A^{T})}^{+} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉) \\ = 2 γ_{k} 〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), {(A A^{T})}^{+} E_{i (k) | I (k - 1)} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉 \\ = \frac{2 γ_{k}}{n} 〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), {(A A^{T})}^{+} \sum_{i = 1}^{n} A_{i} 〈 A_{i}, b - A y_{k} 〉 〉 \\ = \frac{2 γ_{k}}{n} 〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), {(A A^{T})}^{+} A A^{T} (b - A y_{k}) 〉 \\ = \frac{2 γ_{k}}{n} 〈 A (y_{k} - x^{*}) + \frac{1 - α_{k}}{α_{k}} β_{k} A (y_{k} - x_{k}), (b - A y_{k}) 〉 \\ = \frac{2 γ_{k}}{n} 〈 A (y_{k} - x^{*}), b - A y_{k} 〉 + \frac{2 γ_{k}}{n} \frac{1 - α_{k}}{α_{k}} β_{k} 〈 A (y_{k} - x_{k}), b - A y_{k} 〉 \\ = - \frac{2 γ_{k}}{n} ∥ A (y_{k} - x^{*}) ∥^{2} + β_{k} γ_{k - 1}^{2} (∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (y_{k} - x^{*}) ∥^{2} - ∥ A (y_{k} - x_{k}) ∥^{2}) \\ \leq - (\frac{2 γ_{k}}{n} + β_{k} γ_{k - 1}^{2}) ∥ A (y_{k} - x^{*}) ∥^{2} + β_{k} γ_{k - 1}^{2} {∥ A (x_{k} - x^{*}) ∥}^{2}, \end{matrix}

(18)

where the fifth and sixth equalities make use of the (13), and (10), respectively. Substituting all three parts of (16)–(18) into(15), we have

\begin{matrix} E_{i (k) | I (k - 1)} (r_{k + 1}^{2}) \\ \leq β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + \frac{γ_{k}}{n} {∥ A (y_{k} - x^{*}) ∥}^{2} \\ + γ_{k}^{2} {∥ A (y_{k} - x^{*}) ∥}^{2} - γ_{k}^{2} E_{i (k) | I (k - 1)} ({∥ A (x_{k + 1} - x^{*}) ∥}^{2}) \\ - (\frac{2 γ_{k}}{n} + β_{k} γ_{k - 1}^{2}) ∥ A (y_{k} - x^{*}) ∥^{2} + β_{k} γ_{k - 1}^{2} {∥ A (x_{k} - x^{*}) ∥}^{2} \\ = β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + (γ_{k}^{2} - \frac{γ_{k}}{n} - β_{k} γ_{k - 1}^{2}) {∥ A (y_{k} - x^{*}) ∥}^{2} \\ - γ_{k}^{2} E_{i (k) | I (k - 1)} ({∥ A (x_{k + 1} - x^{*}) ∥}^{2}) + β_{k} γ_{k - 1}^{2} {∥ A (x_{k} - x^{*}) ∥}^{2} \\ = β_{k} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} - γ_{k}^{2} E_{i (k) | I (k - 1)} ({∥ A (x_{k + 1} - x^{*}) ∥}^{2}) + β_{k} γ_{k - 1}^{2} {∥ A (x_{k} - x^{*}) ∥}^{2}, \end{matrix}

(19)

where the last equality is the consequence of (14). Let us define two sequences

{A_{k}}

,

{B_{k}}

as follows:

B_{k + 1}^{2} = \frac{B_{k}^{2}}{β_{k}}, A_{k + 1}^{2} = γ_{k}^{2} B_{k + 1}^{2},

(20)

we know the

β_{k} \in (0, 1]

, and

B_{k} \geq 0

,

B_{0} \neq 0

. We have

B_{k + 1} \geq B_{k}

. Because of the

γ_{- 1} = 0

, we have

A_{0} = 0

. Moreover,

γ_{k} \in [γ_{k - 1}, \frac{1}{\sqrt{λ}}]

in Lemma 3, so that we can obtain that the

{A_{k}}

is also an increasing sequence. Now, multiplying both sides of (19) by

B_{k + 1}^{2}

and using the (20), we have

\begin{matrix} B_{k + 1}^{2} E_{i (k) | I (k - 1)} {∥ A (v_{k + 1} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} + A_{k + 1}^{2} E_{i (k) | I (k - 1)} {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \\ \leq B_{k}^{2} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{k}^{2} {∥ A (x_{k} - x^{*}) ∥}^{2}, \end{matrix}

(21)

and then

\begin{matrix} E_{I (k)} (B_{k + 1}^{2} ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{k + 1}^{2} {∥ A (x_{k + 1} - x^{*}) ∥}^{2}) \\ = E_{I (k - 1)} (B_{k + 1}^{2} E_{i (k) | I (k - 1)} ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{k + 1}^{2} E_{i (k) | I (k - 1)} {∥ A (x_{k + 1} - x^{*}) ∥}^{2}) \\ \leq E_{I (k - 1)} (B_{k}^{2} ∥ A (v_{k} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{k}^{2} ∥ A (x_{k} - x^{*}) ∥^{2}) \\ \leq E_{I (0)} (B_{1}^{2} ∥ A (v_{1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{1}^{2} ∥ A (x_{1} - x^{*}) ∥^{2}) \\ \leq B_{0}^{2} ∥ A (v_{0} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + A_{0}^{2} ∥ A (x_{1} - x^{*}) ∥^{2}) \\ = B_{0}^{2} {∥ A (v_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} \\ = B_{0}^{2} {∥ A (x_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} . \end{matrix}

(22)

So, by the (22), we can obtain

\begin{matrix} E ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \leq \frac{B_{0}^{2}}{B_{k + 1}^{2}} {∥ A (x_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2}, \\ E {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \leq \frac{B_{0}^{2}}{A_{k + 1}^{2}} {∥ A (x_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2}, \end{matrix}

(23)

we now need to analyze the growth of two sequences

{A_{k}}

and

{B_{k}}

. Following the proof in [22,26] for the Nesterov accelerated scheme and the accelerated sampling Kaczmarz Motzkin algorithm [25], we have

B_{k}^{2} = β_{k} B_{k + 1}^{2} = (1 - \frac{λ γ_{k}}{n}) B_{k + 1}^{2} = (1 - \frac{λ A_{k + 1}}{n B_{k + 1}}) B_{k + 1}^{2} .

It implies that

\begin{matrix} B_{k}^{2} & = (1 - \frac{λ A_{k + 1}}{n B_{k + 1}}) B_{k + 1}^{2} \\ = B_{k + 1}^{2} - \frac{λ}{n} A_{k + 1} B_{k + 1}, \end{matrix}

then

\frac{λ}{n} A_{k + 1} B_{k + 1} = B_{k + 1}^{2} - B_{k}^{2} = (B_{k + 1} - B_{k}) (B_{k + 1} + B_{k}) \leq 2 B_{k + 1} (B_{k + 1} - B_{k}) .

Moreover, because the

{B_{k}}

and the

{A_{k}}

are increasing sequences, we can simplify them and obtain

B_{k + 1} \geq B_{k} + \frac{λ}{2 n} A_{k + 1} \geq B_{k} + \frac{λ}{2 n} A_{k} .

(24)

Similarly, we have

\begin{matrix} \frac{A_{k + 1}^{2}}{B_{k + 1}^{2}} - \frac{A_{k + 1}}{n B_{k + 1}} & = γ_{k}^{2} - \frac{γ_{k}}{n} \\ = β_{k} γ_{k - 1}^{2} \\ = \frac{A_{k}^{2}}{B_{k + 1}^{2}}, \end{matrix}

where the second equality uses (14) and the third equality uses (20). Using the above relationship, we have

\begin{matrix} \frac{1}{n} A_{k + 1} B_{k + 1} & = A_{k + 1}^{2} - A_{k}^{2} \\ = (A_{k + 1} + A_{k}) (A_{k + 1} - A_{k}) \\ \leq 2 A_{k + 1} (A_{k + 1} - A_{k}) . \end{matrix}

Therefore,

A_{k + 1} \geq A_{k} + \frac{B_{k}}{2 n} .

(25)

By combining the two expressions of (25) and (24), we have

[\begin{matrix} A_{k + 1} \\ B_{k + 1} \end{matrix}] \geq {[\begin{matrix} 1 & \frac{1}{2 n} \\ \frac{λ}{2 n} & 1 \end{matrix}]}^{k + 1} [\begin{matrix} A_{0} \\ B_{0} \end{matrix}] .

The Jordan decomposition of the matrix in the above expression is

[\begin{matrix} 1 & \frac{1}{2 n} \\ \frac{λ}{2 n} & 1 \end{matrix}] = {[\begin{matrix} 1 & 1 \\ \sqrt{λ} & - \sqrt{λ} \end{matrix}]}^{- 1} [\begin{matrix} σ_{1} & 0 \\ 0 & σ_{2} \end{matrix}] [\begin{matrix} 1 & 1 \\ \sqrt{λ} & - \sqrt{λ} \end{matrix}] .

Here,

σ_{1} = 1 + \frac{\sqrt{λ}}{2 n}

and

σ_{2} = 1 - \frac{\sqrt{λ}}{2 n}

. Because of

A_{0} = 0

, we have

\begin{matrix} [\begin{matrix} A_{k + 1} \\ B_{k + 1} \end{matrix}] & \geq {[\begin{matrix} 1 & \frac{1}{2 n} \\ \frac{λ}{2 n} & 1 \end{matrix}]}^{k + 1} [\begin{matrix} A_{0} \\ B_{0} \end{matrix}] \\ \geq {[\begin{matrix} 1 & 1 \\ \sqrt{λ} & - \sqrt{λ} \end{matrix}]}^{- 1} {[\begin{matrix} σ_{1} & 0 \\ 0 & σ_{2} \end{matrix}]}^{k + 1} [\begin{matrix} 1 & 1 \\ \sqrt{λ} & - \sqrt{λ} \end{matrix}] [\begin{matrix} 0 \\ B_{0} \end{matrix}] \\ = \frac{1}{2} [\begin{matrix} (\frac{σ_{1}^{k + 1} - σ_{2}^{k + 1}}{\sqrt{λ}}) B_{0} \\ (σ_{1}^{k + 1} + σ_{2}^{k + 1}) B_{0} \end{matrix}] . \end{matrix}

The above relationship gives us the growth bound for the sequences

{A_{k}}

and

{B_{k}}

. Substituting these above bounds in (23), we have

\begin{matrix} E ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \leq \frac{B_{0}^{2}}{B_{k + 1}^{2}} {∥ A (x_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} \leq \frac{4 ∥ A (x_{0} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2}}{{(σ_{1}^{k + 1} + σ_{2}^{k + 1})}^{2}}, \\ E {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \leq \frac{B_{0}^{2}}{A_{k + 1}^{2}} {∥ A (x_{0} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2} \leq \frac{4 λ ∥ A (x_{0} - x^{*}) ∥^{2}}{{(σ_{1}^{k + 1} - σ_{2}^{k + 1})}^{2}}, \end{matrix}

and we have completed the proof. □

Remark 2.

From the relationship between

y_{k}

,

x_{k}

, and

v_{k}

,

y_{k} = α_{k} v_{k} + (1 - α_{k}) x_{k}

, we know that

\begin{matrix} E ∥ A (y_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} \\ = E ∥ A (α_{k + 1} (v_{k + 1} - x^{*}) + (1 - α_{k + 1}) (x_{k + 1} - x^{*})) ∥_{{(A A^{T})}^{+}}^{2} \\ \leq α_{k + 1} E ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + (1 - α_{k + 1}) E {∥ A (x_{k + 1} - x^{*}) ∥}_{{(A A^{T})}^{+}}^{2}, \end{matrix}

and

\begin{matrix} ∥ A (x_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} & \leq ∥ A (x_{k + 1} - x^{*}) ∥^{2} {∥ {(A A^{T})}^{+} ∥}^{2} \\ = \frac{∥ A (x_{k + 1} - x^{*}) ∥^{2}}{σ_{m i n}^{4} (A)}, \end{matrix}

where the

σ_{m i n}^{4} (A)

is the nonzero minimum singular value of A. By the above inequality and Theorem 1, we have

\begin{matrix} E ∥ A (y_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} & \leq α_{k + 1} E ∥ A (v_{k + 1} - x^{*}) ∥_{{(A A^{T})}^{+}}^{2} + \frac{(1 - α_{k + 1})}{σ_{m i n}^{4} (A)} E {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \\ \leq (\frac{4 α_{k + 1}}{{(σ_{1}^{k + 1} + σ_{2}^{k + 1})}^{2}} + \frac{4 (1 - α_{k + 1}) λ}{σ_{m i n}^{4} (A A) {(σ_{1}^{k + 1} - σ_{2}^{k + 1})}^{2}}) {∥ A (x_{0} - x^{*}) ∥}^{2} . \end{matrix}

3. Randomized Coordinate Descent with Momentum Method

The iterative formula of the gradient descent (GD) method is as follows,

x_{k + 1} = x_{k} - λ_{k} \nabla f (x_{k}),

where

λ_{k}

is a positive step-size parameter. Polyak [23] proposed the gradient descent method with momentum (GDm) by introducing a momentum term

δ (x_{k} - x_{k - 1})

, which was also known as the heavy ball method

x_{k + 1} = x_{k} - λ_{k} \nabla f (x_{k}) + δ (x_{k} - x_{k - 1}),

where

δ

is a momentum parameter. Letting

g (x_{k})

be an unbiased estimator of the true gradient

\nabla f (x_{k})

, we have the stochastic gradient descent with momentum (mSGD) method.

x_{k + 1} = x_{k} - λ_{k} g (x_{k}) + δ (x_{k} - x_{k - 1}) .

The randomized coordinate descent with momentum (RCDm) method is proposed in [27]. We will give a new convergence boundary.

The RCDm method takes the explicit iterative form

x_{k + 1} = x_{k} + \frac{〈A_{i}, b - A x_{k}〉}{{∥A_{i}∥}^{2}} e_{i} + δ (x_{k} - x_{k - 1}) .

(26)

The framework of the RCDm method is given as follows (Algorithm 2).

Algorithm 2 Randomized coordinate descent with momentum method (RCDm)

Input:

A \in R^{m \times n}

,

b \in R^{m}

,

K \in R

,

x^{(0)} \in R^{n}

,

δ

.

1:: Initialize $k = 0$ .
2:: while $k < K$ do
3:: Choose $i = i (k)$ from $\{1, 2, . . ., n\}$ with equal probability
4:: $x_{k + 1} = x_{k} + \frac{〈A_{i}, b - A x_{k}〉}{{∥A_{i}∥}^{2}} e_{i} + δ (x_{k} - x_{k - 1})$
5:: $k = k + 1$
6:: end while

Output:

x_{K}

Remark 3.

In order to avoid the calculation of the product of matrix and vector (

A x_{k}

in step 5), we adopt the following method

\begin{matrix} α_{k} = \frac{〈A_{i}, r_{k}〉}{{∥A_{i}∥}^{2}}, \\ r_{k + 1} = (1 + δ) r_{k} - α_{k} A_{i} - δ r_{k - 1}, \end{matrix}

and

r_{0} = b - A x_{0}

,

r_{- 1} = r_{0}

.

Lemma 5

([27]). Fix

F_{1} = F_{0} \geq 0

and let

{F_{k}}_{k \geq 0}

be a sequence of nonnegative real numbers satisfying the relation

F_{k + 1} \leq a_{1} F_{k} + a_{2} F_{k - 1}, \forall k \geq 1,

where

a_{2} \geq 0

,

a_{1} + a_{2} < 1

, and at least one of the coefficients

a_{1}

,

a_{2}

is positive. Then the sequence satisfies the relation

F_{k + 1}

\leq q^{k} (1 + ξ) F_{0}

for all

k \geq 1

, where

q = \frac{a_{1} + \sqrt{a_{1}^{2} + 4 a_{2}}}{2}

and

ξ = q - a_{1} \geq 0

. Moreover,

q \geq a_{1} + a_{2},

with equality if and only if

a_{2} = 0

(in that case,

q = a_{1}

and

ξ = 0

).

Theorem 2.

Assume

δ \geq 0

, and that the expressions

a_{1} = 1 - \frac{σ_{m i n}^{2} (A)}{n} + 3 δ - 3 \frac{δ σ_{m i n}^{2} (A)}{n} + 2 δ^{2}

and

a_{2} = 2 δ^{2} + δ - \frac{δ σ_{m i n}^{2} (A)}{n}

satisfy

a_{1} + a_{2} < 1

, where

σ_{m i n}^{2} (A)

is nonzero minimum singular value of A. Let

{x_{k}}_{k = 0}^{\infty}

be the iteration sequence generated by the RCDm method starting from initial guess

x_{0} = 0

. Then, it holds that

E (∥ A (x_{k + 1} - x^{*}) ∥^{2}) \leq q^{k} (1 + ξ) {∥ A (x_{0} - x^{*}) ∥}^{2},

(27)

where

q = \frac{a_{1} + \sqrt{a_{1}^{2} + 4 a_{2}}}{2}

,

ξ = q - a_{1} \geq 0

,

x^{*}

is the least-squares solution. Moreover,

a_{1}

,

a_{2}

, q obeys

a_{1} + a_{2} \leq q < 1

.

Proof.

From the algorithm of RCDm, we have

\begin{matrix} E_{i (k)) | I (k - 1)} {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \\ = E_{i (k) | I (k - 1)} {∥ A (x_{k} + 〈 A_{i}, b - A x_{k} 〉 e_{i} + δ (x_{k} - x_{k - 1}) - x^{*}) ∥}^{2} \\ = E_{i (k) | I (k - 1)} ∥ A (x_{k} - x^{*}) + A_{i} 〈 A_{i}, b - A x_{k} 〉 ∥^{2} + δ^{2} {∥ A (x_{k} - x_{k - 1}) ∥}^{2} \\ + 2 δ E_{i (k) | I (k - 1)} 〈 A (x_{k} - x_{k - 1}), A (x_{k} - x^{*}) + A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 . \end{matrix}

(28)

We consider the three terms in (28) in turn. For the first term, we have

\begin{matrix} E_{i (k) | I (k - 1)} {∥ A (x_{k} - x^{*}) + A_{i} 〈 A_{i}, b - A x_{k} 〉 ∥}^{2} \\ = ∥ A (x_{k} - x^{*}) ∥^{2} + E_{i (k) | I (k - 1)} {∥ A_{i} 〈 A_{i}, b - A x_{k} 〉 ∥}^{2} \\ + 2 E_{i (k) | I (k - 1)} 〈 A (x_{k} - x^{*}), A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 \\ = ∥ A (x_{k} - x^{*}) ∥^{2} + \frac{1}{n} \sum_{i = 1}^{n} {∥ A_{i} 〈 A_{i}, b - A x_{k} 〉 ∥}^{2} \\ + \frac{2}{n} \sum_{i = 1}^{n} 〈 A (x_{k} - x^{*}), A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 \\ = ∥ A (x_{k} - x^{*}) ∥^{2} - \frac{1}{n} {∥ A^{T} A (x_{k} - x^{*}) ∥}^{2} \\ \leq ∥ A (x_{k} - x^{*}) ∥^{2} - \frac{σ_{m i n}^{2} (A)}{n} {∥ A (x_{k} - x^{*}) ∥}^{2} \\ = (1 - \frac{σ_{m i n}^{2} (A)}{n}) {∥ A (x_{k} - x^{*}) ∥}^{2}, \end{matrix}

(29)

where the last inequality is the consequence of singular value inequality

(∥ A x ∥^{2} \geq σ_{m i n}^{2} (A) ∥ x ∥^{2})

, and

n = {∥ A ∥}_{F}^{2} \geq σ_{m i n}^{2} (A)

. From the second term, we have

\begin{matrix} δ^{2} {∥ A (x_{k} - x_{k - 1}) ∥}^{2} \\ = δ^{2} {∥ A (x_{k} - x^{*}) - A (x_{k - 1} - x^{*}) ∥}^{2} \\ \leq 2 δ^{2} (∥ A (x_{k} - x^{*}) ∥^{2} + ∥ A (x_{k - 1} - x^{*}) ∥^{2}) . \end{matrix}

(30)

From the third term, we have

\begin{matrix} 2 δ E_{i (k) | I (k - 1)} 〈 A (x_{k} - x_{k - 1}), A (x_{k} - x^{*}) + A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 \\ = 2 δ 〈 A (x_{k} - x_{k - 1}), A (x_{k} - x^{*}) 〉 + 2 δ E_{i (k) | I (k - 1)} 〈 A (x_{k} - x_{k - 1}), A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 \\ = δ (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ + \frac{2 δ}{n} 〈 A (x_{k} - x_{k - 1}), \sum_{i = 1}^{n} A_{i} 〈 A_{i}, b - A x_{k} 〉 〉 \\ = δ (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ + \frac{2 δ}{n} 〈 A (x_{k} - x_{k - 1}), A A^{T} (b - A x_{k}) 〉 \\ = δ (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ - \frac{2 δ}{n} 〈 A^{T} A (x_{k} - x_{k - 1}), A^{T} A (x_{k} - x^{*}) 〉 \\ = δ (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ - \frac{δ}{n} (∥ A^{T} A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A^{T} A (x_{k} - x^{*}) ∥^{2} - ∥ A^{T} A (x_{k - 1} - x^{*}) ∥^{2}) \\ \leq δ (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ - \frac{δ σ_{m i n}^{2} (A)}{n} (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ = (δ - \frac{δ σ_{m i n}^{2} (A)}{n}) (∥ A (x_{k} - x_{k - 1}) ∥^{2} + ∥ A (x_{k} - x^{*}) ∥^{2} - ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ \leq (δ - \frac{δ σ_{m i n}^{2} (A)}{n}) (3 ∥ A (x_{k} - x^{*}) ∥^{2} + ∥ A (x_{k - 1} - x^{*}) ∥^{2}) . \end{matrix}

(31)

where the second equality uses (10), the first inequality uses a singular value inequality and (10), and the last inequality is a consequence of

δ ∥ A (x_{k} - x_{k - 1}) ∥^{2} \leq 2 δ ∥ A (x_{k} - x^{*}) ∥^{2} + 2 δ {∥ A (x_{k - 1} - x^{*}) ∥}^{2}

. Using the (29)–(31), we obtain

\begin{matrix} E_{i (k) | I (k - 1)} {∥ A (x_{k + 1} - x^{*}) ∥}^{2} \\ \leq (1 - \frac{σ_{m i n}^{2} (A)}{n}) ∥ A (x_{k} - x^{*}) ∥^{2} + 2 δ^{2} (∥ A (x_{k} - x^{*}) ∥^{2} + ∥ A (x_{k - 1} - x^{*}) ∥^{2}) \\ + (δ - \frac{δ σ_{m i n}^{2} (A)}{n}) (3 ∥ A (x_{k} - x^{*}) ∥^{2} + ∥ A (x_{k - 1} - x^{*}) ∥^{2}) . \end{matrix}

(32)

Moreover, then

\begin{matrix} E ∥ A (x_{k + 1} - x^{*}) ∥^{2} \\ \leq (1 - \frac{σ_{m i n}^{2} (A)}{n} + 3 δ - 3 \frac{δ σ_{m i n}^{2} (A)}{n} + 2 δ^{2}) E {∥ A (x_{k} - x^{*}) ∥}^{2} \\ + (2 δ^{2} + δ - \frac{δ σ_{m i n}^{2} (A)}{n}) E {∥ A (x_{k - 1} - x^{*}) ∥}^{2} . \end{matrix}

(33)

By Lemma 5, let

F_{k} = E {∥ A (x_{k} - x^{*}) ∥}^{2}

, we have the following relation

F_{k + 1} \leq a_{1} F_{k} + a_{2} F_{k - 1},

and we have

E ∥ A (x_{k} - x^{*}) ∥^{2} \leq q^{k - 1} (1 + ξ) {∥ x_{0} - x^{*} ∥}^{2},

(34)

where

a_{1} = 1 - \frac{σ_{m i n}^{2} (A)}{n} + 3 δ - 3 \frac{δ σ_{m i n}^{2} (A)}{n} + 2 δ^{2}

,

a_{2} = 2 δ^{2} + δ - \frac{δ σ_{m i n}^{2} (A)}{n}

,

ξ = q - a_{1} \geq 0

, and we have completed the proof. □

Remark 4.

a_{1} = 1 - \frac{σ_{m i n}^{2} (A)}{n} + 3 δ - 3 \frac{δ σ_{m i n}^{2} (A)}{n} + 2 δ^{2}

,

a_{2} = 2 δ^{2} + δ - \frac{δ σ_{m i n}^{2} (A)}{n}

. When

δ = 0

, we can obtain

a_{1} = 1 - \frac{σ_{m i n}^{2} (A)}{n}

,

a_{2} = 0

satisfy

a_{1} + a_{2} < 1

. When δ takes a small value, we can obtain that this relation

a_{1} + a_{2}

is satisfied. In addition, the RCDm method degenerates to the RCD method when

δ = 0

. The RCDm method converges faster than the RCD method if we choose a proper δ. Numerical experiments will show the effectiveness of the RCDm method.

When

δ \geq 0

, we can conclude that

a_{2} \geq 0

. For the above theorem, we have to satisfy

a_{1} + a_{2} < 1

, so

a_{1} + a_{2} = 4 δ^{2} + 4 (1 - \frac{σ_{m i n}^{2} (A)}{n}) δ + 1 - \frac{σ_{m i n}^{2} (A)}{n}

. We set

ω = \frac{σ_{m i n}^{2} (A)}{n}

. It can be concluded that

δ \in [0, \frac{ω - 1 + \sqrt{{(1 - ω)}^{2} + ω}}{2}]

. When

δ \in [0, \frac{ω - 1 + \sqrt{{(1 - ω)}^{2} + ω}}{2}]

, the RCDm method converges. However, in the later experiments, the choice of

δ

will exceed this range because it took a lot of scaling to reach this range.

4. Numerical Experiments

In this section, we compare the influence of different

δ

on the RCDm algorithm and the effectiveness of the RCD, RCDm, and NARCD methods for solving the large linear system

A x = b

. All experiments were performed in MATLAB [28] (version R2018a), on a personal laptop with a 1.60 GHz central processing unit (Intel(R) Core(TM) i5-10210U CPU), 8.00 GB memory, and a Windows operating system (64 bits, Windows 10).

In all implementations, the starting point was chosen to be x0 = zeros(n, 1), the right vector

b + ϵ = A x^{*} + ϵ

where

ϵ \in N (A^{T})

and

x^{*} = o n e s (n, 1)

. The relative residual error (RRE) at the kth iteration is defined as follows:

\begin{matrix} R R E = \frac{∥ b - A x_{k} ∥^{2}}{{∥ b ∥}^{2}} . \end{matrix}

The iterations are terminated once the relative solution error satisfies

R R E < 10^{- 8}

or the number of iteration steps exceeds 5,000,000. If the number of iteration steps exceeds 5,000,000, it is denoted as “-”. IT and CPU denote the number of iteration steps and the CPU times (in seconds) respectively. In addition, the CPU and IT mean the arithmetical averages of the elapsed running times and the required iteration steps with respect to 50 trials of repeated runs of the corresponding method. The speed-up of the RCD method against the RCDm method is defined as follows:

\begin{matrix} s p e e d - u p_{1} = \frac{C P U o f R C D}{C P U o f R C D m} \end{matrix}

and the speed-up of the RCD method against the NARCD method is defined as follows:

\begin{matrix} s p e e d - u p_{2} = \frac{C P U o f R C D}{C P U o f N A R C D} . \end{matrix}

4.1. Experiments for Different $δ$ on the RCDm

The matrix A is randomly generated by using the MATLAB function unifrnd (0,1,m,n). We observe that RCDm, with appropriately chosen momentum parameters

0 < δ \leq 0.4

, always converges faster than their no-momentum variants. In this subsection, we let

δ = 0, 0.1, 0.2, 0.3, 0.4

to compare their performances. Numerical results are reported in Table 1, Table 2 and Table 3 and Figure 1. We can conclude some observations as follows. when

δ = 0.1, 0.2, 0.3, 0.4

, the acceleration effect is good.

4.2. Experiments for NARCD, RCDm, RCD, NASGD

Matrix A is randomly generated by using the MATLAB function unifrnd (0,1,m,n). For the RCDm method, let us take the momentum parameter

δ = 0.3

. For the NARCD method, let us take the Nesterov accelerated parameter

λ = 0.05

. For the Nesterov accelerated stochastic gradient descent method (NASGD), it is the step size

α = 0.01

. We observe the performances of RCD, RCDm, and NARCD methods with matrices A of different sizes. From Figure 2 and Table 4, Table 5, Table 6 and Table 7, we found that both the NARCD and the RCDm with appropriate momentum parameters can accelerate the RCD; the NARCD and the RCDm always converge faster than the RCD. Moreover, we found that the NARCD has a better acceleration effect than the RCDm. From Table 7, for matrix A

\in R^{8000 \times 3000}

, the NARCD method demonstrates the best numerical results than the other matrices in terms of the value of the speed-up, where the speed-up is

3.0206

. From Figure 3, we found that the acceleration of NARCD method and RCDm method experience gentle changes as the matrix becomes larger, so we can see that these two methods still have good speed-ups when the matrix is very large. From Figure 4, we found that the NARCD converges faster than the NASGD.

4.3. Experiment with Different Correlations of Matrix A

Matrix A is randomly generated by using the MATLAB function unifrnd (c,1,m,n),

c \in [0, 1)

. We let

c = 0, 0.2, 0.4, 0.6

. For the RCDm method, let us take the momentum parameter

δ = 0.3

. For the NARCD method, let us take the Nesterov accelerated parameter

λ = 0.05

. As the value of c increases, the correlation of matrix A becomes stronger. From Table 8, Table 9, Table 10, Table 11 and Table 12, we know that as c increases, the condition number for the matrix increases. The larger the condition number, the more ill-conditioned the matrix. The more ill-conditioned the matrix, the more time it takes to solve. From Table 10 and Table 12, we know that the acceleration effect of RCDm does not change much with the increase of the c value, but the acceleration effect of NARCD is becoming better.

4.4. The Two-Dimensional Tomography Test Problems

In this section, we use the previously and newly proposed methods to reconstruct 2D seismic travel time tomography. The 2D seismic travel-time tomography reconstruction is implemented in the function seismictomo (N, s, p) in the MATLAB package AIR TOOLS [29], which generated a sparse matrix A, an exact solution

x_{*}

(which is shown in Figure 4a) and the right vector

b + ϵ = A x^{*} + ϵ

where

ϵ \in N (A^{T})

. We set N = 20, s = 30 and p = 100 in the function seismictomo (N, s, p). We utilize the RCD, RCDm(

δ

=0.3) and NARCD(

λ = 0.05

) methods to solve the linear least-squares problem (1). The experiment ran 90,000 iterations. From Figure 5, we see that the results of the NARCD methods are better than those of the RCD and RCDm methods under the same number of iteration steps.

5. Conclusions

To solve a large system of linear equations, two new acceleration methods for the RCD method are proposed, called the NARCD method and the RCDm method. Their convergences were proved, and the estimations of the convergence rates of the NARCD method and the RCDm method are given, respectively. The two methods are shown to be equally successful in numerical experiments. In uniformly distributed data, with appropriately chosen momentum parameters, the RCDm is better than the RCD in IT and CPU. The NARCD and the RCDm are faster than the RCD, and the NARCD has a better acceleration effect than the RCDm. In the case of an overdetermined linear system, for the NARCD method, the fatter the matrix, the better the acceleration. The acceleration effect of NARCD becomes better when c in the MATLAB function unifrnd (c, 1, m,n) increases. The block coordinate descent method is a very efficient method for solving large linear equations; in future work, it would be interesting to apply the two accelerated formats to the block coordinate descent method.

Author Contributions

Software, W.B.; Validation, F.Z.; Investigation, Q.W.; Writing—original draft, Q.W.; Writing—review and editing, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Key Research and Development program of China (2019YFC1408400).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Leventhal, D.; Lewis, A.S. Randomized methods for linear constraints: Convergence rates and conditioning. Math. Oper. Res. 2010, 35, 641–654. [Google Scholar] [CrossRef] [Green Version]
Ruhe, A. Numerical aspects of Gram-Schmidt orthogonalization of vectors. Linear Algebra Its Appl. 1983, 52, 591–601. [Google Scholar] [CrossRef] [Green Version]
Breheny, P.; Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 2011, 5, 232. [Google Scholar] [CrossRef] [Green Version]
Chang, K.W.; Hsieh, C.J.; Lin, C.J. Coordinate descent method for large-scale l2-loss linear support vector machines. J. Mach. Learn. Res. 2008, 9, 1369–1398. [Google Scholar]
Canutescu, A.A.; Dunbrack Jr, R.L. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 2003, 12, 963–972. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bouman, C.A.; Sauer, K. A unified approach to statistical tomography using coordinate descent optimization. IEEE Trans. Image Process. 1996, 5, 480–492. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Ye, J.C.; Webb, K.J.; Bouman, C.A.; Millane, R.P. Optical diffusion tomography by iterative-coordinate-descent optimization in a Bayesian framework. JOSA A 1999, 16, 2400–2412. [Google Scholar] [CrossRef]
Bai, Z.Z.; Wu, W.T. On greedy randomized coordinate descent methods for solving large linear least-squares problems. Numer. Linear Algebra Appl. 2019, 26, e2237. [Google Scholar] [CrossRef]
Zhang, J.; Guo, J. On relaxed greedy randomized coordinate descent methods for solving large linear least-squares problems. Appl. Numer. Math. 2020, 157, 372–384. [Google Scholar] [CrossRef]
Lu, Z.; Xiao, L. On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 2015, 152, 615–642. [Google Scholar] [CrossRef] [Green Version]
Necoara, I.; Nesterov, Y.; Glineur, F. Random block coordinate descent methods for linearly constrained optimization over networks. J. Optim. Theory Appl. 2017, 173, 227–254. [Google Scholar] [CrossRef]
Richtárik, P.; Takáč, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 2014, 144, 1–38. [Google Scholar] [CrossRef] [Green Version]
Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let. Cl. Sci. Math. Nat. 1937, 35, 355–357. [Google Scholar]
Strohmer, T.; Vershynin, R. A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 2009, 15, 262–278. [Google Scholar] [CrossRef] [Green Version]
Bai, Z.Z.; Wu, W.T. On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. 2018, 40, A592–A606. [Google Scholar] [CrossRef]
Bai, Z.Z.; Wu, W.T. On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems. Appl. Math. Lett. 2018, 83, 21–26. [Google Scholar] [CrossRef]
Liu, Y.; Gu, C.Q. Variant of greedy randomized Kaczmarz for ridge regression. Appl. Numer. Math. 2019, 143, 223–246. [Google Scholar] [CrossRef]
Guan, Y.J.; Li, W.G.; Xing, L.L.; Qiao, T.T. A note on convergence rate of randomized Kaczmarz method. Calcolo 2020, 57, 1–11. [Google Scholar] [CrossRef]
Du, K.; Gao, H. A new theoretical estimate for the convergence rate of the maximal weighted residual Kaczmarz algorithm. Numer. Math. Theory Methods Appl. 2019, 12, 627–639. [Google Scholar]
Yang, X. A geometric probability randomized Kaczmarz method for large scale linear systems. Appl. Numer. Math. 2021, 164, 139–160. [Google Scholar] [CrossRef]
Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O (1/k²). Dokl. Akad. Nauk Sssr 1983, 269, 543–547. [Google Scholar]
Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 2012, 22, 341–362. [Google Scholar] [CrossRef] [Green Version]
Polyak, B.T. Some methods of speeding up the convergence of iteration methods. Ussr Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Sun, T.; Li, D.; Quan, Z.; Jiang, H.; Li, S.; Dou, Y. Heavy-ball algorithms always escape saddle points. arXiv 2019, arXiv:1907.09697. [Google Scholar]
Sarowar Morshed, M.; Saiful Islam, M. Accelerated Sampling Kaczmarz Motzkin Algorithm for The Linear Feasibility Problem. J. Glob. Optim. 2019, 77, 361–382. [Google Scholar] [CrossRef]
Liu, J.; Wright, S. An accelerated randomized Kaczmarz algorithm. Math. Comput. 2016, 85, 153–178. [Google Scholar] [CrossRef]
Loizou, N.; Richtárik, P. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. Comput. Optim. Appl. 2020, 77, 653–710. [Google Scholar] [CrossRef]
Higham, D.J.; Higham, N.J. MATLAB Guide; SIAM: Philadelphia, PA, USA, 2016. [Google Scholar]
Hansen, P.C.; Jørgensen, J.S. AIR Tools II: Algebraic iterative reconstruction methods, improved implementation. Numer. Algorithms 2018, 79, 107–137. [Google Scholar] [CrossRef]

Figure 1. (a,b): m = 300 rows and n = 150, 100 columns for different

δ

. (c,d): m = 800 rows and n = 300, 200 columns for different

δ

. (e,f): m = 8000 rows and n = 3000, 2000 columns for different

δ

.

Figure 1. (a,b): m = 300 rows and n = 150, 100 columns for different

δ

. (c,d): m = 800 rows and n = 300, 200 columns for different

δ

. (e,f): m = 8000 rows and n = 3000, 2000 columns for different

δ

.

Figure 2. (a,b): m = 4000 rows and n = 800, 1000 columns for RCD, RCDm, and NARCD. (c,d): m = 8000 rows and n = 2000, 3000 columns for RCD, RCDm, and NARCD. (e,f): m = 12,000 rows and n = 2000, 4000 columns for RCD, RCDm, and NARCD.

Figure 3. The speed-up of the RCD method against the NARCD and RCDm for matrices

A \in R^{m \times n}

with

m = 300 \times k

and

n = 100 \times k

.

Figure 3. The speed-up of the RCD method against the NARCD and RCDm for matrices

A \in R^{m \times n}

with

m = 300 \times k

and

n = 100 \times k

.

Figure 4. m = 800 and n = 300 for NARCD and NASGD.

Figure 5. Performance of RCD, RCDm, and NARCD methods for the seismictomo test problem. (a) Exact seismic. (b) RCD. (c) RCDm. (d) NARCD.