New Efficient Approach to Solve Big Data Systems Using Parallel Gauss–Seidel Algorithms

Shih Yu Chang; Hsiao-Chun Wu; Yifan Wang

doi:10.3390/bdcc6020043

Abstract

In order to perform big-data analytics, regression involving large matrices is often necessary. In particular, large scale regression problems are encountered when one wishes to extract semantic patterns for knowledge discovery and data mining. When a large matrix can be processed in its factorized form, advantages arise in terms of computation, implementation, and data-compression. In this work, we propose two new parallel iterative algorithms as extensions of the Gauss–Seidel algorithm (GSA) to solve regression problems involving many variables. The convergence study in terms of error-bounds of the proposed iterative algorithms is also performed, and the required computation resources, namely time- and memory-complexities, are evaluated to benchmark the efficiency of the proposed new algorithms. Finally, the numerical results from both Monte Carlo simulations and real-world datasets are presented to demonstrate the striking effectiveness of our proposed new methods.

Keywords:

Gauss–Seidel algorithm; random iterations; matrix factorization; linear systems; big data

1. Introduction

With the advances of computer and internet technologies, tremendous data will be processed and archived in our daily life. Data-generating sources include the internet of things (IoT), social websites, smart-devices, sensor networks, digital images/videos, multimedia signal archives for surveillance, business-activity records, web logs, health (medical) records, on-line libraries, eCommerce data, scientific research projects, smart cities, and so on [1,2]. This is the reason why the quantity of data all over the world has been growing exponentially. By 2030, the International Telecommunication Union (ITU) predicts that the trend of this exponential growth of data will continue and overall data traffic just for mobile devices will reach an astonishingly five zettabytes (ZB) per month [3].

In big-data analysis, matrices are utilized extensively in formulating problems with linear structure [4,5,6,7,8,9,10]. For example, matrix factorization techniques have been applied for topic modeling and text mining [11,12]. For example, a bicycle demand–supply problem was formulated as a matrix-completion problem by modeling the bike-usage demand as a matrix whose two dimensions were defined as the time interval of a day and the region of a city [13]. For social networks, matrices such as adjacency and Laplacian matrices have been used to encode social–graph relations [14]. A special class of matrices, referred to as low-rank (high-dimensional) matrices, which often have many linearly dependent rows (or columns), is often encountered when various big data analytics applications need to be addressed. Let’s list several data analytics applications involving such high-dimensional, low-rank matrices: (i) system identification: low-rank (Hankel) matrices are used to represent low-order linear, time-invariant systems [15]; (ii) weight matrices: several signal-embedding problems, for example, multidimensional scaling (see [16]) and sensor positioning (see [17]), etc., use weight matrices to represent the weights or distances between pairs of objects, and such weight matrices are often low-rank since most signals of interest appear only within the subspaces of small dimensions; (iii) signals over graphs: the adjacency matrices used to describe connectivity structures, e.g., those resulting from communication and radar signals, social networks, and manifold learning, are low-rank in general (see [18,19,20,21,22]); (iv) intrinsic signal properties: various signals, such as the collection of video frames, sensed signals, or network data, are highly correlated, and these signals should be represented by low-rank matrices (see [23,24,25,26,27,28,29]); (v) machine learning: the raw input data can be represented by low-rank matrices for artificial intelligence, natural language processing, and machine learning (see [30,31,32,33]).

Let’s manipulate a simple algebraic expression to illustrate the underlying big data problem. If

V

is a high-dimensional, low-rank matrix, it is convenient to reformulate it by a factorization form of

V = W H

. There are quite a few advantages to working on the factorization form

W H

rather than the original matrix

V

. The first advantage is computational efficiency. For example, the alternating least squares (ALS) method is often invoked for collaborative-filtering based recommendation systems. In the ALS method, one has to approximate the original matrix

V_{m \times n}

by

W_{m \times k} \times H_{k \times n}

for solving

W_{m \times k}

by keeping

H_{k \times n}

fixed and then solving

H_{k \times n}

by keeping

W_{m \times k}

fixed iteratively. By repeating the aforementioned procedure alternately, the final solution can be obtained. The second advantage is resource efficiency. Since

V

is usually large in dimension, the ALS method can thus reduce the required memory-storage space from the size

m \times n

to only

k (m + n)

. Such reduction can save memory and further reduce communication overhead significantly if one implements the ALS computations using the factorized matrices. Finally, the third advantage for applying the factorization technique to large matrices is data compression. Recall that principal component analysis (PCA) aims to extract more relevant information from the raw data by considering those singular vectors corresponding to large singular values (deemed signals) but ignoring the data spanned by those singular vectors corresponding to small singular values (deemed noise). The objective of PCA is to efficiently approximate an original high-dimensional matrix by another matrix with a (much) smaller rank, i.e., low-rank approximation. Therefore, the factorization can lead to data compression consequently.

Generally speaking, given a vector

c

(dependent variables), we are interested in the linear regression of

V

(independent variables) onto

c

for a better understanding of the relationship between the dependent and independent variables because many data processing techniques are based on solving a linear-regression problem, for example, beamforming (see [34]), model selection (see [35,36]), robust matrix completion (see [37]), data processing for big data (see [38]), and kernel-based learning (see [39]). Most importantly, the Wiener–Hopf equations are frequently invoked in optimal or adaptive filter design [40]. When tremendous “taps” or “states” are considered, the correlation matrix in the Wiener–Hopf equations becomes very large in dimension. Thus, solving Wiener–Hopf equations with large dimensions is mathematically equivalent to solving a big-data-related linear-regression problem. Because the factorization of a large, big-data-related matrix (or a large correlation matrix) can bring us advantages (as previously discussed), the main contribution of this work is to propose new iterative methods that can work on the factorized matrices instead of the original matrix. By taking such a matrix-factorization approach, one can enjoy the associated benefits in computation, implementation, and representation for solving a linear-regression problem. Our main idea is to utilize a couple of stochastic iterative algorithms for solving the factorized matrices by the Gauss–Seidel algorithm (GSA) in parallel and then combine the individual solutions to form the final approximate solution. There are many existing algorithms to solve large, linear systems of equations, however, the proposed GSA is easier to program and takes less time to compute each iteration compared to existing ones [41,42]. Moreover, we even provide parallel framework to accelerate the proposed GSA. Figure 1 presents a high-level illustration for the proposed new method. This approach can serve as a common framework for solving many large problems by use of the approximate solutions. In this information-technology boom era, problems are often quite large and have to be solved by digital computers subject to finite precision. The proposed new divide-and-iterate method can be applied extensively in data processing for big data. This new approach is different from the conventional divide-and-conquer scheme as there exist no horizontal (mutual iterations among subproblems) computations in the conventional divide-and-conquer approach. Under the same divide-and-iterate approach, this work uses GSA, instead of the Kaczmarz algorithm (KA) [43], to solve factorized subsystems in a parallel method.

Figure 1. Illustration of the proposed new divide-and-iterate approach.

The rest of this paper is organized as follows. The linear-regression problem and the Gauss–Seidel algorithm are discussed in Section 2. The proposed new iterative approach to solve a factorized system is presented in Section 3. The validation of the convergences of the proposed methods is provided in Section 4. The time- and memory-complexities for our proposed new approach are discussed in Section 5. The numerical experiments for the proposed new algorithms are presented in Section 6. Finally, conclusion will be drawn in Section 7.

2. Solving Linear Regression Using Factorized Matrices and Gauss–Seidel Algorithm

A linear-regression problem (especially involving a large matrix) will be formulated using factorized matrices first in this section. Then, the Gauss–Seidel algorithm will be introduced briefly, as this algorithm needs to be invoked to solve the subproblems involving factorized matrices in parallel. Finally, the individual solutions to these subproblems will be combined to form the final solution.

2.1. Linear Regression: Divide-and-Iterate Approach

A linear regression is given by

V y = c,

where

V \in C^{m \times n}

and

C

denotes the set of complex numbers. It is equivalent to the following:

\begin{matrix} V y = W H y = c, \end{matrix}

(1)

where the matrix

V

is decomposed as the product of the matrix

W

and the matrix

H

,

W \in C^{m \times k}

, and

H \in C^{k \times n}

. Generally, the dimension of

V

is large in the context of big data. Therefore, it is not practical to solve the original regression problem. We propose to solve the following subproblems alternatively:

\begin{matrix} W x & = & c, \end{matrix}

(2)

and:

\begin{matrix} H y & = & x . \end{matrix}

(3)

One can obtain the original linear-system solution to Equation (1) by first solving the sub-linear system given by Equation (2) and then substituting the intermediate solution

x

into Equation (3) to obtain the final solution

y

. A linear system

V y = c

is called consistent if it has at least one solution. On the other hand, it will be called inconsistent if there exists no solution. The sub-linear system can be solved by the Gauss–Seidel algorithm, which will be briefly introduced in the next subsection.

2.2. Gauss–Seidel Algorithm and Its Extensions

The Gauss–Seidel algorithm (GSA) is an iterative algorithm for solving linear equations

V y = c

. It is named after the German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel, and it is similar to the Jacobi method [44]. The randomized version of the Gauss–Seidel method can converge linearly when a consistent system is expected [45].

Given

V \in C^{m \times n}

and

c

as in Equation (1), the randomized GSA will pick column

j \in {1, 2, \dots, n}

of

V

with probability

\frac{{∥V_{(, j)}∥}_{2}^{2}}{{∥A∥}_{F}^{2}}

, where

C

denotes the set of complex numbers,

V_{(, j)}

is the j-th column of the matrix

V

,

{∥\cdot∥}_{F}

is the Frobenius norm, and

{∥\cdot∥}_{2}

is the Euclidean norm. Thus, the solution

y

will be updated as:

\begin{matrix} y_{t} = y_{t - 1} + \frac{V_{(, j)}^{*} (c - V y_{t - 1})}{{∥V_{(, j)}∥}_{2}^{2}} e_{(j)}, \end{matrix}

(4)

where t is the (iteration) index of the solution at the t-th step (iteration),

e_{(j)}

is the j-th basis vector (a vector with 1 at the j-th position and 0 otherwise), and

^{*}

denotes the Hermitian adjoint of a matrix (vector).

However, the randomized GSA updated by Equation (4) does not converge when the system of equations is inconsistent [45]. To overcome this problem, an extended GSA (EGSA) was proposed in [46]. The EGSA will pick row

i \in {1, 2, \dots, m}

of

V

with probability

\frac{{∥V_{(i,)}∥}_{2}^{2}}{{∥V∥}_{F}^{2}}

and pick column

j \in {1, 2, \dots, n}

of

V

with probability

\frac{{∥V_{(, j)}∥}_{2}^{2}}{{∥V_{(, j)}∥}_{F}^{2}}

, where

V_{(i,)}

represents the i-th row of the matrix

V

. Consequently, the solution

y

will be updated as:

\begin{matrix} d_{t} = d_{t - 1} + \frac{V_{(, j)}^{*} (c - V d_{t - 1})}{{∥V_{(, j)}∥}_{2}^{2}} e_{(j)}, \end{matrix}

(5)

and:

\begin{matrix} y_{t} = y_{t - 1} + \frac{V_{(i,)} (d_{t} - y_{t - 1})}{{∥V_{(i,)}∥}_{2}^{2}} V_{(i,)}^{*} . \end{matrix}

(6)

When the EGSA is applied for the consistent systems, it behaves exactly like the GSA. For the consistent systems, the EGSA has been shown to converge linearly in expectation to the least-squares solution (

y_{o p t} = V^{†} c

, where

^{†}

denotes the pseudo-inverse based on the least-squares norm) according to [46].

3. Parallel Random Iterative Approach for Linear Systems

In this section, we will propose a novel parallel approach to deal with vector additions/subtractions and inner-product computations. This new parallel approach can faster the computational speed of the GSA and the EGSA as stated in Section 3.1 and Section 3.2. Suppose that we have p processors (indexed by

P_{1}

,

P_{2}

, …,

P_{p}

) available to carry out vector computations in parallel. The data involved in such computations need be allocated to each processor in balance. Such balanced load of data across all processors can make the best use of resource, maximize the throughput, minimize the computational time, and mitigate the chance of any processor’s overload. Here we propose two strategies to assign data evenly, namely (i) cyclic distribution and (ii) block distribution. For the cyclic distribution, we assign the i-th component of a length-m vector

y

to the corresponding processor as follows:

\begin{matrix} y_{(i)} \to P_{i | p}, \end{matrix}

(7)

where

i | p

denotes i modulo by p. On the other hand, for block distribution, we assign the i-th component of a length-m vector

x

to the corresponding processor as follows:

\begin{matrix} y_{(i)} \to P_{⌊\frac{i}{ℓ}⌋}, \end{matrix}

(8)

where

0 \leq i < m

, ℓ specifies the block size such that

ℓ \overset{def}{=} ⌈\frac{m}{p}⌉

,

⌊ ⌋

denotes the integer rounding-down operation, and

⌈ ⌉

denotes the integer rounding-up operation. The cyclic and block distributions for four processors are illustrated in Figure 2.

Figure 2. Illustration of the cyclic and block distributions for p = 4.

For example, the parallel computation of an inner product between two vectors using the cyclic distribution is illustrated by Figure 3.

Figure 3. Illustration of an inner-product computation on the parallel platform using cyclic distribution (

p = 4

and

m = 12

).

In Figure 3, we illustrate how to undertake a parallel inner product between two vectors

a \overset{def}{=} [1, 0, 4, 7, - 1, 2, 1, 4, 3, 6, 3, 4]

and

b \overset{def}{=} [1, 2, 3, - 2, 0, 1, 3, 4, 2, 2, 4, 0]

via four

(p = 4)

processors. Processor 1 is employed to compute the inner product of the components indexed by 1, 5, and 9, so we obtain

1 \times 1 + (- 1) \times 0 + 3 \times 2 = 7

; processor 2 is employed to compute the inner product of the components indexed by 2, 6, and 10, so we obtain

0 \times 2 + 2 \times 1 + 6 \times 2 = 14

; processor 3 is employed to compute the inner product of the components indexed by 3, 7, and 11, so we obtain 4 × 3 + 1 × 3 +3 × 4 = 27; finally, processor 4 is employed to compute the inner product of the components indexed by 4, 8, and 12, so we get 7 ×

(- 2)

+ 4 × 4 + 4 × 0 = 2. The overall inner product can thus be obtained by adding those above-stated sub-inner products resulting from the four processors.

3.1. Consistent Linear Systems

The parallel random iterative algorithm to solve the original system formulated by Equation (1) is stated by Algorithm 1 if the original system is consistent. The idea here is to solve the sub-system formulated by Equation (2) and the sub-system formulated by Equation (3) alternately using the GSA. The symbols

\oplus_{p}

,

⊖_{p}

, and

⊙_{p}

represent parallel vector addition, subtraction, and inner-product, respectively, using p processors. Note that

\times_{p}

is the operation to scale a vector by a complex value. The parameter

T

specifies the number of iterations required to perform the proposed algorithms. This quantity can be determined by the error tolerance of the solution (refer to Section 5.1 for detailed discussion).

Algorithm 1 The Parallel GSA

Result:

y_{t}

Input:

W

,

H

,

c

,

T

; while

t \leq T

do

Pick up column

W_{(, i)}

with probability

\frac{{∥W_{(, i)}∥}_{2}^{2}}{{∥W∥}_{F}^{2}}

;

Update

x_{t} = x_{t - 1} \oplus_{p} \frac{W_{(, i)}^{*} ⊙_{p} (c ⊖_{p} W x_{t - 1})}{{∥W_{(, i)}∥}_{2}^{2}} \times_{p} e_{(i)}

;

Pick up column

H_{(, j)}

with probability

\frac{{∥H_{(, j)}∥}_{2}^{2}}{{∥H∥}_{F}^{2}}

;

Update

y_{t} = y_{t - 1} \oplus_{p} \frac{H_{(, j)}^{*} ⊙_{p} (x_{t} ⊖_{p} H y_{t - 1})}{{∥H_{(, j)}∥}_{2}^{2}} \times_{p} e_{(j)}

;

end

3.2. Inconsistent Linear Systems

If the original system formulated by Equation (1) is not consistent, Algorithm 2 is proposed to solve it instead. Algorithm 2 below is based on the EGSA.

Algorithm 2 The Parallel EGSA

Result:

y_{t}

Input:

W

,

H

,

c

,

T

while

t \leq T

do

Pick up column

W_{(, i)}

with probability

\frac{{∥W_{(, i)}∥}_{2}^{2}}{{∥W∥}_{F}^{2}}

Update

d_{t} = d_{t - 1} \oplus_{p} \frac{W_{(, i)}^{*} ⊙_{p} (c ⊖_{p} V d_{t - 1})}{{∥W_{(, i)}∥}_{2}^{2}} \times_{p} e_{(i)}

Pick up row

W_{(l,)}

with probability

\frac{{∥W_{(l,)}∥}_{2}^{2}}{{∥W∥}_{F}^{2}}

Update

x_{t} = x_{t - 1} \oplus_{p} \frac{W_{(l,)} ⊙_{p} (d_{t} ⊖_{p} x_{t - 1})}{{∥W_{(l,)}∥}_{2}^{2}} \times_{p} W_{(l,)}^{*}

Pick up column

H_{(, j)}

with probability

\frac{{∥H_{(, j)}∥}_{2}^{2}}{{∥H∥}_{F}^{2}}

Update

y_{t} = y_{t - 1} \oplus_{p} \frac{H_{(, j)}^{*} ⊙_{p} (x_{t} ⊖_{p} H y_{t - 1})}{{∥H_{(, j)}∥}_{2}^{2}} \times_{p} e_{(j)}

end

4. Convergence Studies

The convergence studies for the two algorithms proposed in Section 3 are manifested by Theorem 1 for consistent systems and Theorem 2 for inconsistent systems. The necessary lemmas for establishing the main theorems discussed in Section 4.2 are first presented in Section 4.1. All proofs will be written using vector operations without the subscript p because the parallel computations for vector operations should lead to the same results regardless of the processor index p. Without loss of generality, the instances of subscript p indexed in Algorithms 1 and 2 are simply used to indicate that those computations can be carried out in parallel.

4.1. Auxiliary Lemmas

We define the metric

ϱ_{A}

for a matrix

A

as:

\begin{matrix} ϱ_{A} \overset{def}{=} 1 - \frac{σ_{min}^{2} (A)}{{∥ A ∥}_{F}^{2}}, \end{matrix}

(9)

where

σ_{min} (A)

denotes the minimum nontrivial singular value of the matrix

A

and

0 < σ_{min} (A) < 1

. We present the following lemma, which establishes an identity related to the error bounds of our proposed iterative algorithms.

Lemma 1.

Let

A

be a nonzero real matrix. For any vector

v

in the range of

A

, i.e.,

v

can be obtained by a linear combination of

A

’s columns (taking columns as vectors), we have:

\begin{matrix} v^{T} (I - \frac{A A^{T}}{{∥A∥}_{F}^{2}}) v \leq ϱ_{A} {∥v∥}_{2}^{2} . \end{matrix}

(10)

Proof.

Because the singular values of

A

and

A^{T}

are the same and

σ_{i} (A A^{T}) \geq σ_{min}^{2} (A)

where the subscript i denotes the i-th largest singular value in magnitude, Lemma 1 is proven. □

Since the original solution to Equation (1) can be facilitated from solving the factorized linear systems, Lemma 2 below can be utilized to bound the error arising from the solutions to the factorized sub-systems at each iteration.

Lemma 2.

The expected squared-norm for

H y_{t} - H y_{o p t}

, or the error between the result at the t-th iteration and the optimal solution conditional on the first t iterations, is given by:

\begin{matrix} E_{t} [{∥H y_{t + 1} - H y_{opt}∥}_{2}^{2}] \leq ϱ_{H} {∥H y_{t} - H y_{opt}∥}_{2}^{2} + E_{t, x} [{∥x_{t + 1} - x_{opt}∥}_{2}^{2}] \end{matrix}

(11)

where the subscript

x

of the statistical expectation operator

E_{t, x} []

indicates that the expectation should be taken over the random variable

x

.

Proof.

Let

{\hat{y}}_{t + 1}

be the one-step update in the GSA, so

{\hat{y}}_{t + 1} = y_{t} + \frac{H_{(, j)}^{*} (x_{o p t} - H y_{t})}{{∥H_{(, j)}∥}_{2}^{2}} e_{(j)}

and:

y_{t + 1} = y_{t} + \frac{H_{(, j)}^{*} (x_{t} - H y_{t})}{{∥H_{(, j)}∥}_{2}^{2}} e_{(j)}

.

Then we have:

\begin{matrix} E_{t} [{∥H y_{t + 1} - H y_{opt}∥}_{2}^{2}] \\ =_{(1)} E_{t} [{∥H y_{t + 1} - H y_{opt} + H {\hat{y}}_{t + 1} - H {\hat{y}}_{t + 1}∥}_{2}^{2}] \\ =_{(2)} E_{t} [{∥H {\hat{y}}_{t + 1} - H y_{opt}∥}_{2}^{2}] + E_{t} [{∥H y_{t + 1} - H {\hat{y}}_{t + 1}∥}_{2}^{2}] \\ =_{(3)} E_{t} [{∥H y_{t} - H y_{opt}∥}_{2}^{2}] - E_{t} [{∥H {\hat{y}}_{t + 1} - {Hy}_{t}∥}_{2}^{2}] + E_{t} [{∥H y_{t + 1} - H {\hat{y}}_{t + 1}∥}_{2}^{2}] \\ =_{(4)} {∥ H y_{t} - H y_{opt} ∥}_{2}^{2} - E_{t} [\frac{| H_{(, j)}^{*} H y_{opt} - H_{(, j)}^{*} {Hy}_{t} |^{2}}{{∥H_{(, j)}^{*}∥}_{2}^{2}}] + E_{t} [\frac{| H_{(, j)}^{*} x_{t + 1} - H_{(, j)}^{*} x_{opt} |^{2}}{{∥H_{(, j)}^{*}∥}_{2}^{2}}] \\ =_{(5)} {∥H y_{t} - H y_{opt}∥}_{2}^{2} - E_{t, y} [\frac{| H_{(, j)}^{*} H y_{opt} - H_{(, j)}^{*} H y_{t} |^{2}}{{∥H_{(, j)}^{*}∥}_{2}^{2}}] + E_{t, x} \{E_{t, y} [\frac{| H_{(, j)}^{*} x_{t + 1} - H_{(, j)}^{*} x_{opt} |^{2}}{{∥H_{(, j)}^{*}∥}_{2}^{2}}]\} \\ =_{(6)} {∥H y_{t} - H y_{opt}∥}_{2}^{2} - \frac{{∥H^{*} (H y_{opt} - H y_{t})∥}_{2}^{2}}{{∥ H ∥}_{F}^{2}} + \frac{E_{t, x} [{∥H^{*} (x_{t + 1} - x_{opt})∥}_{2}^{2}]}{{∥ H ∥}_{F}^{2}} \\ \leq_{(7)} {∥H y_{t} - H y_{opt}∥}_{2}^{2} - \frac{σ_{min}^{2} (H)}{{∥ H ∥}_{F}^{2}} {∥H y_{t} - H y_{opt}∥}_{2}^{2} + \frac{E_{t, x} [{∥H^{*} (x_{t + 1} - x_{opt})∥}_{2}^{2}]}{{∥ H ∥}_{F}^{2}} \\ =_{(8)} ϱ H {∥H y_{t} - H y_{opt}∥}_{2}^{2} + \frac{E_{t, x} [{∥H^{*} (x_{t + 1} - x_{opt})∥}_{2}^{2}]}{{∥ H ∥}_{F}^{2}} \\ \leq_{(9)} ϱ H {∥H y_{t} - H y_{opt}∥}_{2}^{2} + E_{t, x} [{∥x_{t + 1} - x_{opt}∥}_{2}^{2}] . \end{matrix}

(12)

The equality

=_{(1)}

results from adding and subtracting the same term “

H {\hat{y}}_{t + 1}

”. The equality

=_{(2)}

holds because

H {\hat{y}}_{t + 1} - H y_{o p t}

and

H y_{t + 1} - H {\hat{y}}_{t + 1}

are orthogonal to each other. The equality

=_{(3)}

comes from Pythagoras’ Theorem since

H {\hat{y}}_{t + 1} - H y_{o p t}

and

H {\hat{y}}_{t + 1} - H y_{t}

are orthogonal to each other. The proof of the orthogonality between

H {\hat{y}}_{t + 1}

−

H y_{o p t}

and

H {\hat{y}}_{t + 1}

−

H y_{t}

is presented as follows:

(H {\hat{y}}_{t + 1}

−

H y_{t})

is parallel to

H_{(, j)}

and

(H {\hat{y}}_{t + 1}

−

H y_{o p t})

is perpendicular to

H_{(, j)}

because:

\begin{matrix} H_{(, j)} & (H {\hat{y}}_{t + 1} - H y_{opt}) \\ = H_{(, j)} [H (y_{t} + \frac{H_{(, j)}^{*} (x_{opt} - H y_{t})}{{∥H_{(, j)}∥}_{2}^{2}} e_{(j)}) - H y_{opt}] \\ = H_{(, j)} H y_{t} + H_{(, j)} H y_{opt} - H_{(, j)} H y_{t} - H_{(, j)} H y_{opt} = 0 . \end{matrix}

(13)

Therefore,

H {\hat{y}}_{t + 1}

−

H y_{o p t}

and

H {\hat{y}}_{t + 1}

−

H y_{t}

are orthogonal to each other. The relation

H y_{o p t} = x_{o p t}

is used to establish the identity

=_{(4)}

. Recall that the expectation

E_{t} []

is conditional on the first t iterations. The law of iterated expectations in [47] is thereby applied here to establish the equality

=_{(5)}

. Since the probability of selecting the column

H_{(, j)}

is

\frac{{∥H_{(, j)}∥}_{2}^{2}}{{∥H∥}_{F}^{2}}

, we can have the equality

=_{(6)}

. The inequality

\leq_{(7)}

comes from the fact that

{∥H^{*} (H y_{o p t} - H y_{t})∥}_{2}^{2} \geq ϱ_{H} {∥H y_{t} - H y_{o p t}∥}_{2}^{2}

. The equality

=_{(8)}

results from the definition of

ϱ_{H}

in Equation (9). Finally, the inequality

\leq_{(9)}

comes from the fact that

{∥A x∥}_{2}^{2} \leq {∥A∥}_{F}^{2} {∥x∥}_{2}^{2}

(according to the Cauchy–Schwarz inequality) for the matrix

A \in C^{m \times n}

and the vector

x \in C^{n \times 1}

. □

Lemma 3.

Consider a linear, consistent system

V y = c

where

V

has the dimension

m \times n

. If the Gauss–Siedel algorithm (GSA) with an initial guess

y_{0} \in R^{n}

(

R

denotes the set of real numbers) is applied to solve such a linear, consistent system, the expected squared-norm for

y_{t} - y_{o p t}

can be bounded as follows:

\begin{matrix} E [{∥y_{t} - y_{o p t}∥}_{2}^{2}] \leq ρ_{V}^{t} {∥y_{0} - y_{o p t}∥}_{2}^{2} . \end{matrix}

(14)

Proof.

See Theorem 4.2 in [45]. □

The following lemma is presented to bound the iterative results for solving an inconsistent system using the extended Gauss–Siedel algorithm (EGSA).

Lemma 4.

Consider a linear, inconsistent system

V y = c

. If the extended Gauss–Siedel algorithm (EGSA) with an initial guess

y_{0}

within the range of

V^{T}

and

d_{0} \in R^{n}

is applied to solve such a linear, inconsistent system, the expected squared-norm for

y_{s} - y_{o p t}

can be bounded as follows:

\begin{matrix} E [{∥y_{t} - y_{o p t}∥}_{2}^{2}] & \leq & ϱ_{V}^{t} {∥y_{0} - y_{o p t}∥}_{2}^{2} + \\ \frac{t ϱ_{V}^{t}}{{∥V∥}_{F}^{2}} {∥V d_{0} - V y_{o p t}∥}_{2}^{2} . \end{matrix}

(15)

Proof.

Since:

\begin{matrix} y_{t} - y_{opt} = & y_{t - 1} + \frac{V_{(l,)} (d_{t} - y_{t - 1})}{{∥V_{(l,)}∥}_{2}^{2}} V_{(l,)}^{*} - y_{opt} \\ = & [I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}] y_{t - 1} + \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} d_{t} - y_{opt} \\ = & [I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}] (y_{t - 1} - y_{opt}) + \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt}) \end{matrix}

(16)

{[\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}]}^{2} = \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}

(17)

and:

{(d_{t} - y_{opt})}^{T} \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} [I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}] \times (y_{t - 1} - y_{opt}) = 0

(18)

we have the following:

{∥y_{t} - y_{opt}∥}_{2}^{2} = {∥[I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}] (y_{t - 1} - y_{opt})∥}_{2}^{2} + {∥\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})∥}_{2}^{2}

(19)

The expectation of the first term in Equation (19) can be bounded as:

\begin{matrix} E_{t - 1} [{∥(I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}) (y_{t - 1} - y_{opt})∥}_{2}^{2}] \\ = E_{t - 1} [{(y_{t - 1} - y_{opt})}^{T} {(I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}})}^{2} (y_{t - 1} - y_{opt})] \\ = E_{t - 1} [{(y_{t - 1} - y_{opt})}^{T} (I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}) (y_{t - 1} - y_{opt})] \\ = {(y_{t - 1} - y_{opt})}^{T} (I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}) (y_{t - 1} - y_{opt}) \\ \leq ϱ_{V} {∥y_{t - 1} - y_{opt}∥}_{2}^{2} \end{matrix}

(20)

Then, we have:

E [{∥(I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}) (y_{t - 1} - y_{opt})∥}_{2}^{2}] \leq ϱ_{V} E [{∥y_{t - 1} - y_{opt}∥}_{2}^{2}]

(21)

The expectation of the second term in Equation (19) is given by:

\begin{matrix} E_{t - 1} [{∥\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})∥}_{2}^{2}] \\ = E_{t - 1} [{(d_{t} - y_{opt})}^{T} {(\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}})}^{2} (d_{t} - y_{opt})] \\ = (1) E_{t - 1, (, i)} \{E_{t - 1, (l,)} [{(d_{t} - y_{opt})}^{T} \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})]\} \\ =_{(2)} E_{t - 1, (, i)} [{(d_{t} - y_{opt})}^{T} \frac{V^{*} V}{{∥ V ∥}_{F}^{2}} (d_{t} - y_{opt})] \\ = \frac{E_{t - 1} [{∥V d_{t} - V y_{opt}∥}_{2}^{2}]}{{∥ V ∥}_{F}^{2}} \end{matrix}

(22)

where the equality

=_{(1)}

comes from the law of iterated expectations again for the conditional expectations

E_{t - 1, (, i)} []

(conditional on the i-th column at the

(t - 1)

-th iteration) and

E_{t - 1, (l,)} []

(conditional on the l-th row at the

(t - 1)

-th iteration), and the equality

=_{(2)}

comes from the probability of selecting the l-th row to be

\frac{{∥V_{(l,)}∥}_{2}^{2}}{{∥V∥}_{F}^{2}}

.

From the GSA update rule, we can have the following inequality:

\begin{matrix} E_{t - 1} [{∥\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})∥}_{2}^{2}] = \frac{E_{t - 1} [{∥V d_{t} - V y_{opt}∥}_{2}^{2}]}{{∥ V ∥}_{F}^{2}} \\ = \frac{E_{t - 1} [{(V d_{t} - V y_{opt})}^{T} (V d_{t} - V y_{opt})]}{{∥ V ∥}_{F}^{2}} \\ = (1) \frac{E_{t - 1} [{(V d_{t - 1} - V y_{opt})}^{T} {(I - \frac{V_{(, i)}^{*} V_{(, i)}}{{∥V_{(, i)}∥}_{2}^{2}})}^{2} (V d_{t - 1} - V y_{opt})]}{{∥ V ∥}_{F}^{2}} \\ = (2) E_{t - 1} [{(V d_{t - 1} - V y_{opt})}^{T} (I - \frac{V_{(, i)}^{*} V_{(, i)}}{{∥V_{(, i)}∥}_{2}^{2}}) \times (V d_{t - 1} - V y_{opt})] / {∥ V ∥}_{F}^{2} \\ = \frac{{(V d_{t - 1} - V y_{opt})}^{T} (I - \frac{V_{(, i)}^{*} V_{(, i)}}{{∥V_{(, i)}∥}_{2}^{2}}) (V d_{t - 1} - V y_{opt})}{{∥ V ∥}_{F}^{2}} \\ \leq_{(3)} \frac{ϱ V {∥V d_{t - 1} - V y_{opt}∥}_{2}^{2}}{{∥ V ∥}_{F}^{2}} \end{matrix}

(23)

where the equality

=_{(1)}

is established by applying the GSA one-step update, and the equality

=_{(2)}

is based on the fact that

{(I - \frac{V_{(, i)}^{*} V_{(, i)}}{{∥V_{(, i)}∥}_{2}^{2}})}^{2} = I - \frac{V_{(, i)}^{*} V_{(, i)}}{{∥V_{(, i)}∥}_{2}^{2}})

. The inequality

\leq_{(3)}

comes from Lemma 1. Based on this inequality and the law of iterated expectations, the expectation of the second term in Equation (19) can be bounded as:

E [{∥\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})∥}_{2}^{2}] \leq \frac{ϱ_{V}^{t} {∥V d_{0} - V y_{opt}∥}_{2}^{2}}{{∥ V ∥}_{F}^{2}}

(24)

According to Equations (19), (21) and (24), we have:

\begin{matrix} E [{∥y_{t} - y_{opt}∥}_{2}^{2}] & = E [{∥(I - \frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}}) (y_{t - 1} - y_{opt})∥}_{2}^{2}] \\ + E [{∥\frac{V_{(l,)}^{*} V_{(l,)}}{{∥V_{(l,)}∥}_{2}^{2}} (d_{t} - y_{opt})∥}_{2}^{2}] \\ \leq ϱ V E [{∥y_{t - 1} - y_{opt}∥}_{2}^{2}] + \frac{ϱ_{V}^{t} {∥V d_{0} - V y_{opt}∥}_{2}^{2}}{{∥ V ∥}_{F}^{2}} \\ \leq ϱ_{V}^{2} E [{∥y_{t - 2} - y_{opt}∥}_{2}^{2}] + \frac{2 ϱ_{V}^{t} {∥V d_{0} - V y_{opt}∥}_{2}^{2}}{{∥ V ∥}_{F}^{2}} \\ \leq \dots \leq ϱ_{V}^{t} {∥y_{0} - y_{opt}∥}_{2}^{2} + \frac{t ϱ_{V}^{t}}{{∥ V ∥}_{F}^{2}} {∥V d_{0} - V y_{opt}∥}_{2}^{2} \end{matrix}

(25)

Consequently, Lemma 4 is proven. □

4.2. Convergence Analysis

Since the necessary lemmas are introduced in Section 4.1, we can begin to present the main convergence theorems here for the two proposed algorithms. Theorem 1 is established for the consistent systems, while Theorem 2 is established for the inconsistent systems.

Theorem 1.

Let

V \in C^{m \times n}

be a low-rank matrix such that

V = W H

with a full-rank

W \in C^{m \times k}

and a full-rank

H \in C^{k \times n}

, where

k < m

and

k < n

. Suppose that the systems

V y = c

and

W x = c

have the optimal solutions

y_{o p t}

and

x_{o p t}

, respectively. The initial guesses are selected as

y_{0} \in r a n g e (H^{T})

and

x_{0} \in r a n g e (W^{T})

. Define

ζ \overset{def}{=} \frac{ϱ W}{ϱ H}

. If

V y = c

is consistent, we have the following bound for

ζ \neq 1

:

\begin{matrix} E [∥ H y_{t} - & H y_{o p t} ∥_{2}^{2}] \\ \leq ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} \\ + ϱ_{H}^{t} \frac{{∥x_{0} - x_{o p t}∥}_{2}^{2} (1 - ζ^{t})}{1 - ζ} . \end{matrix}

(26)

On the other hand, for

ζ = 1

, we have:

\begin{matrix} E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}] & \leq & ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} \\ + t ϱ_{H}^{t} {∥x_{0} - x_{o p t}∥}_{2}^{2} . \end{matrix}

(27)

Proof.

From Lemma 2, we have:

\begin{matrix} E_{t - 1} [∥ H y_{t} & - H y_{o p t} ∥_{2}^{2}] \\ \leq ϱ_{H} {∥H y_{t - 1} - H y_{o p t}∥}_{2}^{2} \\ + E_{t - 1, x} [{∥x_{t} - x_{o p t}∥}_{2}^{2}] . \end{matrix}

(28)

By applying the bound given by Lemma 3 to Equation (28), we get:

\begin{matrix} E_{t - 1} [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}] & \leq & ϱ_{H} {∥H y_{t - 1} - H y_{o p t}∥}_{2}^{2} \\ + ϱ_{W}^{t} {∥x_{0} - x_{o p t}∥}_{2}^{2} . \end{matrix}

(29)

If

ϱ_{W} \neq ϱ_{H}

, from the law of iterated expectations, we can rewrite Equation (29) as:

\begin{matrix} E & [{∥H y - H y_{opt}∥}_{2}^{2}] \\ \leq ϱ_{H}^{t} {∥H y_{0} - H y_{opt}∥}_{2}^{2} + {∥x_{0} - x_{opt}∥}_{2}^{2} \sum_{g = 0}^{t - 1} ϱ_{W}^{t - g} ϱ_{H}^{g} \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{opt}∥}_{2}^{2} + ϱ_{H}^{t} {∥x_{0} - x_{opt}∥}_{2}^{2} \sum_{g = 0}^{s - 1} ζ^{g} \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{opt}∥}_{2}^{2} + ϱ_{H}^{t} \frac{{∥x_{0} - x_{opt}∥}_{2}^{2} (1 - ζ^{s})}{1 - ζ} \end{matrix}

(30)

On the other hand, if

ϱ_{W} = ϱ_{H}

, we have:

\begin{matrix} E [{∥H y - H y_{opt}∥}_{2}^{2}] \\ \leq ϱ_{H}^{t} {∥H y_{0} - H y_{opt}∥}_{2}^{2} + {∥x_{0} - x_{opt}∥}_{2}^{2} \sum_{g = 0}^{t - 1} ρ_{H}^{t - 1} \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{opt}∥}_{2}^{2} + t ϱ_{H}^{t} {∥x_{0} - x_{opt}∥}_{2}^{2} \end{matrix}

(31)

Consequently, Theorem 1 is proven. □

Theorem 2.

Let

V \in C^{m \times n}

be a low-rank matrix such that

V = W H

with a full-rank

W \in C^{m \times k}

and a full-rank

H \in C^{k \times n}

, where

k < m

and

k < n

. The systems

V y = c

and

W x = c

have the optimal solutions

y_{o p t}

and

x_{o p t}

, respectively. The initial guesses are selected as

y_{0} \in r a n g e (H^{T})

,

x_{0} \in r a n g e (W^{T})

,and

d_{0} \in C^{k}

. Define

ζ \overset{def}{=} \frac{ϱ W}{ϱ H}

. If

V y = c

is inconsistent, we have the following bound for

ζ \neq 1

:

\begin{matrix} E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}] \\ \leq ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + ϱ_{H}^{t} \frac{{∥x_{0} - x_{o p t}∥}_{2}^{2} (1 - ζ^{t + 1})}{1 - ζ} \\ + ϱ_{H}^{t} \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} [\frac{ζ (1 - ζ^{t})}{{(1 - ζ)}^{2}} - \frac{t ζ^{t + 1}}{1 - ζ}] . \end{matrix}

(32)

On the other hand, for

ζ = 1

, we have:

\begin{matrix} E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}] \\ \leq ϱ_{H}^{s} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + t ϱ_{H}^{t} {∥x_{0} - x_{o p t}∥}_{2}^{2} \\ + \frac{t (t + 1)}{2} ϱ_{H}^{t} \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} . \end{matrix}

(33)

Proof.

From Lemma 2, we have:

\begin{matrix} E_{t - 1} [∥ H y_{t} & - H y_{o p t} ∥_{2}^{2}] \\ \leq ϱ_{H} {∥H y_{t - 1} - H y_{o p t}∥}_{2}^{2} \\ + E_{t - 1, x} [{∥x_{t} - x_{o p t}∥}_{2}^{2}] . \end{matrix}

(34)

By applying the bound in Lemma 4 to Equation (34), we get:

\begin{matrix} E_{t - 1} {∥H y_{t} - H y_{o p t}∥}_{2}^{2} \leq ϱ_{H} {∥H y_{t - 1} - H y_{o p t}∥}_{2}^{2} \\ + ϱ_{W}^{t} {∥x_{0} - x_{o p t}∥}_{2}^{2} + t ϱ_{W}^{t} \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} . \end{matrix}

(35)

Next, if

ϱ_{W} \neq ϱ_{H}

, by applying the law of iterated expectations, we can rewrite Equation (35) as:

\begin{matrix} E {∥H y_{t} - H y_{o p t}∥}_{2}^{2} \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + {∥x_{0} - x_{o p t}∥}_{2}^{2} \sum_{g = 0}^{t - 1} ϱ_{W}^{t - g} ϱ_{H}^{g} \\ + \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} \sum_{g = 0}^{t - 1} (t - g) ϱ_{W}^{t - g} ϱ_{H}^{g} \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + ϱ_{H}^{t} \frac{{∥x_{0} - x_{o p t}∥}_{2}^{2} (1 - ζ^{s + 1})}{1 - ζ} \\ + ϱ_{H}^{t} \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} [\frac{ζ (1 - ζ^{t})}{{(1 - ζ)}^{2}} - \frac{s ζ^{t + 1}}{1 - ζ}] . \end{matrix}

(36)

On the other hand, if

ζ = 1

, or

ϱ_{W} = ϱ_{H}

, Equation (36) becomes:

\begin{matrix} E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}] \\ = ϱ_{H}^{t} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + {∥x_{0} - x_{o p t}∥}_{2}^{2} \sum_{g = 0}^{t - 1} ϱ_{H}^{t} \\ + \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} \sum_{g = 0}^{t - 1} (t - g) ρ_{H}^{t} \\ = ϱ_{H}^{s} {∥H y_{0} - H y_{o p t}∥}_{2}^{2} + t ϱ_{H}^{t} {∥x_{0} - x_{o p t}∥}_{2}^{2} \\ + \frac{t (t + 1)}{2} ϱ_{H}^{t} \frac{{∥W d_{0} - W x_{o p t}∥}_{2}^{2}}{{∥W∥}_{F}^{2}} . \end{matrix}

(37)

Consequently, Theorem 2 is proven. □

5. Complexity Analysis

In this section, the time- and memory-complexity analyses will be presented for the algorithms proposed in Section 3. The details are manifested in the following subsections.

5.1. Time-Complexity Analysis

For a consistent system with

ϱ_{W} \neq ϱ_{H}

, the error estimate can be bounded as:

\begin{matrix} E [{∥H y_{s} - H y_{opt}∥}_{2}^{2}] \overset{def}{=} E r r o r_{ϱ_{W} \neq ϱ_{W}}^{cons} \leq C_{ϱ_{W} \neq ϱ_{H}}^{cons} ϱ_{max}^{t}, \end{matrix}

(38)

where

ϱ_{max} \overset{def}{=} max (ϱ_{W}, ϱ_{H})

and

C_{ϱ w \neq ϱ_{H}}^{cons}

is a constant related to the matrices

W

and

H

. If the (error) tolerance of

E [{∥H y_{s} - H y_{opt}∥}_{2}^{2}]

is predefined by

ϵ

, one has to go through the “while-loop” in Algorithm 1 for at least

T \overset{def}{=} \frac{log (ϵ / C_{e W \neq ρ_{H}}^{cons})}{log (ρ_{max})}

times. For each while-loop in Algorithm 1, we need

\frac{2 k + 2 m}{p}

arithmetic operations for updating

x_{t}

and another

\frac{2 k + 2 n}{p}

arithmetic operations for updating

y_{t}

using p processors. Therefore, given the error limit not exceeding

ϵ

, the time-complexity

T_{ϱ W \neq ϱ w}^{cons}

for solving a consistent system with

ϱ_{W} \neq ϱ_{H}

can be bounded as:

\begin{matrix} T_{ϱ_{W} \neq ϱ_{H}}^{c o n s} \geq \frac{(2 m + 2 n + 4 k) log (\frac{ϵ}{C_{ϱ_{W} \neq ϱ_{H}}^{cons}})}{p log (ϱ_{max})} . \end{matrix}

(39)

For a consistent system with

ϱ_{W} = ϱ_{H}

, since the growth rate of the term

t ϱ_{max}^{t}

is larger than that of the term

ϱ_{H^{t}}

, the error estimate can be bounded by

t ϱ_{max}^{t}

(see Theorem 5 in [48]) as follows:

\begin{matrix} E [{∥H y_{s} - H y_{opt}∥}_{2}^{2}] \overset{def}{=} E r r o r_{ϱ_{W} = ϱ_{W}}^{cons} \leq C_{ϱ_{W} = ϱ_{H}}^{cons} t ϱ_{max}^{t}, \end{matrix}

(40)

where

C_{ϱ_{W} = ϱ_{H}}^{c o n s}

is another constant related to the matrices

W

and

H

. As proven by Theorem 4 in [48], one has to iterate the while-loop in Algorithm 1 for at least

T \overset{def}{=} \sqrt{log (\frac{ϵ}{C_{ϱ_{W} = ϱ_{H}}^{c o n s}})}

times. For each while-loop in Algorithm 1, the time-complexity here (for

ϱ_{W} = ϱ_{H}

) is the same as that for solving the consistent system with

ϱ_{W} \neq ϱ_{H}

instead. Therefore, the time-complexity

T_{ϱ_{W} = ϱ_{H}}^{c o n s}

for a consistent system with

ϱ_{W} = ϱ_{H}

can be bounded as:

\begin{matrix} T_{ϱ_{W} = ϱ_{H}}^{c o n s} \geq \frac{(2 m + 2 n + 4 k) \sqrt{log (\frac{ϵ}{C_{ϱ_{W} = ϱ_{H}}^{c o n s}})}}{p} . \end{matrix}

(41)

On the other hand, for an inconsistent system with

ϱ_{W} \neq ϱ_{H}

, the error estimate can be bounded as:

\begin{matrix} E [{∥H y_{s} - H y_{o p t}∥}_{2}^{2}] \overset{def}{=} {E r r o r}_{ρ_{W} \neq ρ_{H}}^{i n c o} \leq C_{ρ_{W} \neq ρ_{H}}^{i n c o} ϱ_{max}^{t}, \end{matrix}

(42)

where

C_{ρ_{W} \neq ρ_{H}}^{i n c o}

is a constant related to the matrices

W

and

H

. For a predefined tolerance

ϵ

, one has to go through the while-loop in Algorithm 2 for at least

T \overset{def}{=} \frac{log (\frac{ϵ}{C_{ϱ_{W} \neq ϱ_{H}}^{inco}})}{log (ϱ_{max})}

times. For each while-loop in Algorithm 2, it requires

\frac{2 m + 2 k}{p}

arithmetic operations for updating

d_{t}

,

4 \times \frac{k}{p}

arithmetic operations for updating

x_{t}

, and another

\frac{2 n + 2 k}{p}

arithmetic operations for updating

y_{t}

using p processors. Hence, given the error tolerance

ϵ

, the time-complexity

T_{ϱ_{W} \neq ϱ_{H}}^{i n c o}

for solving an inconsistent system with

ϱ_{W} \neq ϱ_{H}

can be bounded as:

\begin{matrix} T_{ϱ_{W} \neq ϱ_{H}}^{i n c o} \geq \frac{(2 m + 2 n + 8 k) log (\frac{ϵ}{C_{ρ_{W} \neq ρ_{H}}^{i n c o}})}{p log (ϱ_{max})} . \end{matrix}

(43)

For an inconsistent system with

ρ_{W} = ρ_{H}

, we can apply Theorem 5 in [48] to bound the error estimate as:

\begin{matrix} E [{∥H y_{s} - H y_{o p t}∥}_{2}^{2}] \overset{def}{=} {E r r o r}_{ϱ_{W} = ρ_{H}}^{i n c o} \leq C_{ϱ_{W} = ϱ_{H}}^{i n c o} t ϱ_{max}^{t}, \end{matrix}

(44)

where

C_{ϱ_{W} = ϱ_{C}}^{i n c o}

is a constant related to the matrices

W

and

H

. Similar to the previous argument for solving a consistent system with

ϱ_{W} = ϱ_{H}

, one should iterate the while-loop in Algorithm 2 for at least

T \overset{def}{=} \sqrt{log (\frac{ϵ}{C_{ϱ_{W} = ϱ_{H}}^{i n c o}})}

times. For each while-loop in Algorithm 2, the time-complexity is the same as that for solving an inconsistent system with

ϱ_{W} \neq ϱ_{H}

. Therefore, the time-complexity

T_{ϱ_{W} = ϱ_{H}}^{i n c o}

for solving an inconsistent system with

ϱ_{W} = ϱ_{H}

can be bounded as:

\begin{matrix} T_{ϱ_{W} = ϱ_{H}}^{i n c o} \geq \frac{(2 m + 2 n + 8 k) \sqrt{log (\frac{ϵ}{C_{ρ_{W} = ϱ_{H}}^{i n c o}})}}{p} . \end{matrix}

(45)

According to the time-complexity analysis earlier in this section, the worst case occurs when

ϵ

= 0 since it requires t→∞ in Equations (38), (41), (43), and (44). On the other hand, the best case occurs when

ϵ

is fairly large and all constants (determined from the matrices

W

and

H

),

C_{ρ_{W} \neq ϱ_{H}}^{c o n s}

,

C_{ρ_{W} = ϱ_{H}}^{c o n s}

,

C_{ρ_{W} \neq ϱ_{H}}^{i n c o}

and

C_{ρ_{W} = ϱ_{H}}^{i n c o}

are relatively small and we only need a single time iteration to make all error estimates less than such an

ϵ

.

5.2. Memory-Complexity Analysis

In the context of big data, the memory usage considered here is extended from the conventional memory-complexity definition, i.e., the size of the working memory used by an algorithm. Besides, we will also consider the memory used to store the input data. In this subsection, we will demonstrate that our proposed two algorithms, which solve the factorized sub-systems, require much less memory-complexity than the conventional approach to solve the original system. This memory-efficiency is a main contribution of our work. For a consistent system

V_{m \times n}

factorized as

W_{m \times k} \times H_{k \times n}

, our proposed Algorithm 1 will require

m k + n k + m

memory-units (MUs) to store the inputs

W_{m \times k}

,

H_{k \times n}

, and

c_{m \times 1}

. In Algorithm 1, one needs

(k + n)

MUs to store the probability values used for the column-selections. For updating various vectors,

(k + n)

MUs are required to store the corresponding updates. Hence, the total number of the required MUs is given by:

\begin{matrix} k (m + n) + 2 (n + k) + m . \end{matrix}

(46)

Alternatively, if one applies Algorithm 1 to reconsider the memory-complexity for solving the original system (unfactorized system), it will need

(m n + m + 2 n)

MUs for storing data.

On the other hand, for an inconsistent system, our proposed Algorithm 2 will require

m k + n k + m

memory units to store the inputs

W_{m \times k}

,

H_{k \times n}

, and

c_{m \times 1}

. In Algorithm 2, one needs

(m + k + n)

MUs to store the probability values used for the row and column selections. For updating various vectors,

(2 k + n)

MUs are required to store the corresponding updates. Therefore, the total number of the required MUs is given by:

\begin{matrix} (k + 2) (m + n) + 3 k . \end{matrix}

(47)

Alternatively, if one applies Algorithm 2 to enumerate the memory-complexity for solving the original system, it will require

(m n + 2 m + 2 n)

MUs to store data.

6. Numerical Evaluation

The numerical evaluation for our proposed algorithms is presented in this section. Convergence and time/memory-complexities of our proposed new algorithms will be evaluated in Section 6.1, Section 6.3 and Section 6.4, respectively.

6.1. Convergence Study

First consider a consistent system. The entries of

W

,

H

, and

c

are drawn from an independently and identically distributed (i.i.d.) random Gaussian process with zero-mean and unit variance where

m = 200

,

n = 150

, and

k = 100

. We plot the convergence trends of the expected error

E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}]

and the actual

L_{2}

-errors (shown by the shadow areas) with respect to

(ϱ_{W} = 0.997, ϱ_{H} = 0.894)

and

(ϱ_{W} = 0.988, ϱ_{H} = 0.907)

in Figure 4. The convergence speed subject to

ϱ_{W} = 0.997

is slower than that subject to

ϱ_{W} = 0.988

because the convergence speed is determined solely by

ϱ_{W}

according to Equation (26), where

ϱ_{H} = \frac{ϱ_{W}}{ζ}

and

ϱ_{W} > ϱ_{H}

. Each shadow region spans over the actual

L_{2}

-errors resulting from one hundred Monte Carlo trials.

Figure 4. The effect of

ϱ_{W}

on the convergence of a random consistent system.

On the other hand, consider an inconsistent system. One has to apply Algorithm 2 to solve it. The entries of

W

,

H

, and

c

are drawn from an independently and identically distributed (i.i.d.) random Gaussian process with zero-mean and unit variance where

m = 200

,

n = 150

, and

k = 100

. We plot the convergence trends of the expected error

E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}]

and the actual

L_{2}

-errors (shown by the shadow areas) with respect to

(ϱ_{W} = 0.894, ϱ_{H} = 0.993)

and

(ϱ_{W} = 0.882, ϱ_{H} = 0.987)

in Figure 5. Again, the convergence speed subject to

ϱ_{H} = 0.993

is slower than that subject to

ϱ_{H} = 0.987

because the convergence speed is determined solely by

ϱ_{H}

according to Equation (32), where

ϱ_{H} > ϱ_{W}

. Each shadow region spans over the actual

L_{2}

-errors resulting from one hundred Monte Carlo trials.

Figure 5. The effect of

ϱ_{H}

on the convergence of a random inconsistent system.

6.2. Validation Using Real-World Data

In addition to the Monte Carlo simulations shown in Section 6.1, we also validate our proposed new algorithms for real-world data on wine quality and bike rental. Here we use two real-world datasets from the UCI Machine Learning Repository [49]. The first set is related to wine quality, where we chose the data related to red wine only. The owner of this data set invoked twelve physicochemical and sensory variables to measure the wine quality. These variables include: 1—fixed acidity, 2—volatile acidity, 3—citric acid, 4—residual sugar, 5—chlorides, 6—free sulfur dioxide, 7—total sulfur dioxide, 8—density, 9—pH value, 10—sulphates, 11—alcohol, and 12—quality (each score ranging from 0 to 10). Consequently, these twelve categories of data can form an overdetermined matrix (as a matrix

V

) with size

1599 \times 12

. If the nonnegative matrix factorization is applied to obtain the factorized matrices

W

and

H

for

k = 5

, we have

ϱ_{W} = 0.99840647

and

ϱ_{H} = 0.99838774

. The expected errors

E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}]

and the actual

L_{2}

errors for wine data (denoted by triangle) are depicted in Figure 6, where Algorithm 2 is applied to solve the pertinent linear-regression problem in this case. On the other hand, another dataset about a bike-sharing system includes categorical and numerical data. Since the underlying problem is linear regression, we can work on the numerical attributes of the data only. These attributes include: 1—temperature, 2—feeling temperature, 3—humidity, 4—windspeed, 5—causal counts, 6—registered counts, and 7—total rental-bike counts. The matrix size for this dataset is thus

17, 379 \times 7

, and the corresponding parameters to this matrix are

ϱ_{W} = 0.99599172

,

ϱ_{H} = 0.98568833

, and

k = 7

. The expected errors

E [{∥H y_{t} - H y_{o p t}∥}_{2}^{2}]

and the actual

L_{2}

errors for bike data (denoted by rhombus) are delineated in Figure 6.

Figure 6. Error-convergence comparison for the wine data and the bike-rental data.

6.3. Time-Complexity Study

The time-complexity is studied here according to the theoretical analysis in Section 5.1. First consider an arbitrary consistent system (a random sample drawn from the Monte Carlo trials stated in Section 6.1). The effect of error tolerance

ϵ

on time-complexity can be visualized in Figure 7. It can be observed that time-complexity increases as

ϵ

decreases. In addition, we would like to investigate the effects of the number of processors p and the dimension k on time-complexity. The time-complexity results versus the number of processors p and the dimension k are presented in Figure 8 subject to

ϵ = 10^{- 5}

.

Figure 7. Time-complexity versus n for an arbitrary consistent system (

k = 100

,

m = 1.25 n

).

Figure 8. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an arbitrary consistent system (

m = 1.25 n

).

On the other hand, let’s focus on an arbitrary inconsistent system (an arbitrary Monte Carlo trial as stated in Section 6.1) now. The corresponding time-complexities for

ϵ = 10^{- 5}

and

ϵ = 10^{- 4}

are delineated by Figure 9. Note that one more vector is required to be updated in Algorithm 2 compared to Algorithm 1, the time-complexities shown in Figure 9 are higher that those shown in Figure 7 subject to the same

ϵ

. Because our derived error-estimate bound is tighter than that presented in [46] for the EGSA, the time-complexity of the proposed method for an inconsistent system has been reduced about 60% from that of the approach proposed by [46] subject to the same

ϵ

. How the number of processors p and the dimension k affect the time-complexity for inconsistent systems is illustrated by Figure 10 subject to the error tolerance

ϵ = 10^{- 5}

.

Figure 9. Time-complexity versus n for an arbitrary inconsistent system (

k = 100

,

m = 1.25 n

). The curves denoted by “ZF” illustrate the theoretical time-complexity error-bounds for solving the original system involving the matrix

V

without factorization (theoretical results from [46]).

Figure 10. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an inconsistent system (

m = 1.25 n

).

According to [50], we define the spectral radius

η (A)

of the “iteration matrix”

A

\overset{def}{=}

V^{*} V

, where

V

is given by Equation (1), by:

\begin{matrix} η (A) \overset{def}{=} max \{|λ_{1}|, |λ_{2}|, \dots, |λ_{| A |}|\} . \end{matrix}

(48)

Note that

| A |

denotes the cardinality of

A

and

λ_{1}

,

λ_{2}

, …,

λ_{| A |}

specify the eigenvalues of

A

. In Figure 11, we delineate the time-complexities required by the close-form solution (denoted by “Closed-Form” in the figure) and our proposed iterative Gauss–Seidel approach (denoted by “GS” in the figure) versus the dimension n for

V

with different spectral radii subject to

ϵ

=

10^{- 10}

for an inconsistent system (

m = 1.25 n

) such that

η (V^{*} V)

=

0.9

,

0.5

, and

0.1

. Even under such a small error tolerance

ϵ

=

10^{- 10}

, the time-complexity required by the closed-form solution to Equation (1) is still much larger than the that required by the iterative Gauss–Seidel algorithm proposed in this work when only a single processor is used.

Figure 11. Time-complexity versus n for

V

with different spectral radii subject to

ϵ

=

10^{- 10}

for an arbitrary inconsistent system (

m = 1.25 n

) such that

η (V^{*} V)

=

0.9

,

0.5

, and

0.1

.

Figure 11 demonstrates that if

η (V^{*} V)

< 1, our proposed new iterative Gauss–Seidel approach requires less time-complexity than the exact (closed-form) solution. According to Figure 11, the time-complexity advantage of our proposed approach becomes more significant as the dimension grows.

The run-time results listed by Table 1 and Table 2 are evaluated for different dimensions: k =

0.2 n

and m =

1.2 n

with respect to different n. The run-time unit is seconds. The computer specifications are as follows: GeForce RTX 3080 Laptop GPU, Windows 11 Home, 12th Gen Intel Core i9 processor, and SSD 8GB. Table 1 compares the run-times for the LU matrix factorization method in [51] and alternate least-squares (ALS) method in [52] with respect to different dimensions involved in the factorization step formulated by Equation (1). According to Table 1, the ALS method leads to a shorter run-time compared to the LU matrix factorization method. Table 2 compares the run times of the LAPACK solver [53], our proposed Gauss–Seidel algorithms with the factorization step formulated by Equation (1) (acronymed as “Fac. Inc.” in the tables), and our proposed Gauss–Seidel algorithms without the factorization step formulated by Equation (1) (acronymed as “Fac. Exc.” in the tables) for

ϱ_{max}

=

0.99

and

ϱ_{max}

=

0.01

. If

ϱ_{max}

=

0.99

, the convergence speeds of our proposed Gauss–Seidel algorithms are slow since

ϱ_{max}

is close to one and thus it requires a longer time than the LAPACK solver. However, our proposed method can outperform the LAPACK solver when

ϱ_{max}

is small since our proposed Gauss–Seidel algorithms will converge to the solution much faster.

Table 1. Run-times (in seconds) for the factorization of

V

.

Table 2. Run-times (in seconds) for solving

V

using the Gauss–Seidel algorithms.

6.4. Memory-Complexity Study

Memory-complexity is also investigated here according to the theoretical analysis stated in Section 5.2. Figure 12 depicts the required memory-complexity for solving an arbitrary consistent system (the same as the system used in Section 6.3) using Algorithm 1. The memory-complexity is evaluated for different dimensions:

k = 0.2 n

,

k = 0.1 n

, and

k = 0.05 n

. We further set

m = 1.25 n

. On the other hand, for an arbitrary inconsistent system (the same as the system used in Section 6.3), all of the aforementioned values of m, n, and k remain the same and Algorithm 2 should be applied instead. Figure 13 plots the required memory-complexity for solving an inconsistent system using Algorithm 2. In Figure 12 and Figure 13, for both consistent and inconsistent systems, we also present the required memory-complexity for solving the original system involving the matrix

V

without factorization. According to Figure 12 and Figure 13, the storage-efficiency can be significantly improved by at least 75% to 90% (dependent on the dimension k).

Figure 12. The memory -complexity versus n for a consistent system (

m = 1.25 n

).

Figure 13. The memory -complexity versus n for an inconsistent system (

m = 1.25 n

).

7. Conclusions

For a wide variety of big-data analytics applications, we designed two new efficient parallel algorithms, which are built upon the Gauss–Seidel algorithm, to solve large linear-regression problems for both consistent and inconsistent systems. This new approach can save computational resources by transforming the original problem into subproblems involving factorized matrices of much smaller dimensions. Meanwhile, the theoretical expected-error estimates were derived to study the convergences of the new algorithms for both consistent and inconsistent systems. Two crucial computational resource metrics—time-complexity and memory-complexity—were evaluated for the proposed new algorithms. Numerical results from artificial simulations and real-world data demonstrated the convergence and the efficiency (in terms of computational resource usage) of the proposed new algorithms. Our proposed new approach is much more efficient in both time and memory than the conventional method. Since the prevalent big-data applications frequently involve linear-regression problems (such as how to undertake linear regression when the associated matrix dimension is very large) with tremendous dimensions, our proposed new algorithms can be deemed very impactful and convenient to future big-data computing technology. In the future, we would like to consider the problem about how to perform the matrix factorization

V

properly to have

ϱ_{max}

=

max (ϱ_{W}, ϱ_{H})

as small as possible. If we have a smaller

ϱ_{max}

, we can expect faster convergences of our proposed Gauss–Seidel algorithms. In general, it is not always possible to have the linear system characterized by

V

having a small value of

ϱ_{max}

. Future research suggested here will help us to overcome this main challenge.

Author Contributions

S.Y.C. and H.-C.W. contribute to the main theory development and draft preparation. Y.W. is responsible for some figures and manuscript editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was partially supported by Louisiana Board of Regents Research Competitiveness Subprogram (Contract Number: LEQSF(2021-22)-RD-A-34).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Thakur, N.; Han, C.Y. An ambient intelligence-based human behavior monitoring framework for ubiquitous environments. Information 2021, 12, 81. [Google Scholar] [CrossRef]
Chen, Y.; Ho, P.H.; Wen, H.; Chang, S.Y.; Real, S. On Physical-Layer Authentication via Online Transfer Learning. IEEE Internet Things J. 2021, 9, 1374–1385. [Google Scholar] [CrossRef]
Tariq, F.; Khandaker, M.; Wong, K.-K.; Imran, M.; Bennis, M.; Debbah, M. A speculative study on 6G. arXiv 2019, arXiv:1902.06700. [Google Scholar] [CrossRef]
Gu, R.; Tang, Y.; Tian, C.; Zhou, H.; Li, G.; Zheng, X.; Huang, Y. Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 2539–2552. [Google Scholar] [CrossRef]
Dass, J.; Sarin, V.; Mahapatra, R.N. Fast and communication-efficient algorithm for distributed support vector machine training. IEEE Trans. Parallel Distrib. Syst. 2018, 30, 1065–1076. [Google Scholar] [CrossRef]
Yu, Z.; Xiong, W.; Eeckhout, L.; Bei, Z.; Mendelson, A.; Xu, C. MIA: Metric importance analysis for big data workload characterization. IEEE Trans. Parallel Distrib. Syst. 2017, 29, 1371–1384. [Google Scholar] [CrossRef]
Zhang, T.; Liu, X.-Y.; Wang, X.; Walid, A. cuTensor-Tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 2019, 31, 595–610. [Google Scholar] [CrossRef]
Zhang, T.; Liu, X.-Y.; Wang, X. High performance GPU tensor completion with tubal-sampling pattern. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1724–1739. [Google Scholar] [CrossRef]
Hu, Z.; Li, B.; Luo, J. Time-and cost-efficient task scheduling across geo-distributed data centers. IEEE Trans. Parallel Distrib. Syst. 2017, 29, 705–718. [Google Scholar] [CrossRef]
Jaulmes, L.; Moreto, M.; Ayguade, E.; Labarta, J.; Valero, M.; Casas, M. Asynchronous and exact forward recovery for detected errors in iterative solvers. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 1961–1974. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Wu, J.; Lin, J.; Liu, R.; Zhang, H.; Ye, Z. Affinity regularized non-negative matrix factorization for lifelong topic modeling. IEEE Trans. Knowl. Data Eng. 2019, 32, 1249–1262. [Google Scholar] [CrossRef]
Kannan, R.; Ballard, G.; Park, H. MPI-FAUN: An MPI-based framework for alternating-updating nonnegative matrix factorization. IEEE Trans. Knowl. Data Eng. 2017, 30, 544–558. [Google Scholar] [CrossRef]
Wang, S.; Chen, H.; Cao, J.; Zhang, J.; Yu, P. Locally balanced inductive matrix completion for demand-supply inference in stationless bike-sharing systems. IEEE Trans. Knowl. Data Eng. 2019, 32, 2374–2388. [Google Scholar] [CrossRef]
Sharma, S.; Powers, J.; Chen, K. PrivateGraph: Privacy-preserving spectral analysis of encrypted graphs in the cloud. IEEE Trans. Knowl. Data Eng. 2018, 31, 981–995. [Google Scholar] [CrossRef]
Liu, Z.; Vandenberghe, L. Interior-point method for nuclear norm approximation with application to system identification. SIAM J. Matrix Anal. Appl. 2009, 31, 1235–1256. [Google Scholar] [CrossRef]
Borg, I.; Groenen, P. Modern multidimensional scaling: Theory and applications. J. Educ. Meas. 2003, 40, 277–280. [Google Scholar] [CrossRef]
Biswas, P.; Lian, T.-C.; Wang, T.-C.; Ye, Y. Semidefinite programming based algorithms for sensor network localization. ACM Trans. Sens. Netw. 2006, 2, 188–220. [Google Scholar] [CrossRef]
Yan, K.; Wu, H.-C.; Xiao, H.; Zhang, X. Novel robust band-limited signal detection approach using graphs. IEEE Commun. Lett. 2017, 21, 20–23. [Google Scholar] [CrossRef]
Yan, K.; Yu, B.; Wu, H.-C.; Zhang, X. Robust target detection within sea clutter based on graphs. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7093–7103. [Google Scholar] [CrossRef]
Costa, J.A.; Hero, A.O. Geodesic entropic graphs for dimension and entropy estimation in manifold learning. IEEE Trans. Signal Process. 2004, 52, 2210–2221. [Google Scholar] [CrossRef] [Green Version]
Sandryhaila, A.; Moura, J.M. Big data analysis with signal processing on graphs. IEEE Signal Process. Mag. 2014, 31, 80–90. [Google Scholar] [CrossRef]
Sandryhaila, A.; Moura, J.M. Discrete signal processing on graphs. IEEE Trans. Signal Process. 2013, 61, 1644–1656. [Google Scholar] [CrossRef] [Green Version]
Ahmed, A.; Romberg, J. Compressive multiplexing of correlated signals. IEEE Trans. Inf. Theory 2014, 61, 479–498. [Google Scholar] [CrossRef] [Green Version]
Davies, M.E.; Eldar, Y.C. Rank awareness in joint sparse recovery. IEEE Trans. Inf. Theory 2012, 58, 1135–1146. [Google Scholar] [CrossRef] [Green Version]
Cong, Y.; Liu, J.; Fan, B.; Zeng, P.; Yu, H.; Luo, J. Online similarity learning for big data with overfitting. IEEE Trans. Big Data 2017, 4, 78–89. [Google Scholar] [CrossRef]
Zhu, X.; Suk, H.-I.; Huang, H.; Shen, D. Low-rank graph-regularized structured sparse regression for identifying genetic biomarkers. IEEE Trans. Big Data 2017, 3, 405–414. [Google Scholar] [CrossRef] [Green Version]
Liu, X.-Y.; Wang, X. LS-decomposition for robust recovery of sensory big data. IEEE Trans. Big Data 2017, 4, 542–555. [Google Scholar] [CrossRef]
Fan, J.; Zhao, M.; Chow, T.W.S. Matrix completion via sparse factorization solved by accelerated proximal alternating linearized minimization. IEEE Trans. Big Data 2018, 6, 119–130. [Google Scholar] [CrossRef]
Hou, D.; Cong, Y.; Sun, G.; Dong, J.; Li, J.; Li, K. Fast multi-view outlier detection via deep encoder. IEEE Trans. Big Data 2020, 1–11. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
Landauer, T.K.; Foltz, P.W.; Laham, D. An introduction to latent semantic analysis. Discourse Process. 1998, 25, 259–284. [Google Scholar] [CrossRef]
Obozinski, G.; Taskar, B.; Jordan, M.I. Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 2010, 20, 231–252. [Google Scholar] [CrossRef] [Green Version]
Liu, H.; Wu, J.; Liu, T.; Tao, D.; Fu, Y. Spectral ensemble clustering via weighted k-means: Theoretical and practical evidence. IEEE Trans. Knowl. Data Eng. 2017, 29, 1129–1143. [Google Scholar] [CrossRef]
Jiang, X.; Zeng, W.-J.; So, H.C.; Zoubir, A.M.; Kirubarajan, T. Beamforming via nonconvex linear regression. IEEE Trans. Signal Process. 2015, 64, 1714–1728. [Google Scholar] [CrossRef]
Kallummil, S.; Kalyani, S. High SNR consistent linear model order selection and subset selection. IEEE Trans. Signal Process. 2016, 64, 4307–4322. [Google Scholar] [CrossRef]
Kallummil, S.; Kalyani, S. Residual ratio thresholding for linear model order selection. IEEE Trans. Signal Process. 2018, 67, 838–853. [Google Scholar] [CrossRef]
So, H.C.; Zeng, W.-J. Outlier-robust matrix completion via l_p-minimization. IEEE Trans. Signal Process. 2018, 66, 1125–1140. [Google Scholar]
Berberidis, D.; Kekatos, V.; Giannakis, G.B. Online censoring for large-scale regressions with application to streaming big data. IEEE Trans. Signal Process. 2016, 64, 3854–3867. [Google Scholar] [CrossRef] [Green Version]
Boloix-Tortosa, R.; Murillo-Fuentes, J.J.; Tsaftaris, S.A. The generalized complex kernel least-mean-square algorithm. IEEE Trans. Signal Process. 2019, 67, 5213–5222. [Google Scholar] [CrossRef]
Widrow, B. Adaptive Signal Processing; Prentice Hall: Hoboken, NJ, USA, 1985. [Google Scholar]
Sonneveld, P.; Van Gijzen, M.B. IDR (s): A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations. SIAM J. Sci. Comput. 2009, 31, 1035–1062. [Google Scholar] [CrossRef] [Green Version]
Bavier, E.; Hoemmen, M.; Rajamanickam, S.; Thornquist, H. Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems. Sci. Program. 2012, 20, 241–255. [Google Scholar] [CrossRef] [Green Version]
Chang, S.Y.; Wu, H.-C. Divide-and-Iterate approach to big data systems. IEEE Trans. Serv. Comput. 2020. [Google Scholar] [CrossRef]
Hageman, L.; Young, D. Applied Iterative Methods; Academic Press: Cambridge, MA, USA, 1981. [Google Scholar]
Leventhal, D.; Lewis, A.S. Randomized methods for linear constraints: Convergence rates and conditioning. Math. Oper. Res. 2010, 35, 641–654. [Google Scholar] [CrossRef] [Green Version]
Ma, A.; Needell, D.; Ramdas, A. Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 2015, 36, 1590–1604. [Google Scholar] [CrossRef] [Green Version]
Weiss, N.A. A Course in Probability; Addison-Wesley: Boston, MA, USA, 2006. [Google Scholar]
Harremoës, P. Bounds on tail probabilities in exponential families. arXiv 2016, arXiv:1601.05179. [Google Scholar]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 3 March 2022).
Li, C.-K.; Tam, T.-Y.; Tsing, N.-K. The generalized spectral radius, numerical radius and spectral norm. Linear Multilinear Algebra 1984, 16, 215–237. [Google Scholar] [CrossRef]
Mittal, R.; Al-Kurdi, A. LU-decomposition and numerical structure for solving large sparse nonsymmetric linear systems. Comput. Math. Appl. 2002, 43, 131–155. [Google Scholar] [CrossRef] [Green Version]
Kroonenberg, P.M.; De Leeuw, J. Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 1980, 45, 69–97. [Google Scholar] [CrossRef]
Kågström, B.; Poromaa, P. LAPACK-style algorithms and software for solving the generalized Sylvester equation and estimating the separation between regular matrix pairs. ACM Trans. Math. Softw. 1996, 22, 78–103. [Google Scholar] [CrossRef]

Figure 1. Illustration of the proposed new divide-and-iterate approach.

Figure 2. Illustration of the cyclic and block distributions for p = 4.

Figure 3. Illustration of an inner-product computation on the parallel platform using cyclic distribution (

p = 4

and

m = 12

).

Figure 3. Illustration of an inner-product computation on the parallel platform using cyclic distribution (

p = 4

and

m = 12

).

Figure 4. The effect of

ϱ_{W}

on the convergence of a random consistent system.

Figure 4. The effect of

ϱ_{W}

on the convergence of a random consistent system.

Figure 5. The effect of

ϱ_{H}

on the convergence of a random inconsistent system.

Figure 5. The effect of

ϱ_{H}

on the convergence of a random inconsistent system.

Figure 6. Error-convergence comparison for the wine data and the bike-rental data.

Figure 7. Time-complexity versus n for an arbitrary consistent system (

k = 100

,

m = 1.25 n

).

Figure 7. Time-complexity versus n for an arbitrary consistent system (

k = 100

,

m = 1.25 n

).

Figure 8. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an arbitrary consistent system (

m = 1.25 n

).

Figure 8. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an arbitrary consistent system (

m = 1.25 n

).

Figure 9. Time-complexity versus n for an arbitrary inconsistent system (

k = 100

,

m = 1.25 n

). The curves denoted by “ZF” illustrate the theoretical time-complexity error-bounds for solving the original system involving the matrix

V

without factorization (theoretical results from [46]).

Figure 9. Time-complexity versus n for an arbitrary inconsistent system (

k = 100

,

m = 1.25 n

). The curves denoted by “ZF” illustrate the theoretical time-complexity error-bounds for solving the original system involving the matrix

V

without factorization (theoretical results from [46]).

Figure 10. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an inconsistent system (

m = 1.25 n

).

Figure 10. Time-complexity versus the number of processors p and the dimension k subject to

ϵ = 10^{- 5}

for an inconsistent system (

m = 1.25 n

).

Figure 11. Time-complexity versus n for

V

with different spectral radii subject to

ϵ

=

10^{- 10}

for an arbitrary inconsistent system (

m = 1.25 n

) such that

η (V^{*} V)

=

0.9

,

0.5

, and

0.1

.

Figure 11. Time-complexity versus n for

V

with different spectral radii subject to

ϵ

=

10^{- 10}

for an arbitrary inconsistent system (

m = 1.25 n

) such that

η (V^{*} V)

=

0.9

,

0.5

, and

0.1

.

Figure 12. The memory -complexity versus n for a consistent system (

m = 1.25 n

).

Figure 12. The memory -complexity versus n for a consistent system (

m = 1.25 n

).

Figure 13. The memory -complexity versus n for an inconsistent system (

m = 1.25 n

).

Figure 13. The memory -complexity versus n for an inconsistent system (

m = 1.25 n

).

Table 1. Run-times (in seconds) for the factorization of

V

.

Table 1. Run-times (in seconds) for the factorization of

V

.

Dimensions, n	100	1000	10,000	100,000
LU	5.31	8.18 × 10 $^{1}$	2.18 × 10 $^{2}$	7.01 × 10 $^{3}$
ALS	4.31	17.81	9.18 × 10 $^{1}$	4.81 × 10 $^{2}$

Table 2. Run-times (in seconds) for solving

V

using the Gauss–Seidel algorithms.

Table 2. Run-times (in seconds) for solving

V

using the Gauss–Seidel algorithms.

Dimensions, n	100	1000	10,000	100,000
$ϱ_{max}$ = $0.99$ , LAPACK	5.71	25.1	2.4 × 10 $^{2}$	9.2× 10 $^{2}$
$ϱ_{max}$ = $0.99$ , Fac. Inc.	6.63	41.61	8.08 × 10 $^{2}$	2.30 × 10 $^{3}$
$ϱ_{max}$ = $0.99$ , Fac. Exc.	2.32	23.8	7.18 × 10 $^{2}$	1.91 × 10 $^{3}$
$ϱ_{max}$ = $0.01$ , LAPACK	5.31	23.1	2.28 × 10 $^{2}$	9.01 × 10 $^{2}$
$ϱ_{max}$ = $0.01$ , Fac. Inc.	4.43	20.1	1.71 × 10 $^{2}$	7.60 × 10 $^{2}$
$ϱ_{max}$ = $0.01$ , Fac. Exc.	0.13	2.31	8.18 × 10 $^{1}$	2.81 × 10 $^{2}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.