1. Introduction
The Nyström method is a widely used technique to speed up kernel machines. Its efficiency in computation has attracted much attention in the past few years [
1,
2,
3,
4,
5,
6,
7,
8]. Given a kernel matrix
, the Nyström method tries to approximate the kernel by random sampling to save computation cost. At the cost of computational efficiency, it suffers from a relatively large matrix approximation error in real applications [
9,
10]. Given the target rank
k and target precision parameter
, Wang and Zhang [
4] gave a theoretical analysis that, with the Nyström method, it is impossible to obtain a
bound relative to
unless the number of sampled columns
. Here,
denotes the best rank-
k approximation to the kernel matrix
. Several modified Nyström methods were proposed in recent years [
3,
4,
11,
12]. In the work of [
11], a modified Nyström method just needs
columns of the kernel matrix to obtain a
bound relative to
. To the best of our knowledge, it is the fastest algorithm, costing
to achieve a
relative error of
, where
means the number of non-zero entries of
. Although these modified Nyström methods are superior in approximation accuracy, it needs a much higher computational burden compared to the conventional Nyström method.
In this paper, we propose a much faster modified Nyström method which runs in
time to achieve a
bound relative to
. When
, our algorithm will be accelerated to
which is guaranteed by Lemma 3. Our algorithm is given in Algorithm 3. It needs
times to conduct matrix multiplication which is easily implemented in parallel. The computation complexity of matrix multiplication in Algorithm 3 is near linear in input sparsity. In addition, for the arithmetic operations which are hard to implement in parallel, such as SVD, pseudoinverse and QR decomposition, Algorithm 3 needs
time which is sublinear in the input size
n. At the cost of sacrificing a certain accuracy,
can be reached with the same computational complexity as the conventional Nyström method, needing
arithmetic operations when sampling
columns. Our empirical studies further validate the efficiency of our algorithm.
In this paper, we improve several key algorithms which constitute a faster modified Nyström method. We summarized our contributions as follow.
First and most importantly, we propose an efficient modified Nyström method with theoretical guarantees.
Second, a more computationally efficient adaptive sampling method is proposed in Lemma 2. Adaptive sampling is a cornerstone of column selection, CUR decomposition and the Nyström method [
4,
5,
11,
13], and it is also very popular in other matrix problems [
14].
Finally, our proposed practical Nyström method can achieve computation efficiency in real applications, as shown by our experiments.
The rest of this paper is structured as follows. In
Section 2, we provide the notations used in this study.
Section 3, several key algorithms that constitute the modified Nyström are improved.
Section 4 gives our modified Nyström method. We conduct empirical analysis and comparison in
Section 5, and conclude our work in
Section 6. All detailed proofs are omitted except computation complexity analysis.
2. Notation and Preliminaries [15]
Firstly, we introduce the notation and concepts that will be utilized here and hereafter. is used to represent the identity matrix. Sometimes we just use for simplicity. We also use to signify a zero vector or a zero matrix with an appropriate size. The number of non-zero entries in is indicated by the notation .
Let
and
. The singular value decomposition (SVD) of
may be expressed as
where the top
k singular values are represented by
(
),
(
) and
(
). The best (or closest) rank-
k approximation to
is denoted by
. The
i-th greatest singular value of
is denoted by
. The SVD is the same as the eigenvalue decomposition when
is symmetric positive semi-definite (SPSD), in which case we obtain
.
Furthermore, let be the Moore–Penrose inverse of , defined as . When is non-singular, the matrix inverse is the same as the Moore–Penrose inverse.
The matrix norms are defined in the manner as follows. Assume that the spectral norm is and the Frobenius norm is .
When given the matrices, and with , we explicitly define matrix as the closest representation of in the column space of with the rank of the most k. The function minimizes the residual across all in the column space of C. Here, “” denotes either the spectral norm or the Frobenius norm.
When given three matrices, , , and , the projection of onto ’s column space is represented as , and the one onto ’s row space is denoted by .
We now give the definition of leverage score sampling and subspace embedding, which are key tools to construct our Nyström algorithm.
Definition 1 (Leverage score sampling, [
13,
15])
. Allow to be column orthonormal with , and to signify the i-th row of . Allow . Given that the are leverage scores, let r be an integer in the range . Create the sampling matrix and the rescaling matrix as follows. Pick an index i from the set of with probability , for each column of and , separately and with replacements. Let and . The number of operations required by this procedure is . This procedure is designated as Definition 2 ([
16])
. Assuming and , define a distribution on matrix as
, where ℓ depends on n, d, ε and δ. Assume that, any given matrix , with a probability of at least , a matrix chosen from distribution Π is a -subspace embedding for . Meaning that, for every , with probability . After that, we designate Π as an -oblivious -subspace embedding. The sparse subspace embedding matrix
and subsampled Hadamard matrix
are the two most popular subspace embedding matrices. For an
matrix
with
k dimension subspace, we can construct a sparse subspace embedding matrix
for
with
rows, and the subsampled Hadamard matrix
with
[
16]. Combining
with
still has the property.
Let’s discussed the computational costs about the matrix operations mentioned above. Matrix multiplication is an intrinsic parallel operation; hence, it can be easily implemented in parallel efficiently just as many mathematical software do. However, SVD decomposition and QR decomposition are much harder to implement in parallel. Hence, we denote the time complexity of such a matrix multiplication by . For a general matrix with , computing the full SVD requires flops, whereas computing the truncated SVD of rank k (), requires flops. Additionally, computing requires flops, too. Given a Hadamard–Walsh transform matrix , is the cost for the Hadamard–Walsh transform , which is substantially quicker than for the typical matrix multiplication. A sparse subspace embedding matrix for an matrix , needs arithmetic operations.
3. Main Lemmas and Theorems
In this part, we will outline our principal theorems and lemmas, which are the key tools to implement Algorithm 3. In addition, these lemmas and theorems are of independent interest and have wide application.
First, we give a fast randomized SVD method which is depicted in Algorithm 1 which is the fastest randomized SVD method as far as we know.
Lemma 1. Given matrix , target rank k and error parameter , is returned from Algorithm 1; then, the following formula holds with high probability.In addition, can be computed in . We denote Algorithm 1 as Algorithm 1 Sparse SVD |
- 1:
Input: a real matrix , error parameter and target rank k; - 2:
Compute , where with . is a sparse subspace embedding matrix with and is a subsampled randomized Hadamard matrix with ; - 3:
Compute an orthonormal basis for by , where is the Cholesky decomposition of ; - 4:
Compute , where with . is a sparse subspace embedding matrix with and is a subsampled randomized Hadamard matrix with . - 5:
Compute the SVD of and let contain the top k left singular vectors of ; - 6:
Output: .
|
Proof. Lemma A2 shows that
, where
is of
columns. Applying Lemma A1 and replacing
with
, we can obtain the result that
For computation time analysis, computing
takes
, and then
computes the
, where
is the Cholesky decomposition of
. Computing
requires
. Computing the SVD of
requires
. In addition, computing
requires
. Hence, Algorithm 1 takes
computation complexity. □
A faster adaptive sampling, Algorithm 2, is developed based on the work of [
13]. Boutsidis and Woodruff [
13] tried to compute norms of each column of
. To further reduce the computation cost, we introduce the sketched
to approximate
. By such sketching,
can be computed more efficiently than
.
Algorithm 2 Adaptive Sampling |
- 1:
Input: a real matrix , and the number of selected columns c; - 2:
Construct , where with . is a sparse subspace embedding matrix with and is a subsampled randomized Hadamard matrix; - 3:
Construct where is a normalized Gaussian matrix with ; - 4:
Compute sampling probabilities for , where is the th column of ; - 5:
Output: Obtain by selecting c columns from in c i.i.d. trials; in each trial the index j is chosen with probability .
|
Lemma 2. Given , and such that , with , let be returned from Algorithm 2 containing columns of . Then, the matrix satisfies that for any integer , and with a high probability which is at least . In addition, this randomized algorithm can be implemented incomputation time. We denote this randomized algorithm as Proof. Let
be the residual matrix and
is the
i-th column of
. By Theorem A4, with high probability, it holds that
Besides, by the JL property of
, we have
. Hence, after utilizing the below distribution for sampling,
Using Lemma A3, we obtain
Using the Markov inequality, we have that
holds with a probability of at least
.
As to the running time, it needs
arithmetic operations to compute
. To compute
costs
. To compute
, it requires
. In addition, computing
and
require
and
, respectively. In addition, to compute
needs
computation. In addition,
needs another
arithmetic operations. Thus, all these need
□
Lemma 3 ([
15,
17])
. Given the matrices , and , let’s suppose that is the leverage-score sketching matrix of with rows, and is the leverage-score sketching matrix of with columns. Letandthen we can obtain The number of sampled rows in Lemma 3 is independent on the input dimension of and is linear to c. By losing some accuracy, a much faster algorithm can be implemented.
5. Empirical Study
In this section, we compare our Practical Nyström algorithm with the uniform+adaptive algorithm [
11,
19], near-optimal+adaptive algorithm [
4,
11,
13] and conventional Nyström using just uniform sampling. All algorithms were implemented in Matlab and experiments were conducted on a workstation with 32 cores of 2G Hz and 24G RAM.
On each data set, we give the approximation error and the execution duration of each algorithm. The approximation error is
where
is the intersection matrix defined in the Nyström method.
On three data sets we test all three algorithms, and the results are listed in
Table 1. We create a RBF kernel matrix
for each dataset, with
, where
and
are data instances and
is the parameter of the RBF kernel function. By the definition of
, the size
n of
is the number of instances of the dataset. Thus, the kernel matrices in our experiments are of large sizes. We set
different values for each data set as
Table 1 describes. However, the effectiveness of our algorithm does not depend on the setting of
. For each data set, we set
and 50. We sampled
columns from
and
a ranges from 8 to 26. We ran each algorithm 5 times and report the average value of approximation error and running time. All results are illustrated in
Figure 1,
Figure 2 and
Figure 3.
As evidenced by the empirical results in the figures, it is clear that our approach is efficient. In terms of accuracy, Our approach is comparable to the state-of-the-art algorithm—the near-optimal+adaptive algorithm [
4,
11,
13]. As to the running time, our approach is much faster than near-optimal+adaptive algorithm and uniform+adaptive algorithm. Our algorithm’s running time grows slower than the near-optimal+adaptive algorithm and uniform+adaptive algorithm. The advantage of the running time of our algorithm grows as the dimension of kernel matrix
increases. Calculating kernel matrix
of size
from the ‘PenDigits’ data set, our alogrithm is twice as fast as the near-optimal+adaptive algorithm. As to the ‘a9a’ data set of 32,561 instances, our algorithm is four times faster than near-optimal+adaptive. In addition, as
c increases, the running-time superiority of our algorithm also increases. Our algorithm also has similar a advantage over the uniform+adaptive algorithm. Hence, our algorithm is suitable to scale to kernel matrices of high dimensions.