Next Article in Journal
Non-Additive Entropic Forms and Evolution Equations for Continuous and Discrete Probabilities
Previous Article in Journal
Emergence of Inequality in Income and Wealth Dynamics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Acceleration of Approximate Matrix Multiplications on GPUs

Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(8), 1130; https://doi.org/10.3390/e25081130
Submission received: 17 July 2023 / Revised: 25 July 2023 / Accepted: 25 July 2023 / Published: 27 July 2023
(This article belongs to the Section Multidisciplinary Applications)

Abstract

:
Matrix multiplication is important in various information-processing applications, including the computation of eigenvalues and eigenvectors, and in combinatorial optimization algorithms. Therefore, reducing the computation time of matrix products is essential to speed up scientific and practical calculations. Several approaches have been proposed to speed up this process, including GPUs, fast matrix multiplication libraries, custom hardware, and efficient approximate matrix multiplication (AMM) algorithms. However, research to date has yet to focus on accelerating AMMs for general matrices on GPUs, despite the potential of GPUs to perform fast and accurate matrix product calculations. In this paper, we propose a method for improving Monte Carlo AMMs. We also give an analytical solution for the optimal values of the hyperparameters in the proposed method. The proposed method improves the approximation of the matrix product without increasing the computation time compared to the conventional AMMs. It is also designed to work well with parallel operations on GPUs and can be incorporated into various algorithms. Finally, the proposed method is applied to a power method used for eigenvalue computation. We demonstrate that, on an NVIDIA A100 GPU, the computation time can be halved compared to the conventional power method using cuBLAS.

1. Introduction

Computing matrix products is a fundamental and essential operation in various information-processing tasks [1,2,3]. For example, one can obtain the largest eigenvalue and eigenvector of a positive definite matrix A by repeating the following calculations:
y A x x y y .
In the power method, the computation of the matrix product is responsible for most of the computational burden. The same applies to optimization algorithms [4,5,6]. Therefore, it is crucial to reduce the computation time of matrix products to speed up various scientific and practical calculations.
A well-known method for accelerating matrix multiplication is to use GPUs. Other research has focused on implementing fast matrix multiplication libraries [7,8,9], designing custom hardware to accelerate specific classes of matrix multiplication [10,11,12], accelerating distributed matrix multiplication [13,14,15], designing exact and low-computation-order algorithms based on the divide-and-conquer method [16,17,18], and designing efficient approximate matrix multiplication (AMM) algorithms [19,20,21]. These areas are still active research fields. Recently, an AI-powered method that utilizes reinforcement learning to discover matrix arithmetic algorithms that are computationally less expensive than existing methods has been reported [22], building upon a previous study by Strassen [16].
For practical purposes, in some cases, the exact result of the matrix product may be optional. In this paper, we propose a method for such a case. It is known that when only an approximate result of the matrix product is required, the AMM methods can significantly reduce the computation time compared to other matrix acceleration approaches. MADDNESS, proposed in [20], provides high-speed approximate results. The paper demonstrated that MADDNESS, when executed on a CPU, can reduce the computation time by a factor of 100 compared to the exact matrix product while retaining sufficient computational accuracy to enable neural network inference. Allowing approximate calculations enables MADDNESS to achieve an overwhelming speed-up with other matrix optimization approaches. However, GPU computation can typically perform matrix products two orders of magnitude faster than CPU computation. Therefore, MADDNESS is practical for running neural networks in environments like the Internet of Things (IoT), where power consumption and installation space limitations make GPUs infeasible. On the other hand, if GPUs are available and there are no such limitations, obtaining the exact result by GPU computation rather than an approximate result by MADDNESS on a CPU is preferable.
A significant body of research proposing custom hardware exists to improve matrix multiplication performance. Hardware research often focuses on reducing energy consumption [23,24,25,26,27,28]. Existing hardware research has introduced various methods to reduce the energy consumption of artificial intelligence (AI) algorithms while maintaining acceptable accuracy. One approach involves approximate computing, which trades computational accuracy for improved energy efficiency. This strategy is particularly effective for AI algorithms that exhibit inherent error tolerance. Several papers have proposed using approximate multipliers, simplifying multiplication operations, and reducing energy consumption. Another approach involves using logarithmic multipliers, simplifying multiplication operations while sacrificing some degree of accuracy. This research significantly advances energy-efficient computing techniques tailored to AI algorithms, a central and pertinent topic within computer science. Our proposed method pertains to the approximate matrix multiplication algorithm and does not result in a reduction in the power consumption of a single FMA operation. However, it can potentially reduce the overall computation time of the application, thus making a valuable contribution to energy conservation.
To the best of our knowledge, no prior research other than the method for sparse matrices [29] has endeavored to accelerate approximate computations on GPUs, despite their capability for fast and precise computation of matrix products. Computations that can be efficiently accelerated by GPUs typically involve a large number of parallel executions of simple operations with regular memory access patterns. MADDNESS proposes an alternative approach to matrix multiplication by using hashing trees for table look-ups, which is considered unsuitable for GPU implementation. In addition, FD-AMM [30] incorporates the frequent direction algorithm [31] into AMM. However, the previous study [30] has reported that FD-AMM requires an SVD iteration to generate a sketch matrix whenever one of the multiplication matrices changes, resulting in longer computation times than existing methods. Thus, traditional AMM algorithms, which lack design considerations for parallel processing efficiency on GPUs, cannot guarantee the expected acceleration simply by implementing them on GPUs. Nevertheless, given that GPUs are already employed to expedite matrix products in numerous applications, it is also advantageous to leverage such devices for approximate calculations. Accordingly, we identified a known Monte Carlo AMM that can exploit the parallel computing capabilities of GPUs. Additionally, we devised an algorithm to enhance the approximation performance of this extant method while minimizing the increase in computational complexity. Our numerical experiments demonstrate that the proposed approach can expedite the computation of eigenvalues, which is crucial in both scientific and engineering domains.

2. Proposed Methods

2.1. Theory of Monte Carlo AMM

In this section, we present the theory behind the Monte Carlo method [32] used to compute the approximate matrix multiplication (AMM) of matrices A R m × k and B R k × n . We denote the lth column and row of a real matrix M by M ( l ) and M ( l ) , respectively. The Frobenius norm of the matrix M is denoted by M , defined as the square root of the sum of the squares of all elements in the matrix. Furthermore, the Euclidean norm of vector v is denoted as v .
Algorithm 1 presents the pseudocode for the Monte Carlo AMM. It is a randomized algorithm that returns an approximate solution Z that is close to the exact solution AB with high probability. The expected square Frobenius error between AB and Z is evaluated using the following equation:
E AB Z 2 = 1 c l = 1 k A ( l ) 2 B ( l ) 2 p l AB 2 .
Algorithm 1:  Monte Carlo AMM [32]
Require:  A R m × k and B R k × n , and c N
Ensure:  Z R m × n such that AB Z is small
  1:   Z O
  2:  Set p 1 , , p k such that l = 1 k p l = 1
  3:  for  t = 1 , , c   do
  4:        Pick i t with P [ i t = l ] = p l
  5:         Z Z + A ( i t ) B ( i t ) c p i t
  6:  end for
It has been proven that this equation is minimized when the probability p l of sampling the lth column/row is set to its optimal value, defined as
p l opt : = A ( l ) B ( l ) l = 1 k A ( l ) B ( l ) .
In this case, the error is given by
E AB Z 2 p l = p l opt = 1 c l = 1 k A ( l ) B ( l ) 2 AB 2 .
The statement that AB = l = 1 k A ( l ) B ( l ) and Equation (3) imply that Algorithm 1 with the optimal Monte Carlo sampling is designed to assign a higher probability to columns/rows with larger Frobenius norms during sampling and to compute their product as an approximate solution.

2.2. AMM with Preprocessing: Proposed Methods

A straightforward method to reduce the expected squared error is to increase the number of sampling iterations c. The error decreases inversely with the number of iterations. However, increasing c results in a longer computation time for the Monte Carlo AMM. Therefore, in this paper, we propose a method to reduce the expected squared error without significantly increasing the computation time by further exploiting the information used in the conventional AMM.
Let α i and β i be the mean of A ( i ) and B ( i ) , respectively. We define the following two matrices
X : = A e m α + x , Y : = B β + y e n ,
where α : = [ α 1 , , α k ] , β : = [ β 1 , , β k ] , x : = [ x 1 , , x k ] , y : = [ y 1 , , y k ] , and e l is an l-dimensional vector whose elements are all one. These mean that we obtain X and Y by subtracting a constant value α i + x i from the ith column of A and β i + y i from the ith row of B . Then,
AB = X + e m α + x Y + β + y e n = XY + X β + y e n + e m α + x Y + e m α + x β + y e n .
The second, third, and fourth terms in Equation (6) can be computed using the matrix–vector and inner products instead of matrix–matrix products. By leveraging this fact, these terms can be accurately calculated with fewer computations for given matrices A , B , x , and y . Consequently, we utilize this approach to evaluate an AMM of matrices X and Y instead of directly using A and B . Then, the expected error of this approximation relies solely on the error of XY . To simplify the notation, we define A : = A e m α and B : = B β e n . Notably, the matrices A and B are constant, and the crucial observation is that all column sums of A and row sums of B are equal to zero.
Algorithm 2 presents the pseudocode of our proposed method. The only differences between the proposed and conventional method lie in the second and third lines. The second line solely involves matrix–vector multiplication and inner product computation, which are less computationally intensive than matrix multiplication. Furthermore, the third line allows parallel execution for each pair of i and j. Consequently, the proposed method can be regarded as integrating the less computationally demanding preprocessing employed by the existing Monte Carlo AMM. This algorithm represents the calculation of an approximate solution to AB , as shown by the following calculation.
AB = A B + A β e n + e m α B + e m α β e n = A B + A β e n + e m α B e m α β e n
Algorithm 2: Proposed method: Monte Carlo AMM with preprocessing
Require:   A R m × k and B R k × n , and c N
Ensure:   Z R m × n , such that AB Z is small
  1:   Z O
  2:   v c A β , v r α B , w α β
  3:   Z i j Z i j w + [ v c ] i + [ v r ] j for all 1 i m and 1 j n
  4:  Set p 1 , , p k such that l = 1 k p l = 1
  5:  for  t = 1 , , c   do
  6:        Pick i t with P [ i t = l ] = p l
  7:        Z Z + A ( i t ) B ( i t ) c p i t
  8:  end for
Equation (7) can be regarded as the case of x = y = 0 in Equation (6). Remarkably, we can prove theoretically that the expected error by Algorithm 2 is equal to or less than the error by Algorithm 1. Considering the discussion on Equation (4), we see that the expected squared error for computing AB using Equation (6) and the combination of AMM with exact computations is proportional to   
E : = l = 1 k X ( l ) Y ( l ) 2 XY 2 = l = 1 k A ( l ) x l e m B ( l ) y l e n 2 A e m x B y e n 2 .
The expected squared error can be represented by E / c . Hereafter, our objective is to establish the superior accuracy of the proposed method compared to the original Monte Carlo AMM. To achieve this, we aim to demonstrate that, through the presentation of several theorems, the global minimum of E occurs at x = y = 0 .
Theorem 1.
Equation (8) is minimized at x i = 0 if all elements in A ( i ) are zero.
Proof. 
Suppose that all elements of A ( i ) are zero for a certain i. Let z : = [ z 1 , , z k ] be the vector, such that
z l = x l ( l i ) 0 ( l = i ) .
For simplicity, we introduce the following notations.
f 1 ( x , y ) : = l = 1 k A ( l ) x l e m B ( l ) y l e n 2
f 2 ( x , y ) : = A e m x B y e n 2
Since we have A ( i ) z i e m = 0 , the following equalities hold.
f 1 ( x , y ) f 1 ( z , y ) = l X ( l ) Y ( l ) 2 l i X ( l ) Y ( l ) 2 = X ( i ) 2 Y ( i ) 2 + 2 X ( i ) Y ( i ) l i X ( l ) Y ( l )
In addition, we have
f 2 ( x , y ) f 2 ( z , y ) = l i X ( l ) Y ( l ) 2 l X ( l ) Y ( l ) 2 = l i X ( l ) Y ( l ) 2 l i X ( l ) Y ( l ) 2 + X ( i ) Y ( i ) 2 + 2 l i X ( l ) Y ( l ) , X ( i ) Y ( i ) = X ( i ) Y ( i ) 2 2 l i X ( l ) Y ( l ) , X ( i ) Y ( i ) ,
where · , · denotes the Frobenius inner product.
We can easily see that v 2 w 2 = v w 2 holds for vectors v and w . Combining this equality, the Cauchy–Schwarz inequality, and the triangle inequality, we obtain
l i X ( l ) Y ( l ) , X ( i ) Y ( i ) l i X ( l ) Y ( l ) X ( i ) Y ( i ) = X ( i ) Y ( i ) l i X ( l ) Y ( l ) X ( i ) Y ( i ) l i X ( l ) Y ( l ) .
From Equation (12), Equation (13), and Equation (14), we have
f 1 ( x , y ) + f 2 ( x , y ) f 1 ( z , y ) + f 2 ( z , y ) 0 .
This inequality means that Equation (8) is minimized at x i = 0 if all elements of A ( i ) are zero.    □
If all the elements in A ( i ) are identical, then according to the definitions of A and A , all elements in A ( i ) are also identical. Hence, applying Theorem 1, we can deduce that the value of x i yields the minimum value of E is 0, which implies A ( i ) x i e m = 0 . Combining this equation with Equation (8), we can obtain the values of x l ( l i ) , which minimize E by extracting the submatrices, respectively, after removing the ith column from A and the ith row from B .
Therefore, to minimize E, we can eliminate the ith column of A and the ith row of B if all of their elements are identical. Subsequently, we can minimize the error E using the resulting submatrix, which can be obtained by removing identical elements, as presented in Equation (8). Hereafter, we denote the standard deviation of the elements in A ( i ) and B ( i ) by σ i and δ i , respectively, with the assumption that σ i > 0 and δ i > 0 for all i.
Theorem 2.
The following equations hold for any i.
E x i x = 0 y = 0 = 0 , E y i x = 0 y = 0 = 0 .
Proof. 
Let [ M ] i , j be the ith row and jth column element of matrix M . For matrices A R m × k and B R k × n , we have the following equalities:
AB 2 = i = 1 m j = 1 n l = 1 k [ A ] i , l [ B ] l , j 2 = i = 1 m j = 1 n l 1 , l 2 [ A ] i , l 1 [ A ] i , l 2 [ B ] l 1 , j [ B ] l 2 , j = l 1 , l 2 i = 1 m j = 1 n [ A ] i , l 1 [ A ] i , l 2 [ B ] l 1 , j [ B ] l 2 , j = l 1 , l 2 i = 1 m [ A ] i , l 1 [ A ] i , l 2 j = 1 n [ B ] l 1 , j [ B ] l 2 , j = l 1 , l 2 A ( l 1 ) A ( l 2 ) B ( l 1 ) B ( l 2 )
Using Equation (17), we can reformulate a term in Equation (8) as
A e m x B y e n 2 = l 1 , l 2 A ( l 1 ) x l 1 e m A ( l 2 ) x l 2 e m B ( l 1 ) y l 1 e n B ( l 2 ) y l 2 e n .
Since the sum of each column of A and each row of B is zero, A ( l ) e m = 0 and B ( l ) e n = 0 are valid for any l. Applying this to the equality above, we have
A e m x B y e n 2 = l 1 , l 2 A ( l 1 ) A ( l 2 ) + m x l 1 x l 2 B ( l 1 ) B ( l 2 ) + n y l 1 y l 2 = A B 2 + A y e n 2 + B x e m 2 + e m x y e n 2 = A B 2 + n A y 2 + m B x 2 + m n x y 2 .
Since the average of elements in A ( i ) is zero, we have σ i 2 = l [ A ( i ) ] l 2 / m , where [ v ] i denotes the ith element of vector v . From this equality, we obtain
A ( i ) x i e m = l = 1 m [ A ( i ) ] l x i 2 = l = 1 m [ A ( i ) ] l 2 + x i 2 = m σ i 2 + x i 2 .
In the same way, we have
B ( i ) y i e n = n δ i 2 + y i 2 .
Substituting Equation (19), Equation (20), and Equation (21) into Equation (8), we obtain
E = m n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 2 A B 2 n A y 2 m B x 2 m n x y 2 .
Because we have B x 2 = x B B x = l 1 , l 2 [ B B ] l 1 , l 2 x l 1 x l 2 , the following holds.
B x 2 x i = 2 j [ B B ] i , j x j = 2 [ B B x ] i
Thus,
E x i = 2 m n x i l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 δ i 2 + y i 2 σ i 2 + x i 2 2 m [ B B x ] i 2 m n y i x y .
In the same way, we have
E y i = 2 m n y i l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 σ i 2 + x i 2 δ i 2 + y i 2 2 n [ A A y ] i 2 m n x i x y .
By combining Equation (24) with Equation (25), we can derive Equation (16).    □
Theorem 3.
The largest eigenvalue of M M is less than or equal to M 2 for any real nonzero matrix M . The equality holds if the rank of M is one.
Proof. 
Let λ i be the ith largest eigenvalue of M M . Since M M is positive semidefinite, all eigenvalues of M M are non-negative, and we have M 2 = Tr M M = i λ i λ 1 . Therefore, the largest eigenvalue of M M is less than or equal to M 2 .
The equality holds if and only if all eigenvalues other than the largest one are zero, which means that the rank of M M is one. Hereafter, we assume that the rank of M M is one. Let r be the rank of M R m × n . We then have the singular value decomposition (SVD) of M as M U Σ V , where U R m × r , Σ R r × r , and V R n × r . Because V is an orthogonal matrix and Σ is a diagonal matrix, M M = U Σ 2 U holds. This is nothing but the SVD of M M , and it implies that the rank of M M is r as well. Since the rank of M M is one, we have r = 1 ; thus, the rank of M is one.    □
Theorem 4.
The following equation holds for any combination of i and j.
2 E x i y j x = 0 y = 0 = 0 .
Proof. 
From Equation (24), we have
2 E x i y i = 2 m n x i y i σ i 2 + x i 2 δ i 2 + y i 2 l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 2 m n x y .
Furthermore,
2 E x i y j = 2 m n x i y j l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 δ i 2 + y i 2 σ i 2 + x i 2 · σ j 2 + x j 2 δ j 2 + y j 2 2 m n x j y i
holds for any i j . Therefore, we have Equation (26).    □
Theorem 5.
The Hessian matrices of E, regarding x and y , and x x 2 E and y y 2 E , are positive semidefinite at x = y = 0 .
Proof. 
We define f ( x i ) l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 for simplicity. Then, we have
x i x i f ( x i ) δ i 2 + y i 2 σ i 2 + x i 2 = f ( x i ) x i x i δ i 2 + y i 2 σ i 2 + x i 2 + x i δ i 2 + y i 2 σ i 2 + x i 2 f ( x i ) x i = f ( x i ) σ i 2 δ i 2 + y i 2 σ i 2 + x i 2 3 2 + x i 2 ( δ i 2 + y i 2 ) σ i 2 + x i 2 .
Moreover, the following equalities hold.
x i [ B B x ] i = x i j [ B B ] i , j x j = [ B B ] i , i x i y i x y = y i 2
Substituting these equalities into Equation (24), we can simplify 2 E / x i 2 as
2 m n σ i 2 δ i 2 + y i 2 σ i 2 + x i 2 3 2 l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 + 2 m n x i 2 δ i 2 + y i 2 σ i 2 + x i 2 2 m [ B B ] i , i 2 m n y i 2 .
Since [ B B ] i , i is equal to the square of the Euclidean length of B ( i ) , we have [ B B ] i , i = n δ i 2 . Substituting this into Equation (31), we obtain
2 E x i 2 = 2 m n σ i 2 σ i 2 + x i 2 δ i 2 + y i 2 σ i 2 + x i 2 l i σ l 2 + x l 2 δ l 2 + y l 2 .
This equation implies that
2 E x i 2 x = 0 y = 0 = 2 m n δ i σ i l i σ l δ l .
For i j , we can simplify 2 E / x i x j as
2 m n δ i 2 + y i 2 σ i 2 + x i 2 · δ j 2 + y j 2 σ j 2 + x j 2 x i x j 2 m [ B B ] i , j 2 m n y i y j .
This equation implies that
2 E x i x j x = 0 y = 0 = 2 m [ B B ] i , j .
Combining Equations (33) and (35), we can express the Hessian matrix of E regarding x at x = y = 0 as follows:
x x 2 E x = 0 y = 0 : = 2 E x 1 2 2 E x 1 x 2 2 E x 1 x k 2 E x 1 x 2 2 E x 2 2 2 E x 2 x k 2 E x 1 x k 2 E x 2 x k 2 E x k 2 = 2 m n l = 1 k σ l δ l δ 1 σ 1 O O δ k σ k B B
Let Z be a diagonal and regular matrix Z diag ( δ 1 / σ 1 , , δ k / σ k ) . From Theorem 3, the largest eigenvalue of Z 1 B B ( Z 1 ) is less than or equal to Z 1 B 2 = l = 1 k σ l / δ l B ( l ) 2 = n l = 1 k σ l δ l . Thus,
n l = 1 k σ l δ l I Z 1 B B ( Z 1 ) O ,
where I is an identity matrix. Therefore, n Z l = 1 k σ l δ l Z B B O holds, which shows that xx 2 E is positive semidefinite at x = y = 0 . In the same way, we can prove that yy 2 E is positive semidefinite at x = y = 0 .    □
Theorem 6.
The function E is locally optimal at x = y = 0 .
Proof. 
The Hessian matrix of E can be expressed as
2 E : = xx 2 E xy 2 E yx 2 E yy 2 E .
Theorem 4 provides xy 2 E = yx 2 E = O at x = y = 0 . Furthermore, Theorem 5 claims that xx 2 E and yy 2 E are positive semidefinite at x = y = 0 , which means 2 E is also positive semidefinite at x = y = 0 . Based on the positive semidefinite property and Theorem 2, it is apparent that the point x = y = 0 serves as a locally optimal solution for the minimization of E.    □
The preceding discussion demonstrates that the point x = y = 0 constitutes a local optimum of E. Hereafter, we establish that, in almost all cases, this point is a global optimum because E has only one local optimum, and, in other cases, the point is globally optimal as well because all local optima are connected.
Theorem 7.
Let M be a rank-1 matrix and v be a vector satisfying M M = v v . The vector v is the only eigenvector of M M . The matrix M 2 I M M has an eigenvalue of 0, and only a scaled version of v is its eigenvector.
Proof. 
As M 2 = Tr M M = Tr v v = v 2 holds, we have M 2 I M M v = v 2 v v v v = 0 . Since M is a rank-1 matrix, M M is a rank-1 matrix as well and has only one eigenvalue. Combining this and M 2 v = M M v , we can see that v is the unique eigenvector of M M . Moreover, based on the equality above, the matrix M 2 I M M has an eigenvalue of 0, and only a scaled version of v becomes its eigenvector.    □
Theorem 8.
Let M be a rank-1 matrix, and v be a vector satisfying M M = v v . Let Z be a regular and diagonal matrix. The vectors x that satisfy x Z 1 M 2 Z 2 M M x = 0 are scaled versions of Z 2 v .
Proof. 
Since M is a rank-1 matrix and Z is regular, Z 1 M is a rank-1 matrix as well. Let Q be Z 1 M 2 I ( Z 1 M ) ( Z 1 M ) , and v ˜ be a vector satisfying v ˜ v ˜ = Z 1 M Z 1 M . Theorem 7 implies that Q has an eigenvalue of 0 and only a scaled version of v ˜ becomes its eigenvector. Because Z v ˜ Z v ˜ = M M holds, we have v = Z v ˜ . Thus, a scaled version of Z 1 v is an eigenvector of Q corresponding to an eigenvalue of 0. Therefore, the following equality holds for any real value of c.
Q c Z 1 v = 0
We can easily see that ZQZ c Z 2 v = 0 holds, and it implies that the matrix ZQZ has the eigenvalue of 0 and only a scaled version of Z 2 v is one of its eigenvectors. Moreover, we have the following equalities:
ZQZ = Z Z 1 M 2 I Z 1 M M ( Z 1 ) Z = Z 1 M 2 Z 2 M M
Therefore, the matrix Z 1 M 2 Z 2 M M has an eigenvalue of 0, and a scaled version of Z 2 v is one of its eigenvectors.
Theorem 3 claims that Q is positive semidefinite, and it implies that ZQZ is positive semidefinite as well. Thus, the vector x satisfying x Z 1 M 2 Z 2 M M x = 0 is zero vector or a scaled version of the eigenvector corresponding to the eigenvalue of 0. Therefore, x are scaled versions of Z 2 v .    □
Theorem 9.
The function E is globally optimal at  x = y = 0.
Proof. 
From Equation (24), we obtain ∇xE as
2 m n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 δ 1 2 + y 1 2 σ 1 2 + x 1 2 O O δ k 2 + y k 2 σ k 2 + x k 2 x 2 m B B x 2 m n x y y .
This implies that
x x E 2 m = x n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 δ 1 2 + y 1 2 σ 1 2 + x 1 2 O O δ k 2 + y k 2 σ k 2 + x k 2 B B + n y y x .
We study the conditions that x must satisfy in the local solution of E. In that case, x x E must be zero due to x E = 0 .
Let B be B B , n y R k × ( m + 1 ) and Z be a diagonal and regular matrix
Z : = diag δ 1 2 + y 1 2 σ 1 2 + x 1 2 1 4 , , δ k 2 + y k 2 σ k 2 + x k 2 1 4 .
From Theorem 3, the largest eigenvalue of Z 1 B B ( Z 1 ) is less than or equal to
Z 1 B 2 = l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 1 4 B ( l ) 2 = l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 B ( l ) n y l 2 = l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 n δ l 2 + y l 2 = n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 .
Thus, we have
n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 I k Z 1 B B ( Z 1 ) O .
In particular, "≻" holds if rank ( B ) > 1 .
(i)
rank ( B ) > 1 , or rank ( B ) = 1 and y is not a scaled version of a column vector of B
Because the rank of B is greater than 1, we have n l = 1 k σ l 2 + x l 2 δ l 2 + y l 2 Z Z B B O . Combining this inequality and Equation (42), we see that x x E = 0 holds at x = 0 only. Therefore, if x is a local minimum of E, then x E = 0 holds and we have x = 0 . Since a local minimum is limited to x = 0 , it also becomes the global minimum.
(ii)
rank ( B ) = 1 and y is a scaled version of a column vector of B
All rows of B are dependent due to rank ( B ) = 1 . As y is a scaled version of a column vector of B , we can express B as B = y 1 u , , y k u by introducing a vector u . By definition, y i is not always zero.
Substituting Equation (44) into Equation (42), we have
x x E 2 m = x Z 1 B Z 2 B B x .
The right-hand side of Equation (46) is zero due to x x E = 0 . Theorem 8 claims that the vectors x satisfying x Z 1 B Z 2 B B x = 0 are scaled versions of Z 2 v , where v v = B B . The equality B = y u , n gives us B B = u 2 + n y y , and it means v = u 2 + n y . Thus,
x i = c σ i 2 + x i 2 δ i 2 + y i 2 u 2 + n y i = c σ i 2 + x i 2 y i 2 | u | 2 + y i 2 u 2 + n y i = c σ i 2 + x i 2
must be satisfied with a certain real value of c, where c : = c ( u 2 + n ) / ( | u | 2 + 1 ) . Solving this equation, we have x i = ± c σ i / 1 c 2 . This means that if x is a scaled version of vector [ σ 1 , , σ k ] then we have x E = 0 . Because x can be 0 , points satisfying x E = 0 have the same value of E at x = 0 .
We show that xx 2 E O if x is a scaled version of vector [ σ 1 , , σ k ] . Substituting σ i 2 / ( σ i 2 + x i 2 ) = 1 c 2 into Equation (32), we have
1 2 m n 2 E x i 2 = 1 c 2 δ i 2 + y i 2 σ i 2 + x i 2 l σ l 2 + x l 2 δ l 2 + y l 2 δ i 2 + y i 2 .
We also obtain the following equality for i j by applying x i 2 / ( σ i 2 + x i 2 ) = c 2 into Equation (34).
1 2 m n 2 E x i x j = c 2 δ i 2 + y i 2 δ j 2 + y j 2 [ B B ] i , j + n y i y j n .
Summarizing Equations (48), (49), and the equality [ B B ] i , i + n y i 2 = n δ i 2 + y i 2 , we can simplify xx 2 E as follows:
xx 2 E 2 m n = 1 c 2 l σ l 2 + x l 2 δ l 2 + y l 2 δ 1 2 + y 1 2 σ 1 2 + x 1 2 O O δ k 2 + y k 2 σ k 2 + x k 2 B B n + c 2 δ 1 2 + y 1 2 δ k 2 + y k 2 δ 1 2 + y 1 2 δ k 2 + y k 2 B B n
The assumption on B and definition of δ i attain δ i 2 + y i 2 = y i | u | 2 + 1 for all i. Thus, we have
δ 1 2 + y 1 2 δ k 2 + y k 2 δ 1 2 + y 1 2 δ k 2 + y k 2 B B n = 1 1 n | u | 2 y y O .
The discussion regarding the positive semidefiniteness of Equation (36) also demonstrates that the first term on the right-hand side of Equation (50) is positive semidefinite. Therefore, xx 2 E O holds if x is a scaled version of vector [ σ 1 , , σ k ] .
Through the discussion on the conditions under which x must satisfy in the local solution of E when y is a constant vector, the following revelations emerge:
  • If x is a local solution, it must be a scaled version of vector [ σ 1 , , σ k ] .
  • All local solutions are continuous and have the same value of E.
  • xx 2 E O holds if x is a scaled version of vector [ σ 1 , , σ k ] .
In other words, the set of local solutions is globally minimal, and x = 0 also becomes a global minimum.
Both discussions (i) and (ii) yield the conclusion that x = 0 is valid when E is minimized under the condition of a constant y . Similar observations can be made regarding investigating the conditions for y in the local solution of E. Hence, the function E attains global optimality at x = y = 0 .    □

3. Results

3.1. Assessing the Accuracy of Approximate Matrix Multiplications

We evaluate the approximation performance of the proposed method for matrix multiplication. First, we set the size of the matrices to m = n = k = 2 15 . Then, we randomly determine each element of the matrices independently of a uniform distribution between 0 and 1.
Figure 1A evaluates the speedup ratio of the computation time and the approximation performance obtained by applying the conventional and proposed methods for AMM. The horizontal axis compares the computation time of the approximate calculation with that of the exact matrix product calculation. In the experiments of this chapter, we implemented the CUDA program for AMM using CUDA 12.0, which was executed on an NVIDIA A100 GPU. The exact matrix product was computed using an API incorporated into cuBLAS (NVIDIA’s official matrix library). The numerical precision was set to a single-precision floating point. The vertical axis in this figure represents the median relative accuracy between an approximate matrix Z and the exact solution AB , calculated by executing each method 100 times. The relative accuracy is defined as
1 AB Z AB .
Figure 1A demonstrates that the proposed method achieves higher acceleration while maintaining relative accuracy. For instance, at a relative accuracy of 0.95, the acceleration rate is 87.2 times faster for the conventional method and 443.2 times faster for the proposed method. Although the relative accuracy required depends on the application, the proposed method outperforms the traditional method at all levels of relative accuracy.
Figure 1B displays a graph showing the speedup ratio trend achieving a relative accuracy of 0.95 when the matrix size is changed from 2 10 to 2 15 . This demonstrates that the proposed method can achieve higher speedup ratios than the conventional method regardless of the matrix size. Furthermore, the difference in the speedup ratio expands as the matrix size increases. If the matrix size is small, the exact product does not generally become the bottleneck of the entire calculation. The approximate product is required mainly for large-scale computations. The property of the proposed method (that the advantage becomes more pronounced as the matrix size increases) is desirable.

3.2. Application to a Practical Problem: Computing Eigenvalues and Eigenvectors

Next, we evaluate the performance of the proposed methodology in practical applications. We focus on the computation of eigenvalues, particularly the largest eigenvalue. Several algorithms have been proposed for computing the largest eigenvalue, including power iteration, inverse power iteration [33], accelerated stochastic power iteration [34], and the LOBPCG method [35]. This paper applies the proposed AMM to the power iteration, which forms the backbone of many existing techniques.
We assume for simplicity that the matrix A R m × m is symmetric. Then, all eigenvalues of A are real. See Algorithm 3. For any square matrix A , let λ be an eigenvalue of A . Then, A 2 has an eigenvalue of λ 2 . Therefore, we substitute A with A 2 in Algorithm 3.
Algorithm 3:  Power method
Require:   A R m × m , x 0 : m-dimensional randomized unit vector.
Ensure:   λ : the eigenvalue of A with the largest absolute value
  1:  for while not converged do
  2:     y Ax
  3:     λ x y
  4:     x y / y
  5:  end for
  6:  Return λ
In this case, the second line is expressed as y A 2 x , and we apply the proposed AMM to this step. From Equation (7), we have
A 2 = A A = A A + A α e m + e m A α e m α α e m .
We execute the calculation of A A in Equation (53) using Algorithm 2. By incorporating this AMM into Algorithm 3, we obtain Algorithm 4, a power iteration algorithm that utilizes the approximation. The fourth to eighth lines of Algorithm 4 correspond to the Monte Carlo AMM with the proposed method.
Algorithm 4: Power method with the proposed Monte Carlo AMM
Require:  A m × m symmetric positive-semidefinite matrix, x 0 : m-dimensional randomized unit vector, c N
Ensure:   λ : the largest absolute eigenvalue of A
  1:   z A α , v z α 2 e m
  2:  Set p 1 , , p m , such that l = 1 m p l = 1
  3:  for while not converged do
  4:     y e m x v + z x e m
  5:    for  t = 1 , , c  do
  6:        Pick i t with P [ i t = l ] = p l
  7:         y y + A ( i t ) A ( i t ) x c p i t
  8:    end for
  9:     λ x y
  10:     x y / y
  11:  end for
  12:  Return λ
In this section, we assess the performance of the proposed methodology by employing the principal component analysis (PCA) [36] as a representative example, which serves as one of the applications for computing eigenvalues. The PCA is a method employed to examine and reduce the dimensionality of the dataset. It involves identifying the principal components and linear combinations of the original features that effectively capture the most substantial variations within the data. The eigenvalues and eigenvectors of a covariance matrix hold a pivotal role in PCA. When presented with a dataset comprising p features, the initial step in PCA involves the computation of the covariance matrix. This matrix assumes a symmetric structure with dimensions of p-by-p, where each element denotes the covariance between two specific features. The quest for the most dominant principal components is tantamount to identifying the largest eigenvector within the covariance matrix. Hence, within this section, we focus on image analysis, an area in which PCA finds frequent application. Specifically, we have obtained the largest eigenvalues and corresponding eigenvectors from a matrix with a size of 11750 × 11750 , which is acquired via the eigenface algorithm [37] on the LFW dataset [38].
Figure 2A illustrates the variation in convergence over time when CUDA programs of the power method (PM) are run on an NVIDIA A100 GPU. Both figures share a common horizontal axis representing the computation time.
The vertical axis in Figure 2A represents the relative error between the largest and estimated eigenvalue during iterations. In contrast, the vertical axis in Figure 2B represents the deviation between the eigenvector corresponding to the largest eigenvalue and the estimated eigenvector during iterations. Incorporating the proposed AMM into the power method requires the execution of the first line of Algorithm 4, resulting in a longer duration until the completion of the initial iteration compared to the conventional method incorporating AMM in the power method. However, since this computation only needs to be performed once, its impact on the overall computation time is limited. Compared to the power method with the conventional AMM, the method with the proposed AMM demonstrates favorable convergence in both indicators. The power method with the proposed AMM cannot achieve the same level of convergence as the one employing exact matrix multiplication. This is because the expected variance from the AMM breaks the condition revealed in previous studies on the noisy power method [39].
To solve this problem, we propose a variance reduction method that performs exact matrix multiplication under some criteria. When obtaining the approximate matrix product of the matrices A and B + B ˜ , if C = AB is known, we can compute the approximate solution C ˜ A B ˜ and provide C + C ˜ as an approximate solution. As mentioned in Section 2, the expected error will decrease if the Frobenius norm of B ˜ is small. The estimated eigenvector converges gradually through the power method. Therefore, computing the exact matrix product only sometimes and minimizing the expected error of the AMM are expected to be effective. Based on this motivation, we present Algorithm 5. It uses the proposed AMM while performing exact matrix multiplication only for the steps that satisfy the given conditions.
Algorithm 5: Variance reduction power method with the proposed Monte Carlo AMM
Require:  A m × m symmetric positive-semidefinite matrix, x 0 : m-dimensional randomized unit vector, c N
Ensure:   λ : the largest absolute eigenvalue of A
  1:   z A α , v z α 2 e m , x ¯ 0 , y ¯ 0
  2:  Set p 1 , , p m , such that l = 1 m p l = 1
  3:  for while not converged do
  4:    if criteria are met then
  5:         y y ¯ + e m ( x x ¯ ) v + z ( x x ¯ ) e m
  6:        for  t = 1 , , c  do
  7:              Pick i t with P [ i t = l ] = p l
  8:            y y + A ( i t ) A ( i t ) ( x x ¯ ) c p i t
  9:           end for
  10:    else
  11:         y A 2 x , x ¯ x , y ¯ y
  12:    end if
  13:     λ x y
  14:     x y / y
  15:  end for
  16:  Return λ
Various conditions can be considered for performing exact matrix multiplication. For instance, we can employ a static strategy where an exact computation is performed every few steps. Alternatively, a dynamic strategy is also promising, where convergence stagnation due to approximation errors is detected, and an exact computation is executed. In this section, we adopt a hybrid approach that combines an AMM with the exact computation based on this dynamic strategy. To assess the convergence of eigenvector calculations, we adopt the following rules:
1.
If the dot product between the current and previous vectors of x exceeds a threshold value θ , then we execute the exact matrix product at the next step.
2.
Even if the initial condition is met, the subsequent calculation employs an AMM if the exact matrix multiplication was previously executed.
The experimental results are presented in Figure 3. We set the number of samplings c to 512. We executed each method 1000 times while changing random seeds. In this figure, the markers represent the median, while the shaded region represents the interquartile range from the first to the third quartiles. The top row corresponds to θ = 1 10 4 , while the bottom row corresponds to θ = 1 10 3 . The left panel depicts the relative error between the true largest eigenvalue λ max and the estimated largest eigenvalue λ ˜ max . The right panel illustrates the discrepancy between the eigenvectors corresponding to the largest eigenvalue, denoted as v max , and the estimated eigenvectors denoted as v ˜ max . The threshold is a hyperparameter, and it is not clear how to determine an appropriate value for it. Therefore, the method should exhibit stable convergence for various threshold values. For θ = 1 10 4 , the power method with the conventional AMM fails to converge. In contrast, the power method with the proposed AMM demonstrates convergence performance comparable to the power method employing exact matrix multiplication while achieving a reduced computation time. When the threshold is changed to θ = 1 10 3 , the steps involving exact matrix multiplication increase. Consequently, even the power method with the conventional AMM exhibits good convergence. However, the increased frequency of exact matrix multiplications requires more computation time than the power method with the proposed AMM. Thus, the power method with the proposed AMM demonstrates more robust convergence than the one utilizing only exact matrix multiplication. Moreover, it is faster than the power method using only exact matrix multiplication. For instance, when θ = 1 10 3 , the computation time until reaching 1 v max v ˜ max 2 = 10 11 is reduced by about 40%.

4. Discussion

One of the challenges of the proposed method is that its effectiveness diminishes when the elements of matrix A ’s columns, representing α , and the elements of matrix B ’s rows, representing β , are close to zero. In practical applications and datasets, matrices often exhibit element distributions with specific tendencies or possess the property of being positive semidefinite, thereby deviating from the scenarios where the proposed method is less effective. However, there are cases in specific applications where these conditions apply, requiring alternative methods to improve the approximation accuracy of the matrices.
In this paper, we applied AMM to the simple power method and computed the largest eigenvalue. By incorporating QR decomposition into the iterations of the power method, we can obtain the first several singular eigenvectors [39]. Furthermore, by extending the application of AMM to other variants of the power method discussed in this paper, we anticipate accelerating the computation of various eigenvalue calculations. As mentioned in the previous section, some prior studies on the noisy power method analyze the convergence behavior when noise is introduced in each iteration of the power method. Although we determined the threshold θ in this paper, it is possible to analytically or dynamically adjust θ based on such prior research. This remains a challenge for future investigation.
Besides the power method, least square minimization can also be considered among the applications of AMM. Previous research has explored accelerating the analysis of large-scale single nucleotide polymorphisms by utilizing approximate calculations to reduce computational complexity, as the use of pseudo-inverse matrices for obtaining approximate solutions incur high computational costs [19]. By applying our proposed concept to randomized numerical linear algebra (RandNLA), including these applications, there is potential to achieve faster computations and obtain higher-quality approximate solutions. Therefore, applying our method to various scientific and technological computations will be important.

Author Contributions

Conceptualization, T.O., A.R., T.M., and M.N.; methodology, T.O.; software, T.O.; writing—original draft preparation, T.O.; writing—review and editing, A.R., T.M., and M.N.; visualization, T.O.; supervision, M.N.; project administration, M.N.; funding acquisition, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in part by the CREST project (JPMJCR17N2) funded by the Japan Science and Technology Agency, Grants-in-Aid for Scientific Research (JP20H00233), and Transformative Research Areas (A) (JP22H05197) funded by the Japan Society for the Promotion of Science. A.R. was funded by the Japan Society for the Promotion of Science as an International Research Fellow.

Data Availability Statement

Readers can download the programs used in Section 3.1 and Section 3.2 from https://github.com/Takuya-Okuyama/AMM_with_preprocessing (accessed on 24 July 2023) and https://github.com/Takuya-Okuyama/eigenvalueComputation_by_AMM (accessed on 24 July 2023), respectively.

Acknowledgments

We would like to thank Nicolas Chauvet for useful discussions in the early stages of this work.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  1. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  2. Sarle, W. Neural Networks and Statistical Models. In Proceedings of the 19th Annual SAS Users Group International Conference, Dallas, TX, USA, 10–13 April 1994; pp. 1538–1550. [Google Scholar]
  3. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef]
  4. Okuyama, T.; Sonobe, T.; Kawarabayashi, K.; Yamaoka, M. Binary optimization by momentum annealing. Phys. Rev. E 2019, 100, 012111. [Google Scholar] [CrossRef] [PubMed]
  5. Bottou, L. On-Line Learning and Stochastic Approximations. In On-Line Learning in Neural Networks; Cambridge University Press: Cambridge, MA, USA, 1999; pp. 9–42. [Google Scholar]
  6. Wright, S.J. Primal-Dual Interior-Point Methods; SIAM: Philadelphia, PA, USA, 1997. [Google Scholar]
  7. Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410. [Google Scholar] [CrossRef] [PubMed]
  8. Sanderson, C.; Curtin, R. Armadillo: A template-based C++ library for linear algebra. J. Open Source Softw. 2016, 1, 26. [Google Scholar] [CrossRef]
  9. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the NIPS 2017 Workshop on Autodiff, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  10. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. SIGARCH Comput. Archit. News 2017, 45, 1–12. [Google Scholar] [CrossRef]
  11. Zhang, Z.; Wang, H.; Han, S.; Dally, W.J. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 261–274. [Google Scholar] [CrossRef]
  12. Zhou, H.; Dong, J.; Cheng, J.; Dong, W.; Huang, C.; Shen, Y.; Zhang, Q.; Gu, M.; Qian, C.; Chen, H.; et al. Photonic matrix multiplication lights up photonic accelerator and beyond. Light. Sci. Appl. 2022, 11, 30. [Google Scholar] [CrossRef]
  13. Yu, Q.; Maddah-Ali, M.A.; Avestimehr, A.S. Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2022–2026. [Google Scholar] [CrossRef]
  14. Aliasgari, M.; Simeone, O.; Kliewer, J. Private and Secure Distributed Matrix Multiplication With Flexible Communication Load. IEEE Trans. Inf. Forensics Secur. 2020, 15, 2722–2734. [Google Scholar] [CrossRef]
  15. Nissim, R.; Schwartz, O. Accelerating Distributed Matrix Multiplication with 4-Dimensional Polynomial Codes. In Proceedings of the SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), Seattle, WA, USA, 31 May–2 June 2023; pp. 134–146. [Google Scholar] [CrossRef]
  16. Strassen, V. Gaussian elimination is not optimal. Numer. Math. 1969, 13, 354–356. [Google Scholar] [CrossRef]
  17. Li, J.; Ranka, S.; Sahni, S. Strassen’s Matrix Multiplication on GPUs. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 7–9 December 2011; pp. 157–164. [Google Scholar] [CrossRef]
  18. Winograd, S. On multiplication of 2 × 2 matrices. Linear Algebra Its Appl. 1971, 4, 381–388. [Google Scholar] [CrossRef]
  19. Drineas, P.; Mahoney, M.W. RandNLA: Randomized Numerical Linear Algebra. Commun. ACM 2016, 59, 80–90. [Google Scholar] [CrossRef]
  20. Blalock, D.; Guttag, J. Multiplying Matrices Without Multiplying. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: La Jolla, CA, USA, 2021; Volume 139, pp. 992–1004. [Google Scholar]
  21. Francis, D.P.; Raimond, K. A Practical Streaming Approximate Matrix Multiplication Algorithm. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 1455–1465. [Google Scholar] [CrossRef]
  22. Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Ruiz, F.J.R.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef] [PubMed]
  23. Lee, K.J.; Lee, J.; Choi, S.; Yoo, H.J. The Development of Silicon for AI: Different Design Approaches. IEEE Trans. Circuits Syst. Regul. Pap. 2020, 67, 4719–4732. [Google Scholar] [CrossRef]
  24. Pilipović, R.; Risojević, V.; Božič, J.; Bulić, P.; Lotrič, U. An Approximate GEMM Unit for Energy-Efficient Object Detection. Sensors 2021, 21, 4195. [Google Scholar] [CrossRef] [PubMed]
  25. Kim, M.S.; Barrio, A.A.D.; Oliveira, L.T.; Hermida, R.; Bagherzadeh, N. Efficient Mitchell’s Approximate Log Multipliers for Convolutional Neural Networks. IEEE Trans. Comput. 2019, 68, 660–675. [Google Scholar] [CrossRef]
  26. Ansari, M.S.; Cockburn, B.F.; Han, J. An Improved Logarithmic Multiplier for Energy-Efficient Neural Computing. IEEE Trans. Comput. 2021, 70, 614–625. [Google Scholar] [CrossRef]
  27. Pilipović, R.; Bulić, P.; Lotrič, U. A Two-Stage Operand Trimming Approximate Logarithmic Multiplier. IEEE Trans. Circuits Syst. Regul. Pap. 2021, 68, 2535–2545. [Google Scholar] [CrossRef]
  28. Kim, M.S.; Del Barrio, A.A.; Kim, H.; Bagherzadeh, N. The Effects of Approximate Multiplication on Convolutional Neural Networks. IEEE Trans. Emerg. Top. Comput. 2022, 10, 904–916. [Google Scholar] [CrossRef]
  29. Liu, X.; Liu, Y.; Yang, H.; Dun, M.; Yin, B.; Luan, Z.; Qian, D. Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. J. Supercomput. 2022, 78, 11464–11491. [Google Scholar] [CrossRef]
  30. Ye, Q.; Luo, L.; Zhang, Z. Frequent Direction Algorithms for Approximate Matrix Multiplication with Applications in CCA. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; AAAI Press: Atlanta, GA, USA, 2016; pp. 2301–2307. [Google Scholar]
  31. Ghashami, M.; Liberty, E.; Phillips, J.M.; Woodruff, D.P. Frequent Directions: Simple and Deterministic Matrix Sketching. SIAM J. Comput. 2016, 45, 1762–1792. [Google Scholar] [CrossRef]
  32. Drineas, P.; Kannan, R.; Mahoney, M.W. Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. SIAM J. Comput. 2006, 36, 132–157. [Google Scholar] [CrossRef]
  33. Serre, D. Matrices; Springer: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
  34. Xu, P.; He, B.; De Sa, C.; Mitliagkas, I.; Re, C. Accelerated stochastic power iteration. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Lanzarote, Canary Islands, 9–11 April 2018; pp. 58–67. [Google Scholar]
  35. Knyazev, A.V. Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method. SIAM J. Sci. Comput. 2001, 23, 517–541. [Google Scholar] [CrossRef]
  36. Jolliffe, I.T. Principal Component Analysis, 2nd ed; Springer Series in Statistics; Springer: New York, NY, USA, 2002. [Google Scholar]
  37. Turk, M.; Pentland, A. Face recognition using eigenfaces. In Proceedings of the 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Maui, HI, USA, 3–6 June 1999; pp. 586–591. [Google Scholar] [CrossRef]
  38. Huang, G.B.; Jain, V.; Learned-Miller, E. Unsupervised Joint Alignment of Complex Images. In Proceedings of the ICCV, Rio de Janeiro, Brazil, 14–20 October 2007. [Google Scholar]
  39. Hardt, M.; Price, E. The Noisy Power Method: A Meta Algorithm with Applications. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2; NIPS’14. MIT Press: Cambridge, MA, USA, 2014; pp. 2861–2869. [Google Scholar]
Figure 1. (A) The trade-off between the speedup ratio and Frobenius norm error when calculating AMM using the conventional and proposed methods. The matrices are of size m = n = k = 2 15 , with elements randomly generated from a uniform distribution between 0 and 1. The speedup ratio is the ratio of the time taken to calculate the matrix product precisely using cuBLAS on a GPU to the time taken to execute AMM. (B) Comparison of speedup ratios for achieving a relative accuracy of 0.95. It is clear that as the matrix size increases, the difference in speedup rate increases.
Figure 1. (A) The trade-off between the speedup ratio and Frobenius norm error when calculating AMM using the conventional and proposed methods. The matrices are of size m = n = k = 2 15 , with elements randomly generated from a uniform distribution between 0 and 1. The speedup ratio is the ratio of the time taken to calculate the matrix product precisely using cuBLAS on a GPU to the time taken to execute AMM. (B) Comparison of speedup ratios for achieving a relative accuracy of 0.95. It is clear that as the matrix size increases, the difference in speedup rate increases.
Entropy 25 01130 g001
Figure 2. Temporal convergence curves of power methods. The differences between the estimated largest eigenvalue λ ˜ max and its corresponding eigenvector v ˜ max at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure.
Figure 2. Temporal convergence curves of power methods. The differences between the estimated largest eigenvalue λ ˜ max and its corresponding eigenvector v ˜ max at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure.
Entropy 25 01130 g002
Figure 3. Temporal convergence curves of variance reduction power methods. The differences between the estimated largest eigenvalue λ ˜ max and its corresponding eigenvector v ˜ max at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure. The top row (A,B) corresponds to θ = 1 10 4 , while the bottom row (C,D) corresponds to θ = 1 10 3 . The notation θ is a parameter representing the threshold for switching between exact and approximate matrix products in power methods; the details of which are described in the main text.
Figure 3. Temporal convergence curves of variance reduction power methods. The differences between the estimated largest eigenvalue λ ˜ max and its corresponding eigenvector v ˜ max at each iteration of the power method from their respective true values, denoted as λ max and v max , are plotted on the vertical axis of each figure. The top row (A,B) corresponds to θ = 1 10 4 , while the bottom row (C,D) corresponds to θ = 1 10 3 . The notation θ is a parameter representing the threshold for switching between exact and approximate matrix products in power methods; the details of which are described in the main text.
Entropy 25 01130 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Okuyama, T.; Röhm, A.; Mihana, T.; Naruse, M. Acceleration of Approximate Matrix Multiplications on GPUs. Entropy 2023, 25, 1130. https://doi.org/10.3390/e25081130

AMA Style

Okuyama T, Röhm A, Mihana T, Naruse M. Acceleration of Approximate Matrix Multiplications on GPUs. Entropy. 2023; 25(8):1130. https://doi.org/10.3390/e25081130

Chicago/Turabian Style

Okuyama, Takuya, André Röhm, Takatomo Mihana, and Makoto Naruse. 2023. "Acceleration of Approximate Matrix Multiplications on GPUs" Entropy 25, no. 8: 1130. https://doi.org/10.3390/e25081130

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop