High Performance Graph Data Imputation on Multiple GPUs
Abstract
:1. Introduction
- We design and implement the convolutional imputation algorithm on GPU to achieve high performance and accuracy. The GPU-based convolutional imputation algorithm includes an efficient graph Fourier transform operation with coalesced memory accesses to achieve high parallelism.
- We propose effective optimization strategies to improve GPU utilization, including stream computing and batched computing. To support large-scale graph-tensor imputation and further improve performance, we propose a multi-GPU computing scheme to perform the computation on multiple GPUs.
- We perform extensive experiments to evaluate the performance of the GPU-based convolutional imputation algorithm using both synthetic and real data. With synthetic data, the GPU-optimized implementation achieves up to speedups versus the GPU-baseline implementation running on a Quadro RTX6000 GPU. The multi-GPU implementation achieves up to speedups on two GPUs versus the GPU-optimized implementation on a single GPU. The GPU implementation achieves similar recovery errors with the CPU MATLAB implementation. For the ego-Facebook dataset with various sampling rates, the GPU-optimized implementation achieves up to speedups versus the GPU-baseline implementation running on a Quadro RTX6000 GPU, while achieving similar recovery errors.
2. Related Works
3. Convolutional Imputation Algorithm
3.1. Notations
3.2. Overview of the Convolutional Imputation Algorithm
Algorithm 1: Iterative convolutional imputation. |
Input: , number of iterations C, maximum number of iterations T.
|
3.3. Parallel Acceleration Analysis
4. Efficient Convolutional Imputation Algorithm on GPU
4.1. Design and Implementation of the Baseline GPU Convolutional Imputation Algorithm
4.1.1. Computing in the Graph Spectral Domain
- First, transform the incomplete graph-tensor into the graph spectral domain by applying graph Fourier transform along the third dimension.
- Then, perform the matrix imputation task for each frontal slice of the graph-tensor.
- Finally, transform the completed graph-tensor back to the time domain by applying the inverse graph Fourier transform along the third dimension.
4.1.2. Data Storage
4.1.3. Parallelization of the Algorithm
- For the graph Fourier transforms computation in line 8 of Algorithm 1, the algorithm accesses data along the third dimension of the graph-tensor. That is to access the graph-tensor model-3 tube by model-3 tube. Based on the data storage mentioned earlier, it cannot provide coalesced memory access for the GFT computation on GPU. The premise of coalesced memory accesses is that accesses must be sequential and addresses aligned. Therefore, we introduce the step of data reorganization. First, we utilize the eigenvalue solver method provided in the cuSLOVER library [24] to get the graph Fourier transform matrix , which is composed of eigenvectors of the graph Laplacian matrix . Then, we propose a mapping to reconstruct the graph-tensor , where the original slice-by-slice layout (the index of is ) of data is reorganized into the tube-by-tube layout (the index of is ). Finally, we utilize batched matrix-matrix multiplication in the cuBLAS library [24] to calculate the GFT of , and then convert the result back to the original data layout in the graph spectral domain to get . Since the benefits obtained from batched matrix multiplication far outweigh the overhead introduced by reorganizing the data, the overall algorithm performance is improved.We design a batched scheme for the GFT computation by reorganizing the graph-tensor data to improve parallelism. Figure 3 shows an example to illustrate the detailed scheme. For a graph-tensor with size and graph Fourier transform matrix with size , the original computation is that each frontal slice in the graph spectral domain is a linear combination of data matrices on the graph, where data access is random. The result of is the dot product of vector and , 0 🞷 0 + 2 🞷 6 + 1 🞷 2 + 1 🞷 8 = 22. However, it is more time consuming because random accesses to the data are slower than sequential accesses. Therefore, we reorganize the graph-tensor data to achieve sequential accesses, such as the -th model-3 tube is reorganized into the 1st-column of 1-st frontal slice, the -th model-3 tube is reorganized into the 2nd-column of 1-st frontal slice, and so on. The graph Fourier transform matrix can now multiply each frontal slice matrix in batched by exploiting batched matrix-matrix multiplication.
- For the singular value soft-threshold computation in line 9 of Algorithm 1, it includes singular value decomposition and matrix multiplication for each frontal slice, as shown in the the pseudo code in Algorithm 2. To perform SVD of each frontal slice of , we utilize the cusolverDnSgesvdj(.) routine in the cuBLAS library [24], which is implemented via the Jacobi method and is faster than the standard method. Besides, we arrange the computations in Algorithm 2 into batched computation to achieve better parallelism and performance. Because only stores non-zero diagonal elements, we design a GPU kernel to batch the execution of N matrices (line 3), where each thread is responsible for an element of tensor (i.e., is paired with thread ). Considering that tensor of size , tensor of size and tensor of size in SVD results are stored in 1D arrays of size , and , respectively, the value of with the index in the 1D array is . Using this kernel is more efficient than calling the cuBLAS library API for each frontal slice. Further, we convert the diagonal matrix to the left of the operator (Figure 4a) rather than the right (Figure 4b) to allow for the coalesced memory access to the device memory. As shown in Figure 4a, threads access contiguous memory blocks, and so they are benefiting from coalesced memory access instructions for better efficiency. In line 4, we utilize a routine in the cuBLAS library to perform the batched matrix-matrix multiplication on N matrices in parallel.
Algorithm 2: Implementation of singular value soft-threshold. |
Input: graph-tensor in the graph spectral domain, regularization parameter of iteration j, number of vertex N.
|
4.2. Optimizations
4.2.1. Performance Bottlenecks Analysis
4.2.2. Optimizing SVD Computation
5. Large-Scale and Multi-GPU Graph-Tensor Imputation
- GPU utilizes a partitioning strategy to split the frontal slice of into n partitions, then sends a partition to each of the other GPUs by using a peer to peer, asynchronous memory transferring routine in the CUDA library [24]. Each GPU computes its own part of independently and sends the result back to GPU;
- GPU performs synchronization to ensure results are received from all GPUs, then performs GFT computation to obtain ;
- GPU utilizes a partitioning strategy to split into n partitions, then sends a partition to each of the other GPUs by using a peer to peer, asynchronous memory transferring routine. All GPUs perform singular value soft-threshold computation with their own data independently. After the completion of computation, each of the other GPUs sends the result back to GPU.
- GPU performs synchronization to ensure all GPUs finish their tasks and then performs the iGFT computation.
6. Performance Evaluation
6.1. Evaluation Settings
6.1.1. Experiment Datasets and Configurations
6.1.2. Experiment Platform
6.2. Results and Analysis
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, X.Y.; Wang, X. LS-decomposition for robust recovery of sensory big data. IEEE Trans. Big Data 2018, 4, 542–555. [Google Scholar] [CrossRef]
- Sun, Q.; Yan, M.; Donoho, D. Convolutional Imputation of Matrix Networks. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4818–4827. [Google Scholar]
- Liu, X.Y.; Aeron, S.; Aggarwal, V.; Wang, X. Low-tubal-rank tensor completion using alternating minimization. IEEE Trans. Inf. Theory 2019, 66, 1714–1737. [Google Scholar] [CrossRef]
- Zhang, T.; Liu, X.Y.; Wang, X. High Performance GPU Tensor Completion with Tubal-sampling Pattern. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1724–1739. [Google Scholar] [CrossRef]
- Liu, X.Y.; Zhu, M. Convolutional graph-tensor net for graph data completion. In IJCAI 2020 Workshop on Tensor Network Representations in Machine Learning; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Kyrola, A.; Blelloch, G.; Guestrin, C. Graphchi: Large-scale graph computation on just a PC. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), Hollywood, CA, USA, 8–10 October 2012; pp. 31–46. [Google Scholar]
- Wang, Y.; Davidson, A.; Pan, Y.; Wu, Y.; Riffel, A.; Owens, J.D. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Barcelona, Spain, 12–16 March 2016; pp. 1–12. [Google Scholar]
- Dathathri, R.; Gill, G.; Hoang, L.; Dang, H.V.; Brooks, A.; Dryden, N.; Snir, M.; Pingali, K. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Philadelphia, PA, USA, 18 June 2018; pp. 752–768. [Google Scholar]
- Perraudin, N.; Paratte, J.; Shuman, D.; Martin, L.; Kalofolias, V.; Vandergheynst, P.; Hammond, D.K. GSPBOX: A toolbox for signal processing on graphs. arXiv 2014, arXiv:cs.IT/1408.5781. [Google Scholar]
- Defferrard, M.; Martin, L.; Pena, R.; Perraudin, N. Pygsp: Graph Signal Processing in Python. 2017. Available online: https://github.com/epfl-lts2/pygsp (accessed on 31 January 2021).
- Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 729–734. [Google Scholar]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Estrach, J.B.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and deep locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Adomavicius, G.; Kwon, Y. Multi-Criteria Recommender Systems; Springer: Boston, MA, USA, 2015; pp. 847–880. [Google Scholar]
- Li, X.; Ye, Y.; Xu, X. Low-rank tensor completion with total variation for visual data inpainting. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 2210–2216. [Google Scholar]
- Liu, J.; Musialski, P.; Wonka, P.; Ye, J. Tensor Completion for Estimating Missing Values in Visual Data. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 208–220. [Google Scholar] [CrossRef] [PubMed]
- Zhang, T.; Liu, X.Y.; Wang, X.; Walid, A. cuTensor-Tubal: Efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 2019, 31, 595–610. [Google Scholar] [CrossRef]
- Abdelfattah, A.; Keyes, D.E.; Ltaief, H. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Softw. 2016, 42, 18. [Google Scholar] [CrossRef]
- Jia, Z.; Kwon, Y.; Shipman, G.; McCormick, P.; Erez, M.; Aiken, A. A distributed multi-gpu system for fast graph processing. Proc. VLDB Endow. 2017, 11, 297–310. [Google Scholar] [CrossRef] [Green Version]
- Zhang, T.; Lu, H.; Liu, X.Y. High-Performance Homomorphic Matrix Completion on Multiple GPUs. IEEE Access 2020, 8, 25395–25406. [Google Scholar] [CrossRef]
- Sandryhaila, A.; Moura, J.M.F. Big Data Analysis with Signal Processing on Graphs. IEEE Signal Process. Mag. 2014, 31, 80–90. [Google Scholar] [CrossRef]
- Kilmer, M.E.; Martin, C.D. Factorization strategies for third-order tensors. Linear Algebra Its Appl. 2011, 435, 641–658. [Google Scholar] [CrossRef] [Green Version]
- Ortega, A.; Frossard, P.; Kovačević, J.; Moura, J.M.; Vandergheynst, P. Graph signal processing: Overview, challenges, and applications. Proc. IEEE 2018, 106, 808–828. [Google Scholar] [CrossRef] [Green Version]
- Corporation, N. NVIDIA CUDA SDK 10.1, NVIDIA CUDA Software Download. 2019. Available online: https://developer.nvidia.com/cuda-downloads (accessed on 31 January 2021).
- Corporation, N. NVIDIA QUADRO RTX 6000. 2021. Available online: https://www.nvidia.com/en-us/design-visualization/quadro/rtx-6000 (accessed on 31 January 2021).
- Corporation, N. NVIDIA V100 TENSOR CORE GPU. 2021. Available online: https://www.nvidia.com/en-us/data-center/v100 (accessed on 31 January 2021).
- Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 31 January 2021).
Notations | Description |
---|---|
A vector that has entries | |
A matrix of size | |
A third-order tensor of size | |
A graph-tensor | |
A graph-tensor in the frequency domain | |
A graph Fourier transform matrix |
GPU Algorithm | Total Time (s) | SVD Time (s) | GFT Time (s) | Memory Copy Time (s) | Right-Singular Vectors Time (s) |
---|---|---|---|---|---|
GPU-baseline | 149.54 | 146.37 | 0.27 | 1.96 | - |
GPU-optimized | 2.47 | 1.43 | 0.27 | 0.11 | 0.02 |
CPU | 0.038 | 0.032 | 0.018 | 0.008 | 0.001 |
GPU | 0.038 | 0.032 | 0.018 | 0.008 | 0.001 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, C.; Zhang, T. High Performance Graph Data Imputation on Multiple GPUs. Future Internet 2021, 13, 36. https://doi.org/10.3390/fi13020036
Zhou C, Zhang T. High Performance Graph Data Imputation on Multiple GPUs. Future Internet. 2021; 13(2):36. https://doi.org/10.3390/fi13020036
Chicago/Turabian StyleZhou, Chao, and Tao Zhang. 2021. "High Performance Graph Data Imputation on Multiple GPUs" Future Internet 13, no. 2: 36. https://doi.org/10.3390/fi13020036
APA StyleZhou, C., & Zhang, T. (2021). High Performance Graph Data Imputation on Multiple GPUs. Future Internet, 13(2), 36. https://doi.org/10.3390/fi13020036