Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (11)

Search Parameters:
Keywords = sparse matrix-vector multiplication (SpMV)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 2087 KB  
Article
Automatic Sparse Matrix Format Selection via Dynamic Labeling and Clustering on Heterogeneous CPU–GPU Systems
by Zheng Shi, Yi Zou and Xianfeng Song
Electronics 2025, 14(19), 3895; https://doi.org/10.3390/electronics14193895 - 30 Sep 2025
Viewed by 134
Abstract
Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in high-performance computing (HPC) whose efficiency depends heavily on the storage format across central processing unit (CPU) and graphics processing unit (GPU) platforms. Conventional supervised approaches often use execution time as training labels, but our [...] Read more.
Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in high-performance computing (HPC) whose efficiency depends heavily on the storage format across central processing unit (CPU) and graphics processing unit (GPU) platforms. Conventional supervised approaches often use execution time as training labels, but our experiments on 1786 matrices reveal two issues: labels are unstable across runs due to execution-time variability, and single-label assignment overlooks cases where multiple formats perform similarly well. We propose a dynamic labeling strategy that assigns a single label when the fastest format shows clear superiority, and multiple labels when performance differences are small, thereby reducing label noise. We further extend feature analysis to multi-dimensional structural descriptors and apply clustering to refine label distributions and enhance prediction robustness. Experiments demonstrate 99.2% accuracy in hardware (CPU/GPU) selection and up to 98.95% accuracy in format prediction, with up to 10% robustness gains over traditional methods. Under cost-aware, end-to-end evaluation that accounts for feature extraction, prediction, conversion, and kernel execution, CPUs achieve speedups up to 3.15× and GPUs up to 1.94× over a CSR baseline. Cross-round evaluations confirm stability and generalization, providing a reliable path toward automated, cross-platform SpMV optimization. Full article
Show Figures

Figure 1

15 pages, 1001 KB  
Article
SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures
by Sheng Zhang, Wuqiang Bai, Zongmao Zhang, Xuchao Xie and Xuebin Tang
Appl. Sci. 2025, 15(17), 9811; https://doi.org/10.3390/app15179811 - 7 Sep 2025
Viewed by 746
Abstract
Sparse Matrix–Vector Multiplication (SpMV) is a critical computational kernel in high-performance computing (HPC) and artificial intelligence (AI). However, its irregular memory access patterns lead to frequent cache misses on multi-level cache hierarchies, significantly degrading performance. Scratchpad memory (SPM), a software-managed, low-latency on-chip memory, [...] Read more.
Sparse Matrix–Vector Multiplication (SpMV) is a critical computational kernel in high-performance computing (HPC) and artificial intelligence (AI). However, its irregular memory access patterns lead to frequent cache misses on multi-level cache hierarchies, significantly degrading performance. Scratchpad memory (SPM), a software-managed, low-latency on-chip memory, offers improved data locality and control, making it a promising alternative for irregular workloads. To enhance SpMV performance, we propose a vectorized execution framework targeting SPM-augmented processors. Recognizing the limitations of traditional formats for vectorization, we introduce Sorted-Row-Block ELL (SRB-ELL), a new matrix storage format derived from ELLPACK (ELL). SRB-ELL stores only non-zero elements, partitions the matrix into row blocks, and sorts them by block size to improve load balance and SIMD efficiency. We implement and evaluate SRB-ELL on a custom processor architecture with integrated SPM using the gem5 simulator. Experimental results show that, compared to vectorized CSR-based SpMV, the SRB-ELL design achieves up to 1.48× speedup and an average of 1.19×. Full article
Show Figures

Figure 1

21 pages, 6865 KB  
Article
Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture
by Muhammad Ahmad, Sardar Usman, Ameer Hamza, Muhammad Muzamil and Ildar Batyrshin
Information 2025, 16(7), 553; https://doi.org/10.3390/info16070553 - 29 Jun 2025
Viewed by 610
Abstract
Sparse matrix–vector multiplication (SpMV) plays a significant role in the computational costs of many scientific applications such as 2D/3D robotics, power network problems, and computer vision. Numerous implementations using different sparse matrix formats have been introduced to optimize this kernel on CPUs and [...] Read more.
Sparse matrix–vector multiplication (SpMV) plays a significant role in the computational costs of many scientific applications such as 2D/3D robotics, power network problems, and computer vision. Numerous implementations using different sparse matrix formats have been introduced to optimize this kernel on CPUs and GPUs. However, due to the sparsity patterns of matrices and the diverse configurations of hardware, accurately modeling the performance of SpMV remains a complex challenge. SpMV computation is often a time-consuming process because of its sparse matrix structure. To address this, we propose a machine learning-based tool, namely Elegante+, that predicts optimal scheduling policies by analyzing matrix structures. This approach eliminates the need for repetitive trial and error, minimizes errors, and finds the best solution of the SpMV kernel, which enables users to make informed decisions about scheduling policies that maximize computational efficiency. For this purpose, we collected 1000+ sparse matrices from the SuiteSparse matrix market collection and converted them into the compressed sparse row (CSR) format, and SpMV computation was performed by extracting 14 key sparse matrix features. After creating a comprehensive dataset, we trained various machine learning models to predict the optimal scheduling policy, significantly enhancing the computational efficiency and reducing the overhead in high-performance computing environments. Our proposed tool, Elegante+ (XGB with all SpMV features), achieved the highest cross-validation score of 79% and performed five times faster than the default scheduling policy during SpMV in a high-performance computing (HPC) environment. Full article
Show Figures

Graphical abstract

14 pages, 926 KB  
Article
Optimizing the Performance of the Sparse Matrix–Vector Multiplication Kernel in FPGA Guided by the Roofline Model
by Federico Favaro, Ernesto Dufrechou, Juan P. Oliver and Pablo Ezzatti
Micromachines 2023, 14(11), 2030; https://doi.org/10.3390/mi14112030 - 31 Oct 2023
Cited by 4 | Viewed by 2623
Abstract
The widespread adoption of massively parallel processors over the past decade has fundamentally transformed the landscape of high-performance computing hardware. This revolution has recently driven the advancement of FPGAs, which are emerging as an attractive alternative to power-hungry many-core devices in a world [...] Read more.
The widespread adoption of massively parallel processors over the past decade has fundamentally transformed the landscape of high-performance computing hardware. This revolution has recently driven the advancement of FPGAs, which are emerging as an attractive alternative to power-hungry many-core devices in a world increasingly concerned with energy consumption. Consequently, numerous recent studies have focused on implementing efficient dense and sparse numerical linear algebra (NLA) kernels on FPGAs. To maximize the efficiency of these kernels, a key aspect is the exploration of analytical tools to comprehend the performance of the developments and guide the optimization process. In this regard, the roofline model (RLM) is a well-known graphical tool that facilitates the analysis of computational performance and identifies the primary bottlenecks of a specific software when executed on a particular hardware platform. Our previous efforts advanced in developing efficient implementations of the sparse matrix–vector multiplication (SpMV) for FPGAs, considering both speed and energy consumption. In this work, we propose an extension of the RLM that enables optimizing runtime and energy consumption for NLA kernels based on sparse blocked storage formats on FPGAs. To test the power of this tool, we use it to extend our previous SpMV kernels by leveraging a block-sparse storage format that enables more efficient data access. Full article
(This article belongs to the Special Issue FPGA Applications and Future Trends)
Show Figures

Figure 1

18 pages, 873 KB  
Article
Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs
by Guangsen Zeng and Yi Zou
Electronics 2023, 12(17), 3687; https://doi.org/10.3390/electronics12173687 - 31 Aug 2023
Cited by 1 | Viewed by 2288
Abstract
Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access [...] Read more.
Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Figure 1

15 pages, 1614 KB  
Article
Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP
by Yang Wang, Jie Liu, Xiaoxiong Zhu, Qingyang Zhang, Shengguo Li and Qinglin Wang
Appl. Sci. 2023, 13(15), 8952; https://doi.org/10.3390/app13158952 - 3 Aug 2023
Cited by 2 | Viewed by 1590
Abstract
Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has [...] Read more.
Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively. Full article
Show Figures

Figure 1

11 pages, 500 KB  
Article
Mapping and Optimization Method of SpMV on Multi-DSP Accelerator
by Sheng Liu, Yasong Cao and Shuwei Sun
Electronics 2022, 11(22), 3699; https://doi.org/10.3390/electronics11223699 - 11 Nov 2022
Cited by 4 | Viewed by 2162
Abstract
Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense [...] Read more.
Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense vectors in the algorithm, which takes a lot of time and degrades the SpMV performance of the system. In this study, we design a dedicated channel in the DMA to implement an indirect memory access process to speed up the SpMV operation. On this basis, we propose six SpMV algorithm schemes and map them to optimize the performance of SpMV. The results show that the M processor’s SpMV performance reached 6.88 GFLOPS. Besides, the average performance of the HPCG benchmark is 2.8 GFLOPS. Full article
(This article belongs to the Special Issue High-Performance Computing and Its Applications)
Show Figures

Figure 1

20 pages, 931 KB  
Article
Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs
by Shizhao Chen, Jianbin Fang, Chuanfu Xu and Zheng Wang
Appl. Sci. 2022, 12(19), 9812; https://doi.org/10.3390/app12199812 - 29 Sep 2022
Cited by 4 | Viewed by 3139
Abstract
Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme [...] Read more.
Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme combining multiple SpMV storage formats allows one to choose an appropriate format to use for the target matrix and hardware. However, existing hybrid approaches are inadequate for utilizing the SIMD cores of modern multi-core CPUs with SIMDs, and it remains unclear how to best mix different SpMV formats for a given matrix. This paper presents a new hybrid storage format for sparse matrices, specifically targeting multi-core CPUs with SIMDs. Our approach partitions the target sparse matrix into two segmentations based on the regularities of the memory access pattern, where each segmentation is stored in a format suitable for its memory access patterns. Unlike prior hybrid storage schemes that rely on the user to determine the data partition among storage formats, we employ machine learning to build a predictive model to automatically determine the partition threshold on a per matrix basis. Our predictive model is first trained off line, and the trained model can be applied to any new, unseen sparse matrix. We apply our approach to 956 matrices and evaluate its performance on three distinct multi-core CPU platforms: a 72-core Intel Knights Landing (KNL) CPU, a 128-core AMD EPYC CPU, and a 64-core Phytium ARMv8 CPU. Experimental results show that our hybrid scheme, combined with the predictive model, outperforms the best-performing alternative by 2.9%, 17.5% and 16% on average on KNL, AMD, and Phytium, respectively. Full article
(This article belongs to the Special Issue AI-Based Image Processing)
Show Figures

Figure 1

30 pages, 4265 KB  
Article
Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)
by Sarah AlAhmadi, Thaha Mohammed, Aiiad Albeshri, Iyad Katib and Rashid Mehmood
Electronics 2020, 9(10), 1675; https://doi.org/10.3390/electronics9101675 - 13 Oct 2020
Cited by 17 | Viewed by 6212
Abstract
Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No [...] Read more.
Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, nprvariance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future. Full article
(This article belongs to the Section Computer Science & Engineering)
Show Figures

Graphical abstract

35 pages, 1311 KB  
Article
SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs
by Thaha Muhammed, Rashid Mehmood, Aiiad Albeshri and Iyad Katib
Appl. Sci. 2019, 9(5), 947; https://doi.org/10.3390/app9050947 - 6 Mar 2019
Cited by 23 | Viewed by 4615
Abstract
Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group [...] Read more.
Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Graphical abstract

26 pages, 3255 KB  
Article
Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU
by Mohammed Mahmoud, Mark Hoffmann and Hassan Reza
Computation 2018, 6(3), 45; https://doi.org/10.3390/computation6030045 - 24 Aug 2018
Cited by 4 | Viewed by 4796
Abstract
Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear [...] Read more.
Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear method for solving the nonrelativistic Schrödinger equation for quantum chemical multi-electron systems, and it can deal with the ground state as well as multiple excited states. In this paper, we have developed a hybrid approach in order to deal with CI sparse matrices. The proposed model includes a newly-developed hybrid format for storing CI sparse matrices on the Graphics Processing Unit (GPU). In addition to the new developed format, the proposed model includes the SpMV kernel for multiplying the CI matrix (proposed format) by a vector using the C language and the Compute Unified Device Architecture (CUDA) platform. The proposed SpMV kernel is a vector kernel that uses the warp approach. We have gauged the newly developed model in terms of two primary factors, memory usage and performance. Our proposed kernel was compared to the cuSPARSE library and the CSR5 (Compressed Sparse Row 5) format and already outperformed both. Full article
(This article belongs to the Section Computational Chemistry)
Show Figures

Figure 1

Back to TopTop