MDPI - Publisher of Open Access Journals

20 pages, 2087 KB

Open AccessArticle

Automatic Sparse Matrix Format Selection via Dynamic Labeling and Clustering on Heterogeneous CPU–GPU Systems

by Zheng Shi, Yi Zou and Xianfeng Song

Electronics 2025, 14(19), 3895; https://doi.org/10.3390/electronics14193895 - 30 Sep 2025

Viewed by 134

Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in high-performance computing (HPC) whose efficiency depends heavily on the storage format across central processing unit (CPU) and graphics processing unit (GPU) platforms. Conventional supervised approaches often use execution time as training labels, but our [...] Read more.

Sparse matrix–vector multiplication (SpMV) is a fundamental kernel in high-performance computing (HPC) whose efficiency depends heavily on the storage format across central processing unit (CPU) and graphics processing unit (GPU) platforms. Conventional supervised approaches often use execution time as training labels, but our experiments on 1786 matrices reveal two issues: labels are unstable across runs due to execution-time variability, and single-label assignment overlooks cases where multiple formats perform similarly well. We propose a dynamic labeling strategy that assigns a single label when the fastest format shows clear superiority, and multiple labels when performance differences are small, thereby reducing label noise. We further extend feature analysis to multi-dimensional structural descriptors and apply clustering to refine label distributions and enhance prediction robustness. Experiments demonstrate 99.2% accuracy in hardware (CPU/GPU) selection and up to 98.95% accuracy in format prediction, with up to 10% robustness gains over traditional methods. Under cost-aware, end-to-end evaluation that accounts for feature extraction, prediction, conversion, and kernel execution, CPUs achieve speedups up to 3.15× and GPUs up to 1.94× over a CSR baseline. Cross-round evaluations confirm stability and generalization, providing a reliable path toward automated, cross-platform SpMV optimization. Full article

► Show Figures

Figure 1

15 pages, 1001 KB

Open AccessArticle

SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures

by Sheng Zhang, Wuqiang Bai, Zongmao Zhang, Xuchao Xie and Xuebin Tang

Appl. Sci. 2025, 15(17), 9811; https://doi.org/10.3390/app15179811 - 7 Sep 2025

Viewed by 746

Abstract

Sparse Matrix–Vector Multiplication (SpMV) is a critical computational kernel in high-performance computing (HPC) and artificial intelligence (AI). However, its irregular memory access patterns lead to frequent cache misses on multi-level cache hierarchies, significantly degrading performance. Scratchpad memory (SPM), a software-managed, low-latency on-chip memory, [...] Read more.

Sparse Matrix–Vector Multiplication (SpMV) is a critical computational kernel in high-performance computing (HPC) and artificial intelligence (AI). However, its irregular memory access patterns lead to frequent cache misses on multi-level cache hierarchies, significantly degrading performance. Scratchpad memory (SPM), a software-managed, low-latency on-chip memory, offers improved data locality and control, making it a promising alternative for irregular workloads. To enhance SpMV performance, we propose a vectorized execution framework targeting SPM-augmented processors. Recognizing the limitations of traditional formats for vectorization, we introduce Sorted-Row-Block ELL (SRB-ELL), a new matrix storage format derived from ELLPACK (ELL). SRB-ELL stores only non-zero elements, partitions the matrix into row blocks, and sorts them by block size to improve load balance and SIMD efficiency. We implement and evaluate SRB-ELL on a custom processor architecture with integrated SPM using the gem5 simulator. Experimental results show that, compared to vectorized CSR-based SpMV, the SRB-ELL design achieves up to 1.48× speedup and an average of 1.19×. Full article

(This article belongs to the Special Issue Intelligent Data Processing and Management: Technologies and Applications)

► Show Figures

Figure 1

21 pages, 6865 KB

Open AccessArticle

Elegante+: A Machine Learning-Based Optimization Framework for Sparse Matrix–Vector Computations on the CPU Architecture

by Muhammad Ahmad, Sardar Usman, Ameer Hamza, Muhammad Muzamil and Ildar Batyrshin

Information 2025, 16(7), 553; https://doi.org/10.3390/info16070553 - 29 Jun 2025

Viewed by 610

Abstract

Sparse matrix–vector multiplication (SpMV) plays a significant role in the computational costs of many scientific applications such as 2D/3D robotics, power network problems, and computer vision. Numerous implementations using different sparse matrix formats have been introduced to optimize this kernel on CPUs and [...] Read more.

Sparse matrix–vector multiplication (SpMV) plays a significant role in the computational costs of many scientific applications such as 2D/3D robotics, power network problems, and computer vision. Numerous implementations using different sparse matrix formats have been introduced to optimize this kernel on CPUs and GPUs. However, due to the sparsity patterns of matrices and the diverse configurations of hardware, accurately modeling the performance of SpMV remains a complex challenge. SpMV computation is often a time-consuming process because of its sparse matrix structure. To address this, we propose a machine learning-based tool, namely Elegante+, that predicts optimal scheduling policies by analyzing matrix structures. This approach eliminates the need for repetitive trial and error, minimizes errors, and finds the best solution of the SpMV kernel, which enables users to make informed decisions about scheduling policies that maximize computational efficiency. For this purpose, we collected 1000+ sparse matrices from the SuiteSparse matrix market collection and converted them into the compressed sparse row (CSR) format, and SpMV computation was performed by extracting 14 key sparse matrix features. After creating a comprehensive dataset, we trained various machine learning models to predict the optimal scheduling policy, significantly enhancing the computational efficiency and reducing the overhead in high-performance computing environments. Our proposed tool, Elegante+ (XGB with all SpMV features), achieved the highest cross-validation score of 79% and performed five times faster than the default scheduling policy during SpMV in a high-performance computing (HPC) environment. Full article

► Show Figures

Graphical abstract

14 pages, 926 KB

Open AccessArticle

Optimizing the Performance of the Sparse Matrix–Vector Multiplication Kernel in FPGA Guided by the Roofline Model

by Federico Favaro, Ernesto Dufrechou, Juan P. Oliver and Pablo Ezzatti

Micromachines 2023, 14(11), 2030; https://doi.org/10.3390/mi14112030 - 31 Oct 2023

Cited by 4 | Viewed by 2623

Abstract

The widespread adoption of massively parallel processors over the past decade has fundamentally transformed the landscape of high-performance computing hardware. This revolution has recently driven the advancement of FPGAs, which are emerging as an attractive alternative to power-hungry many-core devices in a world [...] Read more.

The widespread adoption of massively parallel processors over the past decade has fundamentally transformed the landscape of high-performance computing hardware. This revolution has recently driven the advancement of FPGAs, which are emerging as an attractive alternative to power-hungry many-core devices in a world increasingly concerned with energy consumption. Consequently, numerous recent studies have focused on implementing efficient dense and sparse numerical linear algebra (NLA) kernels on FPGAs. To maximize the efficiency of these kernels, a key aspect is the exploration of analytical tools to comprehend the performance of the developments and guide the optimization process. In this regard, the roofline model (RLM) is a well-known graphical tool that facilitates the analysis of computational performance and identifies the primary bottlenecks of a specific software when executed on a particular hardware platform. Our previous efforts advanced in developing efficient implementations of the sparse matrix–vector multiplication (SpMV) for FPGAs, considering both speed and energy consumption. In this work, we propose an extension of the RLM that enables optimizing runtime and energy consumption for NLA kernels based on sparse blocked storage formats on FPGAs. To test the power of this tool, we use it to extend our previous SpMV kernels by leveraging a block-sparse storage format that enables more efficient data access. Full article

(This article belongs to the Special Issue FPGA Applications and Future Trends)

► Show Figures

Figure 1

18 pages, 873 KB

Open AccessArticle

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

by Guangsen Zeng and Yi Zou

Electronics 2023, 12(17), 3687; https://doi.org/10.3390/electronics12173687 - 31 Aug 2023

Cited by 1 | Viewed by 2288

Abstract

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access [...] Read more.

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture recently from NVIDIA provides a new asynchronous memory copy instruction, memcpy_async, for more efficient data movement in shared memory. Leveraging the capability of this new memcpy_async instruction, we first propose the CSR-Partial-Overlap to carefully overlap the data copy from global memory to shared memory and computation, allowing us to take full advantage of the data transfer time. In addition, we design the dynamic batch partition and the dynamic threads distribution to achieve effective load balancing, avoid the overhead of fixing up partial sums, and improve thread utilization. Furthermore, we propose the CSR-Full-Overlap based on the CSR-Partial-Overlap, which takes the overlap of data transfer from host to device and SpMV kernel execution into account as well. The CSR-Full-Overlap unifies the two major overlaps in SpMV and hides the computation as much as possible in the two important access behaviors of the GPU. This allows CSR-Full-Overlap to achieve the best performance gains from both overlaps. As far as we know, this paper is the first in-depth study of how memcpy_async can be potentially applied to help accelerate SpMV computation in GPU platforms. We compare CSR-Full-Overlap to the current state-of-the-art cuSPARSE, where our experimental results show an average 2.03x performance gain and up to 2.67x performance gain. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

15 pages, 1614 KB

Open AccessArticle

Improving Structured Grid-Based Sparse Matrix-Vector Multiplication and Gauss–Seidel Iteration on GPDSP

by Yang Wang, Jie Liu, Xiaoxiong Zhu, Qingyang Zhang, Shengguo Li and Qinglin Wang

Appl. Sci. 2023, 13(15), 8952; https://doi.org/10.3390/app13158952 - 3 Aug 2023

Cited by 2 | Viewed by 1590

Abstract

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has [...] Read more.

Structured grid-based sparse matrix-vector multiplication and Gauss–Seidel iterations are very important kernel functions in scientific and engineering computations, both of which are memory intensive and bandwidth-limited. GPDSP is a general purpose digital signal processor, which is a very significant embedded processor that has been introduced into high-performance computing. In this paper, we designed various optimization methods, which included a blocking method to improve data locality and increase memory access efficiency, a multicolor reordering method to develop Gauss–Seidel fine-grained parallelism, a data partitioning method designed for GPDSP memory structures, and a double buffering method to overlap computation and access memory on structured grid-based SpMV and Gauss–Seidel iterations for GPDSP. At last, we combined the above optimization methods to design a multicore vectorization algorithm. We tested the matrices generated with structured grids of different sizes on the GPDSP platform and obtained speedups of up to 41× and 47× compared to the unoptimized SpMV and Gauss–Seidel iterations, with maximum bandwidth efficiencies of 72% and 81%, respectively. The experiment results show that our algorithms could fully utilize the external memory bandwidth. We also implemented the commonly used mixed precision algorithm on the GPDSP and obtained speedups of 1.60× and 1.45× for the SpMV and Gauss–Seidel iterations, respectively. Full article

► Show Figures

Figure 1

11 pages, 500 KB

Open AccessArticle

Mapping and Optimization Method of SpMV on Multi-DSP Accelerator

by Sheng Liu, Yasong Cao and Shuwei Sun

Electronics 2022, 11(22), 3699; https://doi.org/10.3390/electronics11223699 - 11 Nov 2022

Cited by 4 | Viewed by 2162

Abstract

Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense [...] Read more.

Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense vectors in the algorithm, which takes a lot of time and degrades the SpMV performance of the system. In this study, we design a dedicated channel in the DMA to implement an indirect memory access process to speed up the SpMV operation. On this basis, we propose six SpMV algorithm schemes and map them to optimize the performance of SpMV. The results show that the M processor’s SpMV performance reached 6.88 GFLOPS. Besides, the average performance of the HPCG benchmark is 2.8 GFLOPS. Full article

(This article belongs to the Special Issue High-Performance Computing and Its Applications)

► Show Figures

Figure 1

20 pages, 931 KB

Open AccessArticle

Adaptive Hybrid Storage Format for Sparse Matrix–Vector Multiplication on Multi-Core SIMD CPUs

by Shizhao Chen, Jianbin Fang, Chuanfu Xu and Zheng Wang

Appl. Sci. 2022, 12(19), 9812; https://doi.org/10.3390/app12199812 - 29 Sep 2022

Cited by 4 | Viewed by 3139

Abstract

Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme [...] Read more.

Optimizing sparse matrix–vector multiplication (SpMV) is challenging due to the non-uniform distribution of the non-zero elements of the sparse matrix. The best-performing SpMV format changes depending on the input matrix and the underlying architecture, and there is no “one-size-fit-for-all” format. A hybrid scheme combining multiple SpMV storage formats allows one to choose an appropriate format to use for the target matrix and hardware. However, existing hybrid approaches are inadequate for utilizing the SIMD cores of modern multi-core CPUs with SIMDs, and it remains unclear how to best mix different SpMV formats for a given matrix. This paper presents a new hybrid storage format for sparse matrices, specifically targeting multi-core CPUs with SIMDs. Our approach partitions the target sparse matrix into two segmentations based on the regularities of the memory access pattern, where each segmentation is stored in a format suitable for its memory access patterns. Unlike prior hybrid storage schemes that rely on the user to determine the data partition among storage formats, we employ machine learning to build a predictive model to automatically determine the partition threshold on a per matrix basis. Our predictive model is first trained off line, and the trained model can be applied to any new, unseen sparse matrix. We apply our approach to 956 matrices and evaluate its performance on three distinct multi-core CPU platforms: a 72-core Intel Knights Landing (KNL) CPU, a 128-core AMD EPYC CPU, and a 64-core Phytium ARMv8 CPU. Experimental results show that our hybrid scheme, combined with the predictive model, outperforms the best-performing alternative by 2.9%, 17.5% and 16% on average on KNL, AMD, and Phytium, respectively. Full article

(This article belongs to the Special Issue AI-Based Image Processing)

► Show Figures

Figure 1

30 pages, 4265 KB

Open AccessArticle

Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

by Sarah AlAhmadi, Thaha Mohammed, Aiiad Albeshri, Iyad Katib and Rashid Mehmood

Electronics 2020, 9(10), 1675; https://doi.org/10.3390/electronics9101675 - 13 Oct 2020

Cited by 17 | Viewed by 6212

Abstract

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No [...] Read more.

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (

nnz

,

anpr

,

npr variance

,

maxnpr

, and

distavg

), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Graphical abstract

35 pages, 1311 KB

Open AccessArticle

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

by Thaha Muhammed, Rashid Mehmood, Aiiad Albeshri and Iyad Katib

Appl. Sci. 2019, 9(5), 947; https://doi.org/10.3390/app9050947 - 6 Mar 2019

Cited by 23 | Viewed by 4615

Abstract

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group [...] Read more.

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row (

n p r

) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean

n p r

of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean

n p r

, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the

n p r

variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high

n p r v a r i a n c e

matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Graphical abstract

26 pages, 3255 KB

Open AccessArticle

Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

by Mohammed Mahmoud, Mark Hoffmann and Hassan Reza

Computation 2018, 6(3), 45; https://doi.org/10.3390/computation6030045 - 24 Aug 2018

Cited by 4 | Viewed by 4796

Abstract

Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear [...] Read more.

Sparse matrix-vector multiplication (SpMV) can be used to solve diverse-scaled linear systems and eigenvalue problems that exist in numerous, and varying scientific applications. One of the scientific applications that SpMV is involved in is known as Configuration Interaction (CI). CI is a linear method for solving the nonrelativistic Schrödinger equation for quantum chemical multi-electron systems, and it can deal with the ground state as well as multiple excited states. In this paper, we have developed a hybrid approach in order to deal with CI sparse matrices. The proposed model includes a newly-developed hybrid format for storing CI sparse matrices on the Graphics Processing Unit (GPU). In addition to the new developed format, the proposed model includes the SpMV kernel for multiplying the CI matrix (proposed format) by a vector using the C language and the Compute Unified Device Architecture (CUDA) platform. The proposed SpMV kernel is a vector kernel that uses the warp approach. We have gauged the newly developed model in terms of two primary factors, memory usage and performance. Our proposed kernel was compared to the cuSPARSE library and the CSR5 (Compressed Sparse Row 5) format and already outperformed both. Full article

(This article belongs to the Section Computational Chemistry)

► Show Figures

Figure 1

Search Results (11)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (11)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI