SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures

Zhang, Sheng; Bai, Wuqiang; Zhang, Zongmao; Xie, Xuchao; Tang, Xuebin

doi:10.3390/app15179811

Open AccessArticle

SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures

by

Sheng Zhang

¹,

Wuqiang Bai

¹,

Zongmao Zhang

¹,

Xuchao Xie

¹ and

Xuebin Tang

^1,2,*

¹

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

²

School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9811; https://doi.org/10.3390/app15179811

Submission received: 12 July 2025 / Revised: 24 August 2025 / Accepted: 1 September 2025 / Published: 7 September 2025

(This article belongs to the Special Issue Intelligent Data Processing and Management: Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Sparse Matrix–Vector Multiplication (SpMV) is a critical computational kernel in high-performance computing (HPC) and artificial intelligence (AI). However, its irregular memory access patterns lead to frequent cache misses on multi-level cache hierarchies, significantly degrading performance. Scratchpad memory (SPM), a software-managed, low-latency on-chip memory, offers improved data locality and control, making it a promising alternative for irregular workloads. To enhance SpMV performance, we propose a vectorized execution framework targeting SPM-augmented processors. Recognizing the limitations of traditional formats for vectorization, we introduce Sorted-Row-Block ELL (SRB-ELL), a new matrix storage format derived from ELLPACK (ELL). SRB-ELL stores only non-zero elements, partitions the matrix into row blocks, and sorts them by block size to improve load balance and SIMD efficiency. We implement and evaluate SRB-ELL on a custom processor architecture with integrated SPM using the gem5 simulator. Experimental results show that, compared to vectorized CSR-based SpMV, the SRB-ELL design achieves up to 1.48× speedup and an average of 1.19×.

Keywords:

SpMV (Sparse Matrix Vector Multiplication); vectorization; SPM (Scratchpad Memory); gem5 simulator; SVE (Scalable Vector Extension)

1. Introduction

Since the end of Dennard scaling and the subsequent stagnation in CPU clock frequency, computer architecture has undergone a paradigm shift. To sustain performance growth, both hardware architects and software developers have increasingly relied on exploiting parallelism. While instruction-level parallelism and thread-level parallelism have been extensively studied and deployed, data-level parallelism (DLP) remains a largely untapped opportunity. DLP can be efficiently exposed to hardware through vectorized execution, such as Single Instruction Multiple Data (SIMD) architectures [1,2], which operate on multiple data elements simultaneously. Many modern applications stand to benefit from vectorization, yielding improved performance, enhanced energy efficiency, and better utilization of computational resources [3,4,5,6].

Among the various computational workloads, sparse matrix operations—and in particular, Sparse Matrix–Vector Multiplication (SpMV)—pose significant challenges for efficient vectorization due to their inherently irregular memory access patterns and low computational intensity [7,8]. Despite these difficulties, SpMV remains a foundational kernel in a wide range of domains, including linear solvers, high-performance computing (HPC), artificial intelligence (AI), graph computing, and supercomputing [9,10,11,12,13,14]. For instance, SpMV is the core operation in Conjugate Gradient methods, which serve as a central component in the High-Performance Conjugate Gradient (HPCG) benchmark—an emerging alternative to the traditional LINPACK benchmark for evaluating supercomputer capabilities [15,16]. In AI workloads, SpMV is frequently used in optimization routines such as gradient descent in Support Vector Machine (SVM) training [17]. Additionally, in the realm of big data and graph analytics, SpMV forms the computational backbone of many graph algorithms and is the most critical kernel in the GraphBLAS specification [18]. Therefore, achieving high-performance vectorized implementations of SpMV is of paramount importance, with the potential to deliver transformative improvements across HPC, AI, and data-intensive applications.

In these domains, many real-world applications involve sparse matrices with less than 1% of non-zero elements [19]. To minimize memory usage and improve computational efficiency, a variety of compressed sparse matrix formats have been proposed [20,21,22,23,24]. However, most of these formats are poorly aligned with the requirements of modern vector architectures. Two primary challenges emerge: (1) sparse matrix formats often involve irregular memory access and indirection, which inhibit effective SIMD utilization, and (2) existing vector hardware lacks native support for the structural irregularity of sparse computations.

Among these formats, Compressed Sparse Row (CSR) remains the most commonly used due to its generality and compact representation [25,26]. However, its pointer-based row indexing and variable row lengths introduce irregular control flow and memory access, resulting in low vector lane utilization and poor cache performance.

To address these challenges, scratchpad memory (SPM) [27] has been proposed as an effective alternative to traditional cache hierarchies for irregular workloads. SPM offers low-latency, software-controlled access, allowing fine-grained management of frequently accessed data such as input vectors in SpMV, thereby reducing main memory traffic and improving data locality.

In this work, we propose Sorted-Row-Block ELL (SRB-ELL), a new sparse matrix format optimized for vectorization and SPM-based architectures. SRB-ELL stores only non-zero elements, partitions the matrix into row blocks, and sorts them by size to improve regularity and SIMD efficiency. We implement the SRB-ELL-based SpMV kernel on a custom processor architecture with integrated scratchpad memory using the gem5 simulator [28]. Experimental results demonstrate that SRB-ELL significantly improves vector efficiency and overall performance compared to vectorized CSR implementations.

The structure of this paper is as follows. Section 2 introduces related work. Section 3 details the design of the SRB-ELL format and its integration with scratchpad memory. Section 4 presents experimental evaluations of the proposed method. Finally, Section 5 concludes the paper.

2. Related Works

This section surveys existing sparse matrix formats, analyzes the limitations of current vectorization techniques for sparse computations, and highlights the motivation for hardware–software co-design to improve vectorization efficiency.

2.1. Sparse Matrix Compressed Formats

Efficient memory storage is a fundamental challenge in sparse matrix computation, motivating the development of numerous compressed formats that aim to minimize storage cost and improve access efficiency. CSR is one of the most widely used representations for sparse matrices, particularly in libraries that support sparse matrix operations. For the sparse matrix Ain Figure 1a, CSR encodes the matrix using three arrays, shown in Figure 1b: data stores all non-zero values in row-major order, col_idx holds the corresponding column indices, and row_ptr indicates the starting position of each row in the data and col_idx arrays. This compact structure eliminates the need to store zero elements explicitly, significantly reducing memory usage.

Algorithm 1 presents a typical CSR-based implementation of SpMV. The algorithm traverses only the non-zero elements of matrix A, thus avoiding unnecessary computation and memory access. However, CSR’s performance can be limited by pointer-chasing overhead, since the value in col_idx must first be loaded and then used to access the corresponding entry in the dense input vector x, leading to irregular memory access and reduced vectorized efficiency.

Algorithm 1 CSR-Based SpMV Implementation.

1:: for $i = 0$ to $r o w s - 1$ do
2:: for $j = r o w_p t r [i]$ to $r o w_p t r [i + 1] - 1$ do
3:: $y [i] \leftarrow y [i] + d a t a [j] \times x [c o l_i d x [j]]$
4:: end for
5:: end for

The ELLPACK (ELL) format is a structured sparse matrix representation optimized for vector and GPU architectures. It stores each row using a fixed number of entries equal to the maximum number of non-zero elements per row (K), padding with zeros when necessary. Two dense arrays of size

M \times K

are used: data stores non-zero values and padded zeros, while indices holds the corresponding column indices, as illustrated in Figure 1c. This regular layout enables aligned memory accesses and efficient SIMD vectorization.

Algorithm 2 shows the ELL-based implementation of SpMV. Each row performs K fixed-length operations using the corresponding entries in the data and indices arrays, enabling regular and aligned memory accesses that benefit vector execution. However, when the matrix has a highly irregular distribution of non-zero elements across rows, the padding introduces redundant computations and memory overhead. This inefficiency is further amplified on CPUs, where vector units are narrower and memory bandwidth is more constrained. As a result, despite its structural regularity, the ELL format cannot be efficiently vectorized for general sparse matrices on modern CPUs.

Algorithm 2 ELL-Based SpMV Implementation.

1:: for $i = 0$ to $M - 1$ do
2:: for $j = 0$ to $K - 1$ do
3:: $y [i] \leftarrow y [i] + data [i] [j] \times x [indices [i] [j]]$
4:: end for
5:: end for

2.2. SpMV Vectorization Bottlenecks

Vectorization, or SIMD, is a fundamental optimization technique in modern processors that allows performing the same operation on multiple data elements simultaneously. By operating on vectors of data instead of scalars, SIMD improves instruction throughput and DLP, which is especially critical given the stagnation of processor clock speeds.

To address vectorization challenges in irregular applications such as sparse matrix computations, the ARM Scalable Vector Extension (SVE) provides flexible hardware features including scalable vector lengths, predicate execution, and efficient gather/scatter instructions [29,30,31,32]. These features allow computation on irregular or partial vectors, reducing control divergence and improving SIMD utilization.

As shown in Figure 2, unlike fixed-length SIMD units such as SSE, SVE uses predicate vectors (e.g., svbool_t) to enable conditional execution on vector lanes. This fine-grained control is particularly effective for sparse computations.

However, applying vectorization effectively to SpMV is notoriously difficult. The primary challenge arises from the irregular distribution of non-zero elements, which leads to indirect memory accesses and irregular memory access patterns. Traditional formats such as CSR suffer from pointer chasing and poor data locality, while ELL introduces padding overhead to maintain regular structure, which can result in redundant computations and increased memory bandwidth usage, particularly for highly irregular matrices.

Prior efforts have explored various strategies to enhance memory access patterns and vectorization efficiency in sparse matrix computations. VHCC [33] employs a two-dimensional jagged partitioning scheme combined with segmented summation to improve data locality and SIMD utilization. ESB [34] reduces padding overhead in ELL by sorting and blocking rows, making it more suitable for engineering applications. SELL-C [35] reduces zero padding by slicing the matrix into C-row blocks, sorting rows within each block by non-zero count, and padding only to the block’s maximum row length. Hartwig et al. [36] further aligns row lengths within each block to the SIMD width or GPU warp size, improving parallel efficiency on architectures with strict alignment requirements. CSR5 [20] is designed for cross-platform compatibility and delivers moderate performance on x86 architectures. Chen et al. [37] propose a fine-grained comparison of non-zero elements to boost vectorization in irregular workloads. Ashari et al. introduce the BRC format, which applies two-dimensional blocking to better exploit cache and parallelism. CVR [38] introduces a vectorization-oriented sparse matrix format that enables simultaneous multi-row processing and efficient SIMD lane utilization, thereby improving cache locality while minimizing preprocessing overhead.

Block-based formats like BCSR and UBCSR improve vector utilization by grouping non-zeros into blocks [39,40]. However, these formats often suffer from zero padding overhead and alignment constraints, which may degrade performance. Graph-based reordering techniques such as Cuthill–McKee [41] can enhance data locality, but they also require complex preprocessing steps and are not always portable across different matrix structures.

3. Design and Implementation

We propose Sorted-Row-Block ELLPACK (SRB-ELL), a vector-friendly sparse matrix format based on ELL, designed to reduce padding and improve SIMD efficiency. To further optimize SpMV performance, we integrate scratchpad memory into armv8 processor in gem5.

3.1. SRB-ELL Design and Vectorized Algorithm

The ELL format is well-suited for vectorized execution due to its regular memory layout, but its performance degrades significantly when matrices have irregular row lengths. This is because ELL requires zero padding to align row lengths, which leads to additional memory usage and unnecessary computations, reducing overall SpMV efficiency. In contrast, formats like CSR are more compact but suffer from irregular access patterns and indirect memory accesses, making them less amenable to SIMD vectorization.

To address these limitations, we propose SRB-ELL, a vector-friendly sparse matrix format based on ELL, designed to improve vectorization utilization while minimizing zero padding overhead. SRB-ELL organizes the sparse matrix into blocks of rows, sorts them by row length, and removes fully empty rows, thereby preserving data regularity and reducing redundant storage and computation.

The SRB-ELL format employs four arrays: data stores all non-zero values, index stores their corresponding column indices, row_pos records the row position of each block, and row_len stores the length of each row. The storage process consists of the following:

dividing the matrix into row blocks;
storing each block independently;
sorting blocks by increasing row length;
removing blocks with zero-length rows.

As illustrated in Figure 3, the matrix A is first divided into five row blocks. The position and length of each row are recorded in row_pos and row_len, respectively. The row blocks are then sorted by ascending row length to minimize padding overhead. Fully empty rows are discarded during this process. This layout enhances memory access regularity while eliminating the performance penalties of padding, making it suitable for efficient vectorized SpMV computation.

To maximize the performance of SpMV based on the SRB-ELL, we implement a vectorized SpMV algorithm using SVE. Algorithm 3 outlines the computation process, First, row_pos and row_len are initialized to the current row index and its non-zero value count; svcntd() returns the system’s vector length; and svsum is zeroed to accumulate partial dot-products. Then, for each chunk of non-zero, count sets the active lane count in the predicate register pg; vector instructions (svld1_f64, svld1sw_u64 and svld1_gather_u64) load the non-zero value, column indices and the corresponding x elements; a vector multiply–accumulate updates the svsum. Once all non-zero values in the row have been processed, vector instruction svaddv reduces svsum to produce the y entry for row_pos.

Algorithm 3 SVE Vectorized SRB-ELL SpMV Implementation.

1:: for $i = 0$ to $N - 1$ do
2:: $r o w_p o s \leftarrow row_pos [i]$ , $r o w_l e n \leftarrow row_len [i]$
3:: $v e c_s i z e \leftarrow svcntd ()$ , $s v s u m \leftarrow svdup_f 64 (0.0)$
4:: for $j = 0$ to $r o w_l e n$ step $v e c_s i z e$ do
5:: $c o u n t \leftarrow min (r o w_l e n - j, v e c_s i z e)$
6:: $p g \leftarrow svwhilelt_b 64 (0, c o u n t)$
7:: $s v v a l \leftarrow svld 1_f 64 (p g, data [i] [j])$
8:: $i d x \leftarrow svld 1 sw_u 64 (p g, index [i] [j])$
9:: $s v x v \leftarrow svld 1_gather_u 64 index_f 64 (p g, x, i d x)$
10:: $s v s u m \leftarrow svmla_f 64_m (p g, s v s u m, s v v a l, s v x v)$
11:: end for
12:: $y [r o w_p o s] \leftarrow svaddv (svptrue_b 64 (), s v s u m)$
13:: end for

3.2. Scratchpad Memory Implementation in gem5

In SpMV operations, the irregular distribution of non-zero elements in sparse matrices often leads to non-contiguous and unpredictable memory accesses to the dense input vector x. This irregularity severely limits data reuse and results in frequent cache misses, making SpMV a memory-bound computation. SPM is high-speed internal storage structure with simpler allocation and control mechanisms compared to general cache memories. To alleviate the inefficiency of caches under irregular access patterns in SpMV operations, we integrate a SPM into the processor. In our design, the SPM is exclusively allocated for storing the dense input vector x, which is repeatedly accessed during computation. This explicit management of x within the SPM reduces memory hierarchy traffic and access latency. The sparse matrix—including its non-zero values and associated index metadata—remains in off-chip memory and is streamed during execution. This design choice simplifies the memory allocation strategy.

To evaluate the effectiveness of our proposed architecture and storage format, we rely on the gem5 simulator. It is a widely used, modular, and extensible simulator for computer-system architecture research. It supports detailed microarchitectural modeling and provides a flexible environment for integrating and analyzing hardware modifications. Compared to real hardware prototyping, gem5 offers faster iteration, lower cost, and fine-grained observability of internal behaviors. These features make gem5 an ideal platform for implementing and evaluating customized memory hierarchies.

In gem5, processor cores, caches, and main memory communicate through a port-based architecture that mimics real-world bus connections. Each module consists of a C++ class that defines behavior and a Python configuration script for system instantiation. To maintain modularity while expanding the memory hierarchy, we introduce a custom interconnect component, Xbar, which abstracts communication between components at different levels of the memory system.

To integrate a SPM module into the memory system, we design a new interconnect class SPM_Xbar, derived from our custom Xbar. This component adds additional slave ports for connecting the CPU and SPM, while preserving existing connections to the instruction and data caches. On the CPU side, a new master port is added to interface with SPM_Xbar without interfering with the original cache and memory pathways.

The SPM is implemented as an extension of the base memory class in gem5. We augment it with configurable parameters such as latency, bandwidth, and capacity. Figure 4 illustrates the architectural integration of the SPM into the processor’s memory hierarchy. The CPU is connected to both the traditional cache subsystem and the newly introduced SPM via a custom interconnect module, SPM_Xbar. This interconnect module enables parallel access paths: one for regular memory operations through caches and DRAM, and another dedicated path for accessing the SPM.

And, we implement SPM as a configurable memory block with dedicated read/write ports that bypasses the cache hierarchy entirely. It is mapped to a fixed physical address range. During SpMV execution, the dense input vector x is stored in the SPM, allowing low latency and avoiding cache competition with matrix metadata. Meanwhile, the sparse matrix data is streamed from off-chip memory through the cache subsystem.

To enable software-level control of the SPM, we implement a user–space interface: spm_mem_alloc(uint64_t mem_size, uint64_t mem_address), where mem_size specifies the size of the memory region to be allocated, while mem_address defines the starting physical address of the SPM where the allocation should begin. The function internally invokes the mmap system call to map the specified physical memory region into the process’s virtual address space. It returns a generic pointer to the mapped region, which enables user–space programs to explicitly access and manage data within the SPM. Through this interface, fine-grained software-level control over the SPM is achieved, facilitating customized memory management strategies for performance-critical workloads.

In our prototype, the SPM is configured with 128 KB of capacity. The overall memory space is partitioned into 4 GB for main memory and 2 GB for I/O space, with the SPM address region isolated for conflict-free access.

By executing index-dependent operations on the input vector entirely within the SPM, the design minimizes memory hierarchy traffic and enhances effective bandwidth for streaming the sparse matrix. This hardware–software co-optimization enables higher throughput and improves the efficiency of vectorized SpMV kernels.

4. Results

This section presents the experimental evaluation of the proposed approach. We first describe the simulation environment and dataset configurations. Then, we compare the storage overhead of SRB-ELL with CSR and ELL format, followed by the performance analysis of vectorized SRB-ELL on an SPM-based processor architecture.

4.1. Full-System Simulation Infrastructure

We conduct full-system simulations using the gem5 with a single-core ARMv8 in-order CPU model. The simulated processor integrates a software-managed scratchpad memory and its user-level interface. The system runs Ubuntu 16.04 with Linux kernel 4.18, and employs a stride prefetcher to enhance cache ability. Key simulation parameters are summarized in Table 1.

4.2. Real-World Data Sets

We use 2562 sparse matrices from the publicly available SuiteSparse Matrix Collection [42,43], covering a wide range of realistic application domains, including graph analytics, geometry processing, computational fluid dynamics, optimization, chemical simulation, circuit modeling, networks, and machine learning. The matrix characteristics are summarized in Table 2, including the number of non-zero elements (NNZ), average non-zeros per row (NNZ_Per_Row), and sparsity, defined as

NNZ / (M \times N)

. The maximum row length is limited to 16,384 (

2^{14}

) due to the 128 KB capacity of the SPM, which can hold up to 16,384 double-precision values.

4.3. Experimental Results

4.3.1. Storage Overhead

CSR stores the non-zero elements of a sparse matrix and their row and column information using three one-dimensional arrays: data, row_ptr, and col_idx, with sizes NNZ,

N + 1

, and NNZ, respectively, where NNZ denotes the number of non-zero elements and N denotes the number of rows. Taking double-precision floating-point values as an example, the storage cost of a CSR matrix is Equation (1), which indicates that the storage cost depends on both the number of non-zero elements and the number of rows.

{Cos t}_{CSR} = 8 \times NNZ + 4 \times NNZ + 4 \times (N + 1)

(1)

ELL employs two two-dimensional arrays to store non-zero values and column indices, with the first dimension representing rows. Both arrays have size

N \times k

, where k denotes the maximum row length of the sparse matrix. The storage cost for ELL is given by Equation (2), showing that the cost depends on the matrix size and is closely related to the data distribution. Since sparse matrices often have irregular distributions, the ELL format is typically suitable for matrices with relatively regular data patterns.

{Cos t}_{ELL} = 8 \times N \times k + 4 \times N \times k

(2)

The SRB-ELL is similar to ELL but only stores non-zero values and their column indices, while additionally using two one-dimensional arrays, row_pos and row_len, to record row positions and lengths. The storage cost of SRB-ELL is Equation (3), which is similar to CSR, mainly depending on matrix size. Compared to CSR, SRB-ELL only introduces extra storage for row length arrays.

{Cos t}_{SRB - ELL} = 8 \times NNZ + 4 \times NNZ + 4 \times N + 4 \times N

(3)

Figure 5 shows the storage costs of different formats across our datasets, where the differently colored points indicate the actual storage overhead of each matrix under the corresponding format. CSR and SRB-ELL exhibit similar storage trends, as both store only non-zero elements. Compared to CSR, SRB-ELL introduces an additional overhead of approximately

4 N

bytes to record row positions and lengths, but it avoids dependence on data distribution, making it more broadly applicable and general-purpose.

In contrast, ELL’s storage cost is highly sensitive to the distribution of non-zero elements. Although it compresses the matrix, it requires zero padding to maintain uniform row lengths. When rows have similar lengths, ELL performs comparably to CSR; however, for irregular matrices, its storage cost grows significantly due to fixed row block sizes.

4.3.2. SpMV Performance Analysis

To evaluate the performance of the proposed SRB-ELL format, we conduct full-system simulations in gem5, comparing it with traditional ELL and CSR formats under both memory and SPM environments. The system configuration is shown in Table 1.

Figure 6 illustrates the speedup of SRB-ELL over the ELL format. Across all tested matrices, SRB-ELL achieves a maximum speedup of 914.1× under standard memory, primarily due to its avoidance of excessive zero padding and redundant computations. ELL’s strict layout requires padding to the maximum row length, leading to unnecessary operations and increased memory access. In contrast, SRB-ELL stores only non-zero elements and reorganizes rows into sorted blocks, significantly reducing both computation and memory traffic.

When SPM is used to store the frequently accessed vector x, the performance of SRB-ELL improves further. As shown in Figure 6a,b, the maximum speedup reaches 927.4×, and the average speedup improves from 9.02× (in memory) to 9.05× (with SPM). This demonstrates the advantage of combining optimized data layout with conflict-free local memory.

We also compare SRB-ELL with the CSR format, which is efficient in storage but difficult to vectorize due to indirect memory accesses. As shown in Figure 7, SRB-ELL achieves an average speedup of 1.19× and a maximum of 1.48× compared to CSR under memory. The improved performance stems from SRB-ELL’s regular memory access pattern, which better exploits vector units and improves cache utilization.

In some cases, the performance gap is narrow or slightly negative (minimum speedup 0.95×), particularly for small and uniformly structured matrices. Here, the benefits of vectorization are offset by SRB-ELL’s additional metadata overhead (e.g., row_pos, row_len arrays).

Under SPM conditions (Figure 7a,b), the performance advantage of SRB-ELL remains, with an average speedup of 1.05×. Although SPM mitigates memory contention for both formats, SRB-ELL continues to benefit from its vector-friendly layout and reduced cache miss rate.

Furthermore, cache miss statistics that show in Figure 8, on average, the number of cache misses for ELL is 31.76× (memory) and 42.73× (SPM) higher than SRB-ELL. This confirms that SRB-ELL’s compact and regular layout significantly improves memory locality and reduces contention, especially in memory-bound computations. Figure 8c,d shows that CSR experiences 3.77× (memory) and 8.26× (SPM) more cache misses than SRB-ELL.

5. Conclusions

SpMV suffers from irregular memory accesses that cause severe cache conflicts and limit vectorization efficiency on modern architectures. To address this, we propose SRB-ELL, a new sparse matrix format derived from ELL, specifically designed for efficient vectorization. SRB-ELL organizes non-zero elements into row blocks sorted by block size, reducing load imbalance and eliminating unnecessary zero padding, which improves SIMD utilization. To further mitigate the cache conflicts caused by irregular access to the input vector x, we leverage scratchpad memory (SPM)—a low-latency, software-managed memory that offers precise control over data placement and avoids cache contention. In our design, vector x is allocated in SPM to reduce access conflicts during computation. We integrate SPM into a custom processor architecture in the gem5 simulator and evaluate the SVE-vectorized SpMV algorithm using 2562 sparse matrices from real-world applications. Experimental results demonstrate that our approach significantly improves performance over conventional formats, achieving up to 1.48× speedup and an average of 1.19× compared to a vectorized CSR SpMV.

Author Contributions

Writing—original draft preparation, S.Z.; writing—review and editing, W.B. and Z.Z.; Supervision, X.T. and X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We appreciate Xing Su for his kind suggestions in the paper writing.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SpMV	Sparse Matrix–Vector Multiplication
HPC	High-performance computing
AI	Artificial intelligence
DLP	Data-level parallelism
SIMD	Single Instruction Multiple Data
SPM	Scratchpad Memory
CPU	Center Processor Unit
HPCG	High-Performance Conjugate Gradient
SVM	Support Vector Machine
CSR	Compressed Sparse Row
ELL	ELLPACK
SVE	Scalable Vector Extension
SRB-ELL	Sorted-Row-Block ELLPACK
MSHR	Miss Status Holding Register
SSE	Stream SIMD Extension

References

Asanović, K. Vector Microprocessors. Ph.D. Dissertation, University of California, Berkeley, CA, USA, 1998. [Google Scholar]
Espasa, R.; Valero, M.; Smith, J.E. Vector Architectures: Past, Present and Future. In Proceedings of the 12th International Conference on Supercomputing (ICS), Melbourne, Australia, 13–17 July 1998; pp. 425–432. Available online: https://dl.acm.org/doi/10.1145/277830.277935 (accessed on 31 August 2025).
AMD Corporation. AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit and 256-Bit Media Instructions; AMD: Sunnyvale, CA, USA, 2019. [Google Scholar]
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A: Instruction Set Reference; Intel: Santa Clara, CA, USA, 2015. [Google Scholar]
Hennessy, J.L.; Patterson, D.A. Computer Architecture, Sixth Edition: A Quantitative Approach, 6th ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2017. [Google Scholar]
Satish, N.; Kim, C.; Chhugani, J.; Saito, H.; Krishnaiyer, R.; Smelyanskiy, M.; Girkar, M.; Dubey, P. Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 9–13 June 2012; pp. 440–451. Available online: http://dl.acm.org.recursos.biblioteca.upc.edu/citation.cfm?id=2337159.2337210 (accessed on 31 August 2025).
Bramas, B.; Kus, P. Computing the Sparse Matrix Vector Product Using Block-Based Kernels without Zero Padding on Processors with AVX-512 Instructions. arXiv 2018, arXiv:1801.01134. Available online: http://arxiv.org/abs/1801.01134 (accessed on 31 August 2025). [CrossRef] [PubMed]
D’Azevedo, E.F.; Fahey, M.R.; Mills, R.T. Vectorized Sparse Matrix Multiply for Compressed Row Storage Format. In Computational Science–ICCS 2005; Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 99–106. [Google Scholar]
Gilbert, J.R.; Reinhardt, S.; Shah, V.B. High-Performance Graph Algorithms from Parallel Sparse Matrices. In International Workshop on Applied Parallel Computing; Springer: Berlin/Heidelberg, Germany, 2006; pp. 260–269. [Google Scholar]
Greathouse, J.L.; Daga, M. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. In Proceedings of the SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 16–21 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 769–780. [Google Scholar]
Kepner, J.; Alford, S.; Gadepally, V.; Jones, M.; Milechin, L.; Robinett, R.; Samsi, S. Sparse Deep Neural Network Graph Challenge. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–7. [Google Scholar]
Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S.; Dally, W. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 24–28 June 2017; pp. 27–40. [Google Scholar]
Yuan, F.; Yang, X.; Li, S.; Dong, D.; Huang, C.; Wang, Z. Optimizing Multi-Grid Preconditioned Conjugate Gradient Method on Multi-Cores. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 768–779. [Google Scholar] [CrossRef]
Fu, X.; Su, X.; Dong, D.Z.; Qian, C.D. Dense Linear Solver on Many-Core CPUs: Characterization and Optimization. Comput. Eng. Sci. 2024, 46, 984. [Google Scholar]
Dongarra, J.J.; Luszczek, P.; Petitet, A. The LINPACK Benchmark: Past, Present and Future. Concurr. Comput. Pract. Exp. 2003, 15, 803–820. [Google Scholar] [CrossRef]
Dongarra, J.; Heroux, M.A. Toward a New Metric for Ranking High Performance Computing Systems; Sandia Report SAND2013-4744; Sandia National Laboratories: Albuquerque, Mexico, 2013; pp. 150–168. [Google Scholar]
Evgeniou, T.; Pontil, M. Support Vector Machines: Theory and Applications. In Advanced Course on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 1999; pp. 249–257. [Google Scholar]
Kepner, J.; Aaltonen, P.; Bader, D.; Buluç, A.; Franchetti, F.; Gilbert, J.; Hutchison, D.; Kumar, M.; Lumsdaine, A.; Meyerhenke, H.; et al. Mathematical Foundations of the GraphBLAS. In Proceedings of the 2016 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 13–15 September 2016; pp. 1–9. [Google Scholar]
Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data (accessed on 31 August 2025).
Liu, W.; Vinter, B. CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS), Newport Beach, CA, USA, 8–11 June 2015; pp. 339–350. [Google Scholar]
Kourtis, K.; Karakasis, V.; Goumas, G.; Koziris, N. CSX: An Extended Compression Format for SpMV on Shared Memory Systems. ACM SIGPLAN Not. 2011, 46, 247–256. [Google Scholar] [CrossRef]
Yan, S.; Li, C.; Zhang, Y.; Zhou, H. yaSpMV: Yet Another SpMV Framework on GPUs. ACM SIGPLAN Not. 2014, 49, 107–118. [Google Scholar] [CrossRef]
Cao, W.; Yao, L.; Li, Z.; Wang, Y.; Wang, Z. Implementing Sparse Matrix-Vector Multiplication Using CUDA Based on a Hybrid Sparse Matrix Format. In Proceedings of the 2010 International Conference on Computer Application and System Modeling (ICCASM), Taiyuan, China, 22–24 October 2010; Volume 11, pp. V11-161–V11-165. [Google Scholar]
Buluç, A.; Fineman, J.T.; Frigo, M.; Gilbert, J.R.; Leiserson, C.E. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), Calgary, AB, Canada, 11–13 August 2009; pp. 233–244. [Google Scholar]
Eigen. The Matrix Class, Dense Matrix and Array Manipulation. Available online: https://www.eigen.tuxfamily.org/dox/group__TutorialMatrixClass.html (accessed on 31 August 2025).
Kjolstad, F.; Chou, S.; Lugato, D.; Kamil, S.; Amarasinghe, S. TACO: A Tool to Generate Tensor Algebra Kernels. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), Urbana, IL, USA, 30 October–3 November 2017; pp. 943–948. [Google Scholar]
Pavon, J.; Valdivieso, I.V.; Barredo, A.; Ramasubramanian, N.; Hernandez, R.; Badia, R.M.; Ayguadé, E.; Labarta, J. VIA: A Smart Scratchpad for Vector Units with Application to Sparse Matrix Computations. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 921–934. [Google Scholar]
Lowe-Power, J.; Ahmad, A.M.; Akram, A.; Alian, M.; Amslinger, R.; Andreozzi, M.; Arvanitis, A.; Bates, S.; Bewick, G.; Black, G.; et al. The gem5 Simulator: Version 20.0+. arXiv 2020, arXiv:2007.03152. Available online: https://arxiv.org/abs/2007.03152 (accessed on 31 August 2025).
Arm. Arm Compiler for Linux: Scalable Vector Extension (SVE) Intrinsics Reference Guide. Available online: https://developer.arm.com/documentation/dht0002/ (accessed on 31 August 2025).
Arm. Arm Architecture Reference Manual: Armv8, for Armv8-A Architecture Profile. Available online: https://developer.arm.com/documentation/102476/latest/ (accessed on 31 August 2025).
Stephens, N.; Biles, S.; Boettcher, M.; Eapen, J.; Eyole, M.; Gabrielli, G.; Horsnell, M.; Magklis, G.; Martinez, A.; Premillieu, N.; et al. The ARM Scalable Vector Extension. IEEE Micro 2017, 37, 26–39. [Google Scholar] [CrossRef]
Arm. Arm Architecture Reference Manual Supplement, the Scalable Vector Extension (SVE), for Armv8-A; Version Beta. Available online: https://developer.arm.com/documentation/ddi0584/ag/ (accessed on 31 August 2025).
Tang, W.T.; Zhao, R.; Lu, M.; Liang, Y.; Huynh, P.H.; Li, X.; Goh, R.S.M. Optimizing and Autotuning Sparse Matrix-Vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th IEEE/ACM International Symposium on Code Generation and Optimization (CGO), San Francisco, CA, USA, 7–11 February 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 136–145. [Google Scholar]
Liu, X.; Smelyanskiy, M.; Chow, E.; Dubey, P. Efficient Sparse Matrix-Vector Multiplication on x86-based Manycore Processors. In Proceedings of the 27th ACM International Conference on Supercomputing (ICS), Eugene, OR, USA, 10–14 June 2013; ACM: New York, NY, USA, 2013. [Google Scholar]
Kreutzer, M.; Hager, G.; Wellein, G.; Fehske, H.; Bishop, A.R. A Unified Sparse Matrix Data Format for Modern Processors with Wide SIMD Units. arXiv 2013, arXiv:1307.6209. Available online: http://arxiv.org/abs/1307.6209 (accessed on 31 August 2025).
Anzt, H.; Tomov, S.; Dongarra, J. Implementing a Sparse Matrix Vector Product for the SELL-C/SELL-C-σ Formats on NVIDIA GPUs; Technical Report ut-eecs-14-727; University of Tennessee: Knoxville, TN, USA, 2014. [Google Scholar]
Chen, L.; Jiang, P.; Agrawal, G. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In Proceedings of the 14th International Symposium on Code Generation and Optimization (CGO), Barcelona, Spain, 12–18 March 2016; ACM: New York, NY, USA, 2016; pp. 47–58. [Google Scholar]
Xie, B.; Zhan, J.; Liu, X.; Gao, W.; Jia, Z.; He, X.; Zhang, L. CVR: Efficient Vectorization of SpMV on x86 Processors. In Proceedings of the 2018 International Symposium on Code Generation and Optimization (CGO), Vienna, Austria, 24–28 February 2018; IEEE: San Francisco, CA, USA, 2018; pp. 149–162. [Google Scholar]
Pinar, A.; Heath, M.T. Improving Performance of Sparse Matrix-Vector Multiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, Portland, OR, USA, 14–19 November 1999; p. 30. [Google Scholar]
Vuduc, R.W.; Moon, H.J. Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure. In Proceedings of the International Conference on High Performance Computing and Communications; Springer: Berlin/Heidelberg, Germany, 2005; pp. 807–816. [Google Scholar]
Cuthill, E.; McKee, J. Reducing the Bandwidth of Sparse Symmetric Matrices. In Proceedings of the 1969 24th National Conference, New York, NY, USA, 26–28 August 1969; ACM: New York, NY, USA, 1969; pp. 157–172. [Google Scholar]
SuiteSparse Matrix Collection. Available online: https://sparse.tamu.edu/ (accessed on 31 August 2025).
Davis, T.A.; Hu, Y. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 2011, 38, 1–25. [Google Scholar] [CrossRef]

Figure 1. Overview of the sparse matrix storage. (a) Sparse Matrix A. (b) The CSR format. (c) The ELL format.

Figure 2. Illustration of a scalar operation and a vectorial operation with predicate.

Figure 3. Conversion process to SRB-ELL format. (a) Sparse Matrix A. (b) The SRB-ELL format.

Figure 4. Architecture of scratchpad memory integration in gem5.

Figure 5. Storage overhead comparison of CSR, ELL, and SRB-ELL formats.

Figure 6. Performance comparison between ELL and SRB-ELL. (a) Speedup of SRB-ELL over ELL-based SpMV under memory architecture. (b) Speedup of SRB-ELL over ELL-based SpMV with SPM.

Figure 7. Performance comparison between CSR and SRB-ELL. (a) Speedup of SRB-ELL over CSR-based SpMV under memory architecture. (b) Speedup of SRB-ELL over CSR-based SpMV with SPM.

Figure 8. L1 Cache miss comparisons of Three Matrix Formats. (a) Cache miss comparison between ELL and SRB-ELL under memory-based storage. (b) Cache miss comparison between ELL and SRB-ELL under SPM. (c) Cache miss comparison between CSR and SRB-ELL under memory-based storage. (d) Cache miss comparison between CSR and SRB-ELL under SPM.

Table 1. Simulated processor system configuration.

Component	Configuration
CPU	1 GHz; ARMv8; in-order core
L1 Data Cache	32 KB; 2-way assoc; 2 cycles; 64B line; 4 MSHRs;
L1 Instruction Cache	32 KB; 2-way assoc; 2 cycles; 64B line; 4 MSHRs;
L2 Cache	1 MB; 8-way assoc; 20 cycles; 64B line; 20 MSHRs;
DRAM	4 GB DDR4
Scratchpad Memory	128 KB; 2 cycles; 512 GB/s

Table 2. Structural characteristics of the sparse matrix datasets.

Matrix Feature	Max	Min	Avg
Num_Of_Rows	16,384	18	5320
Sparsity (%)	100.00	0.00017	1.17
NNZ (Non-Zeros)	18,068,388	24	58,990
NNZ_Per_Row	14,231	0	11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, S.; Bai, W.; Zhang, Z.; Xie, X.; Tang, X. SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures. Appl. Sci. 2025, 15, 9811. https://doi.org/10.3390/app15179811

AMA Style

Zhang S, Bai W, Zhang Z, Xie X, Tang X. SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures. Applied Sciences. 2025; 15(17):9811. https://doi.org/10.3390/app15179811

Chicago/Turabian Style

Zhang, Sheng, Wuqiang Bai, Zongmao Zhang, Xuchao Xie, and Xuebin Tang. 2025. "SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures" Applied Sciences 15, no. 17: 9811. https://doi.org/10.3390/app15179811

APA Style

Zhang, S., Bai, W., Zhang, Z., Xie, X., & Tang, X. (2025). SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures. Applied Sciences, 15(17), 9811. https://doi.org/10.3390/app15179811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SRB-ELL: A Vector-Friendly Sparse Matrix Format for SpMV on Scratchpad-Augmented Architectures

Abstract

1. Introduction

2. Related Works

2.1. Sparse Matrix Compressed Formats

2.2. SpMV Vectorization Bottlenecks

3. Design and Implementation

3.1. SRB-ELL Design and Vectorized Algorithm

3.2. Scratchpad Memory Implementation in gem5

4. Results

4.1. Full-System Simulation Infrastructure

4.2. Real-World Data Sets

4.3. Experimental Results

4.3.1. Storage Overhead

4.3.2. SpMV Performance Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI