Performance Analysis of RCU-Style Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System

Kim, Changhui; Choi, Euteum; Han, Mingyun; Lee, Seongjin; Kim, Jaeho

doi:10.3390/app12073458

Open AccessArticle

Performance Analysis of `RCU-Style` Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System

by

Changhui Kim

,

Euteum Choi

,

Mingyun Han

,

Seongjin Lee

and

Jaeho Kim

^*

Department of AI Convergence Engineering, Gyeongsang National University, Jinju-si 52828, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(7), 3458; https://doi.org/10.3390/app12073458

Submission received: 3 February 2022 / Revised: 19 March 2022 / Accepted: 24 March 2022 / Published: 29 March 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

There have been recent advances in multi-core machines with tens and hundreds of cores, and there is an increasing emphasis on the software structure. Many different synchronization mechanism techniques have been developed to improve the performance and the scalability of manycore systems. As the non-blocking algorithms are promising in overcoming performance limits in traditional lock-based blocking synchronization mechanisms, we are observing an increased usage ratio and a number of non-blocking synchronization algorithms. For example, the usage ratio of RCU increased sharply in recent years. Since RCU exhibits low write performance and is difficult to use, the research community introduced RLU and MV-RLU synchronization algorithms to address the issues. RLU and MV-RLU, which are called RCU-style synchronization mechanisms, are promising in terms of providing easy-to-use APIs (Application Programming Interfaces) and better performance in manycore machines. To expand the applicability of RCU-style mechanisms, we need to measure the performance and analyze their measurements in various environments. To meet the goal, we evaluate them at the user and kernel level in sv6 variant, which is a research operating system on a manycore system. In order to enable RCU-style synchronization algorithms in sv6 variant, we implemented and modified some of the libraries and memory allocators in sv6 variant. We use micro-benchmarks that exploit a linked list and hash table to measure the performance while experimenting with parameters of the benchmarks and types of data structures. In most of the experiments, we observed that MV-RLU is scalable. MV-RLU exhibits about thirteen times better throughput than RCU in the case of running 70 threads. In addition, we compare the operation procedures and APIs of each RCU-style synchronization algorithm to analyze the pros and cons of the algorithms.

Keywords:

manycore; synchronization; locks; lock-free; operating system (OS); multi-version concurrency control (MVCC)

1. Introduction

Microprocessor technology has been developed to have a larger number of cores and sockets after reaching the limit of performance improvement in a way that increases the transistor density [1,2]. Manycore systems have tens or hundreds of CPU cores to improve processing power [3,4,5]. However, in manycore systems, contention for shared data between the cores becomes more intense, so a large number of CPU cores cannot be fully utilized. Therefore, it causes a problem of degradation of the overall system performance [6]. As the number of cores increases, the synchronization cost including the cost of cache coherence increases exponentially, which prevents the system from reaching the expected performance improvement [6]. The main cause of such performance degradation is the lack of a proper synchronization mechanism [7,8]. The system performance is greatly affected by the synchronization mechanism. Synchronization mechanisms are essential in designing operating systems, database systems, network stacks, storage systems, etc. and have a significant impact on performance. Synchronization approaches for linearly scalable performance on modern manycore machines have been extensively investigated, although they remain a work in progress.

Synchronization mechanisms for scalable performance have been studied in various ways. The traditional way to synchronize shared data accesses between many threads is to exploit a lock mechanism for mutual exclusively accessing the data [7,9]. Lock-based synchronization mechanisms block other competing threads when a thread acquires a lock on a shared resource. spinlock is a representative lock algorithm that is intuitive and simple; however, the performance is very poor when there are high competition for shared data, and it causes deadlock, priority inversion, etc. Although many researchers worked to improve the performance of spinlock via reducing the contention overhead [10,11,12], but the issue still remains in the recent manycore environment [8,13]. To address the issues in the blocking synchronization mechanisms, the research community shifted its focus on the non-blocking synchronization mechanisms [14,15,16,17,18,19]. The non-blocking mechanism guarantees that a thread is not blocked nor failed while accessing shared data [20]. RCU is a non-blocking mechanism that has been used for some time now due to its high performance on a manycore environment [21]. Linux adopted RCU in the kernel since 2002, and the usage has rapidly increased from 3000 in 2010 to more than 16,000 in 2020 [21,22,23]. One of merits of RCU is its scalability. RCU allows multiple threads to read the shared resource while a thread is writing to the shared resource. However, RCU has disadvantages in that it is difficult to program when the data structure is complex (e.g., binary search tree), and the scalability is poor when there are some write operations. To overcome the shortcomings of RCU, RLU (Read-log-update) [24] and MV-RLU (Multi-version read-log-update) [25] were introduced.

RLU is a non-blocking synchronization mechanism designed to compensate for the disadvantages of RCU: its programming complexity and write performance degradation. RLU provides database transaction-like programming interfaces, making it much simpler to write applications compared to the interface of RCU. It allows reading objects concurrently and writing different objects simultaneously. However, RLU keeps only two versions per object. Thus, when there are multiple write requests to an object, RLU suffers from severe processing delay caused by cleaning up the previous version. Another non-blocking synchronization mechanism, MV-RLU, overcomes the issue by introducing multi-version logging to the mechanism. Unlike RLU, MV-RLU minimizes the request processing delay by keeping a number of different versions of the object. Since MV-RLU keeps multi-version logs of an object, multiple threads can access shared resources at the same time.

RCU, RLU, and MV-RLU are non-blocking synchronization mechanisms that improve the previous mechanisms by improving its scalability and easy-to-use API. We call them RCU-style synchronization mechanisms because the design principle and APIs for the three mechanisms are similar to each other. Although RCU has been utilized and evaluated in Linux and other applications [21,26,27,28,29], RLU and MV-RLU have not been applied nor tested rigorously by the community. To expand the applicability of RCU-style synchronization mechanisms, the performance of these mechanisms have to be evaluated and analyzed in different environments. As systems with hundreds of cores are being introduced in the market, RCU-style mechanisms have to be rigorously tested in order to exploit them in such systems.

In this work, we evaluate and analyze the performance of RCU-style synchronization mechanisms in sv6 variant [30], which is a research operating system designed for manycore machines. To attain the purpose of the paper, we first compare the structure, pros, and cons of RCU-style synchronization mechanisms. Then, we implement RCU, RLU, and MV-RLU on sv6 variant and measure its performance using micro-benchmarks. The result shows that the scalability of the mechanisms are very different from each other. We find that the performance depends on the number of executed threads, write ratio, and type of shared data structure. In most of the experiments, MV-RLU exhibits better performance and scalability than RCU and RLU, but it requires care in reporting and evaluating in some of the platforms. In the case of running seventy threads in MV-RLU, it showed about thirteen times faster performance than RCU. We discuss the performance and limitations in Section 5 and Section 6, respectively. The rest of the paper is organized as follows. Section 2 presents related work. Section 3 explains how RCU-style synchronization mechanisms operate, their pros and cons, and describes the APIs. Section 4 describes the implementation of RCU-style mechanisms on sv6 variant. Section 5 evaluates and analyzes the performance of the mechanisms, and Section 6 discusses the result. Finally, Section 7 concludes the paper.

2. Related Work

In this section, we briefly summarize the synchronization mechanisms and efforts for improving scalability on manycore machines. Synchronization mechanisms have been developed in many different ways. There are two bifurcations of synchronization mechanisms: blocking and non-blocking synchronization mechanisms [7,9,20]. The typical and widely blocking algorithm is spinlock [9]. Blocking synchronization algorithms guarantee mutual exclusive access to shared data by blocking threads competing for the data. Although their behavior is simple and intuitive, since it blocks the thread, they are prone to deadlock and priority inversion [20]. When multiple threads are racing to access shared data simultaneously, the performance of a system degrades significantly because of cache line bouncing [6]. RW-lock was proposed in attempt to increase the concurrency level of lock-based algorithms [7,9]. It allows a concurrent read of the shared data; however, it blocks all the threads from reading when a thread is writing to the shared data because it provides mutual exclusive write access to the data. Queued spinlock attacks the limits of spinlock by inserting competing threads in a queue to reduce the cache coherency traffic [10,11,31]. Although it effectively reduces the traffic, it showed that the queuing approach is not scalable. Recent manycore machines have NUMA architecture with multiple CPU sockets. A noteworthy characteristic of NUMA architecture is that the access speed of remote and local memory is different. There are a plethora of works that consider the characteristics of NUMA architecture [8,12,13,32,33]. Nevertheless, the lock-based synchronization algorithms have not successfully solved the performance issue originating from blocking. To address such an issue, non-blocking synchronization algorithms are proposed.

Unlike blocking algorithms, non-blocking synchronization algorithms do not block a thread while it is competing for a shared data. Since the threads are not blocked in trying to access the data, they are guaranteed to be executed within a period of time [9,20]. Non-blocking algorithms are also known as lock-free (lockless) algorithm because they tend not to use locks in controlling the access to the shared data [16,20]. It is also categorized as wait-free because all threads are simultaneously accessing the shared data [20]. There are a number of that exploits lock-free synchronization algorithms to increase concurrency of accessing basic data structures [14,17,34]. Compared to the lock-based algorithms, the algorithms [14,17,34] improved the parallelism in accessing the data. However, the performance issue still remains when there is high contention in accessing shared data. When multiple threads access the shared data simultaneously, the system calls compare-and-swap (CAS) instructions frequently. Then, the performance drops because it leads to cache line bouncing. It is also reported that there is a performance bottleneck due to cost of memory recycling [35]. On top of that, it is not a easy task in writing a correct lock-free program because we have to consider the atomicity of an operation, reordering of command due to optimization, memory barriers, etc. [36].

RCU is a type of non-blocking algorithm that made in Linux kernel to increase the performance in manycore systems [21,27,37]. Its usage is increasing in many different environments. RCU increases concurrency by allowing multiple threads to read while a thread is writing a shared data [22,38]. It increases the concurrency by making a copy of the shared data when a thread writes to it. As it is being updated, the rest of the threads are allowed to access the copied data, which effectively increases the concurrency. An update is reflected in the data structure with single atomic pointer assignment operation even while threads read the data. To not to interfere with reading threads, the write thread has to wait for the read threads to complete their requests before releasing the memory. One downside of RCU is that multiple write threads are synchronized with locks, which affects the performance. It also introduces new set of APIs different from locks-based APIs that has to be learned prior to programming an application. Therefore, it is not easy to exploit RCU in synchronizing some of the data structures [22,24]. RLU is introduced to address the limits of RCU: only a single writing thread is allowed and complex APIs [24]. RLU keeps a write log per thread. Each updated version is recorded in the log with the time of its update. Read threads can access the most up-to-date version of data within the update time frame; thus, the read threads can simultaneously read the data regardless of the write operation. However, it comes with overhead of cleaning including the cost of recycling the write log for a thread [24,25]. MV-RLU is an extension of RLU that reduces the waiting time in recycling the write log. Typically, RLU keeps only two versions in the log per object at the most. On the contrary, MV-RLU creates multiple versions of data and also has an efficient recycling policy. These decisions are made to increase the performance of the read and write workload [25]. Versioned programming [39] takes also the similar approach as MV-RLU to keep multiple versions and allow each read thread to access appropriate data. However, the version traversal cost is high due to its complexity in version management. Moreover, it exploits lock-free APIs, which means it needs caution in writing a code with versioned programming [36]. Multi-version concurrency control (MVCC) is a concurrency technique widely used in database systems [40,41,42], which has been reintroduced by MV-RLU as a scalable concurrency framework for manycore systems. Park et al. [43]. proposed RCX (RCU extension) to improve the write performance of RCU. RCX takes account of the NUMA architecture and exploits Hardware Transactional Memory (HTM) to provide a fine-grained lock to synchronize write threads. Fine-grained locks are generally considered difficult to write, and even if it is properly written, it may suffer from unpredictable race conditions.

RCU has been tested on many different operating systems including Linux and applications [21,26,27,29,30,37]. On the contrary, only a few have set out to evaluate RLU and MV-RLU in limited environments [24,25]. For RLU and MV-RLU to be widely used in different systems, they have to be rigorously tested on other environments. In this paper, we considered an operating system made for manycore machines for evaluation and analysis to reduce the influence of the operating system on RCU-style algorithms. There are a number of different operating systems for scalability in manycore machines [30,44,45,46,47,48]. In this paper, we chose an open-source research operating system, sv6 variant [49], as the platform for the performance evaluation after due consideration of library support and ease of analysis. The basis of sv6 is a research operating system called xv6 [50], and it adheres to the POSIX standard [30,48]. The sv6 operating system is designed based on the manycore scalable software interface. The sv6’s file system and virtual memory system exhibited exceptionally good scalability in performance comparison with Linux on a manycore machine. Thus, we believe it is appropriate to use sv6 variant as the operating system to evaluate RCU-style algorithms.

3. `RCU-Style` Non-Blocking Synchronous Algorithms

This section describes the algorithm and implementation of the RCU-style non-blocking synchronization schemes. We can differentiate blocking and non-blocking synchronization algorithms by whether they block the execution of a thread when it accesses the shared data. Blocking and non-blocking methods refer to the handling properties of individual operations. One may decide to use both approaches in synchronizing a specific data structure. As a representative example, in the case of RCU, non-blocking for read operations and blocking for writes are provided [22,38]. RLU and MV-RLU synchronize both read and write operations in a non-blocking manner. Through Table 1, the pros and cons of the RW-lock, a representative block method, and the RCU-style non-block synchronization methods are compared. Following that, each section introduces the structure, operations, and API usage of RCU, RLU, and MV-RLU by an example involving updating a linked list node.

3.1. Read Copy Update (`RCU`)

RCU [52] is a synchronization scheme that allows multiple read threads and one write thread to simultaneously access shared data. The advantage of RCU is that reading threads can perform concurrently without blocking while the writing thread is updating shared data. The reason RCU can increase the level of concurrency in the read operation is that it does not block existing read operations by creating a copy before updating the data. RCU overcomes the limitation of RW-lock where all other threads are blocked by a single writing thread in this way. The previous data need to be freed for memory management once all threads reading the previous data have been completed. RCU is a scheme suitable for synchronizing pointer-based data structures such as linked lists, but when updating, updates are applied by a single atomic pointer assignment, so there is a limit to its use in data structures that require multiple pointer assignment. This is because the data must be atomically distinct for the reading thread to distinguish it unambiguously before and after the update. Therefore, applying RCU to a data structure such as a doubly linked list or tree that requires multiple pointer assignment has limitations in programming complexity and operations [22,24,26]. The core of the RCU synchronization scheme is an algorithm that detects the completion of reading the previous data. The algorithm for detecting the read completion is closely related to the operating system’s scheduler, interrupt handler, etc., so the algorithm is selected according to performance optimization and requirements of applications [19]. There are several algorithms for reclamation, including quiescent-state-based reclamation (QSBR), epoch-based reclamation (EBR), hazard-pointer-based reclamation (HPR), and lock-free reference counting (LFRC). QSBR and EBR are the most frequently used due to their performance [19,38]. We introduce the algorithms we used and analyze its performance in Section 4 and Section 5.2. Performance degradation owing to sequential execution between write threads and API usage complexity due to an atomic single pointer assignment update are both disadvantages of RCU [24].

3.1.1. Procedure ofa Node Update

We begin by introducing the RCU procedure with an example of a node update in a singly linked list. The operating procedure is discussed using the example of Figure 1a.

(1): Thread R1 enters the critical section with negligible synchronization overhead, traverses the linked list, and reads node B (❶ in Figure 1a).
(2): Thread W creates a copy of node B, node $B^{'}$ , and updates the value to change node B (❷). Thread R1 is able to read concurrently without interfering with the node modification operation of thread W.
(3): Thread W updates the next node pointer of node A to node $B^{'}$ (❸). At this stage, threads that read the original and the copied nodes are split.
(4): Thread R2 reads a new node $B^{'}$ via node A by traversing the nodes (❹).
(5): Thread W defers memory release until R1 has completed reading node B. Thread W releases node B and completes the update process once it is confirmed that no other thread is accessing node B.

3.1.2. Use of `RCU` APIs

We introduce the main APIs of RCU and show examples of their use. RCU consists of various functions depending on the implementation, and the main APIs are as follows.

rcu_read_lock(), rcu_read_unlock(): APIs that notify the entry and exit of a critical section of reading threads, respectively.
rcu_assign_pointer() or rcu_assign_ptr(): Used by write threads to apply changes to a node. Used when changing the next pointer of node A to node B’ in the example of Figure 1a (❸).
rcu_dereference() or rcu_deref(): An API used by threads to read shared data in a critical section. It guarantees the order of operations through memory barriers, compiler directives, etc., to ensure that appropriate values of changed shared data can be read.
rcu_synchronize(): An API that waits for a critical section of the currently reading threads to exit before safely releasing an original node of the copied node.
rcu_call(): Non-blocking type API of rcu_synchronize(). An API that releases nodes in a non-blocking manner by registering nodes to be freed and a call back function instead of directly calling rcu_synchronize() if an application does not allow the write thread to be blocked. Frequent calls of rcu_call() may cause memory management problems due to the accumulation of a large number of nodes to be freed [53,54].

Through Figure 1b, we explain a relationship between the threads and API usage required in the process of Figure 1a. Thread R1 enters the critical section through rcu_read_lock(). Thread R1 reads node B using rcu_deref() in critical section (❶). Thread W acquires the mutual exclusive lock for write operations, creates B’, which is a copy of node B (❷), and performs an update through pointer assignment operation using rcu_assign_pointer() (❸). Then, thread W releases the lock (at the end of the Write Lock region in Figure 1b). While thread W updates node B, R1 still reads node B without interruption. R1 reads node B (denoted as rcu_deref(B) in the figure) because it reads the A’s next node through rcu_deref() before the rcu_assign_ptr() of thread W ❸. R2 reads node B’ (denoted as rcu_deref(B’) in the figure), because it reads the A’s next node through rcu_deref() after the rcu_assign_ptr() of thread W ❸. The rcu_assign_pointer() and rcu_deref() prevent instruction reordering through memory barriers, compiler directives, etc., so that appropriate nodes can be read before or after the change of the nodes. Thread W calls rcu_synchronize() to wait until R1 finishes reading to node B, and then, it releases the original node (denoted as free in Figure 1b). The waiting period is referred to as the Grace Period of thread W in Figure 1b. The quiescent state means a state in which threads are not accessing shared data. The area on thread R1’s timeline in Figure 1b represents the period following R1’s completion of reading node B. The Grace Period ends when all read threads have passed the Quiescent State [37,55]. To summarize RCU, read threads access nodes without any lock, and a write thread releases a node when all the readers in the critical section pass the Quiescent State after the node is modified.

3.2. Read Log Update (`RLU`)

RLU was proposed to overcome the major limitations of RCU, the complexity of using the API, and the synchronization overhead between write threads due to the lock [24]. RLU maintains a write log for each thread, create updates in the log, and commits these updates atomically to release updates to other threads at once. This way, it is possible to provide more straightforward APIs by atomically performing multiple pointer update operations at once. Each update is versioned in the write log according to the commit time. The previous versions need to be cleaned up for log reclamation for each thread’s write log, which involves a considerable overhead. Multiple threads can read at the same time without interfering with write threads similar to RCU, since read threads read the latest valid version. In both reading and writing, RLU conducts operations based on a clock. The system maintains a global clock, and each thread reads the moment it entered a critical section (also known as the RLU section) from global clock and keeps it in a thread’s metadata as local clock.

For data consistency, read operations read the most recently updated version depending on the thread’s local clock. A copy of a node (a version) to be read may exist in another thread’s write log. Once a write thread executes multiple operations for an update and commit it with a write clock, other threads can atomically identify the change. Conducting a commit with a write clock can result in all updates being published to other threads at the same time. While RCU allows a single pointer modification, RLU enables multiple changes atomically with conducting a commit operation specifying a write clock.

3.2.1. Procedure of a Node Update

Through Figure 2a, we discuss an operation procedure of the singly linked list synchronized with RLU.

(1): When the R1 thread enters the RLU section, it reads global clock, stores it as local clock in R1’s metadata, and reads the latest version node B based on local clock (❶ in Figure 2a).
(2): The thread W creates a new version, $B^{'}$ , in the log and updates the value to change the value of node B (❷).
(3): Assign global clock + 1 as the update clock of the version $B^{'}$ (denoted as write clock ❸) and commit the update (other threads can read the version $B^{'}$ ). At this stage, the thread R1 is still reading node B without interruption from the thread W’s updates.
(4): After that, the $R 2$ thread enters the RLU section and reads the latest version $B^{'}$ (❹).
(5): The thread W that updated node B waits until all the threads reading node B exit the RLU section. Since the commit of version $B^{'}$ is complete, threads entering the RLU section cannot read node B.
(6): If it is confirmed that there is no thread reading node B, the thread W overwrites the value of the version $B^{'}$ to the original node B (❺) and cleans up the version $B^{'}$ .

3.2.2. Use of `RLU` APIs

Since RLU is API compatible with RCU, many applications based on RCU can be replaced with RLU. The main APIs of RLU are as follows.

rlu_reader_lock(), rlu_reader_unlock(): APIs for entering and exiting a RLU section of threads. rlu_reader_lock() reads global clock and stores it as local clock in the thread’s metadata. rlu_reader_unlock() performs a commit for an update and log reclamation.
rlu_try_lock(): Acquire exclusive rights to modify an object and make a copy in the log.
rlu_assign_pointer() or rlu_assign_ptr(): Used by write threads to modify a pointer of a node. For instance, it is used to change the next pointer of node A to copy version $B^{'}$ in Figure 2a.
rlu_dereference() or rlu_deref(): An API used by threads to read shared data. It guarantees a consistent view of shared objects according to local clock of a thread.
rlu_free(): An API to logically delete an object.

Figure 2b shows a thread relationship and the API usages required in the procedure of Figure 2a.

All threads in RLU enter the RLU section through rlu_reader_lock(). A thread reads a copy of an object (a version) if the node has a version whose write clock is less than or equal to local clock; otherwise, it reads an original node. In Figure 2b, the two rlu_deref(B)s performed by R1 and R2 for the first time read node B, that is, the original node B, when entering the RLU section. Thread W acquires the exclusive right to modify node B, creates a new version

B^{'}

in its log (❷), and modifies the value. Additionally, global clock + 1 is stored as the update clock of the version B’ (denoted as write clock ❸). This procedure is represented in Figure 2a’s ❸ and Figure 2b’s try_lock(B) and assign_pointer(B). Then, thread W calls rlu_synchronize() and waits for the threads reading node B to terminate RLU section. This is expressed as Grace Period in Figure 2b. When the Grace Period ends, thread W overwrites the version

B^{'}

with node B (denoted as Figure 2a’s ❺ and Write Back in Figure 2b). While the thread W is modifying the node, the threads entering the RLU section read the copied node

B^{'}

, which appears in the second RLU section of R1 and R2. In this way, by allowing reads of copies (versions) that exist in the log of other threads, both reads and writes can be performed at the same time, but the waiting time for recycling the per-thread log is pointed out as a performance bottleneck.

3.3. Multi-Version Read Log Update (`MV-RLU`)

RLU’s performance bottleneck is pointed out as the block-based waiting for per-thread log recycling. The main reason for the performance degradation is that RLU only allows two versions per object, so waiting for log recycling is required. Multi-Version Read-Log-Update (MV-RLU) [25] prevents a synchronous wait for object reclamation, the limit of RLU, by allowing as many versions as the log can accommodate when requesting a write operation on the same object. Similar to RLU, updates are made in the write log for each thread, and the update appears in the form of a new version of the original object. MV-RLU provides a consistent view of the shared object for each thread by selecting the latest valid version using global clock. MV-RLU provides performance scalability on modern manycore systems through multiple versioning and efficient distributed garbage collection.

3.3.1. Procedure of a Node Update

Figure 3a shows the operation procedure of a single linked list synchronized with MV-RLU.

(1): The local clock is maintained in the thread’s metadata when MV-RLU enters the MV-RLU section in the same way that RLU does.
(2): Thread W1 creates a new version, $B^{'}$ , in the per-thread log to modify node B and modifies the value (denoted as ❶ in Figure 3a). Similar to RLU, MV-RLU specifies global clock + 1 as the update clock (denoted as write clock in the log of W1 ❷) and commits for the update.
(3): After creating the version $B^{'}$ , the thread R1 enters the critical section (also known as the MV-RLU section) and reads the version $B^{'}$ (❸).
(4): Upon request, thread W2 also creates version $B "$ in the log to modify node B and updates the value (❹). The thread W2 stores the update clock in the log (assuming the write clock is 15 (❺) and commits it.
(5): After that, the thread $R 2$ enters the MV-RLU section and reads the latest valid version $B^{''}$ based on local clock (❻).
(6): During the above procedure, log reclamation starts when a thread’s log usage exceeds a certain level. It waits until there is no thread reading a version other than the latest version and then overwrites the latest version on the original node (❼). Then, areas that are no longer accessed by threads are recycled in the log by MV-RLU’s garbage collector.

3.3.2. Use of `MV-RLU` APIs

MV-RLU’s API is almost identical to RLU, so it is compatible with most applications without modification. The main APIs of MV-RLU are as follows.

mvrlu_reader_lock(), mvrlu_reader_unlock(): APIs used by threads to enter/exit a critical section. mvrlu_reader_lock() begins a MV-RLU section and reads global clock and stores it as local clock in the thread’s metadata. mvrlu_reader_unlock() commits an update of a shared object and finishes an MV-RLU section.
mvrlu_try_lock(): An API that acquires exclusive rights to modify an object and creates a version in the log.
mvrlu_try_lock_const(): An optimized form of mvrlu_try_lock() API for objects to be deleted. Objects to be deleted do not need to create a new version in the log, so MV-RLU only has exclusive rights to modify the object.
mvrlu_assign_pointer() orMV-RLU_assign_ptr(): An API where a write thread performs pointer mutation operations on shared objects.
mvrlu_dereference() or mvrlu_deref(): An API that traverses for the latest valid version based on local clock on each thread (MV-RLU searches the version with the latest write clock based on local clock while traversing the version chain of Figure 3a).
mvrlu_free(): An API that performs the logical deletion of objects.

Figure 3b shows a relationship between the threads of API usages required in the process of Figure 3a. In order to access the data structure synchronized with MV-RLU, each thread needs to enter the MV-RLU section, and the thread reads the latest valid version based on the time it enters the MV-RLU section (local clock). Thread W1 calls mvrlu_reader_lock() to begin the MV-RLU section to access shared data. Then, to modify node B, the thread calls try_lock(B) to get exclusive modification rights. Thread W1 creates version

B^{'}

in its log, adds it to the front of node B’s version chain, and releases node

B^{'}

s mutation right. Then, the thread modifies the pointer of the target node using mvrlu_assign_pointer() and stores the update clock of version

B^{'}

with global clock + 1 (stored in the write clock ❷). Thread R1 enters the MV-RLU section after thread W1’s modification, so it reads

B^{'}

, which is the latest valid version for node B, through mvrlu_deref(B). While thread R1 reads

B^{'}

, thread W2 creates a new version

B^{''}

to update node B once more using try_lock(B) and mvrlu_assign_pointer(). Then, the thread records the write clock of version

B^{''}

(❺) and completes the modification. After the update of thread W2 is completed, thread R2 enters the MV-RLU section and reads the copied node (version)

B "

, as

B "

is the most recent valid version that the R2 can read. Similar to RCU and RLU, each thread safely releases versions that has passed the Grace Period to recycle the log. Log reclamation of MV-RLU takes a certain amount of time regardless of the number of execution threads because all threads execute their logs in parallel when the log usage exceeds a certain threshold.

4. Implementation

In this section, we describe the implementations required to run and evaluate the performance of RCU-style synchronization schemes on sv6 variant. We first show API examples of a node deletion on a singly linked list for adopting RW-lock and RCU-style synchronizations to protect the shared data structure. Then, we describe the implementation.

4.1. Example of API Use of `RCU-Style` Synchronizations

Figure 4 demonstrates API examples of RW-lock and RCU-style synchronizations applied to a singly linked list node deletion. RW-lock and RCU update employ the coarse-grained locking approach, while RLU and MV-RLU use hand-over-hand locking, one of the fine-grained locking policies, according to the synchronization method’s design approach [9]. Hand-over-hand locking is a mechanism that allows several threads to access data structures in a secure manner [9]. It is a procedure for acquiring locks on a target node and the prior node together, conducting operations, and then releasing the locks.

RW-lock ensures mutual exclusion for the critical section through using the lock of lines 6 and 17, which are simple and intuitive. In the case of read operations with RCU schemes, RCU uses the intuitive rcu_read_lock() and rcu_read_unlock() APIs to access the shared object. However, for write operations with RCU, the lock acquire and release on lines 6 and 15 are executed in order. The shared object is read with rcu_deref(), and the node is removed through rcu_assign_ptr(), as seen in lines 7 and 8 of the RCU example. At line 15, the critical section ends; then, the thread waits for the Grace Period to expire by rcu_synchronize() at line 16. For RLU, it begins the RLU section on line 7. rlu_deref() reads the shared object and rlu_try_lock() is used to obtain exclusive access to modify the shared object. To delete a node, the exclusive access must be obtained together, according to the hand-over-hand locking policy, so that the target node and the previous node are not modified by other threads. Upon failure of rlu_try_lock() at line 16 and 17, rlu_abort() is re-executed. On line 22, the node is removed from the linked list by assigning the previous node’s next pointer. The node is logically removed at line 23. The RLU section ends on line 28. For MV-RLU API usage, MV-RLU has the same APIs as RLU except for the newly introduced mvrlu_try_lock_const(). While RW-lock provides simple APIs to use, RCU allows only single pointer modification within the lock, which means that when multiple pointer changes are required (e.g., removing a node in a binary tree), multiple locks are required, complicating API usage. Multiple pointer changes are permitted in the critical section of RLU and MV-RLU, which makes writing applications considerably easier and more intuitive than with RCU.

4.2. `RCU-Style` Synchronization Libraries

RCUimplementation: Due to the close relationship between the RCU implementation and an operating system’s CPU scheduling and interrupt handling, the user and kernel-level implementation approaches are different [22,23,28]. The kernel-level implementation of RCU is optimized in several ways in comparison to the user level by utilizing the sv6 variant kernel functions. spinlock in sv6 variant, which is used to synchronize write threads, prevents interruptions in critical sections, hence avoiding overhead caused by context switching. Additionally, the non-blocking rcu_call() is used to wait for the read completion (grace period) before releasing the memory associated with the logically deleted object. By employing the non-blocking method rcu_call(), performance improvements in node deletion can be expected when compared to the blocking method rcu_synchronize(). The EBR method is used to determine whether the read of logically deleted nodes has been completed in order to free up memory. EBR is an epoch-based memory reclamation policy that divides an entire thread execution into epochs and collects nodes that should release memory during each epoch [19,56]. After a certain number of epochs (the number is usually 2 in the optimized implementation), nodes collected by each epoch are assured to have no more reading threads. Nodes that no longer have any reading threads are safely released.

The user-level RCU implementation makes use of the recently introduced Citrus RCU implementation, which improves performance by avoiding overhead such as memory barriers [26]. This implementation identifies nodes that needs to be freed in a manner similar to EBR by synchronizing between threads. Each read thread notifies the entering and exiting of the critical section through manipulating variables and flags. rcu_synchronize() can be implemented without coordination between threads by using thread internal variables and flags [26]. The major difference from the kernel-level implementation is the use of the blocking method rcu_synchronize(). This results in a significant degradation in workloads with writes.

RLUimplementation: We use the open-source library of RLU [57] for user and kernel-level benchmarks. When porting RLU to sv6 variant, the primary consideration is memory allocation for logs per thread and nodes for the linked lists and the chained hash tables. At the user level, the logs and the nodes are allocated using the POSIX-compliant malloc() and free() functions of sv6 variant. In the case of kernel level, kmalloc() and kfree() are used to allocate the logs for each thread and the nodes for the shared data structures. However, because kfree() requires the size of the memory to be released due to the interface of sv6 variant, it is necessary to fix the metadata for each log to keep the size of the allocated memory.

MV-RLUimplementation: We use MV-RLU’s open-source library [58] for user and kernel-level benchmarks. Some insufficient libraries and system call implementations are required to port the MV-RLU library to the user level of sv6 variant, as follows. MV-RLU runs the garbage collector periodically to help recycle each thread’s log. The garbage collector waits in a sleep state for a certain period, using condition variables to operate periodically. However, because the condition variable is not supported in the sv6 variant pthread library, we implemented it using the futex (fast user level mutex) system call and open-source code [59]. The log allocation of MV-RLU uses the same interface as RLU, and the following implementation is additionally required. Additionally, as a performance issue, we implemented a simple memory allocator for assigning the log for each thread while porting the MV-RLU library to the kernel level of sv6 variant. When using vmalloc() to allocate the logs, it was observed that the overhead involved with the log access is pretty considerable. Using the profiling tool of sv6 variant, we observed that the page fault cycle takes up the most time in the entire execution. To address this problem, we developed a log allocator using kmalloc() so that all log areas can be allocated logically contiguous. The log allocator sets aside an area equal to the size of the log for all threads in advance and then assigns a certain size for each thread when the thread begins. The logs for each thread are allocated in a contiguous address space, because MV-RLU tries to minimize the time complexity of determining the version (original or the copied node in the log) of node to O(1). Since MV-RLU needs to distinguish between the original and the copied versions whenever an object is accessed, it is required to minimize the time for performance benefits. Since the log area of a thread is a part of the contiguous memory allocated by the log allocator, it can be determined simply by whether the memory address of the corresponding object is located within the memory range of the log allocator.

4.3. Benchmarks

We implemented user- and kernel-level micro-benchmarks to measure the performance of the synchronization schemes on sv6 variant. While the execution spaces of the two benchmarks are different, the operation procedure and algorithm are identical. The micro-benchmark performs multiple operations on a single linked list and a chained hash table, which are synchronized using the RCU, RLU, and MV-RLU techniques. A chained hash table comprises fixed-size buckets, each of which contains a single linked list sorted by key value. Since each bucket has its own lock, different buckets can be accessed at the same time. To insert, remove, or search for nodes in a hash table, a thread generates a random integer, hashes the random integer to get a bucket number, and then traverses the linked list of the appropriate bucket using the random integer as a key. Each node’s access is uniformly random. The main implementation for the synchronization of node deletion in a linked list is referred to in Section 4.1. It follows the same algorithm as the benchmark of existing synchronization studies [24,25,39]. The benchmark implementation is split into user and kernel, with an identical operation algorithm.

5. Performance Evaluation

In this section, we evaluate the RCU-style synchronization schemes by answering the questions below.

Do RCU-style synchronization techniques show performance scalability in a manycore-based operating system?
How well does each synchronization scheme perform when simultaneously accessing the linked list and hash table, which are representative data structures?
How does each synchronization mechanism perform according to the read/write ratio of the workload?

5.1. Evaluation Platform

To evaluate the performance of RCU-style synchronization methods, we installed sv6 variant with ported RCU-style synchronization schemes on a 36-core machine and conducted benchmark experiments. The user- and kernel-level benchmark experiments were conducted on sv6 variant. These two benchmark experiments were completed on a 36-core (hyperthreaded 72-core) machine equipped with two NUMA sockets for two Intel Xeon Processor E5-2697 v4 2.4 GHz CPUs (each processor has 18 physical cores) and 256 GB of memory.

5.2. Configuration

There are two types of benchmarks: user and kernel level. The benchmark performs operations on linked list and hash table data structures. The linked list is initially set to 1000 nodes. The chained hash table is set to 1000 buckets and 5000 initial nodes. The key of the initial node is generated by a uniformly distributed random integer. Thus, an average of five nodes are created for each bucket. To compare the performance of each scheme according to the amount of write, the update ratio is set to 2%, 20%, and 50%, respectively. The update comprises of inserting and deleting nodes in equal parts. As a result, the total number of nodes remains nearly constant at all times. The operation is random and uniformly distributed to all nodes, and the execution time per experiment is 10 s. The write log size for each thread of MV-RLU is 128 KB, and all other settings use the default. The benchmarks pin running threads to the cores to minimize the performance interference caused by migration between the cores. For this experiment, we use the hyperthread feature of the CPUs. The benchmark’s main thread is always executed on core 0 of CPU 0, and threads created in the benchmarks are pinned in the order of core 1 to 17 of CPU 0 on NUMA node 0 and core 0 to 17 of CPU 1 on NUMA node 1. As a result, the hyperthread is not activated until the 36 threads have been created. The additional threads are again pinned to core 0 to 17 of CPU 0 and core 0 to 17 of CPU 1 in order, and they perform with the hyperthread on the cores.

5.3. Performance Results

In this section, we analyze the results of user-level and kernel-level benchmarks. The performance of each synchronization mechanism is measured in terms of the operation’s throughput as the number of threads increases.

5.3.1. User-Level Benchmark

Figure 5 shows the experimental results of the user-level linked list and chained hash table. The top three graphs shows the results of linked list, while the bottom three graphs represent the chained hash table’s results.

Linked list:RCU shows linear performance scalability up to 50 threads at an update ratio of 2%, but the performance is significantly reduced at update ratios of 20% and 50%. RLU shows the best performance among the others. MV-RLU shows the lowest performance at an update ratio of 2%, but it offers linear performance scalability in all update cases.

The RCU performance decreases significantly as the update ratio increases or the running threads increase. As seen in Section 4.1, the RCU uses spinlock for write operations, so write threads are processed serially. In particular, since user-level RCU uses blocking wait (rcu_synchronization()) for memory reclamation, the performance scalability decreases as the update ratio increases or the number of execution threads grows. Therefore, if the update ratio is low and the number of running threads is not high, RCU shows linear scalability, as shown in Figure 5a, whereas the performance significantly drops when the update ratios increase, as shown in Figure 5b,c.

The performance of RLU and MV-RLU increases as the number of threads increases, regardless of the update ratio. The reason that MV-RLU performs worse than RLU is because of the way MV-RLU traverses for data structures. MV-RLU and RLU both read data structures in a similar fashion. That is, it reads the latest valid version when threads enter the critical section. MV-RLU maintains more versions than RLU to increase parallelism. Therefore, in MV-RLU, if the critical section lengthens similarly to a linked list, the number of versions grows, which increases the version traversal cost when traversing for data structures. The average length of the linked list we evaluated is 200 times longer than the average length of a chained hash table with 1000 nodes. Therefore, as the critical section of MV-RLU becomes longer, the cost of version traversal increases, affecting performance. RLU outperforms them all due to its lower version traversal cost and fewer write conflicts. RLU allows concurrent operations between threads, unless writing to the same object. When the number of threads is low compared to the nodes, the probability of a write conflict for the same object falls. In this case, the log reclaiming cost of RLU is reduced. Since RLU maintains only two versions, the version search cost is relatively low when traversing nodes in the linked list. In the case of RCU, since the write thread is synchronized with a spinlock, the performance decreases considerably when the number of threads and the update ratio increase.

Hash table: The performance result of the chained hash table is quite different from that of the linked list. RCU shows a performance increase in proportion to the number of threads in all update ratios, but a certain number of threads experience a performance drop or saturation. These phenomena are clearly observed in the execution of 18 and 24 threads that exceed the NUMA node boundary in all update cases. Please note that the main thread of the benchmark is pinned to the core 0 of CPU 0, and the threads that perform benchmark operations are sequentially allocated to the core 1 to 17 of CPU 0, so the 18th thread is assigned to CPU 1. RLU shows the best performance in the linked list evaluation but the lowest in the chained hash tables. At the 2% update ratio, performance tends to increase relatively linearly, but at the 20% or 50% update ratio, the performance decreases or gets saturated after running 18 threads. MV-RLU shows the best performance among the synchronization schemes. Although the performance of MV-RLU is almost in par with RCU on the 2% update ratio, MV-RLU shows the best performance in almost all cases. Although the performance impact varies depending on the synchronization mechanism, all schemes show the NUMA boundary effect. As shown in Figure 5d–f, when threads are allocated across NUMA nodes 0 and 1, the performance degrades or is saturated. When execution threads are spread across multiple NUMA nodes, the cost of accessing data structures and synchronizing caches increases.

Since the chained hash table has a lock for each bucket, the simultaneous operation of different buckets is possible without synchronization. The critical section is relatively short because each bucket has only a few linked nodes. Due to the parallelism of each bucket, the performance of the chained hash table of RCU is better than that of the linked list. When accessing a data structure that shares a lock, the write threads of RCU are executed sequentially. Since all write threads share one spinlock in a linked list, the higher the update ratio, the worse the performance. On the other hand, the chained hash table spreads out the competition for the spinlock because each bucket is synchronized with the spinlock, so the performance degradation is not as significant as with the linked list.

Due to the small number of chained nodes per bucket, the hash table’s critical section is relatively short. As a result, the load of rcu_synchronize, which is a blocking wait API for releasing memory of RCU, is reduced. In the case of MV-RLU, the version traversal cost required for node traversal is reduced due to the shortened critical section. In the case of RLU, the write operation necessitates a blocking wait for the Grace Period and Write Back, as shown in the Figure 2b. Therefore, RLU shows less or no performance improvement as the number of threads increases, as shown in Figure 5e,f. MV-RLU shows high performance at all update ratios because it improves the write throughput and solves the bottleneck of RLU, the blocking wait for log recycling through multiple version creation. The reason why MV-RLU’s performance in the hash table is better than that of RLU is that the critical section is significantly reduced compared to the linked list, so the version traversal cost due to node traversal is greatly reduced.

5.3.2. Kernel-Level Benchmark

Figure 6 shows the experimental results of the kernel-level linked list and chained hash table. The top three graphs shows the results of the linked list, while the bottom three graphs represent the chained hash table’s results.

Linked list:RCU shows the best performance at the 2% update ratio, whereas the performance decreases significantly at the 20% and 50% update ratios. The performance gradually reduces after 16 and 8 threads, respectively, in the case of 20% and 50% updates. When comparing performance efficiency by number of threads, 70-thread execution has a low performance efficiency equivalent to two to three-thread execution. Contention between threads owing to the synchronization of locks between write threads is the leading cause of RCU’s performance degradation. When the number of threads is modest, RLU shows scalable performance, but it degrades at update ratios of 2%, 20%, and 50% for exceeding 30, 50, and 50 threads, respectively. This is because when RLU updates an object, it overwrites the original version with an updated copied version. To securely overwrite the original object with the updated copied version, the writing thread needs to wait until all threads reading the original object have exited the RLU section. Therefore, as the update ratio and the number of threads increase, the RLU performance decreases. MV-RLU performs less well than RCU at the 2% update ratio, but it outperforms at 20% and 50% update rates. The multi-version and distributed log recycling policy of MV-RLU contribute to performance improvement even with an increase in the number of execution threads. Comparing the performance of MV-RLU and RLU, the performance difference becomes larger as the number of threads increases. The reason for this is that the rlu_synchronize() cost of RLU increases significantly as the number of threads increases. MV-RLU delays log recycling until the log size is filled, and only the latest version among multiple versions is overwritten (write back) in the original object, so the number of overwrites is reduced by

\frac{1}{n u m b e r - o f - v e r s i o n s}

.

The performance results of RCU and MV-RLU are similar to those of the user-level linked list. It can be shown that regardless of the user or kernel level, RCU and MV-RLU show consistent performance. The performance results of RLU are significantly different from those of the user-level linked list. The user-level results of RLU show linear performance scalability, whereas the kernel-level results of RLU do not. The difference between user and kernel-level implementation of RLU is that the kernel-level RLU maintains the size of the allocated memory in metadata of each log and each node so that kfree() releases a proper size of the allocated memory (described in Section 4). It is analyzed that the tiny increase of the metadata size for keeping the allocated memory size adversely affects the performance. In particular, the cost of managing memory is expected to rise as the number of threads increases. As the metadata size grows, cache line bouncing between competing threads becomes more frequent, and the kmalloc()/kfree() cost grows.

Hash table: The performance trends of the kernel-level hash table are comparable to those of the user-level hash table, with a few exceptions. RCU shows that performance increases linearly with the number of threads and subsequently decreases after 64, 48, and 16 threads at 2%, 20%, and 50% update ratios, respectively. The performance trends of RLU are comparable to those of the user-level chained hash table, with performance degrading at all update ratios. At all update ratios, performance saturates or decreases beyond 16-thread execution. The decrease is especially pronounced in the case of the 50% update. MV-RLU shows a linear performance increase at the update ratios of 2% and 20%, but at the 50% update ratio, performance saturation and decrease occur when running more than 50 threads.

Kernel-level RCU performs better than user-level RCU due to some implementation differences. The major differences are the introduction of the EBR algorithm to wait for the memory recycling of logically deleted nodes, the use of non-blocking rcu_call() API, and the use of optimized locks inside sv6 variant. Based on the performance results of machines with more CPU cores for the RCU benchmark [25], we anticipate a more significant performance degradation as the number of threads increases. The Grace Period and Write Back overhead for log reclamation, which are causes of RLU performance decrease, have a more significant effect as the number of execution threads increases. It is analyzed that the performance of MV-RLU is decreased due to overhead related to the log reclamation at the update ratio of 50%, which will be discussed in the next section.

6. Discussion

RCU’s performance degrades as the number of threads increases because write threads are executed sequentially via spinlock and waiting memory to be released. RLU has been proposed to improve the problem of RCU, but in many cases, it shows lower performance than RCU. The reason for the performance is because the cost of log recycling increases significantly as the number of threads increases. However, compared to RCU, RLU provides an easier and more intuitive programming interface and is evaluated as a meaningful advancement.

In the case of MV-RLU, the performance deviation was measured with the increased number of threads in the 50% update of the hash table on the kernel level benchmark. We discovered that log_reclaim_force() consumes a significant amount of CPU cycles as a result of profiling the kernel benchmark with

p e r f

, which is a performance profiling tool offered by sv6 variant. The internal function of the MV-RLU library, log_reclaim_force, is used by a thread to forcefully empty the log when its log usage exceeds 75%. It asks the garbage collector to recycle the log so that the thread can empty its own log and waits. In general, since the garbage collector of MV-RLU recycles the logs periodically, it is rare that the log usage exceeds 75%. It is not normal behavior of MV-RLU for log_reclaim_force() to consume a high CPU cycle usage. MV-RLU uses a hardware clock-based clock management technique called ORDO [60] for the global clock management without a performance bottleneck. We suspect that the cause of the abnormal behavior of MV-RLU lies in an incorrect ORDO boundary value. The ORDO boundary value is machine-dependent and can be obtained by measuring the machine that one intends to use. If the ORDO boundary value is incorrect, the log reclamation does not work correctly, causing the log usage rate to rise. When measuring the ORDO boundary of the manycore machine used in the experiment, the deviation of the ORDO boundary was observed in the range of 15–30%. In the case of MV-RLU, it was observed that the correct ORDO boundary value had a significant effect on performance when running in a new environment [25].

7. Conclusions

The RCU-style synchronization schemes are promising with the advantages of high performance on modern manycore machines and intuitive APIs. Although the performance of RCU has been evaluated in various environments, there are only a limited number of performance reports of RLU and MV-RLU. Therefore, in order to expand the usage of the RCU-style synchronization schemes, it is required to evaluate and analyze the performance in various environments. We evaluate these synchronization schemes at the user and kernel level on sv6 variant on a manycore machine. We implemented user and kernel-level micro-benchmarks for performance evaluation where multiple threads access linked list and hash table simultaneously. We modified the libraries and system calls to implement RCU-style synchronization schemes to sv6 variant. Each synchronization method has performance advantages and drawbacks depending on the type of data structure and benchmark parameters. MV-RLU generally shows high performance, but memory allocation and global clock management needs attention while adopting it to a new environment. From the performance result of the micro-benchmarks, we can expect to have better scalability by adopting RCU-style synchronization schemes in various layers of operating systems, such as a file system, network stack, and memory management system on a manycore machine.

Author Contributions

Conceptualization, J.K.; Data curation, E.C. and M.H.; Software, C.K.; Writing review and editing, S.L. and J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2014-3-00035) and by National Research Foundation of Korea (NRF) grant number (No. 2021R1F1A1063524) and The APC was funded by Gyeonsang National University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chuck Moore AMD Corporate Fellow & Technology Group CTO. Data Processing in Exascale-Class Computer Systems; Salishan Conference on High Speed Computing: Gleneden Beach, OR, USA, 2011. [Google Scholar]
Hill, M.D.; Marty, M.R. Amdahl’s Law in the Multicore Era. IEEE Comput. 2008, 41, 33–38. [Google Scholar] [CrossRef] [Green Version]
NetworkWorld. Ampere Announces 128-Core Arm Server Processor. 2020. Available online: https://www.networkworld.com/article/3564514/ampere-announces-128-core-arm-server-processor.html (accessed on 3 February 2022).
Intel. Intel® Xeon® Platinum Processor. 2021. Available online: https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable/platinum.html (accessed on 3 February 2022).
Supermicro. SuperServer 7089P-TR4T. 2022. Available online: https://www.supermicro.com/en/products/system/7U/7089/SYS-7089P-TR4T.cfm (accessed on 3 February 2022).
Boyd-Wickizer, S.; Kaashoek, M.F.; Morris, R.; Zeldovich, N. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium, Ottawa Linux Symposium, OLS’12, Ottawa, ON, Canada, 11–13 July 2012; pp. 119–130. [Google Scholar]
McKenney, P.E. Is Parallel Programming Hard, and, If So, What Can You Do about It? (Release v2021.12.22a). arXiv 2021, arXiv:cs.DC/1701.00854. [Google Scholar]
Kashyap, S.; Calciu, I.; Cheng, X.; Min, C.; Kim, T. Scalable and Practical Locking with Shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, Huntsville, ON, Canada, 27–30 October 2019; pp. 586–599. [Google Scholar]
Herlihy, M.; Shavit, N. The Art of Multiprocessor Programming, Revised Reprint, 1st ed.; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2012. [Google Scholar]
Mellor-Crummey, J.M.; Scott, M.L. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 1991, 9, 21–65. [Google Scholar] [CrossRef]
Corbet, J. Locks and Qspinlocks. LWN.net. Available online: https://lwn.net/Articles/590243/ (accessed on 3 February 2022).
Chabbi, M.; Fagan, M.; Mellor-Crummey, J. High Performance Locks for Multi-Level NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, San Francisco, CA, USA, 7–11 February 2015; pp. 215–226. [Google Scholar]
Kashyap, S.; Min, C.; Kim, T. Scalable NUMA-Aware Blocking Synchronization Primitives. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’17, Santa Clara, CA, USA, 12–14 July 2017; pp. 603–615. [Google Scholar]
Harris, T.L. A Pragmatic Implementation of Non-Blocking Linked-Lists. In Proceedings of the 15th International Conference on Distributed Computing, DISC ’01, Lisbon, Portugal, 3–5 October 2001; pp. 300–314. [Google Scholar]
Fomitchev, M.; Ruppert, E. Lock-Free Linked Lists and Skip Lists. In Proceedings of the Twenty-Third Annual ACM Symposium on Principles of Distributed Computing, PODC ’04, St. John’s, NL, Canada, 25–28 July 2004; pp. 50–59. [Google Scholar]
Herlihy, M.; Luchangco, V.; Moir, M. Obstruction-Free Synchronization: Double-Ended Queues as an Example. In Proceedings of the 23rd International Conference on Distributed Computing Systems, ICDCS ’03, Providence, RI, USA, 19–22 May 2003; p. 522. [Google Scholar]
Michael, M.M. High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, New York, NY, USA, 10–13 August 2002; pp. 73–82. [Google Scholar]
Hart, T.; McKenney, P.; Brown, A. Making lockless synchronization fast: Performance implications of memory reclamation. In Proceedings of the 20th IEEE International Parallel Distributed Processing Symposium, Rhodes Island, Greece, 25–29 April 2006; p. 10. [Google Scholar]
Hart, T.E.; McKenney, P.E.; Brown, A.D.; Walpole, J. Performance of Memory Reclamation for Lockless Synchronization. J. Parallel Distrib. Comput. 2007, 67, 1270–1285. [Google Scholar] [CrossRef] [Green Version]
Wikipedia. Non-Blocking Algorithm. 2021. Available online: https://en.wikipedia.org/wiki/Non-blocking_algorithm (accessed on 3 February 2022).
McKenney, P.E.; Fernandes, J.; Boyd-Wickizer, S.; Walpole, J. RCU Usage In the Linux Kernel: Eighteen Years Later. SIGOPS Oper. Syst. Rev. 2020, 54, 47–63. [Google Scholar] [CrossRef]
McKenney, P. What Is RCU, Fundamentally? 2007. Available online: https://lwn.net/Articles/262464/ (accessed on 3 February 2022).
Desnoyers, M.; McKenney, P.E.; Stern, A.S.; Dagenais, M.R.; Walpole, J. User-Level Implementations of Read-Copy Update. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 375–382. [Google Scholar] [CrossRef] [Green Version]
Matveev, A.; Shavit, N.; Felber, P.; Marlier, P. Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, Monterey, CA, USA, 4–7 October 2015; pp. 168–183. [Google Scholar]
Kim, J.; Mathew, A.; Kashyap, S.; Ramanathan, M.K.; Min, C. MV-RLU: Scaling Read-Log-Update with Multi-Versioning. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, Providence, RI, USA, 13–17 April 2019; pp. 779–792. [Google Scholar]
Arbel, M.; Attiya, H. Concurrent Updates with RCU: Search Tree as an Example. In Proceedings of the 2014 ACM Symposium on Principles of Distributed Computing, PODC ’14, Paris, France, 15–18 July 2014; pp. 196–205. [Google Scholar]
Mckenney, P.E.; Walpole, J. Exploiting Deferred Destruction: An Analysis of Read-Copy-Update Techniques in Operating System Kernels. Ph.D. Thesis, Oregon Health & Science University, Portland, OR, USA, 2004. [Google Scholar]
McKenney, P.E.; Desnoyers, M.; Jiangshan, L. User-Space RCU; LWN.net. Available online: https://lwn.net/Articles/573424/ (accessed on 3 February 2022).
Clements, A.T.; Kaashoek, M.F.; Zeldovich, N. Scalable Address Spaces Using RCU Balanced Trees. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, London, UK, 3–7 March 2012; pp. 199–210. [Google Scholar]
Clements, A.T.; Kaashoek, M.F.; Zeldovich, N.; Morris, R.T.; Kohler, E. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, Farminton, PA, USA, 3–6 November 2013; pp. 1–17. [Google Scholar]
Magnusson, P.S.; Landin, A.; Hagersten, E. Queue Locks on Cache Coherent Multiprocessors. In Proceedings of the 8th International Symposium on Parallel Processing, Cancún, Mexico, 1 April 1994; pp. 165–171. [Google Scholar]
Luchangco, V.; Nussbaum, D.; Shavit, N. A Hierarchical CLH Queue Lock; Euro-Par’06; Springer: Dresden, Germany, 2006; pp. 801–810. [Google Scholar]
Dice, D.; Kogan, A. BRAVO: Biased Locking for Reader-Writer Locks. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’19, Renton, WA, USA, 10–12 July 2019; pp. 315–328. [Google Scholar]
Fraser, K.; Harris, T. Concurrent Programming without Locks. ACM Trans. Comput. Syst. 2007, 25, 5-es. [Google Scholar] [CrossRef]
Michael, M.M. Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes. In Proceedings of the Twenty-First Annual Symposium on Principles of Distributed Computing, PODC ’02, Monterey, CA, USA, 21–24 July 2002; pp. 21–30. [Google Scholar]
Preshaing, J. An Introduction to Lock-Free Programming. 2012. Available online: https://preshing.com/20120612/an-introduction-to-lock-free-programming/ (accessed on 3 February 2022).
DPDK. RCU Library. 2019. Available online: https://doc.dpdk.org/guides/prog_guide/rcu_lib.html (accessed on 3 February 2022).
Wikipedia. Read-Copy-Update. 2021. Available online: https://en.wikipedia.org/wiki/Read-copy-update (accessed on 3 February 2022).
Zhan, Y.; Porter, D.E. Versioned Programming: A Simple Technique for Implementing Efficient, Lock-Free, and Composable Data Structures. In Proceedings of the 9th ACM International on Systems and Storage Conference, SYSTOR ’16, Haifa, Israel, 6–8 June 2016. [Google Scholar]
Wikipedia. List of Database Using MVCC. 2021. Available online: https://en.wikipedia.org/wiki/List_of_databases_using_MVCC (accessed on 3 February 2022).
Diaconu, C.; Freedman, C.; Ismert, E.; Larson, P.A.; Mittal, P.; Stonecipher, R.; Verma, N.; Zwilling, M. Hekaton: SQL Server’s Memory-Optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, New York, NY, USA, 22–27 June 2013; pp. 1243–1254. [Google Scholar]
Lim, H.; Kaminsky, M.; Andersen, D.G. Cicada: Dependably Fast Multi-Core In-Memory Transactions. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, Chicago, IL, USA, 14–19 May 2017; pp. 21–35. [Google Scholar]
Park, S.; McKenney, P.E.; Dufour, L.; Yeom, H.Y. An HTM-Based Update-Side Synchronization for RCU on NUMA Systems. In Proceedings of the Fifteenth European Conference on Computer Systems, EuroSys ’20, Heraklion, Greece, 27–30 April 2020. [Google Scholar]
Boyd-Wickizer, S.; Chen, H.; Chen, R.; Mao, Y.; Kaashoek, F.; Morris, R.; Pesterev, A.; Stein, L.; Wu, M.; Dai, Y.; et al. Corey: An Operating System for Many Cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, San Diego, CA, USA, 8–10 December 2008; pp. 43–57. [Google Scholar]
Baumann, A.; Barham, P.; Dagand, P.E.; Harris, T.; Isaacs, R.; Peter, S.; Roscoe, T.; Schüpbach, A.; Singhania, A. The Multikernel: A New OS Architecture for Scalable Multicore Systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, Big Sky, MT, USA, 11–14 October 2009; pp. 29–44. [Google Scholar]
Park, Y.; Van Hensbergen, E.; Hillenbrand, M.; Inglett, T.; Rosenburg, B.; Ryu, K.D.; Wisniewski, R.W. FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment. In Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, NW Washington, DC, USA, 24–26 October 2012; pp. 211–218. [Google Scholar]
Wentzlaff, D.; Agarwal, A. Factored Operating Systems (Fos): The Case for a Scalable Operating System for Multicores. SIGOPS Oper. Syst. Rev. 2009, 43, 76–85. [Google Scholar] [CrossRef]
Clements, A.T.; Kaashoek, M.F.; Zeldovich, N.; Morris, R.T.; Kohler, E. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. ACM Trans. Comput. Syst. 2015, 32, 1–47. [Google Scholar] [CrossRef]
Bhat, S.S.; Eqbal, R.; Clements, A.T.; Kaashoek, M.F.; Zeldovich, N. Scaling a File System to Many Cores Using an Operation Log. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, Shanghai, China, 28–31 October 2017; pp. 69–86. [Google Scholar]
Cox, R.; Kaashoek, M.F.; Morris, R.T. Xv6, a Simple Unix-Like Teaching Operating System. Available online: https://pdos.csail.mit.edu/6.828/2021/xv6.html (accessed on 3 February 2022).
Wikipedia. Readers-Write Lock. 2021. Available online: https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock (accessed on 3 February 2022).
Mckenney, P.; Slingwine, J. Read-copy update: Using execution history to solve concurrency problems. Parallel Distrib. Comput. Syst. 1998, 509518. [Google Scholar]
McKenney, P.E.; Boyd-Wickizer, S.; Walpole, J. FAQ for “RCU Usage in the Linux Kernel: One Decade Later”. Available online: https://pdos.csail.mit.edu/6.S081/2020/lec/rcu-faq.txt (accessed on 3 February 2022).
kernel.org. What Is RCU? Available online: https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt (accessed on 3 February 2022).
McKenney, P. Hierarchical RCU. 2008. Available online: https://lwn.net/Articles/305782/ (accessed on 3 February 2022).
Fraser, K. Practical lock-freedom. In Technical Report UCAM-CL-TR-579; University of Cambridge, Computer Laboratory: Cambridge, UK, 2004. [Google Scholar]
Github. RLU. Available online: https://github.com/rlu-sync/rlu (accessed on 3 February 2022).
Github. MV-RLU. Available online: https://github.com/cosmoss-vt/mv-rlu (accessed on 3 February 2022).
Denis-Courmont, R. Xv6, a Simple Unix-like Teaching Operating System. Available online: https://www.remlab.net/op/futex-condvar.shtml (accessed on 3 February 2022).
Kashyap, S.; Min, C.; Kim, K.; Kim, T. A Scalable Ordering Primitive for Multicore Machines. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, Porto, Portugal, 23–26 April 2018. [Google Scholar]

Figure 1. Example of updating a node in a linked list with RCU. (a) assumes that multiple threads concurrently access the linked list consisting of three nodes: A, B, and C. (b) shows the API usage for the operation example of (a) for each thread.

Figure 2. Example of updating a node in a linked list with RLU. The assumptions and notations of Figure 1 are followed in the same way.

Figure 3. Example of updating a node in a linked list with MV-RLU. The assumptions and notations of Figure 1 are followed in the same way.

Figure 4. Example of deleting a node from a single linked list using synchronization algorithms.

Figure 5. Linked list (upper) and hash table (lower) on user level.

Figure 6. Linked list (upper) and hash table (lower) on kernel level.

Table 1. High-level comparison of blocking and non-blocking synchronization mechanisms. Depending on the design, each mechanisms features concurrent execution parallelism, major design factors, main performance overheads, and difficulty in API usage. RW-lock lock allows concurrent access for read-only operations, while write operations require exclusive access. RCU and RLU are designed for read-mostly workloads. In the case of RCU, read operations are non-blocking, since concurrent reads are allowed whereas write operations are blocking, because write executions are serialized. In the case of RLU, concurrent writes are allowed unless there is a write conflict of the same data. Therefore, RLU supports non-blocking for both read and write operations. MV-RLU extends RLU for write-heavy workloads using multi-versioning while maintaining the optimal performance of RLU for read-mostly workloads along with its intuitive programming model.

		`RW-Lock` [51]	`RCU` [38]	`RLU` [24]	`MV-RLU` [25]
Policy		lock-based	lock and non-blocking	non-blocking	non-blocking
Concurrency	R-R	•	•	•	•
	R-W	×	•	•	•
	W-W	×	×	▴	▴
Design factor		concurrent R, mutual exclusive W	concurrent R, single W	concurrent R, multiple W	concurrent R, multiple W w/o blocking wait
API usage difficulty		low	high	medium	medium
Main performance overhead		lock and unlock, mutual exclusion of W	serialized execution of W with a lock, wait for memory reclamation	blocking wait for log reclamation	version chain traversal

NOTE. R: reader; W: writer; ×: no parallelism; •: full parallelism; ▴: write–write conflict for the same data.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, C.; Choi, E.; Han, M.; Lee, S.; Kim, J. Performance Analysis of RCU-Style Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System. Appl. Sci. 2022, 12, 3458. https://doi.org/10.3390/app12073458

AMA Style

Kim C, Choi E, Han M, Lee S, Kim J. Performance Analysis of RCU-Style Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System. Applied Sciences. 2022; 12(7):3458. https://doi.org/10.3390/app12073458

Chicago/Turabian Style

Kim, Changhui, Euteum Choi, Mingyun Han, Seongjin Lee, and Jaeho Kim. 2022. "Performance Analysis of RCU-Style Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System" Applied Sciences 12, no. 7: 3458. https://doi.org/10.3390/app12073458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Analysis of RCU-Style Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System

Abstract

1. Introduction

2. Related Work

3. RCU-Style Non-Blocking Synchronous Algorithms

3.1. Read Copy Update (RCU)

3.1.1. Procedure ofa Node Update

3.1.2. Use of RCU APIs

3.2. Read Log Update (RLU)

3.2.1. Procedure of a Node Update

3.2.2. Use of RLU APIs

3.3. Multi-Version Read Log Update (MV-RLU)

3.3.1. Procedure of a Node Update

3.3.2. Use of MV-RLU APIs

4. Implementation

4.1. Example of API Use of RCU-Style Synchronizations

4.2. RCU-Style Synchronization Libraries

4.3. Benchmarks

5. Performance Evaluation

5.1. Evaluation Platform

5.2. Configuration

5.3. Performance Results

5.3.1. User-Level Benchmark

5.3.2. Kernel-Level Benchmark

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Performance Analysis of `RCU-Style` Non-Blocking Synchronization Mechanisms on a Manycore-Based Operating System

3. `RCU-Style` Non-Blocking Synchronous Algorithms

3.1. Read Copy Update (`RCU`)

3.1.2. Use of `RCU` APIs

3.2. Read Log Update (`RLU`)

3.2.2. Use of `RLU` APIs

3.3. Multi-Version Read Log Update (`MV-RLU`)

3.3.2. Use of `MV-RLU` APIs

4.1. Example of API Use of `RCU-Style` Synchronizations

4.2. `RCU-Style` Synchronization Libraries