CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture

Nie, Shiqiang; Liu, Yingming; Niu, Jie; Wu, Weiguo

doi:10.3390/app14156449

Open AccessArticle

CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture

by

Shiqiang Nie

,

Yingming Liu

,

Jie Niu

and

Weiguo Wu

^*

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6449; https://doi.org/10.3390/app14156449

Submission received: 28 April 2024 / Revised: 8 July 2024 / Accepted: 18 July 2024 / Published: 24 July 2024

(This article belongs to the Special Issue Advanced Technologies and Applications of High-Performance Computing and Parallel Computing)

Download

Browse Figures

Versions Notes

Abstract

:

The concept of “all cores are created equal” has been popular for several decades due to its simplicity and effectiveness in CPU (Central Processing Unit) design. The more cores the CPU has, the higher performance the host owns and the higher the power consumption. However, power-saving is also one of the key goals for servers in data centers and embedded devices (e.g., mobile phones). The big.LITTLE multicore architecture, which contains high-performance cores (namely big core) and power-saved cores (namely little core), has been developed by ARM (Advanced RISC Machine) and Intel to trade off performance and power efficiency. Facing the new heterogeneous computing architecture, the traditional lock algorithms, which are designed to run on homogeneous computing architecture, cannot work optimally as usual and drop into the performance issue for the difference between big core and little core. In our preliminary experiment, we observed that, in the big.LITTLE multicore architecture, all these lock algorithms exhibit sub-optimal performance. The FIFO-based (First In First Out) locks experience throughput degradation, while the performance of competition-based locks can be divided into two categories. One of them is big-core-friendly, so their tail latency increases significantly; the other is little-core-friendly. Not only does the tail latency increase, but the throughput is also degraded. Motivated by this observation, we propose a Core-Aware Lock for the big.LITTLE multicore architecture named CAL, which keeps each core having an equal opportunity to access the critical section in the program. The core idea of the CAL is to take the slowdown ratio as the matric to reorder lock requests of these big and little cores. By evaluating benchmarks and a real-world application named LevelDB, CAL is confirmed to achieve fairness goals in heterogeneous computing architecture without sacrificing the performance of the big core. Compared to several traditional lock algorithms, the CAL’s fairness has increased by up to 67%; and Its throughput is 26% higher than FIFO-based locks and 53% higher than competition-based locks, respectively. In addition, the tail latency of CAL is always kept at a low level.

Keywords:

lock; fairness; the big.LITTLE multicore architecture

1. Introduction

The single-core CPU (Central Processing Unit) performance has stagnated over recent years, and CPU manufacturers have turned to developing multicore processors. For the early multicore processors, the cores are symmetric, and the performance is the same as each other [1,2], which means that each core has the same computation ability and energy consumption characterization. However, as the core number increases from 2 to 8 or more, the overall chip power consumption also increases and induces high costs for the data center and a sharp decrease in the standby time of the battery-powered embedded device (e.g., mobile phone). In order to trade-off the performance and lifetime of embedded devices, ARM (Advanced RISC Machine), AMD (Advanced RISC Machine), and Intel have devoted themselves to developing asymmetric multicore processors (e.g., 13th Gen Intel (R) Core (TM) CPU, ARM A15 CPU), which contain high-performance cores (namely big core) and power-saved cores (namely little core). The ideal scenarios of big.LITTLE multicore architecture is that the big core runs computing-intensive tasks, and the little core runs less computing-intensive tasks.

However, it’s impractical for OS (Operating System) to schedule multiprocessing programs on only big cores or little cores [3,4]. The existing scheduling strategies, when executing scheduling tasks, tend to overlook the disparities among cores. As a result, the multiprocessing programs are randomly assigned to both big and little cores. When big and little cores collaborate on task processing, and locks are required to sync data between cores. Note that locks are employed by OS to synchronize data between processes and achieve mutually exclusive access to critical resources. These existing lock algorithms are designed and optimized to meet the specified hardware topology, e.g., the hierarchical lock is proposed for NUMA (Non-Uniform Memory Access) architecture to reduce cross-node communication overhead, and ttas (Test-And-Test-And-Set) lock is proposed to mitigate the frequent cache line invalidates caused by atomic instructions. Ttas lock is a spinlock algorithm used in concurrent programming, particularly in environments where multiple threads need to access shared resources in a mutually exclusive manner to prevent race conditions. It’s an optimization over the traditional Test-and-Set lock, designed to reduce the contention caused by spinning (continuous checking) when the lock is already held. Spinlock constitute a simple synchronization primitive wherein, upon a thread’s attempt to acquire a lock already held by another thread, instead of immediately relinquishing its CPU timeslice, the thread engages in a continuous loop of checks (spinning) until the lock becomes available. Ticket locks manage access to the lock through a queueing mechanism; each thread requesting the lock obtains an incrementally increasing “ticket” number and subsequently awaits its turn to acquire the lock as indicated by its ticket number being called. Cptltkt lock realizes a queued lock scheme, encompassing both global partitioned ticketing and local ticketing mechanisms for lock management. In the big.LITTLE multicore architecture, due to the performance gap between big core and little core, big core executes the same critical section faster than little one, which leads to the non-scalability of the inherent fairness of traditional locks.

To depict the lock fairness issue brought by existing lock locks (i.e., core-unaware lock) in the big.LITTLE multicore architecture, we have experimented with observing the relationship between lock throughout, tail latency, and core type. The lock algorithms adapted in the experiment are from the LiTL library, which enables transparent replacement of Pthread mutex lock operations in Linux OS and allows plugging in different lock algorithms under a common API (Application Programming Interface). Figure 1 shows traditional locks’ performance and latency breakdown in the big.LITTLE Multicore Architecture. We can observe that the throughput of four locks increases with the core number but drops quickly when the core number is from 6 to 10, that the first 6 cores are the big core, and the last 4 cores are the little cores; we analyze this phenomenon further in Section 2.3.

In this paper, we study the lock fairness issue in the big.LITTLE Multicore Architecture, and we introduce the slowdown ratio [5] into lock design to measure the fairness of entering the critical section, and design a core-aware lock for this novel big.LITTLE multicore architecture. The goal of CAL is to make the big core and the little core subject to the same slowdown ratio as possible to achieve fairness and improve the throughput of the program running on both the big core and the little core.

This paper makes the following contributions:

We conduct a preliminary experiment to observe the fairness issue of traditional locks in big.LITTLE multicore architecture, and carry out an in-depth analysis of their performance problems. We point out the core-unaware lock on big.LITTLE multicore architecture induces performance degradation.
Motivated by our observation, we designed a core-aware lock for the big.LITTLE multicore architecture to trade-off lock assignment between the big core and little core. We build this scheme based on ticket lock and MCS (John Mellor-Crummey and Michael Scott) lock, and then reorder the lock request according to our lock fairness model. Furthermore, we implemented CAL on top of the open-source lock framework named Litl library, which is widely used for studying lock optimization.
We conduct a series of experiments on the benchmarks and key-value Store engine named LevelDB. The experimental result validates the effectiveness of our scheme. Compared to SOTA (state-of-the-art), the CAL’s fairness has increased by up to 67%; and its throughput is 26% higher than FIFO-based (First In First Out) lock and 53% higher than competition-based lock, respectively. In addition, the tail latency of CAL is always kept at a low level.

In the rest of the paper, Section 2 discusses the lock background and motivates our design. Section 3 presents the detailed scheme. Section 4 describes the experiment methodology and analyzes the results. Section 5 gives the related work, and Section 6 concludes the paper.

2. Background and Research Motivation

2.1. The big.LITTLE Multicore Architecture

In the traditional multicore processor architecture, each core is symmetrical in performance and energy consumption, which brings great convenience to the operating system scheduling. No matter whether the application is performance-oriented or energy-oriented, the operating system could schedule it in a predefined scheme. To trade off performance and power, the big.LITTLE multicore architecture, which aims to provide the most suitable processor for the different applications, is proposed to solve the power-consuming problem. This architecture combines high-performance cores and power-saved cores. The high-performance cores are used to run the performance-oriented application, and the power-saved cores are used for the energy-oriented application. In the ideal scenarios, OS perceives the type of application and schedules the suitable core to run this application, so big.LITTLE multicore architecture can efficiently meet the design goals of the application and save power.

2.2. Lock Classification

Since the invention of concurrent programming, a lot of research has been carried out on the design of lock algorithms. At present, the most widely used locks are mainly divided into three categories:

1.: FIFO-based locks, which assign the lock to the thread of request lock in FIFO order. This representative FIFO lock contains ticket lock [6], CLH (Craig, Landin and Hagersten) lock [7], and so on.
2.: Competition-based locks, which is a lightweight lock. It contains ttas lock [8], simple spinlock [9], pthread lock and so on. These threads themselves call a single atomic operation (e.g., CAS (compare and swap)) to capture the lock. If the atomic instruction executes successfully, the thread enters the critical section; if the atomic instruction fails, the thread continues to execute this atomic instruction until success.
3.: Cohort lock [10], which is also known as hierarchical locks and designed for NUMA architecture [10,11,12]. This lock aims to improve the data validity lifetime in the CPU cache. This lock generally consists of two layers: the top layer uses a global lock, and the bottom layer uses a local lock for each NUMA node. The hierarchical lock prioritizes threads in the same NUMA node for a fixed period, avoiding cross-node traffic overhead while periodically rotating the global lock between NUMA nodes to achieve long-term fairness.

2.3. Motivation

2.3.1. Problem Statement

For multiprocessing programs, the lock works as a coordinator to synchronize operations among threads on CPU, and should give each thread to enter the critical section with an equal probability (i.e., locking fairness), preventing any thread from being blocked for a long time. These popular lock algorithms, such as FIFO lock, competition-based lock, cohort lock, and their variants, always ensure locking fairness explicitly or implicitly. FIFO lock assigns the lock to the thread/process following their coming order. TAS (Test-and-Set) lock, one of the variants of competition-based lock, lets these threads call an atomic instruction to apply for the lock, and the success rate of atomic operations is equal between cores. Cohort lock is suitable for NUMA architecture; it prefers to assign the lock inside one CPU socket, reducing the traffic transmission overhead across sockets. This scheme sacrifices the short-term fairness of locks, but by periodically rotating on each NUMA node, cohort lock provides a long-term fairness [10]. Running on homogeneous computing architecture, these locks work effectively.

However, the above locks become sub-optimal solutions on big.LITTLE multicore architecture, the reason comes from the following folds. As shown in Figure 2. It shows the issue exists in the FIFO-based locks. Due to the strict locking order, the little core has the same chance as the big core to obtain the lock and execute the critical section, but for the same critical section, the little core runs more slowly than the big core, which causes more critical paths to stay on the little core much longer, resulting in locking unfairness and a decrease in throughput. Otherwise, competition-based locks rely on atomic operations to lock. However, the success ratio of atomic operations is no longer the same between big core and little core, resulting in extreme unfairness. As shown in Figure 2(II), once the success rate of atomic operations is higher on the little core (we name it little-core-friendly), the big core almost cannot get the lock, resulting in a significant increase in tail latency. In addition, the critical path mainly falls on the slower little core; the throughput is also significantly reduced. On the other hand, when the success ratio of atomic operations is higher on the big core (we name it big-core-friendly), the throughput benefits from this condition, but the tail latency could be much longer, which is illustrated in Figure 2(III).

To observe the fairness issue existing in lock, we conducted a preliminary experiment based on the 13th Intel Core i5 processor, which contains 6 big cores and 4 little cores. Four locks (i.e., Ticket lock, spinlock, ttas lock, and Cptltkt), which are used widely, are picked up to evaluate the performance deviation on big.LITTLE multicore architecture. The result is shown in Figure 1. As we discussed above, FIFO-based locks, such as ticket lock, suffer a throughput drop when beginning to schedule little core. Spinlock and ttas lock are competition-based locks. The difference is that ttas is big-core-friendly, so it has the highest throughput, but its tail latency rises significantly compared to FIFO-based locks. Spinlock here shows little-core-friendly. As a result, both its throughput and tail latency collapse. Cptltkt lock exhibits a similar performance as ticket lock. All these lock algorithms suffer from performance issues.

2.3.2. Problem Definition

To illustrate the performance issue in big.LITTLE multicore architecture, we introduce the slowdown ratio to measure the slowdown degree when several cores compete for a lock. In general, multiple threads raise requests to enter a critical section at the same time; only one thread is permitted to access the critical section, and the other threads must wait until the lock is released. Compared with that condition where only one thread requests the lock, the former could cost more time than the latter. This slowdown issue could be worse in big.LITTLE multicore architecture. In order to quantify this slowdown issue, the slowdown ratio [5] of each thread is defined as:

s l o w d o w n_{i} = T_{i}^{s h a r e d} / T_{i}^{a l o n e}

(1)

T_{i}^{s h a r e d}

is the average completion time of the critical region of thread i when it is running with other threads at the same time.

T_{i}^{a l o n e}

is the average completion time of the critical region of thread i when it is executed alone. In the case of

T_{i}^{a l o n e}

, thread i does not need to wait for other threads. And

T_{i}^{s h a r e d}

is equal to the sum of the completion time of the

i - 1

thread and the

T_{i}^{a l o n e}

of thread_i itself. Based on the primilary experimental result, the following observation can be conducted. The big.LITTLE multicore architecture gives the big core higher performance than the little core, so the big core suffers a slower down ratio as it needs to wait for the slower little core to complete the critical region. We introduce lock fairness to measure this issue, and it is defined as follows:

F a i r n e s s = M I N {s l o w d o w n_{i}} / M A X {s l o w d o w n_{i}}

(2)

The

F a i r n e s s

means the ratio of the minimum and maximum slowdown ratios suffered by threads in a concurrent execution environment, and the range of

f a i r n e s s

is between 0 and 1. The closer it is to 1, the more fair the slowdown suffered by each thread is. Figure 3 shows the fairness variation of ticket lock. When running on the cores with the same type, the slowdown ratios tend to be the same, and the fairness approaches to be 1. When the multicore process begins to use little cores, the big and little cores suffer different slowdown ratios, and the fairness drops sharply. The lock algorithm we design mainly makes each thread suffer the same slowdown ratio as that in the homogeneous computing architecture. The fairness tends to be 1, which means the cores work effectively, and no thread suffers long waiting overhead, as shown in our experimental results.

3. Design

3.1. Overview

Traditional locks become sub-optical in big.LITTLE multicore architecture, mainly because the fairness of the lock is deteriorated, which is caused by the inherent asymmetry of cores. Therefore, we design a Core-Aware Lock for the big.LITTLE multicore architecture named CAL, based on the fairness computational model.

The CAL is built on the ticket lock and MCS lock to perform lock reordering operations, as the ticket lock mainly deals with low contention and reduces lock paths. The MCS lock performs better under medium and high contention. Furthermore, we expanded the MCS node information for our reordering operation, including the enqueue time

T^{e n q u e u e}

, Individual execution time

T^{a l o n e}

, rearranger, the forward pointer (we expand the MCS queue into a double linked list) and

a t_h e a d

indicate whether the node is at the head of the queue. Figure 4 shows the CAL lock structure and MCS node information. The bottom layer of the CAL lock structure contains a native ticket lock. In addition, CAL includes the pointers to the head and tail of the queue,

p o p_f l a g

indicates whether a thread is exiting the queue, and

n o_s t e a l

, whether to use a shorter lock path under low contention.

The CAL algorithm is divided into three stages, as shown in Algorithm 1: the first stage is the core-aware queue insertion, which inserts the threads of requesting lock into the tail of the corresponding waiting queue according to the core type they are in. The second stage is the fairness-aware node reordering, which tunes the thread into the most suitable position in the current queue to maximize the fairness of the whole queue. The third stage is the item selection, which selects an item from the head of the big-core waiting queue and the head of the little-core waiting queue to obtain the underlying ticket lock. After the processing of the three stages, the big-core threads can obtain a certain degree of priority, maximizing the fairness of the system at the same time.

Algorithm 1 CAL algorithm

1:: int CAL_mutex_lock(CAL_mutex_tlock){
2:: if $l o c k . T i c k e t = U N L O C K & & l o c k . n o_s t e a l = 0$ then
3:: go to $F a s t P a t h;$
4:: end if
5:: $/ /$ If FastPath is unavailable, the thread enqueues into the corresponding wait queue
6:: $n o d e = init_node (T^{e n q u e u e}, T^{a l o n e}, a t_h e a d = 0, r e a r r a n g e r = 0, p r e, n e x t = N U L L);$
7:: if $is_perf_core ()$ then
8:: $/ /$ Threads on performance cores are enrolled in the wait queue for performance core
9:: $p r e_n o d e = SWAP (& l o c k . p e r f_t a i l, & n o d e);$
10:: if $p r e_n o d e = N U L L$ then
11:: $/ /$ The thread that is the first to join the queue
12:: $CAS (& l o c k . p e r f_n o_s t e a l, 0, 1);$
13:: $/ /$ Disabling the FastPath to prevent starvation
14:: $n o d e . a t_h e a d = 1;$
15:: $n o d e . r e a r r a n g e r = 1;$
16:: $/ /$ The head node serves as the initial reordering node
17:: else
18:: $wait_until_to_head (l o c k, p r e_n o d e, & n o d e);$
19:: $/ /$ Spin-waiting until the thread becomes the head node
20:: end if
21:: while $T r u e$ do
22:: if $n o d e . r e a r r a n g e r = 1$ then
23:: $reorder (l o c k, & n o d e, t r u e);$
24:: end if
25:: if $! l o c k . e f f i_p o p f l a g & & PW_get (& n o d e, l o c k . e f f i_h e a d) = n o d e$ then
26:: $/ /$ The thread is selected for dequeuing
27:: $l o c k . p e r f_p o p f l a g = t r u e;$
28:: break;
29:: end if
30:: end while
31:: $pop_from_perfqueue (l o c k, & n o d e);$
32:: $/ /$ Dequeuing acquires the underlying Ticket lock
33:: else
34:: $/ /$ Threads executing on efficiency cores are added to the wait queue for efficiency core
35:: … $/ /$ Handling is analogous to that on the performance core queue
36:: end if
37:: $F a s t P a t h :$
38:: $Ticket_mutex_lock (& l o c k . T i c k e t);$
39:: $/ /$ The thread successfully acquiring the Ticket lock signifies a successful locking operation
40:: return 0;
41:: }

3.2. Design of CAL Lock Algorithm

Core-aware queue insertion: Once a thread issues a lock request, the CAL first checks whether the underlying ticket lock is locked. If it is not locked and

n o_s t e a l

is 0, it means that the current state is under low contention, and the thread directly obtains the ticket lock and returns success. Otherwise, the thread checks the core type on which it runs. If it is running on a big core, it enters the big-core waiting queue. Otherwise, it enters the little-core waiting queue.

T^{e n q u e u e}

and

T^{a l o n e}

(we calculate the

T^{a l o n e}

of each critical region in the warm-up phase) are recorded when the thread enters the queue. After enqueuing, if it is at the head of the current queue,

n o_s t e a l

is modified to 1, to prevent the following threads from stealing the ticket lock, which could cause the queue threads to suffer from the unfair block. At the same time, the CAL appoints the head of the queue as the rearranger, and it is responsible for the reordering process.

Core-aware node reordering: In the second stage, the internal reordering of big- and little-core queues is realized. In the real application, multiple critical sections often need to be protected by the same lock (e.g., the same variable shared by muti-threads is used in different procedures). This way, threads running on the same core type would also have different processing times because of access to different critical sections.

Stage 2 aims to achieve fairness goal in reordering within the big and little core queues. Figure 5 shows the reordering process of the big-core queue (same in the little-core queue).

For each new arriving thread, it should go through the queue two times to find the best insertion position: the first time, CAL calculates the slowdown ratio of each thread from the rearranger thread to the new entrant thread. Note that the

T^{s h a r e d}

of the thread i is equal to the sum of the completion time of the previous 0

i - 1

threads and the

T^{a l o n e}

of thread i itself. In the second time, for each available insertion position, CAL is required to recalculate the slowdown ratio of two thread nodes (the new enqueue node and the delayed node) in reverse order and then recalculate the fairness of the queue. We record the insertion position that provides the maximum fairness and then insert the new entrant node into the position that provides the maximum fairness. In this way, we give a certain priority to the thread with a shorter critical section, letting it be as close to the head of the queue as possible while not letting threads with long critical sections wait too long so as to maintain a high level of fairness within the big and little-core queues.

Node selection: Then we need to select a node from the head of big-core queue and the head of little-core queue to obtain the ticket lock. The thread successfully obtains the ticket, which means the lock operation executes successfully. In order to have a higher fairness, we need to take into account the computing performance and waiting time of the big core and little core. We refer to the classic high response ratio scheduling algorithm and make appropriate modifications to make it more in line with our design requirements. We define the proportional wait time [5] of the thread:

P W = (T^{w a i t} + T^{o t h e r}) / T^{a l o n e}

(3)

In the above formula,

T^{w a i t}

is the time the thread has been waiting since it joined the queue,

T^{a l o n e}

is the individual completion time of the current thread itself, and

T^{o t h e r}

is the

T^{a l o n e}

of the thread at the head of another queue. We calculate the

P W

of the thread at the head of the big-core queue and the little-core queue respectively, and then select the thread with a larger

P W

to dequeue the queue to obtain the underlying ticket lock. Note that if the little-core queue is empty, the thread in the head of the big-core queue directly obtains the underlying ticket lock without calculating the

P W

, and vice versa.

CAL Optimization: We put the big and little core threads into different queues. We decouple the locking process and the reordering process, that is, let those waiting threads perform the reordering operation, which makes the reordering operation occur outside the critical path, greatly reducing the latency overhead of our algorithm. In addition, our reordering process guarantees the following three key points: a. the reordering process must start from the head of the queue; b. only one thread in a queue at the same time will act as the rearranger (i.e., the reordering is single-threaded); c. the identity of the rearranger can be transferred to other thread.

3.3. An Example of CAL Lock

The CAL achieves fairness by locking within and between the big and little cores. Figure 6 demonstrates an example of using the CAL lock algorithm: (a) no thread applies for a lock at the beginning, so the underlying ticket lock is unlocked, and the pointers to the big-core queue and the little-core queue are NULL; (b) the thread

t 0

makes a request to lock, because at this time the ticket is not locked, and the

n o_s t e a l

field is 0, so

t 0

directly applies for a ticket lock and successfully locks into the critical section; (c)

t 1

enters the waiting queue through the SWAP atomic instruction because the underlying ticket lock is locked, and sets

n o_s t e a l

to 1 to prevent subsequent threads from stealing the ticket lock; (d)–(e)

t 2

–

t 8

enter their corresponding waiting queues successively, and as

t 1

and

t 2

are the heads of the big-core queue and the little-core queue respectively, they act as the rearrangers to start the reordering operation; (f)

t 0

completes the task and exits the critical region, and the ticket is not locked, so the head of big-core queue and little-core queue (if it is in the reordering process, it would exit) calculates their

P W

and selects a thread to obtain the ticket lock to enter the critical region (here is

t 1

), and the rearranger identity is passed to

t 5

and

t 4

respectively; (g) t3 becomes the new head of the queue, and

t 2

remains unchanged. the new rearrangers

t 5

and

t 4

respectively begin the rearrangement operation; (h) finally only

t 8

thread holds the ticket lock, and the pointers to the big-core and little-core queues are set to NULL, and the

n o_s t e a l

field is set to 1 to re-open the stealing operation; (i)

t 8

completes the task and exits the critical section, and the locking process ends.

3.4. Overhead Analysis

The time overhead mainly comes from the second stage of the CAL algorithm, which requires going through the queue twice for each node. We use two optimizations to migrate the time overhead for n nodes with a time complexity close to

n^{2}

. Firstly, we decouple the reordering operation from the lock operation, allowing the threads waiting in the queue to perform the reordering operation so that the time-consuming reordering operation occurs outside the critical path. Secondly, we use sliding windows to optimize our reordering algorithm. We fix the number of threads to be reordered at once within the predefined window size. When rearranger is at_head and ticket lock is unlock or the status of rearranger changes from waiting to at_head, rearranger can exit the reordering process and participate in the ticket lock acquisition operation in a timely manner. The above optimizations make the time cost of our algorithm negligible.

In terms of space overhead, we have reused the native nodes of MCS, and for each node, we have expanded five members:

T^{e n q u e u e}

,

T^{l o n e}

, rearranger, at_head and forward pointer pre. This memory overhead can be ignored. In the experimental section, we implemented the CAL algorithm and compared it with the default lock algorithm in Linux, validating its effectiveness. According to the performance result on benckmarks and real application, the overhead induced by CAL algorithm is acceptable.

3.5. Implementation

Our implementation is built on top of the Litl library [14], which is an open-source library used widely to transparently replace native pthread mutexes in applications [13,15,16,17,18,19]. It intercepts a series of Pthread synchronous interfaces through the LD_PRELOAD environment variable in Linux OS and redirects them to our own implemented locking algorithm library. Litl has low overhead and also supports conditional variables. We added 368 lines of code to implement our lock algorithm on top of the Litl library at last. The LiTL library supports pluggable lock algorithms, allowing us to implement custom lock and unlock functions. The LiTL maintains a mapping between an instance of the standard Pthread lock (pthread_mutex_t) and an instance of the lock type. Leveraging the LiTL library, we have implemented the CAL lock and unlock functions. When the application calls the LiTL library’s lock function, it invokes the CAL algorithm.

4. Experiment

We carried out extensive tests and analyses within both microbenchmark and real-world application environments to confirm the effectiveness of our algorithm. We employed a triad of performance metrics: fairness, throughput, and tail latency.

4.1. Experimental Environment

Table 1 presents the server configuration information of our experiment. Our experiment ran on the 13th Gen Intel (R) Core (TM) i5-13400 processor, with the comprehensive CPU specifications depicted in Table 2. This processor is equipped with six big cores and four little cores, where the base frequency for the former is 2.5 GHz and that for the latter is 1.8 GHz. In order to obtain stable experimental results, we turned off the dynamic frequency tuning and hyper-threading technology of the processor. This allows the cores to operate at relatively stable frequencies during processing tasks. At the same time, we bound the threads to a fixed core during runtime to prevent them from migrating among cores, which is also a widely used evaluation method [2,10,16,20]. We compared CAL with several representative classical lock algorithms to validate the effectiveness of our algorithm.

4.2. Micro-Benchmarks

We conducted two sets of benchmarks to evaluate the performance of CAL across a series of workload scenarios.

A benchmark with high contentions: First, we conduct a benchmark with a fixed high contention level, which is representative of the most prevalent and critical contention levels encountered in lock-intensive multithreaded applications and is pivotal for assessing the effectiveness of lock algorithms. We used all cores (six big cores and four little cores) to run the threads competing for a shared lock to execute four distinct critical sections with varying lengths. Each critical section manipulated a predefined number of rows within the shared cache. In order to emulate the execution of non-critical section operations, we inserted a constant quantity of 12,000 nop instructions between every pair of lock acquisitions.

In this section, we selected five classical lock algorithms to compare with CAL.

Ticket lock [21]. This lock algorithm embodies the FIFO locking. Big core and little core acquire lock in FIFO order.
Pthread lock [22], dettas [8], and spinlock [8,23]. These lock algorithms are competition-based locks. The big or little core competes for the lock in an unpredictable behavior.
Cptltkt lock [10]. This lock algorithm implements the cohort locks, comprising a global partitioned ticket lock and a local ticket lock mechanism.

The fairness results are shown in Figure 7a. In high contention runtime, CAL can maintain a high fairness of nearly 0.96, while the fairness of other locks is as low as 0.53. Figure 7b shows the throughput and latency results. Compared with FIFO-based ticket lock, CAL exhibits a commendable improvement with a 7.5% higher throughput while concurrently maintaining a lower tail latency. As for three competition-based locks, ttas shows big-core-friendly (i.e., the success rate of atomic operation on big cores is much higher), so its throughput is the highest. However, this advantage comes at the cost of unignorable tail latency compared to CAL. Both pthread lock and spinlock lock show little-core-friendly, which leads to a collapse of throughput and latency. Given that our experimental environment comprises only a single NUMA node, the global lock component within the hierarchical lock (cptltkt) does not achieve its full potential; hence, its performance is comparable to that of the local ticket lock.

A benchmark with variant contentions: Secondly, we conduct a benchmark test under variable contention conditions to emulate the dynamic nature of contention levels that typically vary with time and workload in real-world scenarios. In this benchmark test, we gradually increased the number of threads from 1 to 10, simulating the transition from mild contention level to high contention level. The results are shown in Figure 8. Figure 8a shows the variation in fairness across different contention levels. When running on the first six big cores, FIFO locks have the highest fairness. Meanwhile, the three locks based on competition can maintain a relatively high fairness due to an approximately equal success rate for atomic operations. However, as the workload expands to use little cores, these lock algorithms face fairness issue, while CAL consistently sustains a relatively high fairness. The throughput and latency results are shown in Figure 8b and Figure 8c, respectively. When the number of threads is low, all lock algorithms exhibit similar throughput performance. When extended to little cores, the big-core-friendly ttas lock achieves the highest throughput but concurrently suffers from relatively high latency. While pthread lock and spinlock will face a double crash of throughput and latency. Ticket lock and cptltkt show the lowest latency among the tested algorithms, but the throughput is not as good as CAL.

4.3. Application Benchmarks

We evaluate LevelDB to demonstrate the practical effectiveness of CAL. LevelDB is a widely adopted high-performance key-value storage database [24]. We use readrandom in LevelDB’s built-in db_bench for testing, in which the get operation applies for a global lock to obtain a snapshot of the internal database structure. To conduct throughput and latency evaluation, we made appropriate modifications to the db_bench. When conducting the throughput and latency test, we inserted FLAG_duration to provide a fixed program running time and modified the corresponding readrandom function to adapt to the inserted FLAGS_duration parameter. We set FLAG_duration to 10 s, and each test was repeated five times to get the average results. In the fairness experiment, we change the number of threads from 1 to 10. We measure the slowdown ratio of each thread performing a million readrandom operations to calculate fairness value.

The experimental results are shown in Figure 9. In Figure 9a, we observe that all the lock algorithms maintain high fairness when running on the first six big cores. The FIFO-based ticket lock, due to its inherent strict lock order, demonstrates higher fairness compared to ttas, Pthread lock, and other competition-based locks. With the introduction of little cores, the fairness of locks except CAL has significantly declined, with the lowest drop to 0.56 (ttas), while CAL has always maintained high fairness (above 0.9). In Figure 9b, the throughput of CAL is only lower than Pthread lock’s, but the tail latency is less than half of Pthread lock’s. The ticket lock and cptltkt lock show slightly higher latency than CAL but significantly lower throughput levels. Spinlock has the lowest throughput, which is 53% lower than CAL’s throughput. Furthermore, the tail latency of spinlock and ttas are appreciably higher than that of CAL.

5. Related Work

Locks continue to serve as the prevailing mechanism for synchronization in concurrent systems at present and have been widely studied for various application scenarios in recent years. However, these research-works on lock algorithm for the big.LITTLE multicore architecture has not been studied widely.

Lock algorithms for NUMA. In recent years, the advent of NUMA architecture has brought unignorable challenges in the design of lock algorithms specifically tailored for such systems. David Dice et al. [10] designed a general lock scheme for NUMA architecture: cohort lock. Cohort lock consists of a unique global lock and a local lock for each NUMA node. Global lock must meet thread-oblivious characteristics (i.e., the locking thread and the unlocking thread can be different). The local lock must incorporate cohort detection capabilities (i.e., the thread needs to detect whether there are waiting threads). Cohort lock allows local locks to be passed within the same NUMA node, thereby minimizing cross-node traffic overhead. Furthermore, it incorporates periodic switching between NUMA nodes to achieve long-term fairness. Radovic et al. design a hierarchical backoff lock (HBO) based on cohort locks [25], which is a ttas lock with a backoff scheme to reduce cross-node lock contention. Sanidhya Kashyap et al. propose shuffling [15], which decouples lock design from policy execution, such as NUMA awareness and parking/waking policies, to achieve high performance without increasing memory overhead. The above lock algorithms are designed for NUMA architecture, considering the inconsistent memory costs of different nodes in the NUMA architecture, while periodically switching NUMA nodes to ensure long-term fairness. However, the inherent asymmetry of the big.LITTLE architecture results in throughput descension when big and little cores are allocated to different NUMA nodes, as detailed in Section 2.3. Therefore, these lock algorithms fail to be adopted in big.LITTLE multicore architecture.

Delegation-based lock algorithms. This lock employs a thread to delegate the right to enter a critical section to other thread, thereby enhancing the cache locality of critical section operations and demonstrating improved adaptability under extreme lock contention scenarios. Jean-Pierre Lozi et al. [26] design Remote Call Locks (RCL) locks to accelerate the execution of critical section of applications on multicore architectures. The idea of RCL is to replace lock acquisition by optimizing remote calls to dedicated server hardware threads. RCL effectively mitigates performance crashes when multi-threads try to acquire locks and eliminates the necessity of transferring shared data to server threads since such data often remains cached within the server cache. Sepideh Roghanchi et al. [27] discussed the problem of delegation and synchronous access to shared memory and found that delegation locking is much faster than traditional locking in a series of common cases. They proposed ffwd, a fast and lightweight delegation lock, which boasts a highly optimized design that sustains scalability advantages. Although delegation-based algorithms can accelerate the processing of server threads, they require a lot of polling, potentially leading to high latency in the case of poor network conditions. Moreover, these algorithms necessitate that critical sections be formulated as closures, augmenting the intricacy and complexity of the programming task. Therefore, the delegation-based approach is not suitable for big.LITTLE architecture.

Automatically tuned locking algorithms. Automatically tuned locking algorithms would automatically tune some internal parameters at runtime. For example, reaction locking tunes the back-off time in spinlock to reduce contention, and mutlock adjusts the number of busy spinning threads to hide the wake-up latency in blocking locks. Anna R. Karlin et al. [28] set the spin threshold before blocking to obtain better throughput. In general, these automatically tuned locking algorithms set different tuning strategies according to different goals. Our proposed CAL can also be seen as a kind of automatic tuning, which reorders according to the performance differences between big and little cores to improve fairness and throughput.

Lock-free data structures. With the development of parallel applications, lock-free data structures have increasingly attracted attention from industry and academia [29,30,31,32], and common lock-free data structures mainly include lock-free queues. Implementing lock-free data structures mainly relies on the underlying atomic instructions because the atomic instructions either succeed or fail in one CPU cycle. When multiple threads execute the atomic instructions simultaneously, only one thread would succeed, and the other threads would roll back to the initial state. Lock-free data structures are efficient under low contention conditions. However, frequent rollbacks due to failed atomic operations introduce substantial overheads in highly competitive scenarios. Furthermore, the prevalence of legacy code challenges exacerbates the complexity of integrating lock-free data structures into existing systems. As a result, despite their advantages, the practical application scope of lock-free data structures remains relatively constrained.

6. Conclusions

In this paper, we analyze the throughout decline and long tail latency issue induced by skewed locking assignment among big core and little core, due to the inherent performance diversity in Big.LITTLE multicore architecture. Targeting on this architecture, we design a core-aware lock named CAL, which reorder the lock request based on the fairness computational model. CAL operates through three main stages: core-aware queue insertion, fairness-aware node reordering, and lock request selection. These stages ensure that both big cores and little cores have equal opportunities to access critical sections in programs. Evaluations on benchmarks and real-world application LevelDB show that CAL achieves the fairness goal for the big.LITTLE multicore architecture. Compared with the competition-based lock algorithm, the CAL’s fairness has increased by up to 67%; and Its throughput is 26% higher than FIFO-based locks and 53% higher than competition-based locks, respectively.

Author Contributions

Conceptualization, S.N.; methodology, S.N.; software, Y.L.; validation, S.N.; formal analysis, Y.L.; investigation, Y.L.; resources, W.W.; data curation, J.N.; writing—original draft preparation, Y.L.; writing—review and editing, S.N.; visualization, J.N.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the open project of Satellite Internet Key Laboratory in 2023 (Project 3: Research on Mixed reality deduction technology for mega constellation operation and maintenance) funded by Shanghai Key Laboratory of Satellite Network and Shanghai Satellite Network Research Institute Co., Ltd.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

The authors would like to thank the reviewers for their thoughtful comments and efforts toward improving our manuscript.

Conflicts of Interest

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Boyd-Wickizer, S.; Kaashoek, M.F.; Morris, R.; Zeldovich, N. Non-scalable locks are dangerous. In Proceedings of the Linux Symposium, San Diego, CA, USA, 27–31 August 2012; pp. 119–130. [Google Scholar]
Dice, D. Malthusian locks. In Proceedings of the Twelfth European Conference on Computer Systems, Belgrade, Serbia, 23–26 April 2017; pp. 314–327. [Google Scholar]
Salami, B.; Noori, H.; Naghibzadeh, M. Fairness-aware energy efficient scheduling on heterogeneous multi-core processors. IEEE Trans. Comput. 2020, 70, 72–82. [Google Scholar] [CrossRef]
Mascitti, A.; Cucinotta, T. Dynamic Partitioned Scheduling of Real-Time DAG Tasks on ARM big. LITTLE Architectures. In Proceedings of the 29th International Conference on Real-Time Networks and Systems, Nantes, France, 7–9 April 2021; pp. 1–11. [Google Scholar]
Tavakkol, A.; Sadrosadati, M.; Ghose, S.; Kim, J.; Luo, Y.; Wang, Y.; Ghiasi, N.M.; Orosa, L.; Gómez-Luna, J.; Mutlu, O. FLIN: Enabling fairness and enhancing performance in modern NVMe solid state drives. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, USA, 1–6 June 2018; pp. 397–410. [Google Scholar]
Dice, D. Brief announcement: A partitioned ticket lock. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures, New York, NY, USA, 4–6 June 2011; pp. 309–310. [Google Scholar]
Craig, T. Building FIFO and Priorityqueuing Spin Locks from Atomic Swap; Technical Report, Technical Report TR 93-02-02; Department of Computer Science, University of Washington: Seattle, WA, USA, 1993. [Google Scholar]
Anderson, T.E. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1990, 1, 6–16. [Google Scholar] [CrossRef]
Scott, M.L. Shared-Memory Synchronization; Morgan & Claypool Publishers: San Rafael, CA, USA, 2013. [Google Scholar]
Dice, D.; Marathe, V.J.; Shavit, N. Lock cohorting: A general technique for designing NUMA locks. ACM Sigplan Not. 2012, 47, 247–256. [Google Scholar] [CrossRef]
Chabbi, M.; Fagan, M.; Mellor-Crummey, J. High performance locks for multi-level NUMA systems. ACM Sigplan Not. 2015, 50, 215–226. [Google Scholar] [CrossRef]
Kashyap, S.; Min, C.; Kim, T. Scalable {NUMA-aware} Blocking Synchronization Primitives. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17), Santa Clara, CA, USA, 12–14 July 2017; pp. 603–615. [Google Scholar]
Liu, N.; Gu, J.; Tang, D.; Li, K.; Zang, B.; Chen, H. Asymmetry-aware scalable locking. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, 2–6 April 2022; pp. 294–308. [Google Scholar]
Guiroux, H.; Lachaize, R.; Quéma, V. Multicore Locks: The Case Is Not Closed Yet. In Proceedings of the USENIX Annual Technical Conference, Denver, CO, USA, 22–24 June 2016. [Google Scholar]
Kashyap, S.; Calciu, I.; Cheng, X.; Min, C.; Kim, T. Scalable and practical locking with shuffling. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, Huntsville, ON, Canada, 27–30 October 2019; pp. 586–599. [Google Scholar]
Dice, D.; Kogan, A. Compact NUMA-aware locks. In Proceedings of the Fourteenth EuroSys Conference 2019, Dresden, Germany, 25–28 March 2019; pp. 1–15. [Google Scholar]
Guerraoui, R.; Guiroux, H.; Lachaize, R.; Quéma, V.; Trigonakis, V. Lock–unlock: Is that all? a pragmatic analysis of locking in software systems. ACM Trans. Comput. Syst. 2019, 36, 1–149. [Google Scholar] [CrossRef]
de Lima Chehab, R.L.; Paolillo, A.; Behrens, D.; Fu, M.; Härtig, H.; Chen, H. Clof: A compositional lock framework for multi-level numa systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, Virtual Event, 26–29 October 2021; pp. 851–865. [Google Scholar]
Guiroux, H. Understanding the Performance of Mutual Exclusion Algorithms on Modern Multicore Machines. Ph.D. Thesis, Université Grenoble Alpes, Grenoble, France, 2018. [Google Scholar]
Hendler, D.; Incze, I.; Shavit, N.; Tzafrir, M. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, Thira Santorini, Greece, 13–15 June 2010; pp. 355–364. [Google Scholar]
Reed, D.P.; Kanodia, R.K. Synchronization with eventcounts and sequencers. Commun. ACM 1979, 22, 115–123. [Google Scholar] [CrossRef]
Kylheku, K. What is PTHREAD_MUTEX_ADAPTIVE_NP. Retrieved Novemb. 2014, 8, 2018. [Google Scholar]
Mellor-Crummey, J.M.; Scott, M.L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. (TOCS) 1991, 9, 21–65. [Google Scholar] [CrossRef]
Sanjay Ghemawat, J.D. LevelDB. 2024. Available online: https://github.com/google/leveldb (accessed on 17 July 2024).
Radovic, Z.; Hagersten, E. Hierarchical backoff locks for nonuniform communication architectures. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture, Anaheim, CA, USA, 12 February 2003; HPCA-9. pp. 241–252. [Google Scholar]
Lozi, J.P.; David, F.; Thomas, G.; Lawall, J.; Muller, G. Fast and portable locking for multicore architectures. ACM Trans. Comput. Syst. (TOCS) 2016, 33, 1–62. [Google Scholar] [CrossRef]
Roghanchi, S.; Eriksson, J.; Basu, N. Ffwd: Delegation is (much) faster than you think. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 342–358. [Google Scholar]
Karlin, A.R.; Li, K.; Manasse, M.S.; Owicki, S. Empirical studies of competitve spinning for a shared-memory multiprocessor. ACM Sigops Oper. Syst. Rev. 1991, 25, 41–55. [Google Scholar] [CrossRef]
Al Bahra, S. Nonblocking Algorithms and Scalable Multicore Programming: Exploring some alternatives to lock-based synchronization. Queue 2013, 11, 40–64. [Google Scholar] [CrossRef]
Harris, T.L. A pragmatic implementation of non-blocking linked-lists. In International Symposium on Distributed Computing; Springer: Berlin/Heidelberg, Germany, 2001; pp. 300–314. [Google Scholar]
Hart, T.E.; McKenney, P.E.; Brown, A.D.; Walpole, J. Performance of memory reclamation for lockless synchronization. J. Parallel Distrib. Comput. 2007, 67, 1270–1285. [Google Scholar] [CrossRef]
Michael, M.M. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, Winnipeg, MB, Canada, 10–13 August 2002; pp. 73–82. [Google Scholar]

Figure 1. Traditional locking algorithms suffer from throughput degradation and latency increase under the big.LITTLE multicore architecture. The CPU in our experiment includes 6 big cores and 4 little cores. The first 6 threads are bound to different big cores and the last 4 threads are bound to different little cores. Threads acquire the same lock to access the critical section, and they execute a predefined number of nop instructions outside the non-critical section to simulate program operations. The CAL algorithm is open-source: https://github.com/nsq974487195/CAL-Lock.git (accessed on 17 July 2024). The throughput in (a) means the operation of critical sections has been done in 1 us. The latency in (b) is the P99 tail latency of staying in the critical section.

Figure 2. An example of performance issue caused by core-unaware locking assignment in big.LITTLE multicore architecture [13].

Figure 3. Fairness and the slowdown ratio of ticket lock in big.LITTLE multicore architecture.

Figure 4. CAL lock structure and MCS node information.

Figure 5. The second stage in CAL, when a new coming thread arrives and enqueues the queue (big-core queue above), it goes through the queue two times to find the most suitable insertion position.

Figure 6. An example of CAL lock algorithm.

Figure 7. Comparison in a high contention environment. (a) Lock fairness. (b) Throughput and tail latency.

Figure 8. Variation in a variable contention environment. (a) Fairness. (b) Throughput. (c) Tail latency.

Figure 9. Comparison in LevelDB. (a) Fairness. (b) Throughput and tail latency.

Table 1. Configuration of server.

Property	Parameter
Processor	13th Gen Intel(R) Core(TM) i5-13400
OS	Ubuntu: 20.04
Linux Kernel	Linux 5.15.0-89-generic
Glibc	2.31
Gcc	9.4.0

Table 2. CPU parameters.

Property	Value
Total cores	10
Number of performance-cores	6
Linux Kernel	Linux 5.15.0-89-generic
Number of efficient-cores	4
Performance-core base frequency	2.50 GHz
Efficient-core base frequency	1.80 GHz

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nie, S.; Liu, Y.; Niu, J.; Wu, W. CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture. Appl. Sci. 2024, 14, 6449. https://doi.org/10.3390/app14156449

AMA Style

Nie S, Liu Y, Niu J, Wu W. CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture. Applied Sciences. 2024; 14(15):6449. https://doi.org/10.3390/app14156449

Chicago/Turabian Style

Nie, Shiqiang, Yingming Liu, Jie Niu, and Weiguo Wu. 2024. "CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture" Applied Sciences 14, no. 15: 6449. https://doi.org/10.3390/app14156449

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CAL: Core-Aware Lock for the big.LITTLE Multicore Architecture

Abstract

1. Introduction

2. Background and Research Motivation

2.1. The big.LITTLE Multicore Architecture

2.2. Lock Classification

2.3. Motivation

2.3.1. Problem Statement

2.3.2. Problem Definition

3. Design

3.1. Overview

3.2. Design of CAL Lock Algorithm

3.3. An Example of CAL Lock

3.4. Overhead Analysis

3.5. Implementation

4. Experiment

4.1. Experimental Environment

4.2. Micro-Benchmarks

4.3. Application Benchmarks

5. Related Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI