1. Introduction
Key–value stores (KV stores) are used in many industry and academic applications due to their simplicity and high performance [
1,
2,
3,
4,
5,
6,
7]. KV stores are non-relational databases that store, retrieve, and manage data as a collection of key–value pairs. Each key in a KV store serves as a unique identifier for accessing its associated value. Both keys and values can be simple data types like strings or numbers, as well as more complex structures such as JSON objects or binary data. Unlike relational databases (RDBs), which rely on predefined schemas like tables composed of fields with well-defined data types, KV stores provide a schemaless structure, allowing for simple and flexible data storage without the need for predefined schemas. Due to their simplicity of managing data and the relationship between data, KV stores are used in many different layers of operating systems, including KV store-based file systems [
8,
9,
10,
11,
12,
13]. As KV stores are adapted to a wide variety of applications, it is becoming important to improve the performance of KV stores running on new upcoming hardware.
In recent years, advancements in hardware, such as multi-core CPUs and high-performance storage devices, have been introduced to improve application performance. However, these hardware innovations require corresponding optimizations in the software stack to fully leverage the hardware’s capabilities. For instance, multi-core CPUs require parallelization of the existing software stack to take full advantage of the increased number of cores. Similarly, high-performance storage devices need more aggressive polling-based, multi-threaded I/O operations to exploit their fast response time and concurrency features. Without proper optimization in the software stack, the potential benefits offered by the emerging hardware may be lost. In some cases, the overall performance of the application could remain unchanged or even degrade due to mismatches between the hardware capabilities and the existing software stack.
To handle these issues, previous studies have optimized the I/O stack to fully exploit the high-performance capabilities of multi-core CPUs and emerging storage devices. IceFS [
14] and SpanFS [
15] leverage hardware capabilities by partitioning the journaling file system. SpanFS consists of a collection of micro file system services known as domains. It distributes files and directories across these domains and delegates I/O operations to the corresponding domain to avoid lock contention. It also exploits device parallelism for shared data structures and parallelizes file system services. IceFS enables performing concurrent I/O operations by separating the file system into directory subtrees. Son et al. [
16] optimized the journaling and checkpointing operations of the existing file system by utilizing idle application threads to participate in the I/O operation. ScaleFS [
17] introduced a per-core operation log within the file system to improve scalability across multiple cores. MAX [
18] introduced a multi-core-friendly log-structured file system to exploit flash parallelism by scaling user I/O operations with newly designed semaphores. These studies concentrate on optimizing the I/O stack within operating systems, particularly improving file system scalability. In contrast, our approach aims to optimize performance independently of underlying systems (e.g., file systems) by utilizing idle computing resources within the KV store to accelerate I/O operations.
We first analyze the I/O operations performed in the existing KV store, focusing on the widely used KV store RocksDB [
19,
20], to identify the performance bottleneck. Our analysis of the I/O path in RocksDB reveals that background I/O operations, such as flush and compaction, cause stalls during foreground I/O operations. As a result, even with an increased number of threads allocated to handle foreground I/O operations, most of these threads remain in a wait state until the background jobs are completed. To address this issue, we focus on the compaction procedure—one of the background I/O operations occurring in the KV store—where multiple files need to be read to sort key–value pairs. This procedure involves numerous recursive read operations, but the existing implementation of RocksDB utilizes only a single thread for these read operations. To optimize this, we enable concurrent file read operations by creating a read buffer and prefetching files into a temporary read buffer.
We implement our scheme within the existing RocksDB and evaluate its performance using the db_bench benchmark, varying the number of threads and using different machines. The experimental results demonstrate that our scheme improves the performance of KV stores by parallelizing I/O operations utilizing idle computing resources.
In summary, our main contributions are as follows:
We analyze the I/O path and scalability within the existing KV store, RocksDB.
We design parallel read operations to accelerate background jobs, optimizing the KV store for multi-core CPUs and high-performance storage devices.
We demonstrate that our proposed scheme improves the overall performance of the KV store by involving idle computing resources in I/O operations.
The rest of this paper is organized as follows:
Section 2 discusses the related work.
Section 3 describes the background and presents the analysis performance of RocksDB with high-performance storage devices using various configurations.
Section 4 presents the design of our scheme to fully utilize hardware capabilities.
Section 5 shows the experimental results.
Section 6 explains the limitations of the study and discusses future work.
Section 7 concludes the paper.
3. Background and Motivation
3.1. High-Performance Storage Devices
With the advancement of flash memory, flash-based SSDs and NVMe SSDs are widely used and are rapidly replacing traditional tape-based storage and magnetic hard disk drives (HDDs). SSDs offer significantly faster throughput and lower latency compared to HDDs and have no moving parts, which enhances their durability and endurance.
A key difference that must be considered in the software stack is that SSDs are typically equipped with multiple channels. In contrast to HDDs, where multiple requests from the host are serialized into a single queue and handled once at a time, SSDs can handle multiple requests simultaneously. This capability arises from the architecture of SSDs, which connect multiple chips (groups of flash cells) to multiple channels. Although each channel can accept only one request at a time, the presence of multiple channels allows for simultaneous access to different chips.
This parallelism in SSDs enables the host to issue requests in a parallel and concurrent manner, rather than a serialized manner. However, existing applications and I/O stacks are optimized for HDDs, and thus, are designed to issue requests in a serialized manner. This approach only utilizes a single channel, despite the storage device’s ability to handle requests across multiple channels. Therefore, a new approach is required to fully exploit the potential of multiple channels in storage devices by employing multiple threads to issue I/O requests in a parallel and concurrent manner.
3.2. Existing I/O Path Analysis of RocksDB
As RocksDB is a widely used KV store, it is important to fully understand its I/O path and examine how the KV stores are implemented to support high performance and data robustness.
Figure 1 illustrates the existing architecture of RocksDB. Incoming insert operations store key–value pairs in the in-memory write buffer, known as the memtable. When the memtable becomes immutable due to accumulating sufficient key–value pairs, background job—specifically, the flush operation—stores the immutable memtable as a level 0 file in persistent storage. RocksDB organizes these files, which store key–value pairs, into multiple levels based on data access recency. Since duplicate keys are not allowed within a certain level and key–value pairs are stored in order within each level, the compaction job continuously performs read and write operations to manage the number of files at each level. This process ensures that the number of files remains constant at each level, thereby maintaining the desired characteristics of the KV store across all levels.
Algorithm 1 presents a pseudocode representation of the existing I/O path in RocksDB during an insert operation.
Algorithm 1 Existing I/O path pseudocode |
- 1:
procedure
DoWrite() - 2:
if write_thread_ is Leader then - 3:
mutex_.Lock - 4:
PreprocessWrite() - 5:
if write_controller_.IsStopped() or .NeedsDelay() then - 6:
DelayWrite() - 7:
Wait() - 8:
end if - 9:
mutex_.Unlock - 10:
WriteWAL() - 11:
SyncWAL() - 12:
Signal() - 13:
else - 14:
DelayWrite() - 15:
Wait() - 16:
end if - 17:
LaunchParallelMemTableWriters() - 18:
end procedure
|
As shown, RocksDB initiates the insert operation with the DoWrite() function. In this function, the first thread to arrive is selected as the leader thread (lines 2–12). The leader thread acquires a lock and performs preprocessing operations (PreprocessWrite()) to handle the write operation (lines 3–9). It then checks the write controller to determine whether the write operation can proceed, taking into account the status of background operations (line 5). For example, if the flush operation is too slow and there is not enough memtable for inserting new key–value pairs, the leader thread waits for the flush job to complete, allowing the memtable to become available. Another background job, the compaction job, also causes stalls in foreground operations. As new key–value pairs continue to arrive, the compaction job is executed to keep the number of files in level 0 below a certain threshold, which may cause the leader thread to wait until space becomes available in level 0. In other words, if the system is in a state where new key–value pairs cannot be inserted due to ongoing background I/O operations, the operations of the leader thread are stalled (lines 5–8). In summary, the overall progress of the foreground insert operation is heavily dependent on the performance of background jobs. If progress is possible, the leader thread proceeds with foreground I/O operations, such as writing to the write-ahead log and flushing the key–value pairs in the memtable to the non-volatile storage device (lines 10–12).
In contrast, threads that are not selected as the leader thread are stalled, waiting for the leader thread’s operations to complete. This means that, despite the availability of a large number of threads —a trend that is increasing with the growth of CPU cores —only a single thread is responsible for executing the actual I/O operations. This limitation not only reduces the efficiency of I/O operations but also hampers the overall progress of insert operations within the KV store. The insert operation is delayed until sufficient space is available to temporarily store the key–value pairs. This clearly indicates that the current implementation of RocksDB is not optimized for multi-core CPUs and high-performance storage devices.
3.3. Existing KV Store Performance Analysis
In this section, we analyze the performance of the existing RocksDB when used with multi-core CPUs and high-performance storage devices, focusing on various configuration parameters that affect scalability. We use
db_bench benchmark, which is included as part of RocksDB and widely used for evaluating the performance of KV stores [
30,
31]. The experiments are conducted on a machine equipped with an Intel Xeon E5-2620 v4 processor (Intel, Santa Clara, CA, USA) and 32 GB of memory. For storage, we use a Samsung 960 NVMe SSD (Samsung, Suwon-si, Republic of Korea) with a capacity of 512 GB. The evaluation parameters used in the experiments are described in
Section 5.1.
Figure 2 presents the performance of RocksDB with varying numbers of application threads, background flush threads, and background compaction threads. By increasing the number of threads, we increase the number of application threads requesting I/O operations from RocksDB. As illustrated in the figure, RocksDB’s performance not only fails to improve but actually decreases as the number of threads increases. This occurs because only a single leader thread is responsible for foreground I/O operations, while other non-leader threads often remain idle or compete for locks, leading to increased lock contention as the thread count rises. Since the number of threads corresponds to the number of threads involved in inserting key–value pairs, the experimental results indicate that the current RocksDB implementation is not optimized for modern multi-core CPUs.
We also measure RocksDB’s performance with increasing numbers of background flush and compaction threads.
Figure 2 includes performance results based on varying the number of threads for background jobs. As previously mentioned, foreground I/O operations are heavily impacted by flush and compaction jobs. Therefore, we analyze the effects of increasing the number of background job threads. While increasing the number of flush and compaction threads could theoretically improve performance, the results show that RocksDB’s performance does not improve with more compaction threads and actually degraded with additional flush threads. Similar to the earlier findings in
Figure 2, which showed decreased performance with an increasing number of foreground operation threads, these results indicate that a single thread handles actual I/O operations while other threads either remain idle or exacerbate lock contention, leading to diminished performance with more flush threads.
In summary, this section demonstrates RocksDB’s performance under different configurations intended to increase parallelism and concurrency by adding more threads. However, as shown in all figures, increasing the number of application, flush, and compaction threads generally does not improve performance and often reduces it. These findings suggest that the current implementation of RocksDB is not optimized for multi-core CPUs and high-performance storage devices, indicating a need for a new approach to fully leverage the potential of emerging hardware.
4. Design and Implementation
In this section, we present our scheme to efficiently utilize emerging hardware. We propose a scheme that utilizes idle threads to participate in I/O operations, thereby taking advantage of both high numbers of core counts in CPUs and the multiple channels within storage devices.
4.1. Design
Figure 3 illustrates the overall design of the proposed scheme. The top part depicts the existing scheme, while the bottom part presents our proposed scheme. In the existing scheme, as discussed in
Section 3, only a single thread is selected as the leader and is solely responsible for handling foreground I/O operations, such as insert operations.
During flush and compaction operations, the thread initiating the operation first acquires the data to be sorted and written to the storage device. In the flush operation, the data are already in memory, as the sorted data are received from the user and formed into a table in memory, known as a memtable, as illustrated in
Figure 1. In contrast, during compaction, the data must be retrieved from the storage device by locating the files containing keys in the target range and reading them into memory. However, because these I/O operations are not performed in parallel, increasing the number of flush or compaction threads does not improve the performance of RocksDB, as shown in
Figure 2.
In our proposed scheme, we expedite background I/O operations by involving threads that are not selected as the leader, utilizing them during their idle states. To achieve this, we adopt a technique commonly used in file system optimization for scalability [
15,
16]. Instead of having these threads sleep or contend for locks, we enable them to participate in the I/O operations. Given our focus on multi-core systems, we believe that there will be a sufficient number of application threads available for I/O operations, rather than being idle. This approach increases the number of threads involved in I/O operations without incurring the overhead associated with thread creation.
As shown in
Figure 3a, in the existing design, threads allocated for foreground I/O operations, except a leader thread, remain in a wait state. Furthermore, if the write controller delays the execution of the leader thread (during the execution of
DelayWrite()) due to intensive background I/O operations, all threads allocated for foreground I/O operations will remain idle. Our key idea is accelerating background jobs that delay foreground I/O operations by parallelizing I/O operations using threads that would otherwise be in a wait state.
To implement this, we first identify the files that need to be read during the compaction operation. As shown in the example in
Figure 3b, the input files #1 through #n are identified as containing the target key range for the compaction job. We also allocate a temporary read buffer to hold the prefetched target input files. Instead of waiting, the threads originally allocated for foreground I/O operations are now engaged in these background jobs. Consequently, while the leader thread handles foreground jobs such as
PreprocessWrite(), other threads assist by concurrently reading the target input files and loading their contents into the preallocated read buffer. This approach allows the background thread to expedite the compaction process by accessing data from the in-memory read buffer rather than from the storage device, resulting in reduced stall time during foreground I/O operations. Our scheme effectively utilizes the high core count of the CPU and the parallelism provided by emerging storage devices.
4.2. Implementation
We implement the proposed scheme in RocksDB version 5.13. To enable foreground idle threads to participate in background I/O operations, we introduce a global data structure called
ReadQueue. Algorithm 2 represents the pseudocode for the process in which multiple threads add or remove information about files that need to be read into the
ReadQueue, facilitating concurrent I/O operations. One of the background jobs, the compaction job, adds information about files requiring read I/O operations (lines 1–7). The main procedure of the compaction job is performed in the function
ProcessKeyValueCompaction(), which begins by identifying compaction target files that need to be read from storage devices (line 2). Since all files stored on the storage devices are sorted, the target files are read sequentially to merge and re-sort valid data. In other words, while all target files for compaction must eventually be read, they are not read all at once. For instance, if six files with file numbers 1 through 6 are selected as compaction targets, the process begins with file number 1. Files 2 and 3, which store overlapping key ranges, should be read first, but the read operations for files 4, 5, and 6 are delayed until the merge and re-sorting operations for the preceding key ranges are completed. The proposed scheme manages information about files that must eventually be read in the global queue, allowing idle threads to perform I/O operations in advance, which must be performed eventually.
Algorithm 2 Pseudocode for the proposed scheme. |
- 1:
procedure
ProcessKeyValueCompaction() - 2:
// find compaction target files - 3:
if file is target range then - 4:
AddFileToReadQueue() - 5:
end if - 6:
// existing works for compaction ▹ Do the original jobs - 7:
end procedure -
- 8:
procedureDelayWrite() - 9:
PrefetchQueueData() - 10:
if write_controller_.IsStopped() or .NeedsDelay() then - 11:
Wait() - 12:
end if - 13:
end procedure -
- 14:
procedurePrefetchQueueData() - 15:
if queue.IsEmpty() then - 16:
return ▹ nothing to do - 17:
else - 18:
SelectFileToPrefetch() ▹ protected by a lock - 19:
AllocateBuffer() - 20:
Read() - 21:
RemoveFileFromReadQueue() - 22:
end if - 23:
end procedure
|
As shown in Algorithm 1 (lines 6–7, 14–15), all threads allocated for foreground I/O operations are stalled and wait for the background jobs to complete. In the proposed scheme, before the threads allocated to foreground I/O operations enter a waiting state, they first check the global queue to see if there are any files requiring read I/O for background jobs (lines 9, 14–23). In the PrefetchQueueData() function, we first check whether the global queue is empty. If the queue is empty, it indicates that other threads have already read the target files and there is no remaining work. In this case, since no additional work is needed, the function returns, and the thread checks again to determine whether waiting is still necessary (lines 10–11). If the queue is not empty, a file is selected from the global queue for reading (lines 17–22). At this point, to avoid duplicate read operations by multiple threads, a thread acquires a lock, reads the file information, and updates the status of the file to indicate that the read operation is in progress. After selecting a file, the thread allocates space in the read buffer so that the file can be preloaded, and the data contained in the file are read into memory. Once the read operations are complete, the status is updated, and the file information is removed from the global queue to prevent duplication I/O operations.
5. Evaluation
5.1. Experimental Setup
We conduct experiments using db_bench benchmark provided by RocksDB. While db_bench benchmark offers multiple benchmark scenarios, we select the fillrandom workload to evaluate our scheme because fillrandom incurs the high I/O processing overhead. The fillrandom workload first generates keys in random order, and then, inserts the generated random key–value pairs into the pre-configured database location. The key and value sizes used are 16 B and 16 KB, respectively. We generate and inserted 625,000 key–value pairs into RocksDB.
Table 1 presents the evaluation parameters used in this study. Parameter values not listed in the table are set to their default values [
20]. The parameters
max_background_flushes and
max_background_compactions indicate the number of threads allocated for flush and compaction jobs, respectively. To verify the effectiveness of the proposed scheme, we conduct experiments with varying numbers of threads allocated for foreground I/O operations.
We use two different machines in the following experiments to evaluate RocksDB’s performance. The machine specifications are provided in
Table 2. I/O operations were configured as direct by setting the parameter values
use_direct_io_for_flush_and_compaction and
use_direct_reads to
true, thereby avoiding the overhead associated with transferring data through the kernel. We implement our scheme in RocksDB version 5.13 on Linux kernel 4.17.0 version, using the ext4 file system.
5.2. Performance Results
Figure 4 illustrates the performance of RocksDB with both the existing and proposed schemes on two different machines. As depicted, the proposed scheme improves RocksDB’s performance by up to 16% across a range of thread counts, from 1 to 16, on both machines. Specifically, on machine A, the proposed scheme improves performance by 9%, 15%, 15%, 10%, and 11% when 1, 2, 4, 8, and 16 threads are allocated for foreground I/O operations, respectively. On machine B, the proposed scheme yields performance improvements of 9%, 15%, 16%, 13%, and 13% for the same thread allocations.
The performance gains result from utilizing idle computing resources to participate in I/O operations rather than merely waiting for the completion of background jobs. By leveraging threads originally scheduled for foreground I/O operations, the thread allocated to background jobs can concentrate on processing data fetched from the read buffer during compaction. This approach reduces the time required, as the thread responsible for the background job can read data from a prefetched memory region instead of directly from the storage device. Additionally, the proposed scheme not only accelerates overall processing performance but also reduces lock contention among threads attempting to become the leader during foreground I/O operations. However, as shown in the figure, the performance gains diminish when using 8 or 16 threads compared to 2 or 4 threads on both machines. Increasing the number of threads does not yield proportional performance improvements. This is because, as the thread count increases, the overhead associated with acquiring locks during foreground I/O operations and becoming the leader thread also increases.
Furthermore, to assess the effectiveness of the proposed scheme with an increasing number of background threads, we measure performance by varying the
max_background_compactions parameter to 2 and 4. Since our scheme is designed to perform concurrent read operations on files required for compaction, we focused on adjusting the number of threads allocated to compaction jobs.
Figure 5 presents the performance results with varying numbers of background threads. The results indicate that the proposed scheme improves performance by accelerating the background compaction job through concurrent reading of target files while reducing delays during foreground I/O operations.
In summary, our evaluation demonstrates that the proposed scheme can effectively improve the performance of KV stores when a high number of threads are used in conjunction with emerging fast storage devices.
6. Limitation and Future Work
One limitation of our proposed scheme is its reliance on idle computing resources to participate in I/O operations. While this approach can lead to performance gains, these gains diminish as the number of threads allocated for foreground I/O operations increases. As discussed in
Section 5.2, as a large number of threads are allocated for foreground I/O operations, contention for acquiring lock to become the leader thread becomes significant. To address this issue, our scheme could be enhanced by incorporating lock-free data structures and operations using atomic instructions, as introduced in [
16].
For future work, our scheme has the potential to be extended to applications that intermittently require a large amount of I/O operations within a short period. For example, in log-structured file systems, the garbage collection process is triggered when free space falls below a certain threshold. During garbage collection, valid data must be read, and space occupied by invalid data must be reclaimed, both of which demand numerous I/O operations. Additionally, other operations are typically stalled while garbage collection is performed. Our scheme could reduce stall time by utilizing idle resources to assist in garbage collection within multi-core systems, thereby improving overall system efficiency.
7. Conclusions
In this paper, we propose a multi-threaded I/O operation scheme for KV stores to improve performance. Our analysis reveals that the main cause of performance degradation is the lack of scalability in handling I/O operations. While multiple threads are allocated to manage foreground I/O operations, only a single thread performs these tasks, forcing all other threads to wait until this operation is completed, which significantly hinders overall system performance. To address this issue, we utilize idle computing resources to accelerate I/O operations. Specifically, we engage these idle threads to perform background I/O operations that would otherwise delay the execution of foreground tasks. In this way, we effectively reduce the waiting time and increase the efficiency of the I/O handling process.
Previous studies [
14,
15,
16,
17,
18] have also attempted to enhance performance by increasing parallelism through the use of idle computing resources. These studies have mainly focused on file systems within operating systems, meaning that their performance improvements are only realized when operating within such systems. Although the potential benefits of the proposed scheme may be limited in resource-constrained systems, our study leverages idle resources allocated at the application level, allowing the proposed method to achieve performance gains without being constrained by the underlying system architecture. Our experimental results demonstrate that the proposed scheme improves the KV store performance by up to 16% compared to the existing KV store.