A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage

Jaranilla, Charles; Zhao, Guangxun; Choi, Gunhee; Park, Sohyun; Choi, Jongmoo

doi:10.3390/electronics14030564

Open AccessFeature PaperArticle

A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage

by

Charles Jaranilla

,

Guangxun Zhao

,

Gunhee Choi

,

Sohyun Park

^*

and

Jongmoo Choi

^*

Department of Software, Dankook University, Yongin 16890, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(3), 564; https://doi.org/10.3390/electronics14030564

Submission received: 30 December 2024 / Revised: 27 January 2025 / Accepted: 28 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue Future Trends of Artificial Intelligence (AI) and Big Data)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents an approach that integrates tiered storage into the Log-Structured Merge (LSM)-tree to balance Key–Value Store (KVS) performance and storage financial cost trade-offs. The implementation focuses on applying tiered storage to LSM-tree-based KVS architectures, using both vertical and horizontal storage alignment strategies or a combination of both. Additionally, these configurations leverage key–value (KV) separation to further improve performance. Our experiments reveal that this approach reduces storage financial costs while offering trade-offs in write and read performance. For write-intensive workloads, our approach achieves competitive performance compared to a fast NVMe Solid State Drive (SSD)-only approach while storing 96% of data on more affordable SATA SSDs. Additionally, it exhibits lookup performance comparable to BlobDB, and improves range query performance by 1.8x over RocksDB on NVMe SSDs. Overall, the approach results in a 49.5% reduction in storage financial cost compared to RocksDB and BlobDB on NVMe SSDs. The integration of selective KV separation further advances these improvements, setting the stage for future research into offloading remote data in LSM-tree tiered storage systems.

Keywords:

tiered storage; log-structured merge-tree; key–value store; key–value separation

1. Introduction

In today’s digital era, the exponential growth of data has a profound effect on technology and business operations. The rapidity at which data are generated and processed presents both opportunities and challenges [1]. Storage devices are critical components of the big data age, enabling the storing, retrieval, and processing of vast amounts of data. Advancements in storage technology are opening up new avenues for leveraging data more effectively, driving insights, and fostering innovation.

Cloud storage and distributed storage software have redefined data storage architectures by adopting tiered storage that consists of hierarchical heterogeneous storage devices such as Persistent memory (also called Non-volatile Memory, or NVM), NVMe SSD, SATA SSD, and Hard Disk Drive (HDD) [2,3,4,5]. These systems distribute data across multiple devices, providing virtually limitless capacity while ensuring fault tolerance through replication and redundancy. Their ability to adapt to growing data demands and accessibility has made them indispensable in modern computing.

KVSs are a widely used storage software for data management due to their strong performance and flexibility in supporting a wide range of application scenarios [6,7]. KVSs like LevelDB [8] and RocksDB [9] are foundational to applications requiring fast, low-latency access to large datasets. Among the various data structures used in KVSs, the LSM-tree [10] is the most prevalent, offering high write throughput and efficient lookup performance. Its design also aligns well with the features of SSDs, which are widely used in modern computing environments [11].

LSM-tree-based KVSs can leverage two characteristics to optimize performance and storage financial cost: levels and KV separation. The leveled structure of LSM-trees aligns well with tiered storage, as bottom levels, which contain less frequently accessed data, can be stored on cheaper, higher-latency storage devices. Similarly, KV separation allows values, which are typically larger, to be stored independently on cost-effective storage devices, while keys remain on faster storage for efficient indexing. Together, these characteristics enable flexible storage configurations that balance performance and storage financial cost.

Even though LSM-tree-based KVSs have the potential to offer inherent advantages, optimizing their performance in tiered storage, especially when balancing storage financial cost trade-offs, remains a significant challenge. One key consideration is how different storage devices affect data retrieval and write efficiency. In this context, we explore the design space about how to integrate KVS into tiered storage while harnessing the benefits of KV separation.

We propose a scheme that utilizes configuration diversity in LSM-tree-based KVS across multiple storage devices. By analyzing the trade-offs in RocksDB, a widely used KVS, we apply tiered storage, particularly focusing on KV separation. We explore various alignment strategies for tiered storage, partitioning the LSM-tree levels, and segregating KV-separated entries onto different storage devices. In this paper, we use the terms Vertical alignment and Vertical, as well as Horizontal alignment and Horizontal, interchangeably to refer to our tiered storage alignment strategies.

In summary, we make the following contributions:

Provide a detailed analysis of how LSM-tree performance is affected by heterogeneous storage and identify areas for improvement (Section 4).
Explore two fundamental strategies for partitioning schemes in tiered storage—Vertical and Horizontal—by analyzing their trade-offs and enhancing their effectiveness through the selective integration of KV separation (Section 5).
Implement our proposal on top of RocksDB and conduct analysis to determine how to maximize gains in performance while lowering storage financial costs (Section 6.2).

The remainder of this paper is organized as follows. We discuss related works in Section 2. Section 3 provides the context of tiered storage and LSM-trees, followed by how we explore the design space of tiered storage combined with KV separation. We then present our motivation in Section 4. Section 5 discusses our approach to configuring the tiered storage alignments and how we combine it with KV separation. Section 6 shows our experimental results, going over the lessons we learned in this study. Finally, we conclude this paper in Section 7, and provide insights for future research directions.

2. Related Works

Analyzing and improving the performance of LSM-trees has been the subject of a spectrum of research [12,13,14,15,16,17,18]. For instance, PebblesDB [19] focuses on reducing WA by employing fine-grained concurrency control and optimized data structures, decreasing the amount of data rewritten during compaction. SILK [20] utilizes an IO scheduler for an LSM-based KV store to reduce tail latency, demonstrating improvements in synthetic and production workloads. Monkey [21] strikes the optimal balance between the costs of updates and lookups by minimizing lookup cost, performance prediction, and autotuning.

WiscKey [22] introduced KV separation, which proved to be an effective way to minimize write amplification (WA). It is a concept that is popularly supported by KV stores, and laid the groundwork for other research. ChameleonDB [23] adopts a multi-shard structure to enhance performance, while DiffKV [24] optimizes merges and introduces fine-grained KV separation to balance performance and storage costs. FenceKV [25] improves range scan efficiency with fence-based data grouping and further proposes key-range garbage collection. Similar to these studies, our scheme leverages KV separation by offloading KV pairs into different storage devices and also utilizes selective KV separation.

Schemes leveraging heterogeneous storage have taken advantage of advancements in storage technologies to address scalability and performance challenges in KVSs. NoveLSM [26] employs a byte-addressable Skiplist tailored for NVM, enhancing (de)serialization processes. MatrixKV [27] introduces a matrix container to mitigate write stalls, while SpanDB [28] utilizes SPDK [29] for asynchronous request processing and high-speed logging, achieving cost-effective performance improvements. p²KVS [30] focuses on parallelism across multiple KVS instances, emphasizing portability and scalability. Jaranilla et al. [31] utilize different storage devices and hybrid compression on the levels of the LSM-tree to navigate the trade-offs between performance and storage space utilization.

While our approach also utilizes heterogeneous storage, these schemes mostly use NVM, which offers byte-addressability, persistence, and DRAM-like performance. In particular, Intel’s Optane series was discontinued [32], and alternative solutions, while offering certain advantages, have different capabilities and considerations. We utilize NVMe and SATA SSDs to show how tiered storage can be implemented cost-efficiently without sacrificing performance. Nevertheless, our approach does not require changes to the storage software stack components (e.g., SPDK, FTL, and the file system).

Schemes that integrate KV separation and heterogeneous storage exemplify the synergy between these two strategies. HashKV [33] implements a hash-based data grouping to optimize garbage collection and further proposes selective KV separation. PRISM [34] employs a heterogeneous storage index table to synergistically utilize multiple devices, thereby improving performance, scalability, and crash consistency. ThanosKV [35] holistically improves KVS performance through hybrid compaction and NVM indexing.

Table 1 provides a comparative overview of various KVS schemes, highlighting their key techniques, optimizations, and storage configurations. In contrast to these existing schemes, our approach offers greater flexibility by enabling configurable separation of (1) keys and values across distinct storage devices, and (2) levels that store both keys and their corresponding values. Additionally, we leverage selective KV separation, allowing tailored optimization based on workload characteristics and storage hierarchies. This adaptability distinguishes our scheme as a more versatile solution for balancing performance, scalability, and cost-efficiency.

3. Background

In this section, we first discuss why tiered storage is popularly employed for managing today’s expanding data landscape. Then, we describe the key concept of the LSM-tree data structure and KV separation in KVSs. Finally, we explore fundamental design space and how KVSs can be integrated into tiered storage.

3.1. Tiered Storage

Tiered storage is a data management solution that organizes and stores data across different types of storage devices based on performance and storage financial cost requirements. As data continues to grow exponentially, tiered storage becomes essential for handling large volumes effectively and economically. Modern tiered storage extends beyond just local storage, frequently incorporating a blend of on-premises and cloud storage solutions, along with integration with edge computing [36].

Figure 1 illustrates a typical tiered storage hierarchy where various storage devices are being used in different storage tiers, highlighting differences in performance, cost per byte, and capacity. In general, frequently accessed data are placed on faster, more expensive storage tiers to ensure low latency and high throughput. Conversely, infrequently accessed data are stored on slower, more cost-effective storage devices [37,38,39]. In this paper, we investigate how this tiered storage can be utilized by KVSs, a de facto standard database for unstructured data. We especially focus on how to exploit the characteristics of the LSM-tree and KV separation using RocksDB’s KV separation mechanism, called BlobDB [40].

3.2. Log-Structured Merge-Tree

The LSM-tree is a data structure used to efficiently manage large write-intensive workloads. It is adopted by most KVSs, including RocksDB, whose LSM-tree implementation is shown in Figure 2a. RocksDB manipulates two core components: a memory component called Memtable, and a storage component called Sorted String Table (SStable). Note that SStable files are distributed across multiple levels, denoted as

L_{0}

,

L_{1}

, and so on.

The Memtable is a 64 MB write buffer wherein inserted KV pairs are batched before being written into storage. These KV pairs are also simultaneously written to Write Ahead Log (WAL) for recovery purposes. Once the Memtable is full, it is converted into an immutable Memtable, and a new Memtable is created. Subsequently, the immutable Memtable is flushed into storage and managed as an SStable file. After successful flushing, the corresponding entries in the WAL are deleted.

As more data are written over time, SStables accumulate at a level. When the overall used space at a level is above a predefined threshold, RocksDB performs compaction. Compaction is the process of merge-sorting multiple SStables into a new set of SStables on the next level, which also includes eliminating obsolete data. This helps reduce disk input/output (IO) and improve read performance. However, compaction introduces a trade-off by consuming CPU and IO resources during the process, which can temporarily impact write performance and increase latency for ongoing operations.

When a lookup is requested, RocksDB first searches Memtable. Memtable is actually implemented using Skiplist, the search complexity of which is O(log N). If it is not found in Memtable, RocksDB goes into SStable files, from

L_{0}

to higher levels. Therefore, files in

L_{0}

are most frequently accessed, while files in higher levels are accessed less frequently.

When we apply tiered storage into KVSs, we can exploit this feature of the LSM-tree; that is, it consists of multiple levels, and each level has different access hotness. One intuitive way is placing hot data in low-latency, high-performance storage while storing cold data in cheaper, higher-latency storage, as shown in Figure 2b. This approach improves access times for frequently used data and reduces storage financial costs for infrequently accessed data, making it ideal for systems with varying data access patterns. However, compaction across different storage tiers can introduce overhead because of the data movement between storage devices. Despite these challenges, tiered storage enhances scalability and cost-efficiency, making it suitable for large datasets and workloads with diverse performance requirements [28,31,35].

3.3. Key–Value Separation

WiscKey [22] is a monumental study that proposes a distinguished technique called KV separation. Unlike the traditional implementation of LSM-tree where KV pairs are stored altogether in the LSM-tree, KV separation stores keys in a sorted order within the LSM-tree, while storing values separately in a log file. This separation reduces the amount of data rewritten during the compaction process, resulting in lower WA and improved write performance. BlobDB is RocksDB’s implementation of KV separation, designed to handle Binary Large Objects (Blobs) efficiently. As shown in Figure 3a, it stores keys only in the LSM-tree, while the separated values are manipulated in Blob files.

In contrast to the original WiscKey, where KV separation occurs before data are written to Memtable, BlobDB implements KV separation during flush operations when data are moved from memory to storage. This choice determines whether to store keys only or KV pairs in Memtable, ultimately resulting in opposing impacts on write and read performance. Specifically, WiscKey stores keys only in Memtable, which reduces flush frequencies and eventually enhances write performance. On the other hand, BlobDB stores KV pairs in Memtable, enabling direct retrieval of values when a key is found (eliminating the need for additional access via pointer), which eventually gives a positive impact on reads. Note that both WiscKey and BlobDB store keys only in SStables, and store values separately.

In BlobDB, the lookup process involves three steps. First, it checks the Memtable for the KV pair. If the key is not found, it searches the SStables in the LSM-tree to retrieve the key metadata, which includes a pointer to the value’s location. Finally, using this pointer, the actual value is fetched from the Blob file. This separation of keys and values introduces additional IO operations, impacting performance.

This KV separation technique triggers another challenge when we apply tiered storage to KVSs. One simple approach is allocating different storage devices to the LSM-tree and Blob files in an isolated manner, as shown in Figure 3b. Since Blob files, which contain values, account for the majority of data, placing them on cost-efficient storage can optimize storage financial cost management. However, this strategy requires careful evaluation of its impact on overall performance, which is one issue we tackle in this study.

3.4. Exploring Vertical and Horizontal Tiered Storage

KVSs make it possible to arrange various configurations in tiered storage. First, the existence of multiple levels in the LSM-tree allows for horizontal tiered storage, as discussed in Section 3.2. Second, the KV separation technique allows for vertical tiered storage, as explained in Section 3.3. Taking into account these two perspectives, we introduce a unified framework integrating both horizontal and vertical dimensions, as illustrated in Figure 4, to explore diverse design spaces.

This paper discusses the challenges arising from our exploration of design spaces and investigates new solutions. Specifically, we quantitatively evaluate (1) performance effects in RocksDB when using different storage devices, (2) the hotness of levels in the LSM-tree, and (3) the impact of KV separation. Furthermore, we address some issues caused by KV separation in horizontal tiered storage and propose selective KV separation as a solution.

By leveraging the complementary strengths of high-performance and cost-effective storage devices, our solution achieves a fine-grained balance between performance and storage financial cost efficiency. In addition, our exploration offers adaptability to diverse workload requirements, enabling tailored trade-offs in throughput, latency, and storage financial costs, which will be discussed further in Section 6.

4. Benefits of Tiered Storage

To construct a baseline for this work’s motivation, we perform a series of preliminary experiments on RocksDB and its KV separation implementation, BlobDB, in the experimental environment summarized in Table 2.

4.1. Device Latency

Figure 5 shows the effects of the latency difference between different storage devices in RocksDB performance. We conduct microbenchmark experiments for RocksDB in both NVMe and SATA SSDs separately. First, we trigger 56 million write operations randomly. After all compactions are done, we perform random reads to the same KV pairs.

The average latency, as presented in Figure 5a, for write and read operations in RocksDB varies significantly depending on the storage devices, with RocksDB exhibiting double the average latency when using SATA SSD compared to when using NVMe SSD. For RocksDB on NVMe SSD, the average latency is 6.657 μs per write and 26.084 μs per read, showcasing the high-speed capabilities of NVMe SSD. In contrast, RocksDB in SATA SSD shows considerably higher latencies, with 13.015 μs for writes and 50.366 μs for reads. This disparity is expected due to the fundamental differences in the performance characteristics of the two storage technologies, with NVMe interface benefiting from faster data transfer rates and lower latency compared to the older SATA interface.

The cumulative write operations shown in Figure 5b also reflect a noticeable difference in the performance of RocksDB between NVMe SSD and SATA SSD. For NVMe SSD, the 56 million write operations take a total of 372.8 s. In comparison, writes take 728.8 s for the same number of operations, with latencies ranging from 2 μs to 6811 μs in SATA SSD with the median write time at 3.67 μs. This result demonstrates the existence of long-tailed latencies in SATA SSD.

For cumulative read operations shown in Figure 5c, the performance difference between NVMe SSD and SATA SSD is even more pronounced. RocksDB on NVMe SSD reads take a total of 1833.5 s for 56 million operations. SATA SSD reads, on the other hand, take 3526.1 s for the same number of operations, highlighting the slower and more variable read performance of RocksDB in SATA SSD compared to NVMe SSD.

4.2. Characteristics of the LSM-Tree

The traditional implementation stores the whole LSM-tree in a single storage device. This causes the storage to be oblivious to the levels wherein the data are stored. However, in tiered storage, we need to decide how to allocate levels or separate values into different devices. To obtain guidelines for this decision, we conduct experiments with the Yahoo! Cloud Serving Benchmark (YCSB) [41] using the Zipfian distribution. In total, 50 million KV pairs are inserted during the Load phase, and 10 million KV pairs are processed on the remaining workloads. The keys and values are configured as 24 and 1000 bytes, respectively.

As shown in Figure 6a, compactions in RocksDB vary significantly between levels due to differences in data volume and structure. At lower levels, compactions are smaller and more frequent. Higher levels, however, handle larger, less frequent compactions. On the other hand, since the LSM-tree in BlobDB only stores key–pointer pairs instead of key–value pairs, each SStable can hold more keys, triggering fewer compactions. In this experiment, all SStables are located at either

L_{0}

or

L_{1}

in BlobDB. In addition, compactions in BlobDB are mostly concentrated at the lowest level, and the overall compaction time is much smaller than that of RocksDB.

During its lifetime, data are rewritten multiple times with compaction, causing WA [42,43]. Figure 6b shows that WA varies in-between levels. This variability arises because compactions behave differently depending on the characteristics of each level, such as the number of SStable files and compaction frequencies. In terms of overall WA, the figure shows that KV separation is an effective way to reduce WA by avoiding the rewriting of values during compactions.

Compactions move files to subsequent levels, shown as the increased number of SStables in RocksDB in Figure 6c. However, since the topmost levels, particularly L₀, store the most updated data, many reads can retrieve the required data without searching deeper levels. This minimizes read latency and highlights the critical role of the levels in optimizing access to frequently accessed keys. Note that Figure 6c only shows the distribution and hit of SStable files in RocksDB since BlobDB only has SStable files by up to L₁ in our experiments.

4.3. Issues of KV Separation

Figure 7 highlights the impact of KV separation in BlobDB on read latency compared to RocksDB when running a read-only workload (YCSB Workload C, specifically). RocksDB, which reads both keys and values only from the LSM-tree, accumulates a total read data size of 10 billion bytes with an average latency of 11.728 μs per read. In contrast, BlobDB’s KV separation caused reads to be split between the LSM-tree (for keys, totaling approximately 9.99 billion bytes) and Blob files (for values, totaling 10.55 billion bytes), resulting in a higher average latency of 15.323 μs per read.

The higher read latency in BlobDB stems from the additional IO operations required to fetch values stored in Blob files after retrieving the corresponding keys from the LSM-tree. From now on, we refer to it as double IO. This separation increases the total IO overhead compared to RocksDB, where both keys and values are stored together in the LSM-tree, enabling more streamlined reads. Consequently, while BlobDB offers benefits for certain workloads by offloading values to separate storage, it introduces a trade-off in the form of increased latency for read-heavy operations.

The data access pattern shown in Figure 6c reveals that the files hit during lookups are mostly in L₀. Additionally, regarding the time taken during compactions, L₀ has an inconsequential effect since compactions are mostly in the middle levels. This leads us to the question: is there merit in separating the values at the topmost level? Both merits and demerits are amplified in a tiered storage environment because of the latency difference of the storage devices, so we need to consider where we apply KV separation carefully and on what device they should be stored.

5. Design

Considering the aforementioned challenges, we address the following research questions:

How can tiered storage strategies be applied to LSM-trees with KV separation to balance performance and cost-efficiency?
What are the trade-offs in separating not only the levels of the LSM-tree (keys) into heterogeneous storage but also the Blob files (values)?
In what workload scenarios does selective KV separation provide the most significant performance benefits in a tiered storage environment?

In this paper, we propose a design strategy specifically for KV separation that leverages tiered storage. This scheme allows users to navigate through the trade-offs between storage financial cost and performance. Our approach focuses on flexibility in choosing which storage device houses the SStables (keys) and Blob files (values), regardless of whether they are separated or not. As Figure 3b shows, users can configure both the levels of the LSM-tree for the keys and their corresponding values to be stored in their specified storage devices. Additionally, as shown in Figure 4, the division can be extended to the values by placing the Blob files in a different storage device. Lastly, we leverage selective KV separation to optimize the performance.

5.1. Configuration Strategy

As discussed Section 3.4, we configure the tiered storage in two prominent approaches, as shown in Figure 8, which offer distinct trade-offs by leveraging the properties of storage media and data access patterns. These configurations aim to optimize the use of low-latency and cost-effective devices to address the challenges of LSM-tree compactions and read–write operations when utilized with KV separation.

Vertical alignment. Taking inspiration from SpanDB’s poor man approach of storing the bulk of data in a cheaper storage device [28], the vertical alignment leans towards storage financial cost efficiency, as shown in Figure 8a. The LSM-tree, which holds the keys and pointers, is stored in faster (but expensive) NVMe SSD. Since the keys are typically small [12], they will consume less storage space on a more expensive device. Furthermore, compactions are performed within the LSM-tree, and having to process fewer data on each compaction, aside from overhead reduction, mitigates the effect on the device’s limited P/E cycle. Alongside this, values, which are larger, are stored on slower (but cheaper) SATA SSD. However, since the larger values are on a higher latency device, actual Blob file writes (during flushes) and reads (during lookups) have higher overhead. This is the trade-off that lowers the overall storage financial cost with moderate performance.

Horizontal alignment. As shown in Figure 8b, horizontal alignment exposes the values to the storage device, which stores the level of their corresponding keys. In addition to the storage efficiency advantages offered by vertical alignment, it also improves read performance since the bulk of data access (topmost level) is now in the faster NVMe SSD for both keys and values, and the rest of the levels in slower SATA SSD. However, further separation of the bottom levels like L₁ and L₂ would mean that each compaction will also need to rewrite values to their respective storage devices which negates the advantages of KV separation. As shown in Figure 6a, the compactions greatly impact these levels and, due to this, we opt to use the same storage device on the remaining levels. Having the same device in consecutive levels retains the advantages of KV separation, while still allowing L₀ to reap benefit from the low-latency device. In terms of read performance, the most frequently accessed data are at the top levels, as shown in Figure 6c, and with L₀ using a low-latency device, the majority of reads can be served at the fast low-latency device.

5.2. Combining with Selective KV Separation

Our approach of tiered storage alignment to diversify the storage configurations in LSM-tree takes advantage of BlobDB’s capability to adjust KV separation among levels. Figure 9 shows our tiered storage alignment approach combined with selective KV separation. Users can configure each of the levels (keys) and Blob files (values) with their respective storage device. In addition, they can opt to use both Vertical and Horizontal alignments with non-separated KV pairs as well. We refer to this as selective KV separation, whose impact will be further discussed in Section 6.2.5.

6. Evaluation

We present the results of our experiments in RocksDB and BlobDB, from throughput, latency breakdown, and storage financial cost, to the effect of selective KV separation.

6.1. Experiment Setup

We benchmark RocksDB and BlobDB on NVMe and SATA SSDs separately, utilizing the experimental environment summarized in Table 2. Additionally, we set up the vertical alignment with LSM-tree (keys) stored in NVMe SSD and Blobs (values) in SATA SSD. Finally, we set up the horizontal alignment with NVMe SSD for L₀ keys and values and SATA SSD for the rest of the levels.

We utilize YCSB, which has been used as a standard benchmark tool for KVS evaluation. YCSB simulates a variety of workloads, summarized in Table 3, to assess database performance in real production environments. In this experiment, we initially load 50 million requests of 24 and 1000 byte KV pairs to form an approximately 53 GB database. We then proceed with 10 million requests for each of the YCSB workloads A through F.

6.2. Experimental Results

6.2.1. Throughput

Figure 10 shows the throughput comparison results under 6 configurations: (1) RocksDB on NVMe SSD, (2) RocksDB on SATA SSD, (3) BlobDB on NVMe SSD, (4) BlobDB on SATA SSD, (5) Vertical approach in tiered storage, and (6) Horizontal approach in tiered storage. RocksDB on NVMe SSD dominates in performance because it stores all of its data in the fast and expensive storage device. BlobDB on NVMe SSD exhibits the best performance for the Load workloads, demonstrating its effectiveness in write-only workloads. SATA SSD-only configurations are trailing in terms of performance compared to their counterparts stored in NVMe SSD. Aside from these obvious results, we highlight five significant findings.

First, in vertical alignment, since the values are stored on a slower device (SATA SSD), it causes higher overhead for writes than the default BlobDB, which only utilizes the fast NVMe SSD. Nevertheless, it outperforms RocksDB regardless of storage device and BlobDB when utilizing SATA SSD. With regard to the horizontal alignment, the compactions from L₀ to L₁ have a higher overhead because the Blob files are rewritten since they reside in a different storage device.

For Workload A, RocksDB enables efficient access and updates because it does not have to manage values separately. Horizontal alignment achieves similar throughput to RocksDB and BlobDB by using NVMe SSD in L₀, which facilitates faster flushes. In contrast, vertical alignment has lower throughput because values are stored on slower SATA SSDs. Since this workload involves both reads and writes, RocksDB outperforms BlobDB by balancing read overhead, a benefit also observed in horizontal alignment.

The vertical alignment has a similar performance with BlobDB on SATA SSD in Workload A despite using fast NVMe SSD to store the keys. While optimized for balanced storage financial costs and decent writes, it struggles with higher read performance. Similarly, in Workload F, out-of-place updates on the faster NVMe SSD offset the SATA SSD’s higher overhead. By writing a new version of the data to L₀ instead of updating in place, the LSM-tree avoids penalties from SATA SSD’s slower writes.

For read-intensive workloads B, C, and D, horizontal alignment performs comparably to BlobDB on NVMe SSD. This is because most requested files are hit in L₀ (Figure 6c), and the overhead at this level is similar for these workload approaches, resulting in nearly identical throughput. Vertical alignment, while slower than BlobDB on NVMe SSD, outperforms BlobDB on SATA SSD due to its use of NVMe SSD for storing keys.

With Workload E, which involves range queries, the horizontal alignment takes advantage despite range queries being a known limitation of KV separation. To implement our tiered storage alignments, we modified the Blob file builder to use the device path directly when creating Blob files. While this change offers minimal improvement for individual lookups, it significantly enhances range query performance. During range scans, only the Blob files within the relevant level are searched, rather than all Blob files, which reduces unnecessary overhead.

Our main takeaway in these experiments is that our approach offers competitive performance with RocksDB and BlobDB stored in NVMe SSD while storing most of the data in SATA SSD. This results in storage financial cost savings, which we discuss further in Section 6.2.4.

6.2.2. Average and Per-Level Latency

Since we apply tiered storage to the LSM-tree, it is crucial to analyze the average latency and its impact on individual levels. The read latencies per level reflect the time taken to perform read operations on each level of the LSM-tree under different configurations. Observing these values can reveal the impact of storage device, data layout, and configuration on read performance at various levels. In these experiments, we discuss our observations with latency differences when running the YCSB Workload A.

In line with expectations, as shown in Figure 11, implementations stored in NVMe SSD always deliver the best results for both reads and writes. On the other hand, BlobDB excels in slower storage (SATA SSD), which makes it ideal for cost-sensitive environments. Note that Figure 11b shows L₃ and L₄ reflect only the results for RocksDB since the SStables in BlobDB, and vertical alignment store more keys because the values are separated in Blob files. In the case of horizontal alignment, we opt to separate the latency for retrieving the Blob file and put it alongside the other configurations using KV separation.

Ancillary to these observations, we extract five noteworthy findings. First, as shown in Figure 11a, vertical alignment exhibits a fairly similar average latency with BlobDB on SATA SSD. With writes, this is due to separate writes of Blob files (values) in a higher latency device (SATA SSD). In terms of reads, this can be attributed to the overhead in fetching a key from NVMe SSD, but its associated value from SATA SSD. On the other hand, horizontal alignment balances tiered storage effectively, achieving near-NVMe SSD performance for both reads and writes by optimizing data placement, taking advantage of low latency device in frequently accessed data.

Second, as shown in Figure 11b, L₀ latencies for all configurations are higher than their L₁ counterparts. While this may seem counterintuitive because data in L₀ comes first during lookups, note that SStables in L₀ are products of the flush operations, which means they can have significant overlaps. This requires lookups to search across multiple files, as every overlapping SStable may need to be checked for matches. In contrast, data in the remaining levels, in this case, L₁, are generated through the compaction process, which includes merge-sort operations.

Third, L₃ latencies are higher than L₄ latencies. Upon further inspection, we notice that there are fewer SStables in L₄, which is the result of LSM-tree is tree-like structure. Since data traverses through the levels of the LSM-tree from top to bottom, L₃ serves more read requests than L₄. Note that RocksDB’s internal thread pool handles compactions and reads concurrently. RocksDB is designed to handle multiple operations in parallel, allowing compaction tasks to run in the background without locking up the database. And, since Workload A is a mixed workload, RocksDB does not wait for compactions to finish before processing reads.

Fourth, vertical alignment has similar latency with BlobDB on SATA SSD when retrieving Blob files, given that both of them store all values in SATA SSD. However, since the vertical alignment stores keys in NVMe SSD, it has lower latencies for the levels that store its keys. As mentioned, the vertical alignment incurs an overhead penalty due to the difference in latency of the devices for keys and values, which is reflected in its throughput shown in Workload A results in Figure 10.

Fifth, the horizontal alignment is consistently competitive with RocksDB on NVMe SSD, not only in L₀, which uses NVMe SSD, but also in the bottom levels using SATA SSD. Moreover, it has similar latency with BlobDB on NVMe SSD when retrieving Blob files. This shows that horizontal alignment can handle most of the requests with minimal overhead penalties, especially for more frequently accessed keys, which we discuss next in exploring tail latencies.

6.2.3. Tail Latency

Tail latency is crucial for LSM-tree-based KVSs, as they are extensively used in production environments to support write-intensive workloads and latency-sensitive applications [27]. Table 4 shows how tiered storage impact latency while storing most of the data compared to RocksDB and BlobDB in single storage set-ups.

As expected, NVMe SSD consistently outperforms their SATA SSD counterparts across all configurations. This is evident in both read and write latencies, where the 99th and 99.9th percentiles on NVMe SSD are significantly lower than on SATA SSD. Across all configurations, the latencies at the 99th and 99.9th percentiles are much higher than those at the 50th or 75th percentiles. This indicates that tail latency (worst-case requests) is more sensitive to system bottlenecks.

From these results, we present three key observations pertaining to both vertical and horizontal alignments. First, the write latency at the 99th and 99.9th percentiles with the vertical alignment is higher than all other configurations, indicating bottlenecks in handling high-volume or skewed access patterns. Nevertheless, it exhibits competitive latencies in the 50th and 75th percentiles with the other configurations, BlobDB on SATA SSD in particular, which is reflected with a marginally comparable average latency in average latency shown in Figure 11a.

Second, the horizontal alignment offers more balanced performance for reads, with competitive performance with RocksDB on NVMe SSD in 50th and 75th percentiles. In these percentiles, most requests are likely to hit frequently accessed keys stored in faster tiers (NVMe SSD) and/or benefit from caching. However, since horizontal alignment spreads the LSM-tree levels across different storage devices, tail latencies can spike if SStables on slower storage are involved in read operations, especially when accessing less frequently queried keys or during a compaction event.

Third, at higher percentiles (99th and 99.9th), less frequently accessed keys that reside in slower tiers (SATA SSD) dominate. This results in significantly higher latencies for horizontal alignment compared to RocksDB on NVMe SSD, which stores all of the data in NVMe SSD, avoiding the need to interact with slower storage devices, resulting in more predictable latencies.

6.2.4. Data Storage Implications

We examine storage consumption and financial cost efficiency. Additionally, we discuss the implications in WA, highlighting how our approach helps reduce these issues and provides a more efficient storage layout compared to RocksDB and BlobDB. Figure 12a shows the storage consumption of the database after we run the experiments. Upon completing all the workloads, RocksDB, BlobDB, and both vertical and horizontal alignments each generate databases of approximately 53 GB, with only slight variations between them. Since both RocksDB and BlobDB are single storage setups, 100% of their data are on the NVMe SSD. In the case of vertical alignment, the values that makeup 96.38% of the database are on a cheaper SATA SSD, with the rest being the keys and pointers, on the NVMe SSD. On the other hand, in horizontal alignment, the keys, pointers, and values for all levels except L₀ are stored on the SATA SSD, which holds 94.96% of the total data.

Given the data distribution between the storage devices in Figure 12a, our approach reduces the overall storage financial cost, which is calculated by

Storage_Financial_Cost = \sum_{i = 1}^{n} (C_{i} \cdot P_{i})

where

C_{i}

is the consumption (in GB) of the i-th storage device,

P_{i}

is the cost-per-GB of the i-th storage device, and n is the total number of storage devices. Specifically, both RocksDB and BlobDB on NVMe SSD result in USD 20.7, while vertical and horizontal alignments result in USD 10.45 (49.5% less) and USD 10.6 (48.7% less), respectively. Note that, in actual implementations, data will continue to grow, increasing storage financial costs. However, our approach offsets these storage financial costs with significant savings while maintaining competitive performance.

We simplify our calculation to facilitate an easier understanding of the storage financial costs when considering multiple storage media. This calculation is limited to the retail price of the storage media shown in Table 5, and does not fully consider other aspects like the whole system hardware cost, energy consumption cost, air-conditioning cost, labor cost, and the difference in prices among different regions, brands, and times of purchase. Additionally, the hardware setup and workloads in an actual data center are more complex than ours, so the service costs will vary accordingly.

Over time, RocksDB can cause the storage device to suffer from fragmentation, especially with large values, as the storage layout becomes suboptimal. WA exacerbates this negative impact since RocksDB writes more data into the storage device than the actual data requested by the user. In specific, WA is caused by compactions, WAL, and metadata updates, which result in additional writes. This is mitigated with KV separation by keeping large values in separate files, and only process keys in during compactions.

WA in KVS is defined as the ratio of the total data size written to the disk to the data size written by the user. This phenomenon arises due to background operations such as compactions, WAL writes, and metadata updates, which result in additional writes beyond the user’s input. WA is calculated as the ratio of the total data written to the disk (

W_{disk}

) to the data written by the user (

W_{user}

):

WA = \frac{W_{disk}}{W_{user}}

Since our approach offloads the data into tiered storage, the Blob files must be rewritten to the corresponding storage device instead of just updating the pointers. As shown in Figure 12b, the WA for both storage alignments is higher than that of BlobDB, but still lower than that of RocksDB. For vertical alignment, its WA is almost identical to that of BlobDB, since it is practically doing the same, except it writes Blob files into a different storage device.

On the other hand, the horizontal alignment has higher WA because it stores the Blob files in the same storage device as their corresponding keys. In cases where subsequent levels utilize different storage devices, compacting from

L_{i}

to

L_{i + 1}

will require the compaction process to rewrite the Blob files. This can be mitigated by configuring consecutive levels in the same storage device. For example, L₀ and L₁ can use the NVMe SSD, L₂ and L₃ can use SATA SSD, and so on.

6.2.5. Leveraging Selective KV Separation

We analyze how selective KV separation affects our approach in Figure 13, comparing results when KV separation starts in either L₀ or L₁, as depicted in Figure 9. When KV separation starts in L₀ (default configuration), keys and values are separated during flush operations. On the other hand, if KV separation is delayed to L₁ (or subsequent levels), KV separation happens during compaction.

Building upon these observations, we identify five significant findings. First, as evidenced in the Load phase, vertical alignment maintains almost the same performance for writes. However, notice the big drop in the throughput of horizontal alignment. Since L₀ does not have KV separation, compactions from L₀ to L₁ trigger both writes of the keys in LSM-tree and Blob files in SATA SSD.

Second, in Workload A, both alignments exhibit throughput improvements when KV separation is delayed to L₁. By keeping values with keys in L₀, write operations in this level are simpler and avoid the separate value write overhead, improving write throughput in the presence of frequent updates. This approach minimizes write latency for hot data by avoiding unnecessary Blob file write operations for frequently updated entries. This is also true for Workload F, which is a read-modify workload.

Third, in Workload C, horizontal alignment shows similar performance in Workload C while vertical alignment increases by 1.3x. Vertical alignment takes advantage of having many of the file hits in L₀, which has no KV separation and has no double IO, meaning that the read requests performed in this level are with the NVMe SSD having lower latency. Regardless of whether KV separation is performed from the start or not, horizontal alignment already uses NVMe SSD in L₀, which means that it already benefits from this and cannot exploit it further.

SStables in L₀ are not organized in a strictly sorted manner compared to the subsequent levels where SStables are products of the compaction process. Remember that RocksDB results in the highest throughput in the earlier comparison; in the case of both vertical and horizontal alignments with KV separation starting from L₁, the SStables in L₀ are processed the same as the traditional RocksDB, which helps boosts its performance.

This behavior is true for the other two read-dominated workloads, B and D, as well. However, horizontal alignment actually exhibits improvements in Workload D. This is because Workload D is a read latest workload, which allows the horizontal alignment to take full advantage of having its L₀ in NVMe SSD, which stores the most updated version of the data.

Fourth, in Workload E, while vertical alignment exhibits a drop in performance, horizontal alignment improves. Note that the vertical alignment stores all the values in the higher latency SATA SSD. On the other hand, because the horizontal alignment subdivides the values to be in the same storage device as their respective keys, the iterator only needs to consider the Blob files at each particular level with which their keys are associated. This is the trade-off of rewriting the Blob files into the next level if consecutive levels have different storage devices, causing higher WA, as discussed in Section 6.2.4.

In summary, selective KV separation enhances both vertical and horizontal tiered storage for write-intensive and read-modify workloads (A and F). For all read-intensive workloads, vertical tiered storage benefits from selective KV separation, while horizontal tiered storage only improves in read-latest workloads.

6.3. Limitations

Currently, our approach supports up to three different devices. During our initial experiments, we included Hard Disk Drives (HDDs) as one of the storage options. However, despite their cost-effectiveness, we discovered that using HDDs was not advantageous. There was a significant reduction in throughput when using HDDs in both alignments, rendering it 4x lower. Consequently, the minor storage financial cost savings from using HDDs will not justify the considerable performance drop, leading us to exclude HDDs from our recommendation even though they can still be used by the user. During our experiments, we constrain the storage devices to the ones we have on hand instead of using emulated environments to conduct a fair comparison.

Our approach currently supports only one column family (CF) to maintain simplicity and efficiency. RocksDB can partition the database into multiple CFs, enabling a consistent view of the entire database while configuring each CF independently. While modifying RocksDB to implement our scheme, we found that introducing multiple CFs adds significant complexity. Specifically, if one CF generates a lot of Blob files while another does not, the overall resource utilization may become unbalanced, leading to suboptimal disk usage.

Additionally, if one CF dominates the cache usage, it could push out keys from other CFs, leading to higher cache misses. This can affect performance, especially if the workload varies significantly across CFs. These instances are also present in the default implementations of RocksDB and BlobDB. However, the effects can be amplified because our approach uses multiple storage devices with varying latencies.

7. Conclusions

In this work, we present a scheme that leverages configuration diversity in LSM-tree-based KVS with KV separation across multiple storage devices. Our approach achieves competitive performance with RocksDB on fast NVMe SSDs for write-intensive workloads while significantly reducing storage costs by storing 96% of the data in more economical SATA SSDs. Horizontal alignment further demonstrates lookup performance comparable to BlobDB on NVMe SSDs. Notably, our approach excels in range query scenarios, as evidenced by a 1.8x improvement in YCSB Workload E over RocksDB on NVMe SSDs, alongside up to 49.5% lower storage financial costs.

Both horizontal and vertical alignments exploit KV separation to achieve meaningful performance gains in read and mixed workloads. This demonstrates the adaptability of our scheme to varying workload demands. Beyond performance, our work introduces a paradigm of flexibility in tiered storage configuration, moving beyond traditional assumptions of the top-down approach of using low-latency, but expensive, devices in the lower levels and high-latency, but cheaper, devices in higher levels. Instead, we propose not only allocating levels across heterogeneous storage devices but also separating the levels while ensuring that keys and their corresponding values remain on the same storage device.

We intend to continue this work by optimizing how the Blob file builder and reader are implemented. In specific, our approach shall be developed to accommodate a wider range of storage devices and multiple CFs. Additionally, we intend to efficiently offload data into remote or cloud storage, which has different considerations.

Author Contributions

Conceptualization, C.J., G.Z. and J.C.; methodology, C.J., G.Z. and J.C.; software, C.J.; validation, C.J., G.C., S.P. and J.C.; formal analysis, C.J., G.Z. and J.C.; investigation, C.J., G.C. and J.C.; resources, J.C.; data curation, C.J., G.Z. and J.C.; writing—original draft preparation, C.J. and G.Z.; writing—review and editing, C.J., G.C. and J.C.; visualization, C.J. and G.Z.; supervision, S.P. and J.C.; project administration, C.J., S.P. and J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

The present research was supported by the research fund of Dankook university in 2022, and by the MSIT (Ministry of Science, ICT), Korea, under the Global Research Support Program in the Digital Field program (RS-2024-00428758) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation), and by the IITP grant funded by the Korea government (MSIT) (No. 2021-0-01475, (SW StarLab) Development of Novel Key-Value DB for Unstructured Bigdata).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

The authors appreciate all the reviewers and editors for their precious comments and work on this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elhemali, M.; Gallagher, N.; Tang, B.; Gordon, N.; Huang, H.; Chen, H.; Idziorek, J.; Wang, M.; Krog, R.; Zhu, Z.; et al. Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22), Carlsbad, CA, USA, 11–13 July 2022; pp. 1037–1048. [Google Scholar]
Sabitha, R.; Sydulu, S.J.; Karthik, S.; Kavitha, M. Distributed File Systems for Cloud Storage Design and Evolution. In Proceedings of the 2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI), Tiruchengode, India, 19–20 October 2023; pp. 1–8. [Google Scholar]
Microsoft. Azure Disk Storage. Available online: https://azure.microsoft.com/en-us/products/storage/disks/ (accessed on 24 November 2024).
Google. Google Cloud Storage. Available online: https://cloud.google.com/ (accessed on 24 November 2024).
Amazon. Hybrid Cloud Storage. Available online: https://aws.amazon.com/cn/products/storage/hybrid-cloud-storage/ (accessed on 24 November 2024).
Dong, S.; Kryczka, A.; Jin, Y.; Stumm, M. Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21), Virtual, 23–25 February 2021; pp. 33–49. [Google Scholar]
Idreos, S.; Callaghan, M. Key-Value Storage Engines. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 2667–2672. [Google Scholar]
LevelDB. LevelDB—A Fast and Lightweight Key/Value Database Library by Google. Available online: https://github.com/google/leveldb (accessed on 24 November 2024).
Meta. RocksDB. Available online: https://github.com/facebook/rocksdb (accessed on 24 November 2024).
O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The Log-Structured Merge-tree (LSM-tree). Acta Inform. 1996, 33, 351–385. [Google Scholar] [CrossRef]
Li, C.; Chen, H.; Ruan, C.; Ma, X.; Xu, Y. Leveraging NVMe SSDs for Building a Fast, Cost-effective, LSM-tree-based KV Store. ACM Trans. Storage (TOS) 2021, 17, 1–29. [Google Scholar] [CrossRef]
Cao, Z.; Dong, S.; Vemuri, S.; Du, D.H. Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20), Santa Clara, CA, USA, 24–27 February 2020; pp. 209–223. [Google Scholar]
Dayan, N.; Idreos, S. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; pp. 505–520. [Google Scholar]
Jin, H.; Choi, W.G.; Choi, J.; Sung, H.; Park, S. Improvement of RocksDB Performance via Large-Scale Parameter Analysis and Optimization. J. Inf. Process. Syst. 2022, 18, 374–388. [Google Scholar]
Yoo, S.; Shin, H.; Lee, S.; Choi, J. A Read Performance Analysis with Storage Hierarchy in Modern KVS: A RocksDB Case. In Proceedings of the 2022 IEEE 11th Non-Volatile Memory Systems and Applications Symposium (NVMSA), Taipei, Taiwan, 23–25 August 2022; pp. 45–50. [Google Scholar]
Kaiyrakhmet, O.; Lee, S.; Nam, B.; Noh, S.H.; Choi, Y.R. SLM-DB: Single-Level Key-Value Store with Persistent Memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 19), Boston, MA, USA, 25–28 February 2019; pp. 191–205. [Google Scholar]
Duan, Z.; Yao, J.; Liu, H.; Liao, X.; Jin, H.; Zhang, Y. Revisiting Log-Structured Merging for KV Stores in Hybrid Memory Systems. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023; Volume 2, pp. 674–687. [Google Scholar]
Yu, J.; Noh, S.H.; Choi, Y.r.; Xue, C.J. ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST 23), Santa Clara, CA, USA, 21–23 February 2023; pp. 65–80. [Google Scholar]
Raju, P.; Kadekodi, R.; Chidambaram, V.; Abraham, I. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 497–514. [Google Scholar]
Balmau, O.; Dinu, F.; Zwaenepoel, W.; Gupta, K.; Chandhiramoorthi, R.; Didona, D. SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, USA, 10–12 July 2019; pp. 753–766. [Google Scholar]
Dayan, N.; Athanassoulis, M.; Idreos, S. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; pp. 79–94. [Google Scholar]
Lu, L.; Pillai, T.S.; Gopalakrishnan, H.; Arpaci-Dusseau, A.C.; Arpaci-Dusseau, R.H. WiscKey: Separating Keys from Values in SSD-Conscious Storage. ACM Trans. Storage (TOS) 2017, 13, 1–28. [Google Scholar] [CrossRef]
Zhang, W.; Zhao, X.; Jiang, S.; Jiang, H. ChameleonDB: A Key-value Store for Optane Persistent Memory. In Proceedings of the Sixteenth European Conference on Computer Systems, Edinburgh, UK, 26–28 April 2021; pp. 194–209. [Google Scholar]
Li, Y.; Liu, Z.; Lee, P.P.; Wu, J.; Xu, Y.; Wu, Y.; Tang, L.; Liu, Q.; Cui, Q. Differentiated Key-Value Storage Management for Balanced I/O Performance. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21), Virtual, 14–16 July 2021; pp. 673–687. [Google Scholar]
Tang, C.; Wan, J.; Xie, C. Fencekv: Enabling efficient range query for key-value separation. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 3375–3386. [Google Scholar] [CrossRef]
Kannan, S.; Bhat, N.; Gavrilovska, A.; Arpaci-Dusseau, A.; Arpaci-Dusseau, R. Redesigning LSMs for Nonvolatile Memory with NoveLSM. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, USA, 11–13 July 2018; pp. 993–1005. [Google Scholar]
Yao, T.; Zhang, Y.; Wan, J.; Cui, Q.; Tang, L.; Jiang, H.; Xie, C.; He, X. MatrixKV: Reducing Write Stalls and Write Amplification in LSM-tree Based KV Stores with Matrix Container in NVM. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC 20), Boston, MA, USA, 15–17 July 2020; pp. 17–31. [Google Scholar]
Chen, H.; Ruan, C.; Li, C.; Ma, X.; Xu, Y. SpanDB: A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21), Virtual, 23–25 February 2021; pp. 17–32. [Google Scholar]
Intel. SPDK: Storage Performance Development Kit. Available online: https://spdk.io (accessed on 24 November 2024).
Lu, Z.; Cao, Q.; Jiang, H.; Wang, S.; Dong, Y. p²KVS: A Portable 2-Dimensional Parallelizing Framework to Improve Scalability of Key-value Stores on SSDs. In Proceedings of the Seventeenth European Conference on Computer Systems, Rennes, France, 5–8 April 2022; pp. 575–591. [Google Scholar]
Jaranilla, C.; Shin, H.; Yoo, S.; Cho, S.j.; Choi, J. Tiered Storage in Modern Key-Value Stores: Performance, Storage-Efficiency, and Cost-Efficiency Considerations. In Proceedings of the 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), Bangkok, Thailand, 18–21 February 2024; pp. 151–158. [Google Scholar]
Corporation, I. Customer Support Options for Discontinued Intel® Optane™ Solid-State Drives and Modules. Available online: https://www.intel.com/content/www/us/en/support/articles/000024320/memory-and-storage.html (accessed on 24 November 2024).
Chan, H.H.; Liang, C.J.M.; Li, Y.; He, W.; Lee, P.P.; Zhu, L.; Dong, Y.; Xu, Y.; Xu, Y.; Jiang, J.; et al. {HashKV}: Enabling Efficient Updates in {KV} Storage via Hashing. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, USA, 11–13 July 2018; pp. 1007–1019. [Google Scholar]
Song, Y.; Kim, W.H.; Monga, S.K.; Min, C.; Eom, Y.I. PRISM: Optimizing Key-Value Store for Modern Heterogeneous Storage Devices. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, BC, Canada, 25–29 March 2023; Volume 2, pp. 588–602. [Google Scholar]
Zhao, G.; Shin, H.; Yoo, S.; Cho, S.j.; Choi, J. ThanosKV: A Holistic Approach to Utilize NVM for LSM-tree based Key-Value Stores. In Proceedings of the 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), Bangkok, Thailand, 18–21 February 2024; pp. 143–150. [Google Scholar]
Ren, Y.; Ren, Y.; Li, X.; Hu, Y.; Li, J.; Lee, P.P. ELECT: Enabling Erasure Coding Tiering for LSM-tree-based Storage. In Proceedings of the 22nd USENIX Conference on File and Storage Technologies (FAST 24), Santa Clara, CA, USA, 2024; pp. 293–310. [Google Scholar]
Elnably, A.; Wang, H.; Gulati, A.; Varman, P.J. Efficient QoS for Multi-Tiered Storage Systems. In Proceedings of the HotStorage, Boston, MA, USA, 13–14 June 2012. [Google Scholar]
Kakoulli, E.; Herodotou, H. OctopusFS: A Distributed File System with Tiered Storage Management. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; pp. 65–78. [Google Scholar]
Karim, S.; Wünsche, J.; Broneske, D.; Kuhn, M.; Saake, G. Assessing Non-volatile Memory in Modern Heterogeneous Storage Landscape using a Write-optimized Storage Stack. Grundlagen von Datenbanken 2023. [Google Scholar]
Meta. BlobDB. Available online: https://github.com/facebook/rocksdb/wiki/BlobDB (accessed on 24 November 2024).
Cooper, B.F.; Silberstein, A.; Tam, E.; Ramakrishnan, R.; Sears, R. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, Indianapolis, IN, USA, 10–11 June 2010; pp. 143–154. [Google Scholar]
Wang, X.; Jin, P.; Hua, B.; Long, H.; Huang, W. Reducing Write Amplification of LSM-Tree with Block-Grained Compaction. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Virtual, 9–12 May 2022; pp. 3119–3131. [Google Scholar]
Lee, H.; Lee, M.; Eom, Y.I. SFM: Mitigating read/write amplification problem of LSM-tree-based key-value stores. IEEE Access 2021, 9, 103153–103166. [Google Scholar] [CrossRef]
Amazon.com. SAMSUNG 970 PRO SSD 1TB-M.2 NVMe Interface Internal Solid State Drive with V-NAND Technology (MZ-V7P1T0BW). Available online: https://www.samsung.com/us/computing/memory-storage/solid-state-drives/ssd-970-pro-nvme-m2-1tb-mz-v7p1t0bw/ (accessed on 24 November 2024).
Amazon.com. Samsung SSD 860 EVO 1TB 2.5 Inch SATA III Internal SSD (MZ-76E1T0B/AM). Available online: https://www.samsung.com/sec/support/model/MZ-76E250B/KR/ (accessed on 24 November 2024).

Figure 1. Tiered storage consists of hierarchical heterogeneous storage devices, where each provides different features with the viewpoint of latency, capacity, and storage financial cost.

Figure 2. LSM-tree structure in RocksDB: (a) shows the conventional implementation in single storage, and (b) shows the LSM-tree in tiered storage.

Figure 3. KV separation in RocksDB: (a) shows the structure of BlobDB, which is stored in a single storage by default, and (b) shows BlobDB in tiered storage.

Figure 4. Vertical and Horizontal alignment in implementing heterogeneous storage to LSM-tree with KV separation. The tiered storage system can be vertically divided, with the LSM-tree (keys) in one storage device and the Blob files (values) in another. Furthermore, it can also be horizontally divided with the levels of the LSM-tree (keys) separated into multiple storage media, along with their respective Blob files (values).

Figure 5. Latency analysis of RocksDB with NVMe SSD and SATA SSD: (a) presents the average latency for both write and read operations. (b,c) track the write and read performance over time, showing how the system behaves under sustained workloads on each storage device.

Figure 6. Analyzing the effects of compactions in LSM-tree: (a) shows the difference in the time spent during compactions, and (b) the WA among the levels of RocksDB and BlobDB. (c) shows the distribution of SStables and file access patterns in RocksDB.

Figure 7. Read analysis during Workload C (read-only) of the YCSB experiment: (a) illustrates the accumulated data size read, distinguishing between key reads from the LSM-tree and value reads from Blob files. (b) shows the average latency of the read operations.

Figure 8. Our configurations of tiered storage in LSM-tree with KV separation. Vertical alignment in (a) utilizes NVMe SSD for the LSM-tree (keys) and SATA SSD for Blob files (values). Horizontal alignment in (b) stores the topmost level (L₀) in NVMe SSD and the subsequent levels in SATA SSD, with keys and values still separated but residing in the same storage device.

Figure 9. Selective KV separation in tiered storage. We delay applying the KV separation from L₀ to L₁. In other words, we do not apply the KV separation at L₀ (both keys and values are in the LSM-tree) while applying it at the next levels (keys in the LSM-tree and values in the Blob files).

Figure 10. We run the YCSB workloads to compare the performance of our configurations to RocksDB and BlobDB, which are separately run in NVMe and SATA SSDs. As discussed in Section 5.1, vertical alignment stores the LSM-tree (keys) in NVMe SSD and Blob files (values) in SATA SSD while horizontal alignment stores L₀ (keys and values) in NVMe SSD and the rest of the levels (L₀ to L_N) in SATA SSD.

Figure 11. Average latency and per-level read latency: (a) illustrates the average latency of reads and writes to provide an overview of the general efficiency of each approach. (b) breaks down the read latency at different levels, as well as Blob files.

Figure 12. Data storage analysis: (a) illustrates the difference in how data are allocated into the storage devices for single-storage approach (NVMe SSD or SATA SSD) and tiered storage (Vertical and Horizontal alignments). (b) shows how these configurations affect WA.

Figure 13. Vertical and Horizontal alignment with selective KV separation performance analysis. This illustrates how delaying KV separation to L₁ affects the overall throughput with the YCSB workloads.

Table 1. Comparison of KVS schemes and their storage configurations.

Scheme	Key Technique	Optimization	KV Separation	Storage
WiscKey (2017) [22]	KV separation	Write/Read amplification	Yes	SSD
HashKV (2018) [33]	Hash-based data grouping Selective KV separation	Garbage collection	Yes	SATA SSD *
NoveLSM (2018) [26]	Byte-addressable Skiplist	(De)serialization	No	NVM SATA SSD
MatrixKV (2020) [27]	Matrix Container	Write stall & amplification	No	NVM SATA SSD
ChameleonDB (2021) [23]	Multi-shard structure	Performance	Yes	NVM *
SpanDB (2021) [28]	Asynchronous request processing High-speed Logging via SPDK	Performance Storage Cost	No	NVMe SSD * SATA SSD *
DiffKV (2021) [24]	Merge Optimizations Fine-grained KV separation	Performance Storage Cost	Yes	SATA SSD
FenceKV (2022) [25]	Fence-based data grouping Key-range garbage collection	Range Scan	Yes	SATA SSD
p²KVS (2022) [30]	Multiple KVS instances Inter/Intra Parallelism	Performance Portability	No	NVMe SSD *
PRISM (2023) [34]	Heterogeneous Storage Index Table	Scalability Crash consistency	Yes	NVM NVMe SSD *
Jaranilla et al. (2024) [31]	Tiered storage Hybrid compression	Performance & storage space utilization trade-off	No	NVMe SSD * SATA SSD
ThanosKV (2024) [35]	Hybrid compaction NVM indexing	Write stall	Supported	NVM SATA SSD
Our approach	Flexible storage alignments with Selective KV separation	Performance Storage financial cost	Yes	NVMe SSD SATA SSD

* Multiple instances of the same storage device.

Table 2. Experimental environment.

Component	Model/Specification
Hardware	Intel i7 processor with 16 cores
	32 GB DRAM
	1 TB Samsung V-NAND NVMe M.2 SSD 970 PRO
	250 GB Samsung 860 EVO SATA SSD
Operating System	Ubuntu 20.04.4 LTS (Focal Fossa)
	Linux kernel version 5.4
KVS	RocksDB 9.0.0

Table 3. Summary of YCSB workloads.

Workload	Load	A	B	C	D	E	F
Insert	100%	-	-	-	-	-	-
Update	-	50%	5%		5%	5%	-
Read	-	50%	95%	100%	95%	-	50%
Range Query	-	-	-	-	-	95%	-
Read-Modify-Write	-	-	-	-	-	-	50%

Table 4. Tail latency analysis.

	Write Latency (μs)				Read Latency (μs)
	P50	P75	P99	P99.9	P50	P75	P99	P99.9
RocksDB on NVMe SSD	5.96	8.33	14.7	20.91	4.2	14.53	149.64	533.9
RocksDB on SATA SSD	10.39	14.42	37.34	70.24	7.3	29.13	364.84	2689.05
BlobDB on NVMe SSD	6	8.42	16.26	26.1	7.38	20.38	148.2	240.04
BlobDB on SATA SSD	8.56	11.6	22.82	33.72	9.48	23.92	219.64	376.86
Vertical alignment	7.84	10.14	48.01	72.75	8.47	24.28	241.13	1882.26
Horizontal alignment	6.09	8.59	29.98	98.22	4.2	14.95	179.36	358.97

Table 5. Summary of storage devices used in our experiments.

Model	Read/Write Speed		Price
Model	Sequential	Random (4K Blocks)	Price
970 PRO NVMe^® M.2 SSD [44]	3500 MBps/ 2700 MBps	15,000–500,000 IOPS/ 55,000–500,000 IOPS	USD 399.99 (USD 0.39/GB)
860 EVO SATA 2.5” SSD [45]	550 MBps/ 520 MBps	10,000–98,000 IOPS/ 42,000–90,000 IOPS	USD 199.99 (USD 0.19/GB)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jaranilla, C.; Zhao, G.; Choi, G.; Park, S.; Choi, J. A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage. Electronics 2025, 14, 564. https://doi.org/10.3390/electronics14030564

AMA Style

Jaranilla C, Zhao G, Choi G, Park S, Choi J. A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage. Electronics. 2025; 14(3):564. https://doi.org/10.3390/electronics14030564

Chicago/Turabian Style

Jaranilla, Charles, Guangxun Zhao, Gunhee Choi, Sohyun Park, and Jongmoo Choi. 2025. "A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage" Electronics 14, no. 3: 564. https://doi.org/10.3390/electronics14030564

APA Style

Jaranilla, C., Zhao, G., Choi, G., Park, S., & Choi, J. (2025). A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage. Electronics, 14(3), 564. https://doi.org/10.3390/electronics14030564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Framework for Integrating Log-Structured Merge-Trees and Key–Value Separation in Tiered Storage

Abstract

1. Introduction

2. Related Works

3. Background

3.1. Tiered Storage

3.2. Log-Structured Merge-Tree

3.3. Key–Value Separation

3.4. Exploring Vertical and Horizontal Tiered Storage

4. Benefits of Tiered Storage

4.1. Device Latency

4.2. Characteristics of the LSM-Tree

4.3. Issues of KV Separation

5. Design

5.1. Configuration Strategy

5.2. Combining with Selective KV Separation

6. Evaluation

6.1. Experiment Setup

6.2. Experimental Results

6.2.1. Throughput

6.2.2. Average and Per-Level Latency

6.2.3. Tail Latency

6.2.4. Data Storage Implications

6.2.5. Leveraging Selective KV Separation

6.3. Limitations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI