This section overviews InK first, and then explains how InK handles various key-value operation types and concerns. To fully take the advantages of the byte-addressable and over-writable address space, we designed InK based on the B+ tree, and optimized for efficient concurrent accesses. At the same time, we considered the characteristics of the non-volatile memory device to guarantee the crash consistency of stored key-value data.
3.1. Managing Address Space on DCPM
DCPM provides two modes for utilizing its persistent address space. One of the modes is Memory mode, in which DCPM provides the system with the vast address space on DCPM as regular memory. The system can utilize DCPM as a large, energy-efficient main memory module without modification. However, the data stored in the Memory mode is not persistent (i.e., destroyed upon a power cycle), and the memory access latency varies when the internal memory controller (IMC) which controls DCPM migrates the data between DCPM and the main memory back and forth [
6].
The other mode is so-called App Direct mode [
6]. IMC provides the system with a separate, persistent address range that the system can explicitly manage. In practice, the Linux kernel manages the persistent address range as similar to the device-mapped memory region; the address range exists, but the system does not utilize the range for serving virtual memory of the processes. Instead, the address range should be mapped to an address space, and then accessed using general memory instructions such as load and store.
InK utilizes DCPM in the App Direct mode. InK maps the entire memory region of DCPM into the kernel address space. The mapping address of the DCPM memory region is not fixed on a particular address but can be changed on a system reboot. Thus, InK represents data in DCPM with the relative address to the start of the mapped address; InK calculates the location by adding the starting address and the offset, and directly accesses to the target address through general memory instructions. Since DCPM allows in-place updates, modification can be applied directly to the address space.
InK partitions the non-volatile address space on DCPM into three areas, and
Figure 2 illustrates those areas. The first area fixed at the beginning of the address space of DCPM is called metadata area, and it stores the metadata of InK such as the address of the B+ tree root node, the information for managing DCPM address space, the descriptor for logging area, so forth. During the system initialization, InK reconstructs the respective in-memory data structures by reading the data from the metadata area. Moreover, InK uses the logging descriptor to identify an unexpected crash of the system.
The second area is for storing the actual data of key-value pairs and their index. It comes right after the metadata area and occupies the majority of the space on DCPM. Basically, keys and values are stored in this area as a contiguous chunks. To accelerate the lookup for stored key-value pairs, InK maintains a B+ tree instance in this area. The tree indexes all key-value pairs in InK. The leaf nodes of the tree is comprised of index keys and value pointers and they store the locations of the chunks for actual keys and values. The usage of the area is managed by the space allocation information stored in the metadata area (see
Section 3.4).
The last area is for the logging changes in InK. Internally, The changes of the B+ tree indexes and key-value data are updated in place on the persistent address space on DCPM. This in-place update approach is a double-edge sword; it allows InK to leverage the byte-addressable capability of DCPM, but it complicates the case when the updates are interrupted in the middle of operation. Thus, it is essential for InK to detect such a crash and recover from the transient, inconsistent state. Combined with the logging descriptor in the metadata area, InK detects such an inconsistent state and then recovers from it (details are discussed in
Section 3.5).
The size of the areas can be set while initializing the InK instance.
3.2. Indexing Key-Value Pairs
Many state-of-the-art key-value store systems [
11,
12,
13,
14] internally index the key-value pairs with the log-structured merge (LSM) tree [
15]. As we explained in
Section 2.2, LSM tree can maximize the I/O performance by leveraging the superior sequential I/O performance of block devices. Such a LSM tree has, however, inherent shortcomings to be integrated on top of the byte-addressable and over-writable address space [
30].
First, the LSM tree operations are designed by assuming the I/Os at a block granularity. Small updates are inevitably merged in the memory (i.e., in memtable), and are written to storage devices at the block granularity (i.e., in sstables). Reads from the storage devices are also performed at the same block granularity. This amplifies the read and write traffic, and also slows down I/O. Second, LSM tree is designed to leverage many performance characteristics of block devices, however, many of them do not hold anymore on the byte-addressable non-volatile memory. For instance, LSM tree performs out-of-place update to leverage the sequential write performance. However, random writes to DCPM shows comparable performance to sequential writes in terms of latency. Updates can be applied directly to DCPM, allowing small in-place updates. Moreover, the lifespan is not as significant issue as it does on the flash memory-based devices. Third, reads from LSM tree exhibits a long tail latency. Since items are stored in a hierarchical order, LSM tree need to scan many sstables to access cold items. This incurs multiple accesses to the storage device extends read time. Lastly, the compaction has been problematic in LSM tree. The compaction is the process of cleaning out-dated/deleted data in LSM tree [
31]. To do that, LSM tree reads sstables, merges them, and produces sstables. This process generates heavy I/O traffic in the background [
32,
33], interfering foreground key-value services.
Based on these observations, we opt to implement InK based on the B+ tree. As one of the most traditional and popular in-memory data structures for indexing items, B+ tree does not have these shortcomings. As the balanced tree, all items in a B+ tree are at the same distance from the root node, thereby providing consistent and bounded lookup time. Since the DCPM address space can be directly accessed through load and store instructions, we can access and update key-value data in-place without amplifying I/O traffic. Moreover, B+ tree can continue operating without depending on critical background processes.
Figure 3 illustrates the layout of index nodes and leaf nodes of the B+ tree with order 5. The index node has five pointers pointing to five child nodes, and four pointers are placed between these child nodes. The pointers store the location of the index keys (thus one index node is effectively comprised of nine pointers). The index key is not inlined in the node but separately stored in the memory. This layout is not generally used on block-based devices since such an indirection causes additional I/Os to access keys, significantly degrading overall performance. This indirection is, however, not so costly on the byte-addressable DCPM, and even helps InK to use the large fan-out degrees of tree nodes.
Leaf nodes are organized in the similar way. It contains eight pointers; four of them are for pointing to the address of values and four of them are to the keys. The value pointer on the left of an index key points to the location of the value of the corresponding index key. The rightmost pointer is always set to null in the leaf nodes.
3.3. Handling Basic Key-Value Operations
InK maintains a B+ tree for indexing the key-value pairs stored on the non-volatile address space. When an application queries for the value for a key through an InK system call, InK looks up the requested key from the B+ tree index. Starting from the root node, InK finds the proper index key and subtree in the node. InK uses the binary search technique to accelerate the search for the index key and subtree in the nodes. If InK finds a proper subtree from the node, it iterates to search the requested key from the pointed subtree. This procedure is repeated until the search reaches a leaf node. If the leaf node contains the requested key, InK returns the value that the value pointer in the leaf node points to. If the key is not in the leaf node, it implies that the requested key does not exist in InK. Thus, InK returns an error code (-ENOENT) indicating the situation.
When a write for a key is requested, InK starts traversing the index tree as similar to the key-value lookup. When it reaches a leaf node and the requested key does not exist in the leaf node, it implies that the key-value pair for the requested key does not exist in InK. In this case, InK allocates memory chunks for the key and value from the data area (see
Section 3.4 for space management), copies the requested key and value onto the allocated chunks, and then inserts them into the appropriate position in the leaf node. To keep the index keys in the node sorted, InK finds the position for the new key through the binary search, and then shifts the index key and value pointers from that position, and then puts the newly allocated key-value pair at the location. If the key for the write request exists in the tree index, InK replaces the old value with the new value in the request. If the size of the value is changed, InK allocates a new memory chunk for the updated value, fills it with the new value, and replaces the value pointer of the key. The memory chunk storing the previous value is reclaimed through the memory management mechanism of InK. When the size of the value is not changed, InK just copies the new value to the existing chunk. Note that the leaf node containing the requested key is not changed for updating a key-value pair with the same-sized value. Thus, InK can omit logging the leaf node. Since the value size for a key does not change frequently [
34,
35], this approach effectively improves the overall performance of InK.
It is crucial for a key-value store system to effectively control the concurrent accesses to the stored key-value pairs so that the applications using the system can maximize the aggregated I/O throughput. This necessitates an efficient control for the concurrent accesses. Basically, InK uses the fine-grained reader-writer locks (i.e., rw_semaphore in the Linux kernel). Each B+ tree node contains a reader-writer lock instance that serializes the concurrent accesses to the data in the node. In general, the context accessing the B+ tree index starts the access from the root node, and moves towards leaf nodes comparing the key with the index keys. InK starts accessing the index keys by grabbing the lock of the root node. For the operations that only read the index, InK locks the current node with the reader lock, allowing multiple readers to access the index concurrently. When the requested operation may alter the index, InK grabs the writer lock whereas it takes the reader lock for read-only operations. To descend down to the leaf nodes, InK grabs the same type of the lock on the next level while holding the lock of the current node. The lock for the current node is released when it is guaranteed that further operation processing does not modify the node. This way, multiple get requests can be simultaneously processed whereas write requests only lock a part of the index, thereby maximizing the concurrent access to the index.
Figure 4 shows the benefit of the fine-grained locking mechanism for a B+ tree. In the coarse-grained lock mechanism, only one thread can access the tree and other threads are blocked by the lock of the root node. In contrast, the fine-grained lockng mechanism allows multiple threads to access the each parts of the tree concurrently.
Applying the proposing locking mechanism in B+ tree seems straightforward. However, controlling concurrent access in B+ tree becomes challenging when a node needs to be split and even increases the tree height. When a leaf node becomes full, the node needs to be split to add more keys that falls into the index key range the node covers. The index key at the middle of the current node is inserted into the parent node of the current node. The current node is split into two nodes, so that one of the nodes contains the index keys that are smaller than the middle index key, whereas the other node contains the rest. Those split nodes are attached to the left and the right of the index key that is newly inserted into the parent node. Since the split inserts an index key to the parent node, the internal index nodes (i.e., parent node) can get full as well as the leaf node, triggering the split of the index node. Thus, the node split firstly take place at a leaf node, and then is recursively propagated towards the root node as long as the current node becomes full. However, this direction of the node split, climbing up from leaf nodes towards the root node, is opposite to the direction of the request processing, which descends down from the root node to leaf nodes, complicating the concurrent access control in B+ tree. Worse, since the node split modifies at least three nodes (current node, newly allocated node, and parent node), the split operation should be controlled so that other contexts do not access the nodes that are in transition.
To handle the node split efficiently, we propose a lazy split scheme in the B+ tree. The key idea is to postpone the split of full nodes until it is accessed again. In the original B+ tree, a full node is split immediately when it gets full. Thus, a node split at a leaf node can trigger a chain of node splitting up to the root node. In contrast, InK splits the nodes in a lazy way. When a node becomes full, InK leaves the node unchanged until the node is to be traversed again. As stated above, InK grabs the lock of the child node while holding the lock of the current node to traverse the tree from the root node to leaf nodes. If the child node is not full, InK releases the lock of the current node, and continues searching the target index key. If the child node is full, InK splits the node while holding the locks of the current node and the child node. The middle index key of the child node is inserted into the current node, and the child node is split into the two child nodes (in fact, InK makes two child nodes by allocating a new node and moving a half of the index keys in the original node to the new node). All nodes influenced by the split is protected by the locks, therefore, the split can be performed without causing a concurrency issue. After the split, InK releases the lock for the current node, and continues traversing the child node. This approach unifies the node access directions of normal operations and node splits, thereby simplifying the concurrency control. Moreover, one write operation can split only up to one node, therefore, it can contribute to the steady performance for write operations.
In practice, the majority of key-value operations in key-value stores is comprised of get, put, and update operations, and deletion is an uncommon operation [
34]. We designed InK to handle the deletion in an optimistic and lazy way and focused on optimizing common cases faster. When an application requests InK for deleting a key-value pair, InK finds the corresponding index key and value pointer from the B+ tree index. If such a key-value pair exists in the leaf node of B+ tree, InK reclaims the spaces storing the actual key and value of the pair. The pointers in the corresponding leaf node are also modified. The update only takes place at the leaf node, and index nodes are not modified to remove key-value pairs. Moreover, InK does not actively merge the nodes with low utilization unlike the original B+ tree. Instead, InK can merge low utilization nodes on background, which can be triggered during an idle period of the system. Since this operation is not critical to the correctness of the system, InK can perform the operation only if the system is idle and it may not disturb the system performance. As it is known that storage systems have a plentiful amount of idle time, we believe this approach might not harm the overall system performance in InK.
3.4. Managing DCPM Address Space
The memory manager in InK manages the key-value area on the DCPM address space, which stores actual key-value data and B+ tree nodes. Many studies analyzing key-value store systems in action commonly report that the majority of keys and values are very small, and their sizes are highly skewed to only a few sizes [
34,
35]. If InK manages the space at a large, fixed-sized granularity, like pages or blocks, InK will suffer from high internal fragmentation since small keys and values are common. On the other hand, employing general, sophisticated memory management schemes in the kernel space can increase the code base size, which is not desirable for the stability and security of operating systems. Moreover, that way is unable to exploit the characteristics of key-value data. We designed the memory manager considering these characteristics and design constraints so that InK can manage the huge non-volatile address space at a low overhead.
The memory manager manages address space of DCPM with two data structures, allocation pointer and free space lists.
Figure 5 illustrates the key mechanisms that InK employs to manage the address space. The allocation pointer points to the address of a free memory chunk. Initially, this pointer points to the start address of the key-value area in
Figure 5. InK may allocate a space from the address that the allocation pointer points to. After allocating the space, the allocation pointer is increased by the allocated size, so that it always points to the start address of the free space.
The free space lists is an array of lists, as illustrated in
Figure 5. The number of list array entries is 10 by default, which is configurable during the InK instance initialization. Each list links the free space chunks whose sizes are 2 to the power of 4 plus the index of the list. For example, the third list links free chunks of
bytes. A pointer is embedded at the beginning of each free chunk, which points to the next free chunk available. When InK needs a free chunks, InK looks up the free space list that corresponds to the requested size. When a free chunk is available in the corresponding free space list, it is detached from the list. If there is no free chunk corresponds to the requested size, InK allocates a new chunk from the allocation pointer as we explained above. When InK frees up a space, it is attached to the head of the corresponding free space list.
Our evaluation on memory management shows this approach inevitably incurs external fragmentation. However, due to the workload characteristics of key-values, the external fragmentation has been remained at a manageable level. Specifically, reclaimed spaces are immediately reused for subsequent requests in the same size, thereby leaving less than 0.2% of space in the free list throughout the evaluation.
3.5. Consistency
Unlike flash memory, DCPM allows in-place update and InK leverages this feature. However, DCPM only guarantees the atomicity of operations at the cache line granularity, and the orders between cache lines nor the atomicity for a large area are not guaranteed. This complicates the situation when a B+ tree node modification in InK is interrupted in the middle of the operation by the system crash. Suppose a B+ tree node became full and is about to be split by the lazy split scheme. The split requires to insert a newly allocated index key into the current node, and to populate two nodes each of which contains a half of the index keys in the original child node. To keep the B+ tree consistent and durable, these changes should be atomically applied to the B+ tree index. However, DCPM architecture does not provide the way to atomically update those multiple locations.
To provide the required consistency and durability on the DCPM address space, InK employs the logging technique with in-place updates. Before applying a modification to the key-value data or B+ tree nodes, InK copies the original data to the logging area on the DCPM address space. The copied data includes the allocation pointer, free space list for memory management, the previous key-value data in the chunks, and the data in the tree nodes. InK maintains the logging descriptor which is stored in the metadata area. The logging descriptor initially points to the start address of the logging area, and then is adjusted to point to the end of ongoing log. InK copies the data to preserve to the location where the logging descriptor points to, and then moves the logging descriptor to the end of the copied data accordingly. Only after copying all the data to be changed, InK starts applying the changes to their original location. When the changes are completely applied, InK resets the logging descriptor to the beginning of the logging area. Whenever the value of the logging descriptor is changed, the value is persisted to the metadata area. This implies that there was an unexpected interrupt while applying some changes in InK if the logging descriptor does not point to the start address of the logging area. In this case, InK recovers the inconsistent state by restoring the original value from the log. After restoring the original values, InK resets the logging descriptor to indicate InK is fully recovered. InK also can recover from another failure during the recovery since copying the original value again is an idempotent operation.
When InK writes data on the DCPM address space, InK should consider the side effect of cache in the processors. Specifically, due to the memory hierarchy of the systems, updates to memory is applied to the cache which is volatile, and then later written back to non-volatile DCPM. This may cause a partial update to DCPM address space, leading the system to an inconsistent state [
36]. InK prevents this happening by leveraging the architecture support. Intel introduces
clwb instruction, which enforces the processor to flush the specified cache line into the memory [
37]. When InK updates data on DCPM address space, InK ensures that updates are committed to DCPM by invoking
clwb for the updated memory range.