Efficient Key-Value Data Placement for ZNS SSD

Oh, Gijun; Yang, Junseok; Ahn, Sungyong

doi:10.3390/app112411842

Open AccessArticle

Efficient Key-Value Data Placement for ZNS SSD

by

Gijun Oh

,

Junseok Yang

and

Sungyong Ahn

^*

School of Computer Science and Engineering, Pusan National University, 2, Busandaehak-ro 63beon-gil, Geumjeong-gu, Busan 46241, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(24), 11842; https://doi.org/10.3390/app112411842

Submission received: 20 November 2021 / Revised: 3 December 2021 / Accepted: 8 December 2021 / Published: 13 December 2021

(This article belongs to the Special Issue System Software Issues in Future Computing Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Log-structured merge-tree (LSM-Tree)-based key–value stores are attracting attention for their high I/O (Input/Output) performance due to their sequential write characteristics. However, excessive writes caused by compaction shorten the lifespan of the Solid-state Drive (SSD). Therefore, there are several studies aimed at reducing garbage collection overhead by using Zoned Namespace ZNS; SSD in which the host can determine data placement. However, the existing studies have limitations in terms of performance improvement because the lifetime and hotness of key–value data are not considered. Therefore, in this paper, we propose a technique to minimize the space efficiency and garbage collection overhead of SSDs by arranging them according to the characteristics of key–value data. The proposed method was implemented by modifying ZenFS of RocksDB and, according to the result of the performance evaluation, the space efficiency could be improved by up to 75%.

Keywords:

zoned namespace; flash translation layer; host-level FTL; key–value store; filesystem

1. Introduction

Recently, key–value stores have been widely used in emerging technologies (e.g., deep learning, blockchain) [1,2] due to multiple benefits such as high performance, flexibility and simplicity [3]. In particular, log-structured merge-tree (LSM-Tree)-based key–value stores such as RocksDB and LevelDB can obtain high Input/Output (I/O) performance on NAND flash memory-based Solid-state Drives SSDs, due to sequential write characteristics caused by an out-place update [4,5]. However, a large number of writes occurs during compaction, which can reduce the performance and lifespan of the SSD [6]. In addition, in the case of legacy block interface SSDs, the garbage collection of flash translation layers further increases write amplification (FTLs) [7].

ZNS (zoned namespace) SSDs use a new zone-based interface that divides the NAND flash memory area into zones of a certain size and allows the host to directly manage the zones [8]. Therefore, ZNS SSD can minimize garbage collection overhead and write amplification, because data placement is performed directly by the host [9,10,11]. Currently, research is being conducted to minimize the write amplification of LSM-Trees using ZNS SSD.

In the case of RocksDB, which is the most popular key–value store developed by Facebook, ZNS SSDs can be used through ZenFS, a simplified filesystem for ZNS [12]. However, the current ZenFS does not effectively utilize the information of RocksDB; therefore, files with different hotness are recorded in the same zone and space is used inefficiently. Moreover, because the current ZenFS does not support garbage collection, RocksDB cannot manage a large key–value set on ZNS SSDs.

Therefore, in this paper, we propose a method to quickly secure an empty zone during compaction through zone-based data placement optimized for LSM-Tree-based key–value store. In addition, a garbage collection technique considering the lifetime of the data is proposed to minimize write amplification. The method proposed in this paper was implemented by modifying the ZenFS of RocksDB [13] and the performance was evaluated using an emulated ZNS in Quick Emulator (QEMU) 5.1.0 [14]. As a result of the performance evaluation, the space efficiency could be improved by up to 75%. We were also able to improve the performance by up to 10% using an improved garbage collection algorithm.

The rest of the paper is structured as follows: In Section 2, we briefly describe RocksDB and ZNS SSD and, in Section 3, we provide the motivation behind this paper. In Section 4, we propose the zone allocation method according to the hotness and the garbage collection logic considering the lifetime of the file. The implementation issues are discussed in Section 5. Section 6 shows the evaluation results of the proposed methods, and we conclude the paper in Section 7.

2. Background

2.1. Zoned Namespace Solid-State Drive (SSD)

The zoned namespace SSD was proposed based on the ZAC/ZBC standard [15] for SMR (Shingled Magnetic Recording) Hard Disk Drives (HDD) and open-channel SSD [16]. Unlike the conventional HDD, the tracks are not separated, but overlapped in SMR HDD. Although this provides greater capacity to the HDD, it cannot serve the in-place update. Disallowing the in-place update disables the HDD from using the legacy block layer. Because of this, new I/O methods were proposed for the SMR HDD, named zoned ATA commands (ZAC) and zoned block commands (ZBC).

Each standard splits its space into the unit of a zone. Each zone’s I/O is independent from the others and a zone can only accept the sequential writes in it. Standards propose the drive-managed, host-managed, or host-aware methods for managing zones efficiently [17]. The drive-managed method uses software firmware such as flash translation layer in the SSD to support the legacy block interface layer. The host-managed method enables the host to manage zones directly and only allows sequential writing to be performed in a zone. Because of this, the host can place data on the disk’s appropriate position by using host information. This enables the disk to fully utilize its performance. The host-aware method is a hybrid method, with drive-managed and host-managed characteristics. Therefore, it does not have any restrictions on writing data in a zone and a host can manage each zone status. However, this method comes with the problem that, sometimes, it generates long-tail latencies [18].

For this reason, the drive-managed method is preferred in an ordinary environment, but the host-managed method is preferred in environments requiring extreme performance. For example, a greater key-value performance can be achieved by a disk with the host-managed method than by one with the drive-managed method [17].

Because of these advantages, the zoned namespace (ZNS) interface was expanded by ZAC/ZBC. ZNS SSD is a kind of SSD using a zoned namespace interface with the NVMe. As can be seen in Figure 1, ZNS SSD abstracts its space-based unit named zone. The host utilizes this abstracted zone to manage and serve the I/O. This means that the SSD needs only a very simple FTL or does not need an FTL at all. The ZNS SSD zone, as with ZAC/ZBC specifications, also normally only permits to sequentially write its space. Restrictions of the sequential writes achieve nameless writes in the ZNS SSD. It reduces the logical-to-physical (L2P) mapping table’s size dramatically. These advantages enhance the SSD performance [19].

Moreover, ZNS SSD enables the host to predict the SSD’s latencies. This is because the conventional SSD has a complex FTL and it becomes a complete black-box by the vendor. However, because ZNS SSD only requires a very simple FTL or no FTL at all, the host can leverage the information exposed by ZNS SSD to better predict the latency of ZNS SSD than the conventional SSD. Additionally, ZNS SSD can choose a data placement position, which can enhance the performance of the disk. For example, as can be seen in Figure 1a, conventional SSD data are randomly placed by the FTL. However, as can be seen in Figure 1b, ZNS SSD data are grouped based on application. Let us assume that Application 1 generates a very intensive workload. In this workload, the conventional SSD needs write amplification to erase the block. On the other hand, workload data are grouped by the host in the ZNS SSD. This means that the SSD does not need write amplification to erase the block. Because of these advantages, it received attention as a next-generation interface method and will be added to the NVMe standard.

2.2. RocksDB

Facebook developed RocksDB, a log-structured merge-tree (LSM-Tree)-based open-source key–value store. Unlike other key–value stores, it enables the use of multi-threading in the flush and compaction routine [20].

An LSM-Tree reduces random writes using out-of-place updates and has a compaction routine to collect the invalid data generated by the out-of-place update. They cause the LSM-Tree to have a fast write performance and help to utilize the space efficiently. These are the main reasons why the LSM-Tree is a backend of the key–value store and the main reasons why the LSM-Tree is adopted by various key–value store backends [4,20].

Figure 2 shows the RocksDB architecture that consists of the Memtable, write-ahead log (WAL) file and sorted sequence table (SSTable) file. The Memtable is a write buffer existing in the memory which aggregates the user write I/O until the buffer threshold is reached. Buffering creates coarse-grained write I/O and it permits to utilize the disk more. SSTable is generated by compaction results. It is the centerpiece of the RocksDB with LSM-Tree. RocksDB compaction uses leveled compaction. Leveled compaction restricts the number of overlapped tables to achieve optimal read performance. However, it struggles with high write amplification problems when it executes the sorted merge [6]. Therefore, the SSTable located in Level 1 has a problem due to write amplification.

3. Motivation

3.1. ZenFS

ZenFS is a file system plugin for RocksDB developed by Western Digital [12]. It is located between the RocksDB and Linux Kernel. To use the Portable Operating System Interface (POSIX)-compliant I/O function, it needs to obtain the file descriptor using the libzbd [21]. After the file descriptor is obtained, ZenFS can flush the RocksDB data to the ZNS SSD in the form of a direct I/O with POSIX-compliant functions [22]. Since data have to be recorded sequentially in each zone, it needs to change the I/O scheduler for ensuring I/O ordering [23].

Figure 3 shows that ZenFS manages each zone, as divided into journal and data areas. The number of zones in the journal area is statically defined when it is compiled. By contrast, the number of zones in the data area is defined as the total number of zones subtracted by the number of journal area zones.

The journal area contains the ZenFS superblock and zonefile metadata. The superb-lock area has ZenFS creation information. The remaining area in the journal area has zonefile metadata, such as WAL and SSTable mapping zone information. The zonefile metadata contain a file ID, file size, file name and an array of the extent metadata.

A zonefile is an abstracted file for ZenFS for handling RocksDB files. Its metadata are uploaded to the memory and written to the journal area for recovery when the system crashes. Its data are written in the unit of extent to the data area. An extent has contiguous parts of zonefile data written in a zone. In the extent, “start” indicates the position of the write pointer (WP) in a zone when data are first written and “length” indicates the contiguously written size in a zone.

Figure 4 depicts the allocation of a zone in ZenFS. In the figure, L3 (Level 3) data request a zone. In this case, the zone allocator checks all zones in the ZNS SSD and finds the zones that have only invalid data. If the zone allocator finds a zone that only contains invalid data, then all the found zones are erased. For example, in the figure, Zone #2 has invalid data only. Therefore, Zone #2 is selected as an erase target and erased. After erasing all zones, the zone allocator finds a zone that has the smallest level differences from the requested data. In the figure, the zone allocator finds that Zone #1 has the smallest level difference with the requested data. As a result, the zone allocator returns Zone #1 to write the requested data. Moreover, the allocation result is written to the zonefile metadata.

3.2. Limitations of ZenFS

We found that ZenFS does not utilize host information and does not have a garbage collection logic. To check this fact, we created a ZNS SSD emulation environment [14] and evaluated the current state of ZenFS. The emulated ZNS SSD had a zone size of 16 MiB, the number of zones was 512 and the total size was 8 GiB. RocksDB’s internal benchmark program, “db_bench” was used to evaluate ZenFS. The workload used “fillseq + overwrite” for the experiments. “fillseq” generates the sequential keys with values and “overwrite” randomly generates keys and values in the previous key range. The experiment was conducted with each of the 4 million (around 3.1 GB) and 5 million (around 3.9 GB) key–value pairs.

The experiment results are shown in Figure 5. The figure shows that the inserts of the 4 million key–value pairs did not have any problems. However, it was found that 5 million key–value pairs could not execute the workload until the overwrite was finished. According to the results, this indicates that ZenFS did not efficiently utilize the space of ZNS SSD.

Moreover, based on the inserts of the 4 million key–value pairs results in the figure, it was found that ZenFS could only reserve a small number of available zones when it began to overwrite. This is because the zones were not reclaimed efficiently due to the zone allocation method of ZenFS, which did not properly consider the lifetime of the zonefile.

Figure 6 shows the average interval time of logical deletion and physical deletion. It classifies the data in SHORT, MEDIUM and EXTREME according to the lifetime of data. “Logical deletion” means data are deleted by ZenFS and marked as invalid. “Physical deletion” means data are physically deleted from the ZNS SSD. The figure shows the average time it takes for data to be deleted after being inserted into a zone. According to the figure, MEDIUM and EXTREME immediately physically deleted after logical deletion. In accordance with these results, it was found that the physical deletion of data with a short lifetime was interrupted by long-lifetime data sharing the same zone. Consequently, in Figure 5, there is only a small number of available zones when it began to overwrite.

Therefore, it was found that ZenFS could not efficiently utilize its space. Moreover, it requires the data placement to be split according to data lifetime. Hence, this paper proposes the lifetime-based zone allocation method and the garbage collection method for the key–value store in the ZNS SSD.

4. Design

The architecture of the proposed method is shown in Figure 7. Unlike the conventional ZenFS, where the zone allocator is in charge of garbage collection and allocation, the proposed method transfers the part in charge of garbage collection to the garbage collector. This makes the zone allocator only in charge of allocating a zone.

The zone allocator analyzes the data that are currently coming in and, if there is a zone with the same lifetime, they are assigned to that zone. If not, they are assigned to an new empty zone. This ensures that one file is placed in a zone or that multiple files with the same lifetime are placed in a zone. This minimizes deletion delays caused by other data with different lifetime values. This separate placement method contributes to reducing the amount of effective data copying in the next garbage collection.

The garbage collector copies the valid data from a victim zone with invalidated data to the other zone and deletes the victim zone containing only invalid data. The garbage collector is run if a certain capacity of the disk is used and the number of victim zones selected for garbage collection is dynamically determined. Furthermore, the victim zone is selected while moving from the queue with the smallest number of valid data to the queue with the largest number of valid data, in the queue that manages zones with invalid pages. With these methods, it is possible to obtain available zones while minimizing unnecessary valid data copying and reducing the load due to garbage collection.

4.1. Lifetime-Based Zone Allocation

Figure 8 shows the zone allocation method according to the lifetime. Let us suppose that we assign a zone to write Extent #0 for Zonefile #2 in Figure 8. First, we check the status of each zone from Zone #0 in order. Zone #0 is excluded from the candidate zone list for zone allocation, because all areas have been exhausted. Since Zone #1 has a free page, we compare its write lifetime hint (WLTH_EXTREME) declared by the host to the write lifetime hint (WLTH_SHORT) of Zonefile #2 containing Extent #0. If the lifetime is the same, that zone is assigned. However, as we can see in Figure 8, Zone #1 is excluded due to the lifetime being different. Finally, Zone #2 is an entirely empty zone and is not assigned any write lifetime hint from the host; therefore, it is assigned as a zone for writing the Extent #0 of Zonefile #2. As a precaution, regardless of zone order, zones with the same lifetime as the requested data and with some vacancies are allocated prior to empty zones.

Such a method induces one zonefile to be placed in the same zone as much as possible. Moreover, even if different zonefiles are placed in a single zone, this does not mix the different lifetime zonefiles; therefore, physical deletion is not delayed. This effectively solves the problem of physical deletion delays due to zonefiles with different lifetimes being mixed into one zone. It reduces the valid data copying amount and enhances the performance.

4.2. Garbage Collector for ZNS SSD

The operation of the garbage collector is shown in Figure 9. When a delete command is issued for any zonefile, each zone records that the data have been invalidated in the zone that contains the data associated with the zonefile. Zones containing invalidated data can be subjected to garbage collection in the future; therefore, the invalid zone collector checks the number of valid data of the zone and inserts them in the appropriate position of the victim queue.

The victim queue consists of multiple queues, each of which has zones with a different number of invalidated data. In other words, zones in the Level 1 queue have the smallest number of valid data, while zones in the Level 4 queue have the largest number of valid data. A zone with the longest lifetime and the fewest valid data is selected as the victim first. It uses priority queues, so that it can quickly search for zones with less valid data. This reduces the probability that zones containing short-lifetime data or that have a lot of valid data are subject to garbage collection. Consequently, it prevents unnecessary copying of valid data.

To reduce the overhead of valid data copying, this is only executed when there are insufficient empty zones. This threshold is dynamically determined as described in the following subsection.

4.2.1. Dynamic Victim Selection

Since the overhead of the zone deletion command is large, the garbage collector dynamically determines the number of zones that needs to be deleted. When calling garbage collection, the maximum number of deletions is defined by Equation (1).

Z_{t}

means the total number of zones that the disk has and

Z_{e}

means the number of empty zones.

Z_{t} \times \frac{1}{1 + Z_{e}} (Z_{t} > 0; Z_{t} \geq Z_{e} \geq 0),

(1)

If

Z_{t}

is 512, tracing the number of changes shows a graph as shown in Figure 10. Equation (1) selects an average from one to four zones as victims if there are enough empty zones. This prevents the problem that too many zones are deleted at once and the garbage collection delay increases even when the number of empty zones is sufficient. However, if the number of empty zones is less than 10%, it is determined that there is insufficient space in the disk and the number of victim zones for garbage collection is increased in the form of an exponential function to create more empty zones.

To minimize the amount of valid data copying and reduce the overhead in the process of searching for queues, the number of queues that can be selected according to the empty space ratio, determined by the result of Equation (2).

L_{m a x} \times \frac{1}{1 + (E / 5)} (L_{m a x} \geq 1; 100 \geq E \geq 0),

(2)

Figure 11 shows the gradual change in the maximum level of queues determined by the empty space ratio in Equation (2). The solid line indicates the maximum level of queues that can be selected to search for victim zones. Note that queue selection for garbage collection is only allowed when the remaining space is less than 25%. The dotted line indicates the maximum level of queues that can be selected without the threshold for remaining space. As a result, the overhead of traversing queues is reduced by the garbage collector.

4.2.2. Valid Page Copying

Figure 12 shows the process of copying valid data. Let us suppose that Zone #1 is selected as the victim in Figure 12a. First, the zonefile extents existing in Zone #1 are searched for. Each extent is read in sequence and copied to a new zone, then the region corresponding to the extent is invalidated. When the invalidation is completed, the metadata of the zonefile are updated to reflect the new position of the extent.

This process moves Extent #1 in File-A and Extent #1 in File-B to a new empty zone, Zone #3. The contents copied to the new zone in this process are placed in a zone that has the same lifetime data according to zone allocation based on the lifetime, as described in Section 4.1. As a result, when the valid data copying is completed, the appearance is as shown in Figure 12b.

5. Implementation

The proposed method was implemented on the RocksDB file system plugin, ZenFS. When modifying the zone allocation method of ZenFS, the proposed method was developed based on the lifetime of each zonefile. Furthermore, the original erase method did not fully utilize the ZNS space. Therefore, by modifying the original erase scheme, we added the garbage collection logic for the key–value store in ZNS SSD to utilize the ZNS space more efficiently. Additionally, the garbage collection thread was implemented to execute separately from the main thread.

As stated by the NVMe 1.4 standard [24], the ZNS interface restricts the number of active and open zones. These values are defined by the vendors. In ZenFS, active zones indicate zones where write is being processed in the zonefile. Open zones indicate all the zones that do not have a closed state. For this reason, all active zones must be open. Therefore, the developer must take care of those values.

If the garbage collection thread timer is too short, it produces unnecessary overhead by checking the garbage collection routine. Otherwise, if the garbage collection thread timer is too long, empty zones may be exhausted too quickly before garbage is collected. Therefore, the developer must set the thread wake timer to the appropriate value. The proposed method uses the 1000 ms defined in the host-level FTL [25] for open-channel SSD to wake the garbage collection thread.

To copy the valid data, we require the information about all the zonefiles contained in the victim zone. Therefore, it is required that each zone has information about all the zonefiles contained in it. However, finding the extent associated with the specific zone in the zonefile incurs overhead. Therefore, the proposed method adds the zonefile and extent mapping table to a zone and can find the extent in the zonefile faster. It was found that the space overhead for adding zonefile and extent mapping table to each zone was negligible. For example, our implementations showed that the space overhead, in the worst-case scenario, used 768 bytes and, in the average case, used 118 bytes. Assuming that the disk has a number of zones of 512, to maintain the whole zone, the mapping table needs an average space of around 0.5 KiB and, in the worst-case scenario, it needs to be around 384 KiB. This is a negligible size in modern memory.

6. Experiments

6.1. Environments

The experiments were conducted under the host environments described in Table 1. The virtual machine on the host was configured as in Table 2 and its emulated ZNS SSD configuration is shown in Table 3.

The zone size in the emulated ZNS SSD was equal to 16 MiB and the number of zones was 512. Therefore, the total size of the ZNS SSD was 8 GiB. According to the sector size, the host I/O size must be aligned with the 512 bytes. In the implementation, it must follow the maximum number of the open and active zones. Each specification value followed the QEMU default configurations [14].

Using RocksDB’s internal benchmark, db_bench evaluated the proposed method. The configuration values are shown in Table 4. Key and value size were equal to ZenFS default evaluation configuration values. Furthermore, the write buffer and SSTable maximum growth size was made equal to the zone size.

In the experiments, db_bench uses the “fillseq + overwrite” workload. First, it filled the sequential key–value pairs in the ZNS SSD. Next, it generated random key–value pairs for overwriting with the already filled key range. In each experiment, the number of keys was changed in increments of 1 million, from 4 million to 7 million.

In this section, the space utilization and I/O performance are evaluated against several evaluation targets, as described in Table 5. In addition, the effect of the dynamic victim selection is examined.

6.2. Analysis of Space Utilization

To evaluate how effectively the proposed method uses space of the ZNS SSD, we checked how much key range was available for the “fillseq + overwrite” workload with four different implementations.

The results of the evaluation are shown in Figure 13. It was confirmed that the conventional method could be used for up to 4 million key ranges. The “original + GC” (just the garbage collection (GC) logic added to the conventional method) and “hot/cold” (which does not have a garbage collection logic but allocates a zone based on the lifetime) used up to 6 million key ranges. It was confirmed that “hot/cold + GC”, which had all the features of the proposed technique, could be used in up to 7 million key ranges, a 75% increase compared to the conventional method.

This means that an 8 GiB ZNS SSD with a 16-byte key and an 800-byte value, cannot freely overwrite if the number of unique keys exceeds 4 million (approximately 3.2 GB). However, it can freely overwrite up to 6 million (about 4.9 GB) unique keys by adopting one of the garbage collections or the zone allocation method based on lifetime. Using both of these, up to 7 million (about 5.7 GB) unique keys can be freely overwritten.

The above evaluation results show that the proposed methods alleviated the delayed physical deletion due to zonefiles with different lifetimes mixed into one zone. As a result, zones could be reclaimed faster and more empty zones could be secured. The proposed zone allocation method considering the lifetime of data is a solution that minimizes the physical deletion delay problem by separating short-lifetime data from long-lifetime data. On the other hand, garbage collection is a post-solution method that reduces the physical deletion delay by extracting long-lifetime valid data from the zone containing invalidated data.

Figure 14 shows how the lifetime-based zone allocation method minimized physical deletion delays caused by interference problems. The figure shows the pending time taken from logical deletion time to physical deletion time when “fillseq + overwrite” was executed for 4 million keys. The ratio of the amount of deletion that occurred for each lifetime is 1:1.5:1, in the order of SHORT, MEDIUM and EXTREME, respectively.

It can be confirmed that SHORT, which has a short lifetime by the conventional method, was delayed in deletion by MEDIUM and EXTREME, which have a relatively long lifetime. However, in the lifetime-based zone allocation method, data with a short lifetime are aggregated into separated zones, so that physical deletion is performed virtually, immediately after logical deletion. Otherwise, physical deletion of long-lifetime data would be further delayed, because data with a long lifetime interfere with each other. However, it is negligible, because the benefit of removing physical deletion delay of short-lifetime data is much greater.

For example, let us suppose that a zone with invalid data applies the percentage of deletions by lifetime to cause the number of SHORT data to be two, of MEDIUM data to be three and of EXTREME data to be two physical deletion delays. With the conventional method, it can be predicted that a delay of 118 s would occur when calculated with the average deletion delay time values of 45 s, 4 s and 8 s for SHORT, MEDIUM and EXTREME, respectively. In contrast, the zone allocation method based on lifetime needs 58 s when calculated with the average delay times of 0.08 s, 12s and 11 s. This means that the lifetime-based zone allocation method reduces the overall physical deletion delay time by dramatically reducing the physical deletion delay time for data with a short lifetime, even though the deletion delay for zones with relatively long-lifetime data increases.

Figure 15 shows that the method of combining garbage collection and the zone allocation method by lifetime was useful for efficient space utilization. The graph is the result of the analysis of the additional writing generated by valid data copying during garbage collection while changing the key range from 4 million to 6 million in millions. The write amplification factor (WAF) of the graph refers to the ratio of the number of additional writes generated to the number of existing writes.

The WAF increased as the key range increased in the method that adds garbage collection to the conventional method. This is due to the problem that, when valid data with a long lifetime are copied to another zone, they are copied again immediately after being copied to a zone containing valid data with a short lifetime. However, the proposed method is free from that problem. Therefore, it showed a consistent WAF amount, even though the key range was large. As a result, the proposed method contributed to the reduction in unnecessary writing and induced efficient space utilization.

6.3. I/O Performance

Figure 16 shows the normalized I/O performance of “fillseq + overwrite” workload on 4 million keys. As a result, the lifetime-based zone allocation method improved the IOPS by 5% compared to the conventional method. Otherwise, the proposed garbage collection logic could use more key ranges but showed that valid page copying resulted in performance degradations. The introduction of garbage collection into the conventional method resulted in a 9% performance degradation and the proposed “hot/cold + GC” technique showed a 3% performance degradation.

The reason for the above evaluation results is the zone allocation latency, as described in Figure 17. According to Figure 17a, you can see that the zone allocation method based on the lifetime without garbage collection had a smaller latency for zone allocation than the conventional method. In other words, the lifetime-based zone allocation method was interrupted by finding free zone less often and this resulted in a better performance than that of the conventional method.

On the other hand, Figure 17b shows the latency required to obtain space when garbage collection was introduced. Although it is not represented in the graph, the latency was less than that of the method that did not use garbage collection up to the 90th percentile, but it could be confirmed that latency increased sharply after the 90th percentile.

It was found that the cause of the increase in the latency over the 99th percentile in the garbage collection case was valid data copying. Moreover, the zone allocation method based on lifetime with garbage collection had more latency in the 99th percentile than the conventional one. This is because the proposed method produced a zone with all data invalidated more quickly and caused garbage collection processing to occur earlier than conventional methods.

Nevertheless, the zone allocation method based on lifetime with garbage collection had fewer latencies over the 99.9th percentile than the conventional method due to the valid page copying reduction. Because the latencies were reduced, I/O could be processed faster. For this reason, the proposed method resulted in a relatively high IOPS compared to conventional zone allocation with garbage collection.

6.4. Effect of the Dynamic Victim Selection

If all the zones that can be erased in the process of collecting garbage are deleted, a long delay time is inevitably produced. Therefore, in this paper, we propose a mathematical equation that can dynamically select the number of victim zones and the victim queue in the process of garbage collection.

Figure 18 shows the normalized IOPS comparing the results obtained by applying the equation in the garbage collection process of the proposed method and the results obtained by not applying it. It could be confirmed that the equation did not have a great influence on the result of generating 4 million keys and executing “fillseq + overwrite”.

The reason for this is that a certain number of I/O must be generated for the equation to work, but enough I/O were not generated. Therefore, an overhead was generated by the equation calculation and performance was degraded. However, with more than 5 million keys, a better performance was confirmed by the equation, because it efficiently selected the number of victim zones and extracted elements from the victim queue, compared to the method without this equation.

7. Conclusions

This paper proposes a lifetime-based zone allocation method and garbage collection method for the key–value store in the ZNS SSD. The zone allocation method based on data lifetime uses the characteristics of the LSM-Tree. Moreover, it was found that ZenFS does not have a garbage collection logic. Therefore, this paper added a garbage collection logic to improve the space utilization of ZNS SSD. According to the results of the experiments, it was found that the proposed method utilized the space of ZNS SSD more efficiently by using zone separation and garbage collection considering the lifetime of data.

Author Contributions

Conceptualization, G.O. and S.A.; methodology, G.O.; software, G.O. and J.Y.; validation, G.O.; formal analysis, G.O.; investigation, G.O. and J.Y.; resources, G.O. and J.Y.; data curation, G.O. and J.Y.; writing—original draft preparation, G.O.; writing—review and editing, S.A.; visualization, G.O.; supervision, S.A.; project administration, S.A.; funding acquisition, S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2019-0-01343, Regional strategic industry convergence security core talent training business) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1A4A4079859).

Conflicts of Interest

The authors declare no conflict of interest.

References

Gandhi, S.; Iyer, A.P. P3: Distributed deep graph learning at scale. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), Santa Clara, CA, USA, 14–16 July 2021; pp. 551–568. [Google Scholar]
Kim, H.; Park, J.H.; Jung, S.H.; Lee, S.W. Optimizing RocksDB for better read throughput in Blockchain systems. In Proceedings of the 23rd International Computer Science and Engineering Conference (ICSEC), Phuket, Thailand, 30 October–1 November 2019; pp. 305–309. [Google Scholar]
Han, J.; Haihong, E.; Le, G.; Du, J. Survey on NoSQL database. In Proceedings of the 2011 6th International Conference on Pervasive Computing and Applications, Port Elizabeth, South Africa, 26–28 October 2011; pp. 363–366. [Google Scholar]
Luo, C.; Carey, M.J. LSM-based storage techniques: A survey. VLDB J. 2020, 29, 393–418. [Google Scholar] [CrossRef] [Green Version]
O’Neil, P.; Cheng, E.; Gawlick, D.; O’Neil, E. The log-structured merge-tree (LSM-tree). Acta Inform. 1996, 33, 351–385. [Google Scholar] [CrossRef]
Zhong, W.; Chen, C.; Wu, X.; Jiang, S. REMIX: Efficient range query for LSM-trees. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21), Santa Clara, CA, USA, 23–25 February 2021; pp. 51–64. [Google Scholar]
Yang, M.-C.; Chang, Y.-M.; Tsao, C.-W.; Huang, P.-C.; Chang, Y.-H.; Kuo, T.-W. Garbage collection and wear leveling for flash memory: Past and future. In Proceedings of the 2014 International Conference on Smart Computing (SMARTCOMP), Hong Kong, China, 3–5 November 2014; pp. 66–73. [Google Scholar]
Bjørling, M. From open-channel SSDs to zoned namespaces. In Proceedings of the Linux Storage and Filesystems Conference (Vault 19), USENIX Association, Boston, MA, USA, 26 February 2019. [Google Scholar]
Choi, G.; Lee, K.; Oh, M.; Choi, J.; Jhin, J.; Oh, Y. A New LSM-style Garbage Collection Scheme for ZNS SSDs. In Proceedings of the 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20), Boston, MA, USA, 13–14 July 2020. [Google Scholar]
Hwang, J. Towards even lower total cost of ownership of data center IT infrastructure. In Proceedings of the NVRAMOS Workshop, Jeju, Korea, 24–26 October 2019. [Google Scholar]
Yang, F.; Dou, K.; Chen, S.; Hou, M.; Kang, J.-U.; Cho, S. Optimizing NoSQL DB on flash: A Case Study of RocksDB. In Proceedings of the 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), Beijing, China, 10–14 August 2015; pp. 1062–1069. [Google Scholar]
Bjørling, M.; Aghayev, A.; Holmberg, H.; Ramesh, A.; Le Moal, D.; Ganger, G.R.; Amvrosiadis, G. ZNS: Avoiding the block interface tax for flash-based SSDs. In Proceedings of the 2021 USENIX Annual Technical Conference (ATC 21), Santa Clara, CA, USA, 14–16 July 2021; pp. 689–703. [Google Scholar]
WesternDigital. RocksDB. Available online: https://github.com/westerndigitalcorporation/rocksdb (accessed on 23 March 2021).
WesternDigital. Getting Started with Emulated NVMe ZNS Devices. Available online: https://zonedstorage.io/docs/getting-started/zns-emulation (accessed on 23 March 2021).
Campello, J. SMR: The next generation of storage technology. In Proceedings of the Storage Development Conference (SDC 15), Santa Clara, CA, USA, 21 September 2015. [Google Scholar]
Bjørling, M. Open-channel solid state drives. In Proceedings of the 2015 Linux Storage and Filesystems Conference (Vault 15), Boston, MA, USA, 12 March 2015. [Google Scholar]
Manzanares, A.; Watkins, N.; Guyot, C.; LeMoal, D.; Maltzahn, C.; Bandic, Z. ZEA, A data management approach for SMR. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO, USA, 20–21 June 2016. [Google Scholar]
Wu, F.; Yang, M.-C.; Fan, Z.; Zhang, B.; Ge, X.; Du, D.H. Evaluating host aware SMR drives. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO, USA, 21 June 2016. [Google Scholar]
Zhang, Y.; Arulraj, L.P.; Arpaci-Dusseau, A.C.; Arpaci-Dusseau, R.H. De-indirection for flash-based SSDs with Nameless Writes. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST 12), San Jose, CA, USA, 14–17 February 2012. [Google Scholar]
Cao, Z.; Dong, S.; Vemuri, S.; Du, D.H. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20), Santa Clara, CA, USA, 25–27 February 2020; pp. 209–223. [Google Scholar]
WesternDigital. Libzbd. Available online: https://github.com/westerndigitalcorporation/libzbd (accessed on 22 March 2021).
Holmberg, H. ZenFS, Zones and RocksDB—Who likes to take out the garbage anyway? In Proceedings of the Storage Development Conference (SDC 20), Santa Clara, CA, USA, 23 September 2020. [Google Scholar]
WesternDigital. ZenFS. Available online: https://github.com/westerndigitalcorporation/zenfs (accessed on 7 November 2021).
NVMExpress. TP 4053a Zoned Namespace. Available online: https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs_09022021.zip (accessed on 10 November 2021).
Bjørling, M.; Gonzalez, J.; Bonnet, P. Lightnvm: The Linux open-channel {SSD} subsystem. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST 17), Santa Clara, CA, USA, 27 February–2 March 2017; pp. 359–374. [Google Scholar]

Figure 1. Data placement on the (a) conventional Solid-state Drive (SSD) and (b) zoned namespace SSD.

Figure 2. RocksDB architecture with ZenFS. L_n represents level n of LSM-tree in RocksDB. Zone #0~#7 represent a part of zones of ZNS SSD.

Figure 3. Overall architecture of ZenFS.

Figure 4. Routine for zone allocation in ZenFS. L1, L2 and L3 represent the extents containing the SSTable files from level 1, 2, and 3 in RocksDB, respectively. IV represents invalid extent of the deleted SSTable file.

Figure 5. Zone usage variation in the “fillseq + overwrite” workloads.

Figure 6. Average deletion interval time in each zonefile based on the lifetime.

Figure 7. The overall architecture of the proposed method. L1, L2 and L3 represent the extents containing the SSTable files from level 1, 2, and 3 in RocksDB, respectively. IV represents invalid extent of the deleted SSTable file. Also, GC means garbage collection.

Figure 8. Lifetime-based zone allocation routine.

Figure 9. Overall architecture of garbage collector.

Figure 10. As obtained from Equation (1), the graph shows the maximum number of zones to be deleted with varying empty zones ratio. Here,

Z_{e}

and

Z_{t} (= 512)

are the number of empty zones and the total number of zones, respectively.

Figure 10. As obtained from Equation (1), the graph shows the maximum number of zones to be deleted with varying empty zones ratio. Here,

Z_{e}

and

Z_{t} (= 512)

are the number of empty zones and the total number of zones, respectively.

Figure 11. As obtained from Equation (2), the graph shows the maximum level of queue to scan victim zones for garbage collection with varying empty space ratio.

Figure 12. Valid page copying logic: (a) before copying; (b) after copying. A1–A2 and B1–B5 represent the extents containing File–A and File–B, respectively. For example, A1 means extent #1 of File–A. IV and E represent invalid extent and empty space in a zone, respectively.

Figure 13. The key range that can be used for each evaluation target.

Figure 14. Pending time for physical deletion after logical deletion was submitted.

Figure 15. Impact of lifetime-based zone allocation method and garbage collection on WAF (Write Amplification Factor).

Figure 16. Input/Output Operations Per Second (IOPS) of “fillseq + overwrite” for 4 million keys.

Figure 17. Latencies of “fillseq + overwrite” for 4 million keys (a) without garbage collection and (b) with garbage collection.

Figure 18. IOPS depending on the adoption of the garbage collection equation.

Table 1. Hardware and software configurations.

Component	Specification
Central Processing Unit	Intel Xeon CPU E5-2620v4 @ 2.1 GHz
Memory (CPU)	Samsung 32 GB PC17000/ECC/REG $\times$ 4
Storage	Intel^® SSD DC P4500 Series (1.0 TB, 2.5in PCIe 3.1x4, 3D1, TLC)
Operating System (Kernel)	Ubuntu 18.04.2 LTS (Linux Kernel 5.3)

Table 2. Virtual machine configurations.

Configuration	Value
Number of Cores	32
Memory Size	32 GB
Storage	Emulated ZNS SSD 8 GiB (QEMU NVMe Ctrl)
OS (Kernel)	Debian Bullseye (Linux Kernel 5.10)

Table 3. Emulated ZNS SSD configurations.

Component	Specification
Zone Model	Host-Managed
Capacity	8 GiB
Sector Size	512 bytes
Number of Zones	512
Zone Size	16 MiB
Max. Open Zones	384
Max. Active Zones	384

Table 4. db_bench configurations.

Configuration	Value
Key Size	16 bytes
Value Size	800 bytes
Write Buffer Size	16 MiB
Max. SST File Size	16 MiB
Max. Background Thread	16

Table 5. Evaluation targets.

Item	Lifetime-Based Zone Allocation	Garbage Collection
Original	X	X
Original + GC	X	O
Hot/Cold	O	X
Hot/Cold + GC	O	O

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oh, G.; Yang, J.; Ahn, S. Efficient Key-Value Data Placement for ZNS SSD. Appl. Sci. 2021, 11, 11842. https://doi.org/10.3390/app112411842

AMA Style

Oh G, Yang J, Ahn S. Efficient Key-Value Data Placement for ZNS SSD. Applied Sciences. 2021; 11(24):11842. https://doi.org/10.3390/app112411842

Chicago/Turabian Style

Oh, Gijun, Junseok Yang, and Sungyong Ahn. 2021. "Efficient Key-Value Data Placement for ZNS SSD" Applied Sciences 11, no. 24: 11842. https://doi.org/10.3390/app112411842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Key-Value Data Placement for ZNS SSD

Abstract

1. Introduction

2. Background

2.1. Zoned Namespace Solid-State Drive (SSD)

2.2. RocksDB

3. Motivation

3.1. ZenFS

3.2. Limitations of ZenFS

4. Design

4.1. Lifetime-Based Zone Allocation

4.2. Garbage Collector for ZNS SSD

4.2.1. Dynamic Victim Selection

4.2.2. Valid Page Copying

5. Implementation

6. Experiments

6.1. Environments

6.2. Analysis of Space Utilization

6.3. I/O Performance

6.4. Effect of the Dynamic Victim Selection

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI