Next Article in Journal
Hg Levels in Marine Porifera of Montecristo and Giglio Islands (Tuscan Archipelago, Italy)
Next Article in Special Issue
Practical Enhancement of User Experience in NVMe SSDs
Previous Article in Journal
Robust-Extended Kalman Filter and Long Short-Term Memory Combination to Enhance the Quality of Single Point Positioning
Previous Article in Special Issue
Execution Model to Reduce the Interference of Shared Memory in ARINC 653 Compliant Multicore RTOS
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives

1
Department of Computer Engineering, Kwangwoon University, 20, Gwangun-ro, Nowon-gu, Seoul 01897, Korea
2
School of Computer and Information Engineering, Kwangwoon University, 20, Gwangun-ro, Nowon-gu, Seoul 01897, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(12), 4341; https://doi.org/10.3390/app10124341
Submission received: 8 June 2020 / Revised: 23 June 2020 / Accepted: 23 June 2020 / Published: 24 June 2020
(This article belongs to the Special Issue Operating System Issues in Emerging Systems and Applications)

Abstract

:
Differentiated I/O services for applications with their own requirements are very important for user satisfaction. Nonvolatile memory express (NVMe) solid-state drive (SSD) architecture can improve the I/O bandwidth with its numerous submission queues, but the quality of service (QoS) of each I/O request is never guaranteed. In particular, if many I/O requests are pending in the submission queues due to a bursty I/O workload, urgent I/O requests can be delayed, and consequently, the QoS requirements of applications that need fast service cannot be met. This paper presents a scheme that handles urgent I/O requests without delay even if there are many pending I/O requests. Since the pending I/O requests in the submission queues cannot be controlled by the host, the host memory buffer (HMB), which is part of the DRAM of the host that can be accessed from the controller, is used to process urgent I/O requests. Instead of sending urgent I/O requests into the SSDs through legacy I/O paths, the latency is removed by directly inserting them into the HMB. Emulator experiments demonstrated that the proposed scheme could reduce the average and tail latencies by up to 99% and 86%, respectively.

1. Introduction

Nonvolatile memory express (NVMe) is a storage interface designed for fast nonvolatile storage media such as solid-state drives (SSDs) [1]. Unlike an advanced host controller interface (AHCI), which has only one queue, NVMe provides numerous submission queues that can be scaled for parallel I/O processing within SSDs. To take full advantage of the numerous queues in this NVMe SSD architecture, the previous single-queue block layer was replaced with a multiqueue block layer in the Linux operating system [2]. Consequently, the multiple queues in the NVMe interface and Linux support enable many I/O requests to be handled simultaneously.
This scalable architecture offers a high I/O bandwidth and a high number of IOPS, but no quality of service (QoS) support is provided for each I/O request. If there are numerous pending I/O requests in the submission queues, I/O requests that should be urgently processed cannot be serviced immediately. For example, application startup or foreground process execution that requires low latency can be delayed when an SSD is processing numerous I/O requests. Since the host operating systems cannot control pending I/O requests in the NVMe submission queues, the existing I/O scheduling algorithms at the block I/O layer are not very effective in this case [3,4].
To solve this problem, we present a scheme in which urgent I/O requests are not delayed even when there are many pending I/O requests in the submission queues. To this end, we exploit the host memory buffer (HMB), which is one of the extended features provided by NVMe. It allows an SSD to utilize a portion of the DRAM of the host when it needs more memory. If an SSD supports the HMB feature, the SSD controller can access the host memory via the high-speed NVMe interface backed by peripheral component interconnect express (PCIe), and it can use a portion of the host’s memory as a storage cache for an address mapping table or regular data [5,6,7,8].
As the HMB is a part of the host DRAM, it can be accessed from the host operating system as well as the SSD controller [9]. It is regarded as a shared region of the host and SSD controller, so urgent I/O requests can be handled immediately by communicating via the HMB. When there are many pending I/O requests in the submission queues, causing a bottleneck, the HMB can be used as a fast I/O path for processing urgent I/O requests. Usually, normal I/O requests issued by file systems are sent to the SSD controller via software and hardware queues in the block I/O layer and submission queues in the NVMe interface. If urgent I/O requests are issued, they are sent via the HMB directly, bypassing those queues. In other words, the host sends an urgent I/O request by writing it in the HMB and the SSD controller responds by writing a response to the HMB. In our scheme, it is important not to violate the protocol of the NVMe interface while providing QoS for urgent I/O requests. This issue is addressed by adding another submission queue or using an administration queue, which is a queue for processing NVMe administration commands. Various experiments performed using our emulator demonstrated that the proposed scheme reduces both the average and tail latencies of urgent I/O requests and thus reduces the application launch time significantly.
The remainder of this paper is organized as follows. Section 2 provides an overview of the related previous works. Section 3 describes the bottleneck for urgent I/O requests that can be caused by many I/O requests waiting in the submission queues when I/O traffic is bursty. Section 4 explains our HMB I/O scheme, which is a fast track for urgent I/O requests using the HMB. Section 5 presents an evaluation of our scheme with various workloads. Finally, Section 6 concludes the paper.

2. Related Work

2.1. HMB of NVMe Interface

As the NVMe interface was designed considering the SSD structure, it provides many features that enable SSDs to be exploited fully [10]. For example, it employs submission queues to send I/O requests to the SSD and completion queues to receive the responses from the SSD, where each of them can be scaled up to 64K, and each queue can contain entries up to 64K [3,4,11,12,13]. This scalable architecture allows SSDs to perform I/O requests in parallel without the unnecessary delay caused by waiting in a single queue.
HMB is another exciting feature of the NVMe interface, which was first introduced in the NVMe 1.2 specifications. As it allows SSDs to use the host DRAM, SSDs can obtain more DRAM for caching address mapping tables or regular data. A DRAM-less SSD can provide performance comparable to that of a DRAM-containing SSD with reduced manufacturing cost by utilizing the HMB. This issue is still open, and the means of implementing this technology depend on SSD manufacturers [11,14]. Using the HMB feature requires support from both the host and SSD. Modern operating systems such as Windows 10 and Linux and some DRAM-less SSDs support the HMB feature [15,16,17,18,19].
Some steps are required to activate the HMB. Firstly, the host uses the host memory buffer preferred size (HMPRE) and host memory buffer minimum size (HMMIN) in the response of the NVMe Identify command to determine whether the NVMe SSD supports the HMB and to identify the HMB size the SSD needs. If the HMPRE is not zero, then the host supports the HMB. The HMB size that the SSD needs is equal to or greater than the HMMIN and equal to or less than the HMPRE. After checking both values, the host determines the HMB size and allocates the host DRAM accordingly. In recent Linux NVMe device drivers, the HMB size has been determined by the smaller of the HMPRE value and the device driver parameter. The HMB usually consists of multiple physically contiguous memory spaces; for example, 64 MB of HMB consists of 16 physically adjacent 4 MB areas. Then, the host issues NVMe set feature commands with 0xd, which is the feature identifier for activating the HMB and sending some information related to the allocated spaces. Finally, if the SSD successfully responds to the request, the HMB can be accessed from the SSD. In Linux, all of these procedures are performed when initializing an NVMe SSD device.
HMB has significant potential to improve the I/O performance of NVMe SSDs, but there has not been much research on this possibility yet. In [20,21], the authors used the HMB as a mapping table cache or data buffer and showed that it can mitigate the performance degradation of DRAM-less SSDs. Hong et al. [22] proposed an architecture that uses the HMB as a data cache by modifying an NVMe command process and adding another DMA path between the system memory and HMB. They showed that the proposed architecture improved the I/O performance by 23% in the case of sequential writes compared to a device buffer architecture. Kim et al. [8] presented an HMB-supported SSD emulator that enables easy integration and evaluation of I/O techniques for the HMB feature. Through implementation using their emulator, they also demonstrated that the I/O performance could be improved significantly when the HMB feature was used as a write buffer.

2.2. Support in the Block I/O Layer for NVMe SSDs

The Linux block I/O layer has changed significantly in response to NVMe SSDs with numerous queues [2]. In a previous single-queue block I/O layer, all the I/O requests issued by processes running on each CPU core were sent to a single request queue. Consequently, a performance bottleneck occurred because of lock contention for the request queue. To alleviate this problem, a multiqueue block I/O layer employs two levels of queues—software queues and hardware queues. I/O requests issued by a process running on a CPU core are firstly sent to the corresponding software queue attached to the CPU core and then to the hardware queue mapped to the software queue. The I/O requests are finally sent to the submission queues in the NVMe controller, which are mapped to the hardware queues.
The multiqueue block I/O layer is suitable for handling I/O requests in parallel by cooperating with NVMe SSDs with many queues. However, as this approach lacks QoS support, some I/O scheduling techniques for NVMe SSDs have been proposed. Lee et al. [9] stated that write requests still affect the latency of read requests negatively in NVMe SSDs. They proposed a queue isolation technique that eliminates the write interference and improves the read performance by separating read and write requests. Joshi et al. [23] presented a scheme that efficiently implements a weighted round robin (WRR) scheduler, which is one of the NVMe features involved in providing differentiated I/O service. The submission queues could be marked as urgent, high, medium, or low in a WRR, and WRR support was implemented in a Linux NVMe driver. These studies tried to provide some QoS services for read requests or high-priority requests, but they could not solve the problems when SSDs are over-burdened.
Zhang et al. [12] classified application types into online applications that are latency-sensitive and offline applications that are relatively less sensitive and provided separated I/O paths for each application type. In their approach, the multiqueue block I/O layer is bypassed to provide a low latency for I/O requests issued by online applications. In contrast, I/O requests issued by offline applications follow the original procedures of the multiqueue block I/O layer. Using this method, the average latency of servers running multiple applications was improved by 22%. It is similar to our approach in terms of the fact that application types are classified and then urgent I/O requests are directly processed by bypassing the multiqueue block I/O layer. However, our scheme utilizes the HMB space as a direct communication channel between the host and SSD, so it can provide a faster response for urgent I/O requests even when there are many pending I/O requests in the submission queues.

3. I/O latency with High Workload

As mentioned above, I/O requests originating from file systems are passed to SSD devices via software and hardware queues in the Linux block I/O layer and submission queues in the NVMe interface. If an I/O request leaves the hardware queue, the host can no longer control it. When I/O requests are added at any moment, they must wait in the submission queues, which adversely affects the processing of urgent I/O requests. In this section, we provide an experimental analysis of the number of I/O requests that are queued in the submission queue when the I/O workload is high and show how much this queueing can delay the processing of urgent I/O requests.
In fact, it is difficult to know exactly how many I/O requests are waiting in the submission queue at any given time. Both the head and tail positions are required to determine the number of I/O requests in the submission queue, but the host manages only the tail, the next insertion point, and not the head, the next deletion point. The head information can be obtained by SQ Head Pointer, which is part of the completion queue entry, but it is updated only when the SSD device finishes I/O processing and inserts the entry into the completion queue. Due to this limitation of NVMe, the average number of entries in each submission queue is updated in our method whenever the host dequeues the completion queue entries.
In this way, we measured the number of I/O requests enqueued in the submission queue in the environment summarized in Table 1. These experiments were performed for the three SSDs described in Table 2, and one of the widely used storage benchmark tools, fio, was employed for I/O workload generation [24]. We used libaio as an I/O engine for fio and set the depth that is the number of simultaneous I/O requests made to the SSD from one to 65,536 [25]. We also set the submission queue size of each SSD to 16,384, which is the maximum size supported. In each experiment, we used the O_DIRECT flag to remove the effects of the buffer and cache layer inside the host operating system [26]. We also ran fio on only one of several CPU cores to avoid interfering with other processes. The configuration for this experiment is summarized in Table 3a. Note that we repeated all experiments 10 times to make the results reliable.
Figure 1 shows the average number of I/O requests enqueued in the submission queue after running fio for about 1 min. This number dramatically increases when the number of I/O requests issued by fio is equal to or greater than 256 for SSD-A and SSD-B and equal to or greater 512 for SSD-C. This finding indicates that the SSDs are no longer able to process I/O requests immediately, so they have started waiting in the submission queue. The submission queue of each SSD also becomes almost filled when the number of I/O requests issued is greater than or equal to 16,384. In this situation, the pending I/O requests in the submission queue cannot be rearranged and controlled, so a significant delay can be expected if an I/O request that needs to be handled urgently enters.
Next, a simple experiment was conducted to measure the delay of the I/O request. We first ran fio by setting the number of I/O requests from zero to 20,000 to send a sufficient I/O load to the submission queues. Then, we ran the microbenchmark tool described in Table 3b that issued a single read request of 512 B, the minimum request size, and measured the time to process the I/O request. We set the process priority of the microbenchmark tool as the highest to mimic a process that is more urgent than other processes.
Figure 2 shows the latency of an I/O request issued by a microbenchmark when each SSD is very busy because of fio execution. Naturally, the latency linearly increases as the I/O load on the SSD increases in each case. When the number of I/O requests issued is 16,384, which is the maximum submission queue size of the SSDs, the latencies become almost constant. Most importantly, there were numerous pending I/O requests in the submission queue in each case, which eventually delayed the execution of the microbenchmark tool with high process priority. Hence, we experimentally confirmed that urgent I/O requests made by high priority processes can be delayed when SSDs are busy due to processing other I/O requests.

4. HMB-Based Fast Track for Urgent I/O Requests

This section presents a new I/O handling method called HMB I/O, which processes urgent I/O requests immediately by using the HMB as a fast track for these requests [27]. As the HMB is a shared region that can be accessed by both the host and SSD, data requested for read or write operations can be transferred between these components via the HMB instead of the legacy I/O stack. Figure 3 briefly shows the I/O processing procedure in our scheme. Normal I/O requests pass through the software and hardware queues in the multiqueue block I/O layer and I/O submission queue of the NVMe device driver as in the original procedure. As described in Section 3, if numerous I/O requests are pending in any queue, the new I/O requests should wait until the previous I/O requests have been dequeued.
When an I/O request is urgent and should be processed immediately, our scheme first bypasses the software and hardware queues. Then, it writes some information related to the I/O requests into the HMB and sends an HMB I/O command, which is a newly added NVMe command in our scheme. As the HMB I/O command should not wait in the submission queues with numerous pending I/O commands, a dedicated queue is necessary in our scheme. A simple solution is to create a new submission queue and then use it as the dedicated queue for the HMB I/O commands. However, even if the highest priority is given to the dedicated submission queue, the latency cannot be avoided completely because submission queues are dispatched in a WRR manner to the SSD devices.
As an alternative, in our approach, the HMB I/O commands are sent into the administration submission queue, which only contains device management I/O commands such as activating the HMB and creating a submission queue. As the administration submission queue is usually empty and is always preferentially dispatched, unlike the other submission queues, the HMB I/O commands can be quickly delivered to the SSD device. If the SSD device receives the HMB I/O commands, it processes them by referring to information within the HMB and finally writes the response of the completed I/O request into the HMB if necessary.
To support HMB I/O, we designed two modules—an HMB I/O requester within the host and an HMB I/O handler within the SSD (Figure 4). In the block I/O layer, the HMB I/O requester first picks out urgent I/O requests. If an I/O request is urgent, the HMB I/O requester writes information related to the I/O request in the HMB I/O metadata area located in the HMB and issues an HMB I/O command to the SSD. The HMB I/O handler receives the HMB I/O command and starts to process the I/O request in the SSD. It retrieves information related to the I/O request from the HMB I/O metadata area and copies data from NAND to the HMB I/O data area located in the HMB or from the HMB I/O data area to NAND in accordance with the I/O type, read or write. Finally, after the SSD completes the HMB I/O command, the HMB I/O requester receives the completion and the I/O request handling is finished.
The detailed operations of HMB I/O according to I/O type are described in Figure 5. In the case of a read request, the HMB I/O requester is invoked by submit_bio(), which starts to process the I/O request at the beginning of the block I/O layer and determines whether or not it is urgent. There are various methods of determining urgency, but one of the simplest is to bring the priority of the process issued the I/O request into the block I/O layer and then use it. As the SSD controller manages the I/O requests in the page units, whereas the host uses the bio structure represented by sectors, the HMB I/O requester first translates the sector addresses into the logical page addresses. This translation is originally performed in the device driver, but in our work, it was conducted in advance to send the I/O requests directly to the SSDs via the HMB.
After translation, the converted data are written to the HMB I/O metadata area. As can be seen in Figure 6, the HMBIOMetadata area has a starting logical page number lpn, number of pages nlp, and request type type. Both this space and the HMB I/O data area, which saves data to be written to or read from the NAND flash, are allocated when the SSD receives the Set Feature command. This command is issued by the host when the host requests that the SSD activate the HMB. To share the spaces with the host, after allocating them, the SSD writes their location information to the HMB control block. The host can access the spaces by using data in the HMB control block: Segment, which is the index of the physically contiguous block of the HMB and offset, which is the internal offset in the segment. Then, the HMB I/O requester sends the HMB I/O command to the SSD, and the HMB I/O handler receives it. By using the HMB I/O metadata, data are copied from the NAND flash to the HMB I/O data area and the completion message is enqueued to the dedicated queue. After the host receives the completion message, the host copies data from the HMB I/O data area into its own memory pages, where the location is described in the bio, and completes the request by bio_endio().
The process in the write request case differs from that in the read request case in two respects. One is that the HMB I/O requester copies the data from the host memory pages to the HMB I/O data area before issuing an HMB I/O command. The other is that the HMB I/O handler copies data from the HMB I/O data area to the NAND flash after reading the request information from the HMB I/O metadata area.

5. Performance Evaluation

We implemented the proposed HMB I/O scheme in the QEMU-based SSD emulator by referring to prior works [8,28]. Our implementation includes HMB-related functions such as activating the HMB, dividing the HMB space for multiple purposes, and sharing the HMB with the host. We also added the HMB I/O handler and requester to the emulated SSD and Linux kernel, respectively.
We conducted various experiments to verify the effectiveness of HMB I/O for providing QoS to urgent I/Os when an SSD has a bursty I/O load. Table 4 describes the experimental environments. The submission queue size of the emulated NVMe SSD was set to 16,384, and fio was used to generate heavy I/O workloads by issuing random read requests. In our experiment, the average number of I/O requests in each submission queue was about 14,890 when only fio was executed. To generate the urgent I/O requests that could be handled by HMB I/O, we used I/O workloads collected on a Nexus 5 Android smartphone [29]. The workloads consisted of I/O records that were collected at the block I/O layer and device driver of the smartphone when some applications or system functions such as sending and receiving calls were working (Table 5). To replay them in the block I/O layer, all I/O requests bypassed the caches and buffers of the host operating system and were directly delivered to the block I/O layer. Before executing the workloads in Table 5, we first executed fio for some duration to fill the submission queues with I/O requests sufficiently.
Figure 7 shows the average latency of urgent I/O requests in four cases—when urgent I/O requests are processed via the original block I/O layer while fio is not executed (Idle (original)), when urgent I/O requests are processed by HMB I/O while fio is not executed (Idle (HMB I/O)), when urgent I/O requests are processed by original block I/O layer while fio is executed (Busy (original)), and when urgent I/O requests are processed by HMB I/O while fio is executed (Busy (HMB I/O)). When the SSD is busy with a heavy workload, the average latencies of urgent I/O requests reach 996.8 ms. However, by processing urgent I/O requests with HMB I/O, the average latency can be improved by 31.6–459.7 times. Above all, the latencies of urgent I/O requests are significantly stable when they are processed with HMB I/O. When fio is executed simultaneously, the latencies of urgent I/O requests are from 0.11 to 2.48 ms if they are processed with the HMB I/O, but they are from 22.31 to 96.98 ms if they are processed via the original block I/O layer.
In the same experiment used to obtain the data in Figure 7, we analyzed the results in terms of tail latencies, which are often more important than average latencies. Figure 8 depicts the relative performance improvement when using HMB I/O compared to the latency when fio and microbenchmark are executed simultaneously without using the HMB I/O. As shown, by servicing the urgent workloads with HMB I/O, the tail latencies were reduced to 14%–39%. In addition, there are no significant differences from the idle situation in which only the microbenchmark was executed, about 1.4 times on average.
Figure 9 shows the individual I/O latencies of the workloads when fio was executed together. Only three sets of results are shown here (those for the Movie, Amazon, and Twitter workloads) because they have high, middle, and low performance improvements in terms of average latency, respectively. As expected, the I/O latencies are very long when differentiated service is not provided, at 600–1200 ms in many cases. This level cannot be ignored in terms of user satisfaction. On the other hand, the latencies are very short when HMB I/O is used to process the workloads urgently, again demonstrating the effectiveness of our scheme.
We also analyzed the I/O latency as a function of the amount of I/O on the SSD, namely, the number of I/O requests issued by fio. To this end, we set the number of I/Os issued from eight to 16,384 and used the Amazon workload to measure the I/O latencies because the performance improvement of this workload was the middle among the workloads we tested. As can be seen in Figure 10, when HMB I/O is not used, the average latency gradually increases as the I/O load increases. If the Amazon workload is processed by HMB I/O, this situation is significantly improved, and the average latency is constantly about 15 ms regardless of the number of I/Os issued by fio.
Since our scheme gives high priority to I/O requests that should be processed urgently even if there are other pending I/O requests in the submission queues, the processing of relatively less urgent I/O requests is inevitably delayed. Figure 11 shows the average fio latency as a function of the number of I/O issued. As the I/O load on the SSD increases, the fio latency also increases regardless of the usage of HMB I/O. In particular, when HMB I/O is employed, the I/O requests from the fio in the submission queue should wait longer due to the processing of urgent I/O requests via HMB I/O, so the fio latency increases by 49% on average. Although the microbenchmark is given a higher process priority, the process priority has no effect here because it is not considered in the original multiqueue block I/O layer.
When the SSD is very busy due to numerous I/O requests, our HMB I/O can be efficiently utilized to improve application launch times. Launching an application requires many I/O requests to read files such as executable files, configurations, and shared libraries. We measured the launch times of several widely used applications by employing the start time and the time at which the initial screen display is finished (Table 6). As in the previous experiments, three applications were given higher process priority, and all I/O requests issued by these applications were processed by HMB I/O.
Figure 12 shows the launch time of each application as a function of the I/O load on the SSD. If the number of I/O requests issued by fio is equal to or less than 16, the effectiveness of HMB I/O is not obvious. However, when the number of I/O requests issued is greater than 16, the application launch time increases rapidly without HMB I/O, whereas with HMB I/O, the launch time is almost constant. This finding indicates that HMB I/O is an effective means of providing a guaranteed launch time for application, which is very important for improving user satisfaction.

6. Conclusions

This paper presented an I/O scheme that provides guaranteed performance even when an SSD is overburdened due to numerous I/O requests. This approach utilizes an HMB that enables the SSD to access the host DRAM as a fast track for I/O requests requiring urgent processing. To avoid delayed processing of urgent I/O requests when there are numerous pending I/O requests in the submission queues, they are sent to the SSDs by bypassing the original block I/O layer. Various experiments showed that this HMB I/O scheme provides stable I/O performance in every case in terms of average and tail latency. In particular, our scheme could guarantee application launch times, which is crucial for end users.
Since the HMB space can be used for the cooperation of the host and SSD, we believe that it has much potential to improve the I/O performances when the NVMe SSD is used as a storage. Like our work bypasses the multiqueue block I/O layer by using the HMB, other parts of the existing I/O kernel stack can be also optimized via the HMB. Especially, as DRAM within modern SSDs is not always enough to execute the FTL, the host’s DRAM can be used instead. In addition, as the host has better computation ability as well as more workload information than SSDs, the HMB space can be utilized efficiently for write buffer or mapping the table cache. In the future, we will study continuously many usage cases to use the HMB for improving the I/O performance in the NVMe SSDs.

Author Contributions

Software, K.K.; writing—Original Draft preparation, K.K.; writing—review and editing, T.K. and S.K.; supervision, T.K.; project administration, T.K.; funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Grant of Kwangwoon University in 2020 and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT), grant number 2020R1F1A1074676.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cobb, D.; Huffman, A. NVM Express and the PCI Express SSD Revolution. In Proceedings of the Intel Developer Forum, San Francisco, CA, USA, 13 September 2012. [Google Scholar]
  2. Bjørling, M.; Axboe, J.; Nellans, D.; Bonnet, P. Linux Block IO: Introducing Multi-Queue SSD Access on Multi-Core Systems. In Proceedings of the 6th International Systems and Storage Conference, Haifa, Israel, 30 June–2 July 2013; ACM: New York, NY, USA, 2013. [Google Scholar]
  3. Kim, S.; Yang, J.S. Optimized I/O determinism for emerging NVM-based NVMe SSD in an enterprise system. In Proceedings of the 55th Annual Design Automation, San Francisco, CA, USA, 24–29 June 2018. [Google Scholar]
  4. Kim, H.J.; Lee, Y.S.; Kim, J.S. NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems, Denver, CO, USA, 20–21 June 2016. [Google Scholar]
  5. Kim, H.; Shin, D. SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device. In Proceedings of the 15th USENIX Conference on File and Storage Technologies, Santa Clara, CA, USA, 27 February–2 March 2017. [Google Scholar]
  6. Xie, W.; Chen, Y.; Roth, P.C. Exploiting internal parallelism for address translation in solid-state drives. ACM Trans. Storage 2018, 14, 1–30. [Google Scholar] [CrossRef]
  7. Jung, M.; Kandemir, M. Revisiting Widely Held SSD Expectations and Rethinking System-level Implications. Available online: http://camelab.org/uploads/paper/MJ-SIGMETRICS13.pdf (accessed on 29 May 2020).
  8. Kim, K.; Lee, E.; Kim, T. HMB-SSD: Framework for efficient exploiting of the host memory buffer in the NVMe SSD. IEEE Access 2019, 7, 150403–150411. [Google Scholar] [CrossRef]
  9. Lee, M.; Kang, D.H.; Lee, M.; Eom, Y.I. Improving Read Performance by Isolating Multiple Queues in NVMe SSDs. Available online: https://dl.acm.org/doi/pdf/10.1145/3022227.3022262 (accessed on 29 May 2020).
  10. NVMe Overview. Available online: https://www.nvmexpress.org/wp-content/uploads/NVMe_Overview.pdf (accessed on 25 May 2020).
  11. Huffman, A. NVM Express Base Specification Revision 1.3c. Available online: https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3c-2018.05.24-Ratified.pdf (accessed on 2 September 2019).
  12. Zhang, J.; Kwon, M.; Gouk, D.; Koh, S.; Lee, C.; Alian, M.; Chun, M.; Kandemir, M.T.; Kim, N.S.; Kim, J.; et al. FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation, Carlsbad, CA, USA, 8–10 October 2018. [Google Scholar]
  13. Peng, B.; Zhang, H.; Yao, J.; Dong, Y.; Xu, Y.; Guan, H. MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through. In Proceedings of the 2018 USENIX Annual Technical Conference, Boston, MA, USA, 11–13 July 2018. [Google Scholar]
  14. Huang, S. DRAM-Less SSD Facilitates HDD Replacement. In Proceedings of the Flash Memory Summit, Santa Clara, CA, USA, 10–11 August 2015. [Google Scholar]
  15. Chen, M. Which PCIe BGA SSD Architecture is Right for Your Application. In Proceedings of the Flash Memory Summit, Santa Clara, CA, USA, 7–10 August 2017. [Google Scholar]
  16. Linux Kernel NVMe Device Driver. Available online: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/nvme/host/pci.c?h=v4.13.10#n2179 (accessed on 25 May 2020).
  17. Ramseyer, C. Silicon Motion SM2263XT HMB SSD Preview. Available online: https://www.tomshardware.com/reviews/silicon-motion-sm2263xt-controller-preview,5404.html (accessed on 25 May 2020).
  18. Marvell 88NV1160 Product Brief. Available online: https://www.marvell.com/storage/assets/Marvell-88NV1160-Product-Brief-20160830.pdf (accessed on 27 December 2019).
  19. Silicon Power PCIe Gen3x2 P32M85. Available online: https://www.silicon-power.com/web/product-P32M85 (accessed on 25 May 2020).
  20. Dorgelo, J.; Chen, M.C. Host Memory Buffer (HMB) based SSD System. In Proceedings of the Flash Memory Summit, Santa Clara, CA, USA, 10–11 August 2015. [Google Scholar]
  21. Yang, S. Improving the Design of DRAM-Less PCIe SSD. In Proceedings of the Flash Memory Summit, Santa Clara, CA, USA, 7–10 August 2017. [Google Scholar]
  22. Hong, J.; Han, S.; Chung, E.Y. A RAM Cache Approach Using Host Memory Buffer of the NVMe Interface. In Proceedings of the 2016 International SoC Design Conference, Jeju, Korea, 23–26 October 2016; pp. 109–110. [Google Scholar]
  23. Joshi, K.; Yadav, K.; Choudhary, P. Enabling NVMe WRR Support in Linux Block Layer. In Proceedings of the 9th USENIX Workshop on Hot Topics in Storage and File Systems, Santa Clara, CA, USA, 10–11 July 2017. [Google Scholar]
  24. Axboe, J. Fio: Flexible I/O Tester. Available online: https://github.com/axboe/fio (accessed on 25 May 2020).
  25. Libaio. Available online: https://pagure.io/libaio (accessed on 25 May 2020).
  26. Open (2)—Linux Manual Page. Available online: http://man7.org/linux/man-pages/man2/open.2.html (accessed on 25 May 2020).
  27. Kim, K.; Kim, S.; Kim, T. FAST I/O: QoS Supports for Urgent I/Os in NVMe SSDs. In Proceedings of the 2020 5th International Conference on Intelligent Information Technology, Hanoi, Vietnam, 19–22 February 2020. [Google Scholar]
  28. Li, H.; Hao, M.; Tong, M.H.; Sundararaman, S.; Bjørling, M.; Gunawi, H.S. The CASE of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, Oakland, CA, USA, 12–15 February 2018. [Google Scholar]
  29. Zhou, D.; Pan, W.; Wang, W.; Xie, T. I/O Characteristics of Smartphone Applications and Their Implications for eMMC Design. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization, Atlanta, GA, USA, 4–6 October 2015; IEEE Computer Society: Washington, DC, USA, 2015. [Google Scholar]
Figure 1. Number of I/O requests enqueued in the submission queue. (a) SSD-A; (b) SSD-B; (c) SSD-C.
Figure 1. Number of I/O requests enqueued in the submission queue. (a) SSD-A; (b) SSD-B; (c) SSD-C.
Applsci 10 04341 g001
Figure 2. Latency of an I/O request when each SSD is busy. (a) SSD-A; (b) SSD-B; (c) SSD-C.
Figure 2. Latency of an I/O request when each SSD is busy. (a) SSD-A; (b) SSD-B; (c) SSD-C.
Applsci 10 04341 g002
Figure 3. I/O handling via host memory buffer (HMB). (a) Normal I/Os; (b) urgent I/Os.
Figure 3. I/O handling via host memory buffer (HMB). (a) Normal I/Os; (b) urgent I/Os.
Applsci 10 04341 g003
Figure 4. System architecture supporting HMB I/O.
Figure 4. System architecture supporting HMB I/O.
Applsci 10 04341 g004
Figure 5. HMB I/O procedures. (a) Read requests; (b) write requests.
Figure 5. HMB I/O procedures. (a) Read requests; (b) write requests.
Applsci 10 04341 g005
Figure 6. HMB I/O data structures.
Figure 6. HMB I/O data structures.
Applsci 10 04341 g006
Figure 7. Average latency of urgent I/O requests.
Figure 7. Average latency of urgent I/O requests.
Applsci 10 04341 g007
Figure 8. Normalized tail latency of urgent I/O requests in the HMB I/O scheme.
Figure 8. Normalized tail latency of urgent I/O requests in the HMB I/O scheme.
Applsci 10 04341 g008
Figure 9. I/O latencies of some workloads. (a) Movie; (b) Amazon; (c) Twitter.
Figure 9. I/O latencies of some workloads. (a) Movie; (b) Amazon; (c) Twitter.
Applsci 10 04341 g009
Figure 10. Average I/O latency of Amazon workload.
Figure 10. Average I/O latency of Amazon workload.
Applsci 10 04341 g010
Figure 11. Average I/O latency of fio.
Figure 11. Average I/O latency of fio.
Applsci 10 04341 g011
Figure 12. Launch time of applications. (a) Firefox; (b) Visual Studio Code; (c) ATOM.
Figure 12. Launch time of applications. (a) Firefox; (b) Visual Studio Code; (c) ATOM.
Applsci 10 04341 g012
Table 1. Experimental environment.
Table 1. Experimental environment.
ProcessorIntel i7-8700 (12 Cores)
Main memory64 GB
Main storageSamsung 860 PRO 256 GB
Operating systemUbuntu 16.04 64 bits (kernel 4.13.10, installed in the main storage)
Table 2. Target solid-state drives (SSDs) used. Submission queue size and the number of submission queues are all 16,384 and 12, respectively.
Table 2. Target solid-state drives (SSDs) used. Submission queue size and the number of submission queues are all 16,384 and 12, respectively.
SSDCapacity (GB)
SSD-ASamsung 970 PRO 512 GB512
SSD-BSamsung 970 EVO + 500 GB500
SSD-CWD 3D Black 500 GB500
Table 3. Configuration for generating I/O workloads to (a) count the number of I/O requests enqueued in the submission queue and (b) measure I/O delays caused by pending I/O requests in the submission queue.
Table 3. Configuration for generating I/O workloads to (a) count the number of I/O requests enqueued in the submission queue and (b) measure I/O delays caused by pending I/O requests in the submission queue.
(a)Fio
- Random read- I/O unit: 4 KB
- Number of threads: 1- Direct I/O enabled
- libaio used (depth: 1–65,536)
(b)Fio
- Random read- I/O unit: 4 KB
- Number of threads: 12- Direct I/O enabled
- libaio used (depth: 1–20,000)- Process priority (nice value): 19
Micro-benchmark tool
- Read from offset 0- I/O unit and size: 512 B
- Number of threads: 1- Direct I/O enabled
- Process priority (nice value): −20
Table 4. Experimental environments.
Table 4. Experimental environments.
Processor6 Cores of Intel i7-8700
Main memory16 GB
SSDNVMe SSD Emulator Based on QEMU 2.9.0
- Total capacity: 36 GB- Page size: 4 KB
- Block size: 1 MB- Submission queue size: 16,384
- Cell read: 40 µs- Cell program: 800 µs
- Register read: 60 µs- Register write: 40 µs
- Block erase: 3000 µs
Operating systemUbuntu 16.04 64 bits (Kernel 4.13.10)
I/O workloadsfio
- Random read- I/O unit: 4 KB
- Number of threads: 9- Direct I/O enabled
- libaio used (depth: 16,384)- Pre-running: 1 min
- Process priority as nice value: 19
Micro-Benchmark for Replaying the Nexus5 Workloads
- Number of threads: 1- Direct I/O enabled
- Process priority as nice value: −20
Table 5. Workloads collected from Nexus 5 smartphone (adapted from [29]).
Table 5. Workloads collected from Nexus 5 smartphone (adapted from [29]).
TypeWorkloadCollection Period (min)Data Size (MB)I/O Rate (#req/s)Ratio of Reads (%)
App.Movie16.6127.44.894.6
App.CameraVideo57.02229.72.770.5
Sys. func.Booting0.7959.2460.466.9
App.Music63.4234.41.847.2
App.Email12.357.93.929.6
App.Facebook18.595.23.525.6
App.WebBrowsing81.793.70.819.3
App.AngryBird33.792.51.615.5
App.GoogleMaps28.7193.27.313.2
App.Twitter14.3183.116.111.5
App.Amazon13.765.83.937.0
Sys. func.Idle489.4120.30.211.1
App.Messaging9.862.29.72.7
App.Youtube78.228.00.42.5
App.Installing16.31615.118.41.7
App.Radio74.2113.31.31.3
Sys. func.CallOut61.726.70.41.1
Sys. func.CallIn62.826.70.40.1
Table 6. Applications tested for measuring the launch time.
Table 6. Applications tested for measuring the launch time.
ApplicationVersionDescription
Firefox68.0.1Open source web browser developed by Mozilla Foundation
Visual Studio1.37.1Open source text editor developed by Microsoft
ATOM1.40.0Open source text editor developed by Github

Share and Cite

MDPI and ACS Style

Kim, K.; Kim, S.; Kim, T. HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives. Appl. Sci. 2020, 10, 4341. https://doi.org/10.3390/app10124341

AMA Style

Kim K, Kim S, Kim T. HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives. Applied Sciences. 2020; 10(12):4341. https://doi.org/10.3390/app10124341

Chicago/Turabian Style

Kim, Kyusik, Seongmin Kim, and Taeseok Kim. 2020. "HMB-I/O: Fast Track for Handling Urgent I/Os in Nonvolatile Memory Express Solid-State Drives" Applied Sciences 10, no. 12: 4341. https://doi.org/10.3390/app10124341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop